Performance Tuning

There's no such thing as the fastest code...

About Me

HRT Compiler Explorer

Placing Orders

  • Many exchanges use FIX protocol
  • ASCII KEY=VALUE pairs
  • Separated by SOH byte ('␁' / '\x01')
  • Not very efficient
  • Requires several binary-to-decimal conversions
8=FIX.4.2 9=230 35=D 34=000000698 49=zxcy-01 52=20150910-18:38:30 56=CME 57=G 60=20150910-18:38:30 167=FUT 21=1 204=0 9702=2 40=2 59=0 11=RS 54=1 38=4 44=-65 55=CL 107=CLF6-CLG6 1028=N 50=MOO 142=US,UT 1=237504 7928=40433 1091=N 10=027

Simplified Order

NEW stock price quantity

/// hide
#include <sstream>
#include <string>
/// unhide
std::string newOrder(std::string stock, int price, int quantity) {
  std::stringstream s;
  s << "NEW " << stock << " " << price << " " << quantity;
  return s.str();
}

How Fast Is It?

  • Time 10 million orders, get average time
  • ~1040ns per order (Intel i7-10510U @ 1.8GHz, GCC 11 static)
  • We can do better!

Profile

$ perf record ./app
$ perf report
20.64%  __dynamic_cast
 6.17%  __cxxabiv1::__si_class_type_info::__do_dyncast
 5.78%  __ostream_insert
 4.96%  __cxxabiv1::__vmi_class_type_info::__do_dyncast
 4.33%  __strcmp_avx2
 3.69%  ostream::_M_insert<long>
 3.51%  basic_streambuf::xsputn
 3.10%  newOrder
 2.87%  std::locale::id::_M_id
 2.84%  num_put::_M_insert_int

Dynamic cast?

__dynamic_cast()
bool std::has_facet(std::locale const&)
std::basic_ios::_M_cache_locale(std::locale const&)
std::basic_ios::init(...)
std::basic_istream::basic_istream(...)
std::basic_iostream::basic_iostream(...)
std::basic_stringstream::basic_stringstream(...)
newOrder

Dynamic cast?


template<typename _CharT, typename _Traits>
void basic_ios<_CharT, _Traits>::_M_cache_locale(
    const locale& __loc) {
  if (has_facet<__ctype_type>(__loc))
    _M_ctype = &use_facet<__ctype_type>(__loc);
  else _M_ctype = 0;

  if (has_facet<__num_put_type>(__loc))
    _M_num_put = &use_facet<__num_put_type>(__loc);
  else _M_num_put = 0;

  if (has_facet<__num_get_type>(__loc)
    _M_num_get = &use_facet<__num_get_type>(__loc);
  else _M_num_get = 0;
}

Dynamic cast?


template<typename _Facet> bool
has_facet(const locale& __loc) {
  const size_t __i = _Facet::id._M_id();
  const locale::facet** __facets = __loc._M_impl->_M_facets;
  return (__i < __loc._M_impl->_M_facets_size
    && dynamic_cast<const _Facet*>(__facets[__i]));
}
    

Fixed in GCC 13!

dynamic_caststatic_cast short-circuit


#define _GLIBCXX_STD_FACET(...) \
  if constexpr (__is_same(const _Facet, const __VA_ARGS__)) \
    return static_cast<const _Facet*>(__facets[__i])
    

Profile - today (GCC 16)

11.97%  __ostream_insert
 8.75%  basic_streambuf::xsputn
 7.27%  newOrder
 6.72%  ostream::_M_insert<long>
 5.86%  num_put::_M_insert_int
 4.92%  basic_ios::_M_cache_locale
 4.43%  basic_stringbuf::overflow
 3.89%  __int_to_char

~1040ns → ~640ns  (~1.6× faster)

Lesson

  • Use the newest compiler you can
  • (GCC 16 for the rest of this talk)

Take Two


/// hide
#include <cstdio>
/// unhide
void newOrder(char *buf, const char *stock,
              int price, int quantity) {
  sprintf(buf, "NEW %s %d %d", stock, price, quantity);
}

Take Two - Results

  • ~260ns per order
  • ~4× faster
  • We can still do better!
47.38%  vfprintf
21.51%  _IO_default_xsputn
 9.81%  __strchrnul
 6.80%  _itoa_word

Take Three - Rethink

  • Don't need full generality of printf
  • Create a custom formatter
  • Maybe end up with a faster _itoa_word

How to do itoa?

  • Rightmost digit is value % 10
  • Next digit is (value / 10) % 10
  • And so on...
  • Digits come out in reverse order

Take Three

class Format {
  char _buffer[2048];
  int _ptr{0};
public:
  void append(char c) { _buffer[_ptr++] = c; }
  void append(const char *data) {
    while (*data) append(*data++);
  }
  void finish() { append('\x00'); }
  void decimalAppend(int value);
  void decimalAppendNonNeg(unsigned value);
};

Take Three

void Format::decimalAppend(int value) {
  if (value < 0) {
    append('-');
    value = -value;
  }
  decimalAppendNonNeg(value);
}

Take Three

void Format::decimalAppendNonNeg(unsigned value) {
  int startPos = _ptr;
  do {
    append(value % 10 + '0');
    value /= 10;
  } while (value);
  // Reverse the digits.
  auto end = &_buffer[_ptr - 1];
  auto start = &_buffer[startPos];
  while (end > start)
    std::swap(*start++, *end--);
}

Take Three


append(value % 10 + '0');
value /= 10;

mov r10d, 0xcccccccd
mov eax, esi          ; esi is value, move to eax
mul r10d              ; tmp = value * 0xcccccccd
                      ; eax = bot32(tmp)
                      ; edx = top32(tmp)
shr edx, 3            ; edx = value * 0xcccccccd / 2^35
                      ; 0xcccccccd / 2^35 = 0.10000000000582077
                      ; edx = floor(value / 10)

lea eax, [rdx+rdx*4]  ; eax = edx * 5
add eax, eax          ; eax *= 2, eax is now edx*10
sub esi, eax          ; esi = value - floor(value / 10) * 10
                      ; esi = value % 10
add esi, 48           ; add '0' to get ASCII
mov [rdi+rcx], sil    ; write out

Take Three


/// hide
#include <utility>
class Format {
  char _buffer[2048];
  int _ptr{0};
public:
  void append(char c) { _buffer[_ptr++] = c; }
  void append(const char *data) { while (*data) append(*data++); }
  void finish() { append('\x00'); }
  void decimalAppend(int value) {
    if (value < 0) { append('-'); value = -value; }
    decimalAppendNonNeg(value);
  }
  void decimalAppendNonNeg(unsigned value) {
    int startPos = _ptr;
    do { append(value % 10 + '0'); value /= 10; } while (value);
    auto end = &_buffer[_ptr - 1];
    auto start = &_buffer[startPos];
    while (end > start) std::swap(*start++, *end--);
  }
};
/// unhide
void newOrder(Format &format, const char *stock,
              int price, int quantity) {
  format.append("NEW ");
  format.append(stock);
  format.append(' ');
  format.decimalAppend(price);
  format.append(' ');
  format.decimalAppendNonNeg(quantity);
  format.finish();
}

Take Three - Results

  • ~80ns / order
  • ~13× faster than Take One
  • ~3.3× faster than Take Two

Take Three - Profile

94.93% newOrder(Format&, char const*, int, int)
 4.86% main

Take Three - Profile


0.68% mov    %esi,%edx
1.66% mov    %ecx,%eax
2.09% movslq %edx,%r8
1.99% lea    0x1(%rdx),%esi
0.68% mul    %r10d
5.51% shr    $0x3,%edx
0.83% lea    (%rdx,%rdx,4),%r9d
1.65% add    %r9d,%r9d
2.06% sub    %r9d,%ecx
2.20% add    $0x30,%ecx
1.55% test   %edx,%edx
1.31% mov    %cl,(%rdi,%r8,1)
2.29% mov    %edx,%ecx

Take Three - continued

  • Can we do better?
  • Absolutely...

Best (so far)

  • Do two digits at a time table[x % 100]; x/= 100;
  • Number of digits in result, N is 1 + log₁₀(x):
    
              log₂(x) ~= 31 - clz(x)
              log₁₀(x) = log₂(x) * (ln(2)/ln(10))
              ln(2)/ln(10) ~= 0.30103 ~= 1233/4096
            
  • Generate bespoke routine for N digits
  • ~76ns / order (a few % over Take Three)

static unsigned const PowersOf10[] =
  {1, 10, 100, 1000, 10000, 100000,
    1000000, 10000000, 100000000, 1000000000};

static unsigned numDigits(unsigned v) {
  auto log2 = 31 - __builtin_clz(v);
  auto log10Approx = (log2 + 1) * 1233 >> 12;
  auto log10 = log10Approx - (v < PowersOf10[log10Approx]);
  return log10 + 1;
}

Of Course

  • Trading system much more than this
  • Networking
  • Algorithm
  • Safeties
  • Logging and auditing
  • Testing

Bonus: std::format

What about std::format?

  • C++20, type-safe, no printf format-string footguns
  • What you'd write today instead of stringstream

/// hide
#include <format>
#include <string>
#include <string_view>
/// unhide
std::string newOrder(std::string_view stock,
                     int price, int quantity) {
  return std::format("NEW {} {} {}",
                     stock, price, quantity);
}

std::format_to - in-place

If you already own the buffer (à la Take Two):


/// hide
#include <format>
/// unhide
void newOrder(char *buf, const char *stock,
              int price, int quantity) {
  std::format_to(buf, "NEW {} {} {}",
                 stock, price, quantity);
}

fmt::format_to - upstream

fmtlib/fmt is the upstream of std::format:

#include <fmt/core.h>

void newOrder(char *buf, const char *stock,
              int price, int quantity) {
  fmt::format_to(buf, "NEW {} {} {}",
                 stock, price, quantity);
}

Is it free?

Version ns / order
stringstream ~1040
sprintf ~270
std::format ~310
std::format_to ~310
fmt::format_to ~210
custom Format ~80
"best" ~76
  • std::formatsprintf - type-safe, not faster
  • fmt::format_to ~25% faster than sprintf - same API, different implementation
  • For the last 2.5×, the custom Format still wins