OoO Crikey!

Out of Order Execution
Matt Godbolt @mattgodbolt

Overview

Image source: Intel

In-order Pipelines

Out of Order Execution

Image credit: Helder Ribeiro

ASM Overview

Intel Syntax

                    ret                           ; return
                    inc dest                      ; dest++
                    add dest, src                 ; dest += src
                    vfmadd132ss dest, src2, src3  ; dest = dest * src3 + src2
                

ASM Overview

Register operand:

                89 d0                  mov eax, edx
                48 89 d0               mov rax, rdx
                66 89 d0               mov ax, dx
                
  • rax / eax / ax...
  • rdi, rsi, rbx, rcx, rdx, rbp, rsp, r8-r15
  • xmm0-15, ymm0-15, zmm0-15 (or 31)...
  • ABI (Windows vs Linux)

ASM Overview

Constant operand:

                b9 ef be ad de          mov ecx, 0xdeadbeef
                48 b8 ef cd ab 89 67 45
                23 01                   mov rax, 0x0123456789abcdef

ASM Overview

Memory operand

                4c 8b 05[04 f3 ff 7f]   mov r8, QWORD PTR [global]
                01 84 8b[10 33 fe 7f]   add DWORD PTR array[rbx + 4 * rdx], eax

                addr = &array[b + d * 4];
                tmp = *addr;
                tmp += a;
                *addr = tmp;
                

Example!


void maxArray(double *x, const double *y)
{
  for (int i = 0; i < 65536; i++)
  {
    if (y[i] > x[i]) x[i] = y[i];
  }
}
                
Compiler Explorer

The Core2 Pipeline

The Core2 Pipeline

  • Branch prediction
  • Fetch
  • Decode
  • µop cache
  • Rename
  • Reorder buffer read
  • Reservation station
  • Execution
  • Reorder buffer write
  • Retire

Branch Prediction

  • Pipeline overlaps work
  • What about branches?
  • Informed guess!

Branch Prediction

  • Is there a branch?
  • Where does it go?
  • Is it taken?

Branch Target Buffer

Is there a branch? Where does it go?

My investigations...

Branch Prediction

Is the branch taken?

Branch Prediction

Is it taken? Where does it go?

Branch Prediction

Local history (older Core2)

Branch Prediction

Local history (older Core2)

Branch Prediction with local history

  • Doesn't scale well
  • Loop predictors mitigate
  • SNB+ use globals

Branch Prediction with global history

Does it matter?


def test(array):
  total = num_clipped = clipped_total = 0
  for i in array:
    total += i
    if i < 128:
      num_clipped += 1
      clipped_total += i

  return total / len(array), clipped_total / num_clipped

BPU → Fetcher

Branch Predictor
Fetch Address
Fetch & Predecoder

Fetcher & Predecoder

  • Reads 16-byte blocks
  • Works out where instructions are

31 c0           xor eax, eax
f2 0f 10 04 06  movsd xmm0, QWORD PTR [rsi+rax]
66 0f 2e 04 07  ucomisd xmm0, QWORD PTR [rdi+rax]
76 05           jbe skip
f2 0f ...       movsd QWORD PTR...
                    
31 c0 f2 0f 10 04 06 66 0f 2e 04 07 76 05 f2 0f

Fetcher → Decoder

Fetcher & Predecoder
Instruction bytes & instruction offsets
Decoder

Decode

  • Generate µops for each instruction
  • Up to 4 instructions/cycle
  • CISC → internal RISC
  • Micro-fusion & macro-fusion

4 Decoders

  • 1-1-1-1 (+1)
  • 2-1-1 (+1)
  • 3
  • 4
  • µcode ROM

Decode


maxArray:
xor eax, eax
.forLoop
movsd xmm0, QWORD PTR [rsi+rax]
ucomisd xmm0, QWORD PTR [rdi+rax]
jbe skipIf
movsd QWORD PTR [rdi+rax], xmm0
.skipIf
add rax, 8
cmp rax, 524288 jne forLoop
ret

clear(rax)
tmp = rsi + rax; xmm0 = rd64(tmp) // µ
tmp = rdi + rax; tmp = rd64(tmp); flags = compare(xmm0, tmp)
if (flags.be) goto skipIf
tmp = rdi + rax; wr64(tmp, xmm0)
rax = rax + 8
flags = compare(rax, 524288); if (flags.ne) goto forLoop // macro
rsp = rsp + 8; goto rd64(rsp - 8)

Decode


t00: clear(rax)
t08: tmp = rsi + rax; xmm0 = rd64(tmp)
t0d: tmp = rdi + rax; tmp = rd64(tmp); flags = compare(xmm0, tmp)
t12: if (flags.be) goto t19                             // predicted taken
t19: rax = rax + 8
t1d: flags = compare(rax, 524288) if (flags.ne) goto t08// predicted taken
t08: tmp = rsi + rax; xmm0 = rd64(tmp)
t0d: tmp = rdi + rax; tmp = rd64(tmp); flags = compare(xmm0, tmp)
t12: if (flags.be) goto t19                             // predicted not taken
t14: tmp = rdi + rax; wr64(tmp, xmm0)
t19: rax = rax + 8
                

Decoder → µop Cache

Decoder
µops & addrs (in predicted order)
µop Cache

µop Cache

  • 1536µops total
  • 32 sets of 8-ways
  • 6 µops per line
  • Complex rules...

BP → µop Cache

Branch Predictor
Fetch Address
µop Cache
Bypass

µop Cache → Rename

µop Cache
µops & addrs (in predicted order)
Rename

Renaming

  • 16 architectural registers
  • 100+ physical registers
  • "Rewrite" µops to use them
  • Unlock ILP!
  • Allocates load/store buffers

Renaming


xmm0 = rd64(rsi + rax)
flags = compare(xmm0, rd64(rdi + rax))
if (flags.be) goto t19
rax = rax + 8
flags = compare(rax, 524288) if (flags.ne) goto t08
xmm0 = rd64(rsi + rax)
flags = compare(xmm0, rd64(rdi + rax))
if (flags.be) goto t19
rax = rax + 8
flags = compare(rax, 524288) if (flags.ne) goto t08
                

Renaming


xmm0 = rd64(rsi + rax)
  flags = compare(xmm0, rd64(rdi + rax))
    if (flags.be) goto t19
    rax = rax + 8
      flags = compare(rax, 524288) if (flags.ne) goto t08
      xmm0 = rd64(rsi + rax)
        flags = compare(xmm0, rd64(rdi + rax))
          if (flags.be) goto t19
          rax = rax + 8
            flags = compare(rax, 524288) if (flags.ne) goto t08
                
7 CPU cycles

Renaming

  • Register Alias Table (RAT)
  • Maps arch register → internal register
  • reg->reg moves
  • Zeroing idioms

Renaming


xmm0_1 = rd64(rsi + rax_1)
flags_1 = compare(xmm0_1, rd64(rdi + rax_1))
if (flags_1.be) goto t19
rax_2 = rax_1 + 8
flags_2 = compare(rax_2, 524288) if (flags_2.ne) goto t08
xmm0_2 = rd64(rsi + rax_2)
flags_3 = compare(xmm0_2, rd64(rdi + rax_2))
if (flags_3.be) goto t19
rax_3 = rax_2 + 8
flags_4 = compare(rax_3, 524288) if (flags_4.ne) goto t08
                

Renaming


xmm0_1 = rd64(rsi + rax_1)
  flags_1 = compare(xmm0_1, rd64(rdi + rax_1))
    if (flags_1.be) goto t19
rax_2 = rax_1 + 8
  flags_2 = compare(rax_2, 524288) if (flags_2.ne) goto t08
  xmm0_2 = rd64(rsi + rax_2)
    flags_3 = compare(xmm0_2, rd64(rdi + rax_2))
      if (flags_3.be) goto t19
  rax_3 = rax_2 + 8
    flags_4 = compare(rax_3, 524288) if (flags_4.ne) goto t08
                
4 CPU cycles (vs 7)

Rename → Reorder buffer read

Rename
Renamed µops
Reorder buffer read

Reorder buffer

  • First out of order stage!
  • Blocks µops awaiting operands
  • Reads from PRF or internal reg

ROB read → Reservation Station

Reorder buffer read
µops with operand values
Reservation Station

Reservation Station

  • Schedules ready µops
  • Execution ports

Reservation Station → Execute

Reservation Station
µops with operand values
Execute port 0
Execute port 1
Execute port 2
Execute port 3
Execute port 4
Execute port 5
Execute port 6
Execute port 7

Execute

Something actually happens!!

Execution ports (Skylake)

Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7
ALU
1
ALU
1
load
& addr
load
& addr
store ALU
1
ALU
1
addr
vec str
3
vec alu
3
vector
permute
branch
1-2
FPU
4
FPU
4
x87 FPU
branch
1-2
vec mul
5
PCLMUL
7
divide
& sqrt

Execute → Reorder buffer write

Execute
µops with output values
Reorder buffer write

Reorder buffer write

  • Results become available
  • Bypass back to ROB read

ROB write → Retire

Reorder buffer write
µops with output values
Retire

Retirement

  • Back in order!
  • "Commit" instruction
  • Write to PRF
  • Flush load/store buffer
  • Exceptions
  • Page faults
  • IRQs

And that's all there is to it!

The Core2 Pipeline

  • Branch prediction
  • Fetch
  • Decode
  • µop cache
  • Rename
  • Reorder buffer read
  • Reservation station
  • Execution
  • Reorder buffer write
  • Retire

Thank you!