Out of Order Execution

OoO Crikey!

Out of Order Execution

Matt Godbolt @mattgodbolt

Overview

Image source: Intel

In-order Pipelines

Out of Order Execution

Image credit: Helder Ribeiro

ASM Overview

Intel Syntax


                    ret                           ; return
                    inc dest                      ; dest++
                    add dest, src                 ; dest += src
                    vfmadd132ss dest, src2, src3  ; dest = dest * src3 + src2

ASM Overview

Register operand:


                89 d0                  mov eax, edx
                48 89 d0               mov rax, rdx
                66 89 d0               mov ax, dx

rax / eax / ax...
rdi, rsi, rbx, rcx, rdx, rbp, rsp, r8-r15
xmm0-15, ymm0-15, zmm0-15 (or 31)...
ABI (Windows vs Linux)

ASM Overview

Constant operand:


                b9 ef be ad de          mov ecx, 0xdeadbeef
                48 b8 ef cd ab 89 67 45
                23 01                   mov rax, 0x0123456789abcdef

ASM Overview

Memory operand


                4c 8b 05[04 f3 ff 7f]   mov r8, QWORD PTR [global]
                01 84 8b[10 33 fe 7f]   add DWORD PTR array[rbx + 4 * rdx], eax


                addr = &array[b + d * 4];
                tmp = *addr;
                tmp += a;
                *addr = tmp;

Example!


void maxArray(double *x, const double *y)
{
  for (int i = 0; i < 65536; i++)
  {
    if (y[i] > x[i]) x[i] = y[i];
  }
}

Compiler Explorer

The Core2 Pipeline

Branch prediction
Fetch
Decode
µop cache
Rename

Reorder buffer read
Reservation station
Execution
Reorder buffer write
Retire

Branch Prediction

Pipeline overlaps work
What about branches?
Informed guess!

Branch Prediction

Is there a branch?
Where does it go?
Is it taken?

Branch Target Buffer

Is there a branch? Where does it go?

My investigations...

hints that Haswell+ the µop cache is involved
From Agner:
The measured throughput for jumps and branches varies between one branch per clock cycle and one branch per two clock cycles for jumps and predicted taken branches. Predicted not taken branches have an even higher throughput of up to two branches per clock cycle. The high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code. If there is more than one branch per 16 bytes of code then the throughput is reduced to one jump per two clock cycles. If there are more than 128 branches in the critical part of the code, and if they are spaced by at least 16 bytes, then apparently the first 128 branches have the high throughput and the remaining have the low throughput. These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer.
From intel's docs: "Conditional branches, which are never taken, do not consume BTB resources"

Branch Prediction

Is the branch taken?

Branch Prediction

Is it taken? Where does it go?

Branch Prediction

Local history (older Core2)

Branch Prediction

Local history (older Core2)

Branch Prediction with local history

Doesn't scale well
Loop predictors mitigate
SNB+ use globals

Branch Prediction with global history

Does it matter?


def test(array):
  total = num_clipped = clipped_total = 0
  for i in array:
    total += i
    if i < 128:
      num_clipped += 1
      clipped_total += i

  return total / len(array), clipped_total / num_clipped

BPU → Fetcher

Branch Predictor

Fetch Address

Fetch & Predecoder

Fetcher & Predecoder

Reads 16-byte blocks
Works out where instructions are


31 c0           xor eax, eax
f2 0f 10 04 06  movsd xmm0, QWORD PTR [rsi+rax]
66 0f 2e 04 07  ucomisd xmm0, QWORD PTR [rdi+rax]
76 05           jbe skip
f2 0f ...       movsd QWORD PTR...

31 c0 f2 0f 10 04 06 66 0f 2e 04 07 76 05 f2 0f

Fetcher → Decoder

Fetcher & Predecoder

Instruction bytes & instruction offsets

Decoder

Decode

Generate µops for each instruction
Up to 4 instructions/cycle
CISC → internal RISC
Micro-fusion & macro-fusion

4 Decoders

1-1-1-1 (+1)
2-1-1 (+1)
3
4
µcode ROM

Decode


maxArray:
  xor     eax, eax
.forLoop
  movsd   xmm0, QWORD PTR [rsi+rax]
  ucomisd xmm0, QWORD PTR [rdi+rax]

  jbe     skipIf
  movsd   QWORD PTR [rdi+rax], xmm0
.skipIf
  add     rax, 8
  cmp     rax, 524288
  jne     forLoop
  ret


clear(rax)

tmp = rsi + rax; xmm0 = rd64(tmp) // µ
tmp = rdi + rax; tmp = rd64(tmp);
  flags = compare(xmm0, tmp)
if (flags.be) goto skipIf
tmp = rdi + rax; wr64(tmp, xmm0)

rax = rax + 8
flags = compare(rax, 524288);
  if (flags.ne) goto forLoop // macro
rsp = rsp + 8;  goto rd64(rsp - 8)

Decode


t00: clear(rax)
t08: tmp = rsi + rax; xmm0 = rd64(tmp)
t0d: tmp = rdi + rax; tmp = rd64(tmp); flags = compare(xmm0, tmp)
t12: if (flags.be) goto t19                             // predicted taken
t19: rax = rax + 8
t1d: flags = compare(rax, 524288) if (flags.ne) goto t08// predicted taken
t08: tmp = rsi + rax; xmm0 = rd64(tmp)
t0d: tmp = rdi + rax; tmp = rd64(tmp); flags = compare(xmm0, tmp)
t12: if (flags.be) goto t19                             // predicted not taken
t14: tmp = rdi + rax; wr64(tmp, xmm0)
t19: rax = rax + 8

Decoder → µop Cache

Decoder

µops & addrs (in predicted order)

µop Cache

1536µops total
32 sets of 8-ways
6 µops per line
Complex rules...

BP → µop Cache

Branch Predictor

Fetch Address

µop Cache

Bypass

µop Cache → Rename

µop Cache

µops & addrs (in predicted order)

Rename

Renaming

16 architectural registers
100+ physical registers
"Rewrite" µops to use them
Unlock ILP!
Allocates load/store buffers

Renaming


xmm0 = rd64(rsi + rax)
flags = compare(xmm0, rd64(rdi + rax))
if (flags.be) goto t19
rax = rax + 8
flags = compare(rax, 524288) if (flags.ne) goto t08
xmm0 = rd64(rsi + rax)
flags = compare(xmm0, rd64(rdi + rax))
if (flags.be) goto t19
rax = rax + 8
flags = compare(rax, 524288) if (flags.ne) goto t08

Renaming


xmm0 = rd64(rsi + rax)
  flags = compare(xmm0, rd64(rdi + rax))
    if (flags.be) goto t19
    rax = rax + 8
      flags = compare(rax, 524288) if (flags.ne) goto t08
      xmm0 = rd64(rsi + rax)
        flags = compare(xmm0, rd64(rdi + rax))
          if (flags.be) goto t19
          rax = rax + 8
            flags = compare(rax, 524288) if (flags.ne) goto t08

7 CPU cycles

Renaming

Register Alias Table (RAT)
Maps arch register → internal register
reg->reg moves
Zeroing idioms

Renaming


xmm0_1 = rd64(rsi + rax_1)
flags_1 = compare(xmm0_1, rd64(rdi + rax_1))
if (flags_1.be) goto t19
rax_2 = rax_1 + 8
flags_2 = compare(rax_2, 524288) if (flags_2.ne) goto t08
xmm0_2 = rd64(rsi + rax_2)
flags_3 = compare(xmm0_2, rd64(rdi + rax_2))
if (flags_3.be) goto t19
rax_3 = rax_2 + 8
flags_4 = compare(rax_3, 524288) if (flags_4.ne) goto t08

Renaming


xmm0_1 = rd64(rsi + rax_1)
  flags_1 = compare(xmm0_1, rd64(rdi + rax_1))
    if (flags_1.be) goto t19
rax_2 = rax_1 + 8
  flags_2 = compare(rax_2, 524288) if (flags_2.ne) goto t08
  xmm0_2 = rd64(rsi + rax_2)
    flags_3 = compare(xmm0_2, rd64(rdi + rax_2))
      if (flags_3.be) goto t19
  rax_3 = rax_2 + 8
    flags_4 = compare(rax_3, 524288) if (flags_4.ne) goto t08

4 CPU cycles (vs 7)

Rename → Reorder buffer read

Rename

Renamed µops

Reorder buffer read

Reorder buffer

First out of order stage!
Blocks µops awaiting operands
Reads from PRF or internal reg

ROB read → Reservation Station

Reorder buffer read

µops with operand values

Reservation Station

Schedules ready µops
Execution ports

Reservation Station → Execute

Reservation Station

µops with operand values

Execute port 0

Execute port 1

Execute port 2

Execute port 3

Execute port 4

Execute port 5

Execute port 6

Execute port 7

Execute

Something actually happens!!

Execution ports (Skylake)

Port 0	Port 1	Port 2	Port 3	Port 4	Port 5	Port 6	Port 7
ALU 1	ALU 1	load & addr	load & addr	store	ALU 1	ALU 1	addr
vec str 3	vec alu 3				vector permute	branch 1-2
FPU 4	FPU 4				x87 FPU
branch 1-2	vec mul 5				PCLMUL 7
divide & sqrt

Execute → Reorder buffer write

Execute

µops with output values

Reorder buffer write

Results become available
Bypass back to ROB read

ROB write → Retire

Reorder buffer write

µops with output values

Retire

Retirement

Back in order!
"Commit" instruction
Write to PRF
Flush load/store buffer
Exceptions
Page faults
IRQs

And that's all there is to it!

The Core2 Pipeline

Branch prediction
Fetch
Decode
µop cache
Rename

Reorder buffer read
Reservation station
Execution
Reorder buffer write
Retire

Thank you!

All CE folks
Sponsor me on GH or Patreon!
twoscomplement.org podcast