Memory & Caches

How they work, and how they affect performance

Matt Godbolt @mattgodbolt

Outline

Physical memory
Caches
Measuring
Virtual memory
Multiprocessing

Types of RAM

Static RAM
Dynamic RAM
Non-volatile

1 Bit of Static RAM

Static RAM

6 transistors per bit
Requires constant power
Logic-level
Repeating layout
<1ns access times
Why not make all RAM this way?

COST & POWER!

1 Bit of Dynamic RAM

Dynamic RAM

1 transistor & 1 capacitor per bit
Even simpler repeating layout
No power needed unless active
So what's the catch?

Dynamic RAM drawbacks

Capacitor leaks charge
- Requires refresh
It's analogue!
- Needs ADC
- Takes time to charge/discharge
Chance of crosstalk
- Row hammer attack!
Used in external chips

Array of Dynamic RAM

Reading from DRAM (short version)

Precharge bit-lines (t_RP ~18 cycles)
Select row (t_RCD ~18 cycles)
Sense & latch row data (~8192 bits)
- Feedback starts recharging cells
Select column (CAS ~16 cycles)
- Data is now available
Need to wait for next precharge (t_RAS ~36 memory cycles)

Precharge
Row select
Sense
Column select
Data ready

DRAM timings — read

Image source: Ulrich Drepper

DRAM timings — precharge

Image source: Ulrich Drepper

Memory Controller

Embedded in CPU
Opens/closes rows
Reorders load/stores
Schedules refreshes (~64ms)
Handles DMA

Image source: Ulrich Drepper

Getting DRAM data out

DDR4 3200 bus runs at ~1600MHz
64-bits per clock, 2x pumped
1600MHz x 2 x 64 bit
= 25,600MB/sec

Access times

First word: 10.67ns
Fourth word: 11.67ns
Eighth word: 13.00ns

Access times can dominate over data rates

Caching

Static RAM is small and expensive
Dynamic RAM is cheap, large & slow
Use SRAM as cache for DRAM!
But how to do this in hardware?
Break into "entries" that can be cached

Cache entries

Organise data in "lines" of 64 bytes
- Bottom 6 address bits index into this
Use some address bits to choose a "set"
- e.g. 6 bits for L1 cache; 11 for L2; 13+ for L3
Remaining bits are a "tag"

Cache entries

636 + `S` + 1	6 + `S`6	50
Tag	Set	Offset
64 - 6 - `S` bits	`S` bits	6 bits

Cache lookup

The S bits choose a set
Each set has a number of "ways" (e.g. 8)
Each way has "tag" and 64-bytes of cache data
Search all the ways for a match with "tag"

Cache lookup


// Cache config parameters:
constexpr auto NumLineBits = 6u; // 6 bits on Intel CPUs, 64-byte lines
constexpr auto NumSetBits = 6u;  // 6 bits on Intel L1 caches
constexpr auto NumWays = 8u;     // 8 ways on Intel L1 cache

constexpr auto LineSize = 1u << NumLineBits;
struct Line { char data[LineSize]; };

struct Way { uint64_t tag; Line line; };

struct Set { Way ways[NumWays]; };

constexpr auto NumSets = 1u << NumSetBits;
struct Cache { Set sets[NumSets]; };

Cache lookup


int Cache::readByte(uint64_t addr) const {
  constexpr auto LineMask = (1u << NumLineBits) - 1u;
  auto lineOffset = addr & LineMask;

  constexpr auto SetMask = (1u << NumSetBits) - 1u;
  const Set &set = sets[(addr >> NumLineBits) & SetMask];

  constexpr auto TagMask = ~((1u << (NumLineBits + NumSetBits)) - 1u);
  auto tag = addr & TagMask;

  for (const auto &way : set.ways)
    if (way.tag == tag) return way.line.data[lineOffset];

  return -1;
}

We just implemented a bad hash map!

Bad hash function
Fixed size bucket chain
Why is it this way?

Caching

Layers

Physical proximity important
Caches layered: L1, L2, L3
Lower-numbered cache smaller, nearer, faster
L3 shared (x86)
Inclusive/exclusive

Timings

Approximate for Sandy Bridge

Level	Size	Access time
L1	32KB	~3 cycles
L2	256KB	~8 cycles
L3	10-20MB	~30-40 cycles
(RAM)		~250 cycles

Real-world measurements

Measured on Westmere X5690
32KB / 256KB / 12MB
Timed pointer chasing
Vary:
- Element size
- Total # elements
- Order of links

Real-world measurements


                    struct Element {
                      Element *next;
                      uint8_t padding[ElementSize - sizeof(Element *)];
                    };

                    Element elements[NumElements];

Real-world measurements

Sequential

Result: Sequential reads

Discussion

Why so fast?
Automatic prefetching
Cache spots patterns in misses

Result: Sequential reads

Discussion

8192 byte stride?
- L3 cache is 12-way 12288-set associative
- 1024, 2048, 4096 divide into 12288
- 8192 doesn't

Real-world measurements

Random

Sequential vs Random Reads

Discussion

Why worse than 250 cycles?
- Prefetching can hurt
Linear scan 5x faster
prefetch is available
- Hard to get right
- Overrides Critical Word Load

Writes vs Reads

Writes

Virtual memory

How do virtual address fit into this?
Page tables require 4 reads to walk:

TLBs

Translation cache, two-level 4-way (Sandy Bridge)
L1: 64 4KB page / 32 2MB page
L2: 512 4KB page

Virtual Memory & Caching

L1 is fast & small
- Physically tagged, virtually indexed
- 64 sets: bits 11:6 of address
- DTLB read in parallel with cache fetch
L2 and above
- Caches physical address

Example with Huge Pages

More on Huge Pages

~15 cycle improvement

Multiprocessing

Bus between CPUs fast...but not that fast
TLB, L1, L2 local to a core (hyperthread siblings)
L3 shared by all cores on a socket
How to ensure consistency?

MESI

Modified — exclusive, dirty
Exclusive — exclusive, non-dirty
Shared — possibly shared, non-dirty
Invalid

MESI

Avoid RFOs
Minimum transaction is cache line
Pad shared data to cache line size

NUMA

Most high end machines are NUMA
Each socket has local RAM
Accessing remote RAM slower

What can we do?

Make good use of cache

Keep data small
Prefer linear scans
Avoid chasing pointers

Maybe...

Prefetch
Bypass cache

Allocate carefully

Allocate from right NUMA node
Use huge pages for giant slabs

Thanks!

Questions?

Based on: "What Every Programmer Should Know About memory"
by Ulrich Drepper
https://www.akkadia.org/drepper/cpumemory.pdf

Memory & Caches

How they work, and how they affect performance

Matt Godbolt @mattgodbolt

Outline

Types of RAM

1 Bit of Static RAM

1 Bit of Static RAM

Static RAM

COST & POWER!

1 Bit of Dynamic RAM

Dynamic RAM

Dynamic RAM drawbacks

Array of Dynamic RAM

Reading from DRAM (short version)

DRAM timings — read

DRAM timings — precharge

Memory Controller

Getting DRAM data out

Access times

Access times can dominate over data rates

Caching

Cache entries

Cache entries

Cache lookup

Cache lookup

Cache lookup

We just implemented a bad hash map!

Caching

Layers

Timings

Approximate for Sandy Bridge

Real-world measurements

Real-world measurements

Real-world measurements

Sequential

Result: Sequential reads

Discussion

Result: Sequential reads

Discussion

Real-world measurements

Random

Sequential vs Random Reads

Discussion

Writes vs Reads

Writes

Virtual memory

TLBs

Virtual Memory & Caching

Example with Huge Pages

More on Huge Pages

Multiprocessing

MESI

MESI

MESI

NUMA

What can we do?

Make good use of cache

Allocate carefully

Thanks!

Questions?