zindex

My problem

  • 1,000s of historical giant gzip text files
  • Needed to be able to search
  • zgrep is cool!
  • ...but slow

godbolt:~ $ ls -hl some-file.csv.gz
505M Oct 14 14:38 some-file.csv.gz
godbolt:~ $ time zgrep AAPL.O some-file.csv.gz | wc -l
864221
Executed in   27.58 secs
                

Enter zq


godbolt:~ 5.2s $ time zq some-file.csv.gz AAPL.O | wc -l
864221
Executed in    4.40 secs
                

The catch


godbolt:~ $ time zindex -d , -f 1 some-file.csv.gz
Executed in  108.26 secs
                

Dirty secret


godbolt:~ $ ls -hl some-file.csv.gz.index
2.3G Oct 14 14:43 some-file.csv.gz.zindex
                

More realistic


godbolt:~ $ zindex ~/some-file.csv.gz
godbolt:~ $ 14.6s $ ls -lh ~/some-file.csv.gz.zindex
629K Oct 14 16:58 some-file.csv.gz.zindex
                

So how does it work?

How does DEFLATE work?

  • Lempel-Ziv
  • Huffman coding

Lempel-Ziv

compress the compressible
0         1         2
0123456789012345678901234
compress the compressible
"compress the "
(go back 12, copy 8)
"ible"
13 + (backref, len) + 4

Backref?

Use new byte values for backref!
  • 256 = end of block
  • 257 = copy 3
  • 258 = copy 4
  • 259 = copy 5
  • ...etc...
(offsets coded after)

But Matt, bytes are 8 bits?!?

Didn't we just make everything worse?

Worry not - Huffman to the rescue!

Huffman Trees

compress the compressible
s e ' ' r p m o c l b i h t
4 4 2 2 2 2 2 2 1 1 1 1 1

Magic

s 00
' ' 010
e 101
l 0110
p 1100
c 1001
r 1101
o 1110
m 1111
i 01110
b 01111
t 10000
h 10001
compress the compressible
1001 1110 1111 1100 1101 101 00 00 010 # compress_
10000 10001 101 010 # the_
1001 1110 1111 1100 1101 101 00 00 # compress
01110 01111 0110 101 # ible
                
Huffman

Combining with LZ

  • Symbols not restricted to 8 bits
  • Backreferences, offsets, lengths go in the dictionary too!
LZ+Huffman

DEFLATE & zindex

  • Supports streaming
  • Limited backreference window - 32,768
  • Resets huffman every ~100k
  • Each reset has minimal state
  • Checkpoint!

Live demo!

zindex

  • Scan compressed file
  • Every XX bytes, store AccessPoint at next block boundary
  • Store line offsets, indices

zq

  • Query index
  • Find nearest AccessPoint
  • Coerce zlib
  • Discard until actual output
  • Profit
hellig thing does it too
https://github.com/hellige/au