zindex

My problem

1,000s of historical giant gzip text files
Needed to be able to search

zgrep is cool!
...but slow


godbolt:~ $ ls -hl some-file.csv.gz
505M Oct 14 14:38 some-file.csv.gz
godbolt:~ $ time zgrep AAPL.O some-file.csv.gz | wc -l
864221
Executed in   27.58 secs

Enter `zq`


godbolt:~ 5.2s $ time zq some-file.csv.gz AAPL.O | wc -l
864221
Executed in    4.40 secs

The catch


godbolt:~ $ time zindex -d , -f 1 some-file.csv.gz
Executed in  108.26 secs

Dirty secret


godbolt:~ $ ls -hl some-file.csv.gz.index
2.3G Oct 14 14:43 some-file.csv.gz.zindex

More realistic


godbolt:~ $ zindex ~/some-file.csv.gz
godbolt:~ $ 14.6s $ ls -lh ~/some-file.csv.gz.zindex
629K Oct 14 16:58 some-file.csv.gz.zindex

So how does it work?

How does DEFLATE work?

Lempel-Ziv
Huffman coding

Lempel-Ziv

compress the compressible

0         1         2
0123456789012345678901234
compress the compressible

"compress the "

(go back 12, copy 8)

"ible"

13 + (backref, len) + 4

Backref?

Use new byte values for backref!

256 = end of block
257 = copy 3
258 = copy 4
259 = copy 5
...etc...

(offsets coded after)

But Matt, bytes are 8 bits?!?

Didn't we just make everything worse?

Worry not - Huffman to the rescue!

Huffman Trees

compress the compressible

s	e	' '	r	p	m	o	c	l	b	i	h	t
4	4	2	2	2	2	2	2	1	1	1	1	1

Magic

s	00
' '	010
e	101

l	0110
p	1100
c	1001
r	1101
o	1110
m	1111

i	01110
b	01111
t	10000
h	10001

compress the compressible

1001 1110 1111 1100 1101 101 00 00 010 # compress_
10000 10001 101 010 # the_
1001 1110 1111 1100 1101 101 00 00 # compress
01110 01111 0110 101 # ible

Huffman

Combining with LZ

Symbols not restricted to 8 bits
Backreferences, offsets, lengths go in the dictionary too!

LZ+Huffman

DEFLATE & zindex

Supports streaming
Limited backreference window - 32,768
Resets huffman every ~100k
Each reset has minimal state
Checkpoint!

Live demo!

zindex

Scan compressed file
Every XX bytes, store AccessPoint at next block boundary
Store line offsets, indices

zq

Query index
Find nearest AccessPoint
Coerce zlib
Discard until actual output
Profit

hellig thing does it too

https://github.com/hellige/au

zindex

My problem

Enter zq

The catch

Dirty secret

More realistic

So how does it work?

How does DEFLATE work?

Lempel-Ziv

Backref?

But Matt, bytes are 8 bits?!?

Worry not - Huffman to the rescue!

Huffman Trees

Magic

Huffman

Combining with LZ

LZ+Huffman

DEFLATE & zindex

Live demo!

zindex

zq

hellig thing does it too

Enter `zq`