zindex

Scratching your own itch

@mattgodbolt / @mattgodbolt

My problem

5000+ read-only archived JSON log files on NFS
Average 450MiB gzipped, 4GiB uncompressed
Need to find entries by a unique ID

zgrep

$ zgrep -E '^\{"eventId":63181572,' audit.log.gz
{"eventId":63181572,...}

Takes 14 seconds
zgrep has to uncompress every byte of the file to search it!
Could index file
...but then would need to be able to seek within gzip file

Inside a gzipped file

Sequence of blocks
Blocks may be uncompressed or compressed
Compressed blocks use Huffman-encoded LZ77

LZ77 decoder

Each symbol can be:

A literal byte (e.g. 'A' or '\xff')
A back-reference and length (Copy 5, distance 3).
- Limited to 32KiB back
The "end of block" marker

Example

"Nom nom nom gzip files!!!!!"

Compressed block	""
"Nom n"	"Nom n"
Copy 7, distance 4	"Nom nom nom "
"gzip files!"	"Nom nom nom gzip files!"
Copy 4, distance 1	"Nom nom nom gzip files!!!!!"
End of block	"Nom nom nom gzip files!!!!!"

Seeking

0	45632	88775
Uncompressed 1	Uncompressed 2	Uncompressed 3

0	3788	9876
Comp block 1	Comp block 2	Comp block 3

Offset	gzip offset	Context
45632	3788	...ressed 1
88775	9876	...ressed 2

zindex

Decompress, find line offsets and index
Checkpoint gzip state every few MiB
32KiB of data for checkpoint

zq

Look up line number in index
Look up file offset of line in index
Get latest checkpoint prior to offset
Initialize gzip library with 32KiB context
Skip forward until offset

Building an index

$ zindex \
    --regex '^\{"eventId":([0-9]+),' \
    --unique --numeric audit.log.gz

Creates a SQLite index file
1.6GiB compressed input, 63MiB index
Takes 90s

Querying

$ zq audit.log.gz 63181572
{"eventId":63181572,...}

Takes 0.03s
47,000% faster than zgrep (14s)

Questions?

github.com/mattgodbolt/zindex
@mattgodbolt