My problem
- 5000+ read-only archived JSON log files on NFS
- Average 450MiB gzipped, 4GiB uncompressed
- Need to find entries by a unique ID
zgrep
$ zgrep -E '^\{"eventId":63181572,' audit.log.gz
{"eventId":63181572,...}
- Takes 14 seconds
- zgrep has to uncompress every byte of the file to search it!
- Could index file
- ...but then would need to be able to seek within gzip file
Inside a gzipped file
- Sequence of blocks
- Blocks may be uncompressed or compressed
- Compressed blocks use Huffman-encoded LZ77
LZ77 decoder
Each symbol can be:
- A literal byte (e.g. 'A' or '\xff')
- A back-reference and length (Copy 5, distance 3).
- The "end of block" marker
Example
"Nom nom nom gzip files!!!!!"
Compressed block |
"" |
"Nom n" |
"Nom n" |
Copy 7, distance 4 |
"Nom nom nom " |
"gzip files!" |
"Nom nom nom gzip files!" |
Copy 4, distance 1 |
"Nom nom nom gzip files!!!!!" |
End of block |
"Nom nom nom gzip files!!!!!" |
Seeking
0 |
45632 |
88775 |
Uncompressed 1 |
Uncompressed 2 |
Uncompressed 3 |
0 |
3788 |
9876 |
Comp block 1 |
Comp block 2 |
Comp block 3 |
Offset |
gzip offset |
Context |
45632 |
3788 |
...ressed 1 |
88775 |
9876 |
...ressed 2 |
zindex
- Decompress, find line offsets and index
- Checkpoint gzip state every few MiB
- 32KiB of data for checkpoint
zq
- Look up line number in index
- Look up file offset of line in index
- Get latest checkpoint prior to offset
- Initialize gzip library with 32KiB context
- Skip forward until offset
Building an index
$ zindex \
--regex '^\{"eventId":([0-9]+),' \
--unique --numeric audit.log.gz
- Creates a SQLite index file
- 1.6GiB compressed input, 63MiB index
- Takes 90s
Querying
$ zq audit.log.gz 63181572
{"eventId":63181572,...}
- Takes 0.03s
- 47,000% faster than zgrep (14s)