zindex

Scratching your own itch

@mattgodbolt / @mattgodbolt

My problem

  • 5000+ read-only archived JSON log files on NFS
  • Average 450MiB gzipped, 4GiB uncompressed
  • Need to find entries by a unique ID

zgrep

$ zgrep -E '^\{"eventId":63181572,' audit.log.gz
{"eventId":63181572,...}
  • Takes 14 seconds
  • zgrep has to uncompress every byte of the file to search it!
  • Could index file
  • ...but then would need to be able to seek within gzip file

Inside a gzipped file

  • Sequence of blocks
  • Blocks may be uncompressed or compressed
  • Compressed blocks use Huffman-encoded LZ77

LZ77 decoder

Each symbol can be:

  • A literal byte (e.g. 'A' or '\xff')
  • A back-reference and length (Copy 5, distance 3).
    • Limited to 32KiB back
  • The "end of block" marker

Example

"Nom nom nom gzip files!!!!!"

Compressed block ""
"Nom n" "Nom n"
Copy 7, distance 4 "Nom nom nom "
"gzip files!" "Nom nom nom gzip files!"
Copy 4, distance 1 "Nom nom nom gzip files!!!!!"
End of block "Nom nom nom gzip files!!!!!"

Seeking

0 45632 88775
Uncompressed 1 Uncompressed 2 Uncompressed 3

0 3788 9876
Comp block 1 Comp block 2 Comp block 3

Offset gzip offset Context
45632 3788 ...ressed 1
88775 9876 ...ressed 2

zindex

  • Decompress, find line offsets and index
  • Checkpoint gzip state every few MiB
  • 32KiB of data for checkpoint

zq

  • Look up line number in index
  • Look up file offset of line in index
  • Get latest checkpoint prior to offset
  • Initialize gzip library with 32KiB context
  • Skip forward until offset

Building an index

$ zindex \
    --regex '^\{"eventId":([0-9]+),' \
    --unique --numeric audit.log.gz
  • Creates a SQLite index file
  • 1.6GiB compressed input, 63MiB index
  • Takes 90s

Querying

$ zq audit.log.gz 63181572
{"eventId":63181572,...}
  • Takes 0.03s
  • 47,000% faster than zgrep (14s)

Questions?

github.com/mattgodbolt/zindex
@mattgodbolt