My problem
            
                - 5000+ read-only archived JSON log files on NFS
- Average 450MiB gzipped, 4GiB uncompressed
- Need to find entries by a unique ID
zgrep
            $ zgrep -E '^\{"eventId":63181572,' audit.log.gz
{"eventId":63181572,...}
            
                - Takes 14 seconds
- zgrep has to uncompress every byte of the file to search it!
- Could index file
- ...but then would need to be able to seek within gzip file
Inside a gzipped file
            
                - Sequence of blocks
- Blocks may be uncompressed or compressed
- Compressed blocks use Huffman-encoded LZ77
LZ77 decoder
            Each symbol can be:
            
                - A literal byte (e.g. 'A' or '\xff')
- A back-reference and length (Copy 5, distance 3).
                    
                
- The "end of block" marker
Example
            "Nom nom nom gzip files!!!!!"
            
                
                    | Compressed block | "" | 
                
                    | "Nom n" | "Nom n" | 
                
                    | Copy 7, distance 4 | "Nom nom nom " | 
                
                    | "gzip files!" | "Nom nom nom gzip files!" | 
                
                    | Copy 4, distance 1 | "Nom nom nom gzip files!!!!!" | 
                
                    | End of block | "Nom nom nom gzip files!!!!!" | 
            
        
        
            Seeking
            
                
                    | 0 | 45632 | 88775 | 
                
                    | Uncompressed 1 | Uncompressed 2 | Uncompressed 3 | 
            
            
            
                
                    | 0 | 3788 | 9876 | 
                
                    | Comp block 1 | Comp block 2 | Comp block 3 | 
            
            
            
                
                    | Offset | gzip offset | Context | 
                
                    | 45632 | 3788 | ...ressed 1 | 
                
                    | 88775 | 9876 | ...ressed 2 | 
            
        
        
            zindex
            
                - Decompress, find line offsets and index
- Checkpoint gzip state every few MiB
- 32KiB of data for checkpoint
zq
            
                - Look up line number in index
- Look up file offset of line in index
- Get latest checkpoint prior to offset
- Initialize gzip library with 32KiB context
- Skip forward until offset
Building an index
            $ zindex \
    --regex '^\{"eventId":([0-9]+),' \
    --unique --numeric audit.log.gz
            
                - Creates a SQLite index file
- 1.6GiB compressed input, 63MiB index
- Takes 90s
Querying
            $ zq audit.log.gz 63181572
{"eventId":63181572,...}
            
                - Takes 0.03s
- 47,000% faster than zgrep (14s)