Parse binary data

This guide shows you how to parse binary data formats into structured events. You’ll learn to work with columnar formats like Parquet and Feather, packet captures in PCAP format, Tenzir’s native Bitz format, and compressed data.

The examples use from_file with a parsing subpipeline to illustrate each technique.

Parquet

Apache Parquet is a columnar format widely used in data lakes and analytics pipelines. Given this Parquet file containing user data:

from_file "users.parquet" {
  read_parquet
}

{id: 1, name: "alice", email: "alice@example.com", role: "admin"}
{id: 2, name: "bob", email: "bob@example.com", role: "user"}
{id: 3, name: "carol", email: "carol@example.com", role: "user"}

Parquet files often come from cloud storage:

from_file "s3://datalake/events/*.parquet"

The from_file operator automatically detects Parquet format from the file extension.

Feather

Apache Feather is Parquet’s little brother—a lightweight columnar format optimized for fast I/O:

from_file "data.feather" {
  read_feather
}

Use read_feather to parse Feather files.

PCAP

PCAP is the standard format for packet captures. Use read_pcap to parse captured packets:

from_file "capture.pcap" {
  read_pcap
}

{linktype: 1, timestamp: 2024-01-15T10:30:45.123456Z, captured_packet_length: 74, original_packet_length: 74, data: "ABY88f1tZJ7zvttmCABFAAA8..."}

Use from_nic to parse directly from a live interface. TQL furhter comes with light-weight packet processing functions. For example, you can extract protocol headers from raw packet data using the decapsulate function:

from_file "capture.pcap" {
  read_pcap
}
packet = decapsulate(this)

{packet: {ether: {src: "64-9E-F3-BE-DB-66", dst: "00-16-3C-F1-FD-6D", type: 2048}, ip: {src: "192.168.1.100", dst: "10.0.0.1", type: 6}, tcp: {src_port: 54321, dst_port: 443}, community_id: "1:YXWfTYEyYLKVv5Ge4WqijUnKTrM="}}

Bitz

Bitz is Tenzir’s native columnar format, optimized for schema-rich security data. Use read_bitz to parse it:

from_file "archive.bitz" {
  read_bitz
}

Compressed data

Binary formats often come compressed. The from_file operator automatically detects compression based on file extensions like .gz, .zst, .bz2, .lz4, and .br:

from_file "data.parquet.gz"      // Auto-detects gzip
from_file "logs.json.zst"        // Auto-detects zstd

When automatic detection doesn’t apply (e.g., custom extensions or chained formats), use explicit decompression operators in the parsing subpipeline. These are bytes-to-bytes operators, so they must appear before the parser:

Format	Operator
Gzip	`decompress_gzip`
Zstandard	`decompress_zstd`
Bzip2	`decompress_bz2`
LZ4	`decompress_lz4`
Brotli	`decompress_brotli`

Example with explicit decompression:

from_file "capture.pcap.zst" {
  decompress_zstd
  read_pcap
}