Formats
General
A format is the bridge between raw bytes and structured data. A format provides a parser and/or printer:
- Parser: translates raw bytes into structured event data
- Printer: translates structured events into raw bytes
Parsers and printers interact with their corresponding dual from a connector:
Formats appear as an argument to the parse
, and
print
operators:
parse <field> <format>
print <field> <format>
Parser Schema Inference
Parsers will attempt to infer an event schema from the input and potentially data format.
The following builtin parsers provide options for more specific control over schema inference:
The Suricata, Zeek JSON and XSV parsers do not provide all of the options.
--merge
(Parsers)
Merges all incoming events into a single schema* that converges over time. This option is usually the fastest for reading highly heterogeneous data, but can lead to huge schemas filled with nulls and imprecise results. Use with caution.
*: In selector mode, only events with the same selector are merged.
--schema <schema>
(Parsers)
Explicitly set the output schema.
If a schema with a matching name is installed, the result will always have all fields from that schema.
- Fields that are specified in the schema, but did not appear in the input will be null.
- Fields that appear in the input, but not in the schema will also be kept.
--schema-only
can be used to reject fields that are not in the schema.
If the given schema does not exist, this option instead assigns the output schema name only.
This option can not be combined with --selector
.
--selector <field>[:<prefix>]
(Parsers)
Similar to --schema
, but use the value of the field specified in <field>
as the schema name.
If the optional <prefix>
is specified, then the schema is prepended with a
prefix. For example, the selector event_type:suricata
with an event that has
the field event_type
set to the value flow
looks for a schema named
suricata.flow
.
This option can not be combined with --schema
.
--schema-only
(Parsers)
When working with an existing schema, this option will ensure that the output
schema has only the fields from that schema. If the schema name is obtained via a selector
and it does not exist, this has no effect.
This option requires either --schema
or --selector
to be set.
--unnest-separator <separator>
(Parsers)
A delimiter that, if present in keys, causes values to be treated as values of nested records.
A popular example of this is the Zeek JSON format. It includes
the fields id.orig_h
, id.orig_p
, id.resp_h
, and id.resp_p
at the
top-level. The data is best modeled as an id
record with four nested fields
orig_h
, orig_p
, resp_h
, and resp_p
.
Without an unnest separator, the data looks like this:
With the unnest separator set to .
, Tenzir reads the events like this:
--raw
(Parsers)
Use only the raw types that are native to the parsed format. Fields that have a type specified in the chosen schema will still be parsed according to the schema.
For example, the JSON format has no notion of an IP address, so this will cause all IP addresses to be parsed as strings, unless the field is specified to be an IP address by the schema. JSON however has numeric types, so those would be parsed.
Use with caution.
MIME Types
When a printer constructs raw bytes, it sets a
MIME content type so that savers
can make assumptions about the otherwise opaque content. For example, the
HTTP connector uses this value to populate the Content-Type
header when
copying the raw bytes into the HTTP request body.
The printers set the following MIME types:
Format | MIME Type |
---|---|
CSV | text/csv |
JSON | application/json |
NDJSON | application/x-ndjson |
Parquet | application/x-parquet |
PCAP | application/vnd.tcpdump.pcap |
SSV | text/plain |
TSV | text/tab-separated-values |
YAML | application/x-yaml |
Zeek TSV | application/x-zeek |
Available Formats
📄️ bitz
Reads and writes BITZ, Tenzir's internal wire format.
📄️ cef
Parses events in the Common Event Format (CEF).
📄️ csv
The csv format is a configuration of the xsv format:
📄️ feather
Reads and writes the Feather file format, a thin wrapper around
📄️ gelf
Reads Graylog Extended Log Format (GELF) events.
📄️ grok
Parses a string using a grok-pattern, backed by regular expressions.
📄️ json
Reads and writes JSON.
📄️ kv
Reads key-value pairs by splitting strings based on regular expressions.
📄️ leef
Parses events in the Log Event Extended Format (LEEF).
📄️ lines
Parses and prints events as lines.
📄️ parquet
Reads events from a Parquet file. Writes events to a Parquet file.
📄️ pcap
Reads and writes raw network packets in PCAP file format.
📄️ ssv
The ssv format is a configuration of the xsv format:
📄️ suricata
Reads Suricata's EVE JSON output. The parser is an alias
📄️ syslog
Reads syslog messages.
📄️ time
Parses a datetime/timestamp using a strptime-like format string.
📄️ tsv
The tsv format is a configuration of the xsv format:
📄️ xsv
Reads and writes lines with separated values.
📄️ yaml
Reads and writes YAML.
📄️ zeek-json
The zeek-json format is an alias for json with the arguments:
📄️ zeek-tsv
Reads and writes Zeek tab-separated values.