Ingest any Logs with a Custom Schema

While other tools often bang the drum for being able to "ingest everything" regardless of the schema, we found that this often comes at the hidden cost of high ingest latency and slow query response times. Much unlike other tools, VAST is blazing fast and can ingest logs at line rate up to millions of events per second. That is partially because VAST features a strong type system. Users need to provide so called schemas to leverage its power. Schemas help VAST to interpret the datatypes in the stream of incoming events.

VAST ships with some predefined schemas for commonly used tools, like Suricata, Zeek, or Sysmon logs. But what if there is no schema available for the tool you want to use? Here is how you can ingest any logs in VAST, so you can profit from the amazing properties a rich type system has to offer.

This guide walks you through the process of creating a custom schema on the example of a firewall that produces logs in LDJSON format.

The Log Source

In an imaginary setup we have a firewall and 10k connected clients per day. The firewall produces about 180 Gigabytes of logs every 24 hours. That is about 2.1 MB/s—and won't get VAST to sweat the slightest bit. Let's get started!

Our example firewall produces logs in line-delimited JSON format. A single log-line looks as follows:

{
"ts": "2021-02-24T09:16:20.000000+0100",
"rule": "known malicious source IP",
"action": "block",
"src_ip": "6.6.6.6",
"src_port": 52768,
"dst_ip": "66.77.88.99",
"dst_port": 22,
"service": "sshd",
}

Every log-line that our example firewall produces contains eight fields. Let's look at them in detail:

FieldnameDescription
tsThe timestamp when the log line was recorded
ruleThe firewall rule that triggered
actionThe action that the firewall applied to the connection
src_ipThe source IP address of the logged connection
src_portThe source port of the logged connection
dst_ipThe destination IP address of the logged connection
dst_portThe destination port of the logged connection
serviceThe detected service of the destination host/port

Deriving a VAST Schema

In this section we first examine the data closely to understand the types used in the different JSON fields. Then we construct a VAST record to complete our custom schema.

Understanding the Datatypes

We can identify four different types in the firewall logs described above:

  1. Timestamps (the ts field)
  2. Strings (rule, action, and service)
  3. IP addresses (src_ip, dst_ip)
  4. Ports (src_port, dst_port)

VAST supports all these basic types natively. That is great, because we automatically inherit powerful concepts from the rich type system. For example, once our logs are ingested as typed data, we can use IP addresses for searching in a subnet space in CIDR notation. But let's write the schema first:

  1. The ts field contains a datetime string and at the same time, this is the timestamp of the logentry. VAST supports that via the timestamp type.
  2. The rule, action, and service fields are simple strings. The VAST type is straight forward called string.
  3. src_ip and dst_ip are IP addresses. VAST has native support for IP addresses and we can use it via the addr type.
  4. Lastly, the src_port and dst_port fields contain JSON numbers. Since port values can never be negative, we have to use the count type in VAST (not the integer type)

Putting It All Together

We will now use the VAST types we derived from the JSON logs above to write a custom schema. Schemas are text files and contain a bunch of records. In our case we need just one record, because our example firewall produces just one kind of log. If your log-source was to produce five different logs, you would simply define five different records inside the same schema file.

Records in VAST schemas look much like JSON objects. We declare a VAST type for every field (key) in the JSON object. The record is complete when we provide a name for it, using this syntax:

type <my-name> = record { key: type, ... }

Finally, here is how the full VAST record looks for our example firewall logs:

type custom.firewall = record{
ts: timestamp,
rule: string,
action: string,
src_ip: addr,
src_port: count,
dst_ip: addr,
dst_port: count,
service: string
}

Let's store the above snippet in a file called my-firewall.schema.

Ingesting Firewall Logs

To ingest logs from our example firewall we need to pass our custom schema to VAST. This is done by placing the my-firewall.schema file in one of VAST's schema directories. On startup, VAST scans all configured schema directories and becomes aware of all defined records.

The exact schema directory depends on your installation. For example, in Linux deployments you may want to put your schema files in /etc/vast/schema. You can always add custom schema directories in the VAST configuration file.

If your file is syntactically correct and placed in one of the configured schema directories, VAST will read it on startup. From that point on VAST knows how to handle logs of your custom log-source. Once imported, your logs are stored in a strongly typed database, according to your schema.

Verify Schema

We can use VAST's status command to check on all known types in the VAST type registry. If everything went well, we should see a new type for our custom VAST record. We called the new record type custom.firewall in our schema file, so we expect that type to be listed in VAST's type registry. Run the status command with the --debug setting as follows:

vast status --debug | jq '."type-registry".types'

The output should contain our new record type and look similar to the following:

[
"custom.firewall",
"vast.metrics"
]

Ingesting Data

We can ingest our firewall logs once the schema is loaded in VAST by invoking vast import. We can either pass a file or a steady stream via stdin. Here's how the invocation looks for a single file:

vast -v import json < my-firewall-logs.json

In our example we use a small log file with the following contents:

$ cat my-firewall-logs.json
---------------------------
{"ts": "2021-02-24T09:16:20.000000+0100","rule": "known malicious source IP","action": "block","src_ip": "6.6.6.6","src_port": 52768,"dst_ip": "66.77.88.99","dst_port": 22,"service": "sshd"}
{"ts": "2021-02-24T09:16:21.000000+0100","rule": "known malicious source IP","action": "block","src_ip": "6.6.6.7","src_port": 52768,"dst_ip": "66.77.88.99","dst_port": 22,"service": "sshd"}
{"ts": "2021-02-24T09:16:22.000000+0100","rule": "known malicious source IP","action": "block","src_ip": "6.6.6.8","src_port": 52768,"dst_ip": "66.77.88.99","dst_port": 22,"service": "sshd"}
{"ts": "2021-02-24T09:16:23.000000+0100","rule": "known malicious source IP","action": "block","src_ip": "6.6.6.9","src_port": 52768,"dst_ip": "66.77.88.99","dst_port": 22,"service": "sshd"}
{"ts": "2021-02-24T09:16:24.000000+0100","rule": "known malicious source IP","action": "block","src_ip": "6.6.6.10","src_port": 52768,"dst_ip": "66.77.88.99","dst_port": 22,"service": "sshd"}

Querying VAST

We can now use VAST's query language to search in the data and profit from the strong type system.

# Query VAST for logs of our custom record type:
vast export json '#type == "custom.firewall"'
# Query all events where the action fields has the value "blocked"
vast export json 'action == "block"'
# Query a CIDR range. This will only show the records where the source IP is 6.6.6.6 or 6.6.6.7:
vast export json 'src_ip in 6.6.6.6/31'
# Query records that are timestamped after the specified date:
vast export json ':timestamp > "2021-02-24T10:16:22"'

Check out our query language documentation for details on the different types and data formats you can use in your queries.

Summary

In this guide we have learned how to write a custom schema for VAST and ingest arbitrary log data with strong types. Here's a cheat sheet outlining the steps we took:

  1. Get a sample log-line from your application.
  2. Find the corresponding VAST type for every field in the log-line.
  3. Create a VAST record with the syntax type <my-name> = record { key: type, ... }.
  4. Repeat steps 1-3 if your application outputs more than one log format (see the officially maintained sysmon.schema for a detailed example of an application that outputs many different formats).
  5. Create a schema file that contains all your custom record definitions.
  6. Place that schema file in a VAST schema directory. You can add custom schema directories in VAST's configuration file.
  7. Use vast status --debug to verify that your schema is known to VAST.
  8. Happy ingesting!