This is unreleased documentation for Tenzir Next version.
For up-to-date documentation, see the latest version (v4.25).
Version: Next
Map Data to OCSF
In this tutorial you'll learn how to map events to Open Cybersecurity Schema
Framework (OCSF). We walk you through an example of
events from a network monitor and show how you can use Tenzir pipelines to
easily transform them so that they become OCSF-compliant events.
The diagram above illustrates the data lifecycle and the OCSF mapping takes
place: you collect data from various data sources, each of which has a different
shape, and then convert them to a standardized representation. The primary
benefit is that normalization decouples data acquisition from downstream
analytics, allowing the processes to scale independently.
The OCSF is a vendor-agnostic event schema (aka. "taxonomy") that defines
structure and semantics for security events. Here are some key terms you need to
know to map events:
Attribute: a unique identifier for a specific type, e.g., parent_folder
of type String or observables of type Observable Array.
Event Class: the description of an event defined in terms of attributes,
e.g., HTTP Activity and Detection Finding.
Category: a group of event classes, e.g., System Activity or Findings.
The diagram below illustrates how subsets of attributes form an event class:
The Base Event Class is a special event class that's part of every event
class. Think of it as a mixin of attributes that get automatically added:
For this tutorial, we look at OCSF from the perspective of the mapper persona,
i.e., as someone who converts existing events into the OCSF schema. OCSF also
defines three other personas, author, producer, and analyst. These are out of
scope. Our mission as mapper is now to study the event semantics of the data
source we want to map, and translate the event to the appropriate OCSF event
class.
The Zeek network monitor turns raw network traffic into
detailed, structured logs. The logs range across the OSI stack from link layer
activity to application-specific messages. In addition, Zeek provides a powerful
scripting language to act on network events, making it a versatile tool for
writing network-based detections to raise alerts.
Zeek generates logs in tab-separated values (TSV) or JSON format. Here's an
example of a connection log in TSV format:
We first need to parse the log file into structured form that we can work with
the individual fields. Thanks to Tenzir's Zeek
support, we can get quickly turn TSV logs
into structured data using a single operator:
Now that we have structured data to work with, our objective is to map the
fields from the Zeek conn.log to OCSF. The corresponding event class in OCSF
is Network Activity.
We will be using OCSF v1.3.0 throughout this tutorial.
To make the mapping process more organized, we map per attribute group:
Classification: Important for the taxonomy and schema itself
Occurrence: Temporal characteristics about when the event happened
Context: Temporal characteristics about when the event happened
Primary: Defines the key semantics of the given event
Within each attribute group, we go through the attributes in the order of the
three requirement flags required, recommended, and optional.
Here's a template for the mapping pipeline:
Let's unpack this:
With this = { event: this } we move the original event into the field
event. This also has the benefit that we avoid name clashes when creating
new fields in the next steps.
There are several fields we want to reference in expressions in the
subsequent assignment, so we precompute them here.
The giant this = { ... } assignment create the OCSF event, with a field
order that matches the official OCSF documentation.
We copy the original event into unmapped.
After we mapped all fields, we now explicitly remove them from unmapped.
This has the effect that everything we didn't touch automatically ends up
here.
We give the event a new schema name so that we can easily filter by its shape
in further Tenzir pipelines.
Now that we know the general structure, let's get our hands dirty and go deep
into the actual mapping.
Let's tackle the next group: Occurrence. These attributes are all about time. We
won't repeat the above record fields in the assignment, but the idea is to
incrementally construct a giant statement with the assignment this = { ... }:
The Context attributes provide enhancing information. Most notably, the
metadata attribute holds data-source specific information and the unmapped
attribute collects all fields that we cannot map directly to their semantic
counterparts in OCSF.
Note that we're copying the original event into unmapped so that we can in a
later step remove all mapped fields from it.
The primary attributes define the semantics of the event class itself. This is
where the core value of the data is, as we are mapping the most event-specific
information.
For this, we still need to precompute several attributes that we are going to
use in the this = { ... } assignment. You can see the use of if/else here
to create a constant field based on values in the original event.
So far we've mapped just a single event. But Zeek has dozens of different event
types, and we need to write one mapping pipeline for each. But how do we combine
the individual pipelines?
Tenzir's answer for this is topic-based publish-subscribe. The
publish and
subscribe operators send events to, and
read events from a topic, respectively. Here's an illustration of the conceptual
approach we are going to use:
The first pipeline publishes to the zeek topic:
Then we have one pipeline per Zeek event type X that publishes to the ocsf
topic:
The idea is that all Zeek logs arrive at the topic zeek, and all mapping
pipelines start there by subscribing to the same topic, but each pipeline
filters out one event type. Finally, all mapping pipelines publish to the ocsf
topic that represents the combined feed of all OCSF events. Users can then use
the same filtering pattern as with Zeek to get a subset of the OCSF stream,
e.g., subscribe "ocsf" | where @name == "ocsf.authentication" for all OCSF
Authentication events.
Isn't this inefficient?
You may think that copying the full feed of the zeek topic to every mapping
pipeline is inefficient. The good news is that it is not, for two reasons:
Data transfers between publish and subscribe use the same zero-copy
mechanism that pipelines use internally for sharing of events.
Pipelines of the form subscribe ... | where <predicate> push perform
predicate pushdown and send predicate upstream so that the filtering
can happen as early as possible.
In this tutorial, we demonstrated how you map logs to OCSF event classes. We
used the Zeek network monitor as a case study to illustrate the general mapping
pattern. Finally, we explained how to use Tenzir's pub-sub mechanism to scale
from on to many pipelines, each of which handle a specific OCSF event class.