Event taxonomies address the uphill battle of data normalization.
They enable you to interact with different data formats with a unified access
layer, instead of having to juggle the various naming schemes and
representations of each individual data source. Today, every SIEM has its own
"unified" approach to represent data, e.g.,
elastic's ECS,
splunk's CIM,
QRadar's LEEF,
Sentinel's ASIM,
Chronicle's UDM,
Panther's UDM,
and the XDR Alliance's CIM
There exist also vendor-agnostic with a varying focus, such as MITRE's
CEE, OSSEM's CDM, or STIX SCOs.
Several vendors joined forces and launched the Open Cybersecurity Schema
Framework (OCSF), an open and extensible project to create a universal
schema.
We could add yet another data model, but our goal is
that you pick one that you know already or like best. We envision a thriving
community around taxonomization, as exemplified with the OCSF. With
Tenzir, we aim for leveraging the taxonomy of your choice.
Concepts
A concept is a field mapping/alias that lazily resolves at query
time.
Concepts are not embedded in the schema and can therefore evolve independently
from the data typing. This behavior is different from other systems that
normalize by rewriting the data on ingest, e.g., elastic with ECS. We
do not advocate for this approach, because it has the following drawbacks:
Data Lock-in: if you want to use a different data model tomorrow, you
would have to rewrite all your past data, which can be infeasible in some
cases.
Compliance Problems: if you need an exact representation of your original
data shape, you cannot perform an irreversible transformation.
Limited Analytics: if you want to run a tool that relies on the original
schema of the data, it will not work.
Type aliases and concepts are two different mechanisms to add
semantics to the data. The following table highlights the differences between
the two mechanisms:
Aliases
Concepts
Objective
Tune data representation
Model a domain
User
Schema writer
Query writer
Typing
Strong
Lazy
Location
Embedded in data
Defined outside of data
Modification
Only for new data
For past and new data
Structure
Type hierarchy
Tag-like collection
The Imperfection of Data Models
Creating a unified data model is conceptually The Right Thing, but prior to
embarking on a long journey, we have to appreciate that it will always remain an
imperfect approximation in practice, for the following reasons:
Incompleteness: we have to appreciate that all data models are incomplete
because data sources continuously evolve.
Incorrectness: in addition to lacking information, data models contain
a growing number of errors, for the same evolutionary reasons as above.
Variance: data models vary substantially between products, making it
difficult to mix-and-match semantics.
A concept is a set of extractors to enable more semantic
querying. Tenzir translates a query expression containing a concept to a
disjunction of all extractors.
For example, Consider Sysmon and Suricata events, each of which have a notion of
a network connection with a source IP address. The Sysmon event
NetworkConnection contains a field SourceIp and the Suricata event flow
contains a field src_ip for this purpose. Without concepts, querying for a
specific value would involve writing a disjunction of two predicates:
With concepts, you can write this as:
Concepts decouple semantics from syntax and allow you to write queries that
"scale" independent of the number of data sources. No one wants to remember
all format-specific names, aside from being an error-prone practice.
Concepts compose. A concept can include other concepts to represent semantic
hierarchies. For example, consider our above source_ip concept. If we want to
generalize this concept to also include MAC addresses, we could define a concept
source that includes both source_ip and a new field that represents a MAC
address:
You define the composite concept in a module as follows:
You can add new mappings to an existing concept in every module. For example,
when adding a new data source that contains an event with a source IP address
field, you can define the concept in the corresponding module.