Event taxonomies address the uphill battle of data normalization. They enable you to interact with different data formats with a unified access layer, instead of having to juggle the various naming schemes and representations of each individual data source. Today, every SIEM has its own "unified" approach to represent data, e.g., elastic's ECS, splunk's CIM, QRadar's LEEF, Sentinel's ASIM, Chronicle's UDM, Panther's UDM, and the XDR Alliance's CIM There exist also vendor-agnostic with a varying focus, such as MITRE's CEE, OSSEM's CDM, or STIX SCOs. Several vendors joined forces and launched the Open Cybersecurity Schema Framework (OCSF), an open and extensible project to create a universal schema.
We could add yet another data model, but our goal is that you pick one that you know already or like best. We envision a thriving community around taxonomization, as exemplified with the OCSF. With Tenzir, we aim for leveraging the taxonomy of your choice. There are currently two mechanisms for this purpose:
- Concept: a field mapping/alias that lazily resolves at query time
- Model: a set of concepts that in sum describe a specific entity
Concepts and models are not embedded in the schema and can therefore evolve independently from the data typing. This behavior is different from other systems that normalize by rewriting the data on ingest, e.g., elastic with ECS. We do not advocate for this approach, because it has the following drawbacks:
- Data Lock-in: if you want to use a different data model tomorrow, you would have to rewrite all your past data, which can be infeasible in some cases.
- Compliance Problems: if you need an exact representation of your original data shape, you cannot perform an irreversible transformation.
- Limited Analytics: if you want to run a tool that relies on the original schema of the data, it will not work.
Type aliases and concepts are two different mechanisms to add semantics to the data. The following table highlights the differences between the two mechanisms:
|Objective||Tune data representation||Model a domain|
|User||Schema writer||Query writer|
|Location||Embedded in data||Defined outside of data|
|Modification||Only for new data||For past and new data|
|Structure||Type hierarchy||Tag-like collection|
Creating a unified data model is conceptually The Right Thing, but prior to embarking on a long journey, we have to appreciate that it will always remain an imperfect approximation in practice, for the following reasons:
- Incompleteness: we have to appreciate that all data models are incomplete because data sources continuously evolve.
- Incorrectness: in addition to lacking information, data models contain a growing number of errors, for the same evolutionary reasons as above.
- Variance: data models vary substantially between products, making it difficult to mix-and-match semantics.
A concept is a set of extractors to enable more semantic querying. Tenzir translates a query expression containing a concept to a disjunction of all extractors.
For example, Consider Sysmon and Suricata events, each of which have a notion of
a network connection with a source IP address. The Sysmon event
NetworkConnection contains a field
SourceIp and the Suricata event
contains a field
src_ip for this purpose. Without concepts, querying for a
specific value would involve writing a disjunction of two predicates:
suricata.flow.src_ip == 18.104.22.168 || sysmon.NetworkConnection.SourceIp == 22.214.171.124
With concepts, you can write this as:
source_ip == 126.96.36.199
Concepts decouple semantics from syntax and allow you to write queries that "scale" independent of the number of data sources. No one wants to remember all format-specific names, aside from being an error-prone practice.
You can define a concept in a module as follows:
description: the originator of a network-layer connection
Concepts compose. A concept can include other concepts to represent semantic
hierarchies. For example, consider our above
source_ip concept. If we want to
generalize this concept to also include MAC addresses, we could define a concept
source that includes both
source_ip and a new field that represents a MAC
You define the composite concept in a module as follows:
description: the originator of a connection
You can add new mappings to an existing concept in every module. For example, when adding a new data source that contains an event with a source IP address field, you can define the concept in the corresponding module.
A model is made of one or more concepts. An event fulfills a model if and only if it fulfills all contained concepts.
Consider again Sysmon and Suricata data for formalizing the notion of a
connection that requires the following concepts to be fulfilled:
suricata.flow fulfil all concepts of the model
connection. The model
definition looks as follows:
description: a network connection 4-tuple
Models compose like concepts: you can define a new model out of existing models or out of a mix of concepts and models. However, a concept cannot include a model.
In the above example, the
connection model consists of the
destination_endpoint model, each of which contains two concepts:
You can query a model by providing a record literal:
connection = <_, _, 10.0.0.1, 80>
The query expression resolution begins with models, continues with concepts, and
terminates when the query consists of extractors only. For example, consider the
destination_endpoint = <10.0.0.1, 80> where the left-hand side
being the name of a model and the right-hand side a record value. Tenzir resolves
this query into a conjunction first:
destination_ip == 10.0.0.1 && destination_port == 80
Thereafter, the concept resolution takes place again, assuming that there exist
concept definitions for
destination_port symmetric to
(sysmon.NetworkConnection.RemoteIp == 10.0.0.1
|| suricata.flow.dest_ip == 10.0.0.1
|| zeek.conn.id.resp_h == 10.0.0.1)
(sysmon.NetworkConnection.RemotePort == 80
|| suricata.flow.dest_port == 80
|| zeek.conn.id.resp_p == 80)
The resolution into conjunctions and disjunctions nicely illustrates the duality of models as product types and concepts as sum types.