Skip to main content
Version: v4.23

Taxonomies

Event taxonomies address the uphill battle of data normalization. They enable you to interact with different data formats with a unified access layer, instead of having to juggle the various naming schemes and representations of each individual data source. Today, every SIEM has its own "unified" approach to represent data, e.g., elastic's ECS, splunk's CIM, QRadar's LEEF, Sentinel's ASIM, Chronicle's UDM, Panther's UDM, and the XDR Alliance's CIM There exist also vendor-agnostic with a varying focus, such as MITRE's CEE, OSSEM's CDM, or STIX SCOs. Several vendors joined forces and launched the Open Cybersecurity Schema Framework (OCSF), an open and extensible project to create a universal schema.

We could add yet another data model, but our goal is that you pick one that you know already or like best. We envision a thriving community around taxonomization, as exemplified with the OCSF. With Tenzir, we aim for leveraging the taxonomy of your choice.

Concepts

A concept is a field mapping/alias that lazily resolves at query time.

Concepts are not embedded in the schema and can therefore evolve independently from the data typing. This behavior is different from other systems that normalize by rewriting the data on ingest, e.g., elastic with ECS. We do not advocate for this approach, because it has the following drawbacks:

  • Data Lock-in: if you want to use a different data model tomorrow, you would have to rewrite all your past data, which can be infeasible in some cases.
  • Compliance Problems: if you need an exact representation of your original data shape, you cannot perform an irreversible transformation.
  • Limited Analytics: if you want to run a tool that relies on the original schema of the data, it will not work.

Type aliases and concepts are two different mechanisms to add semantics to the data. The following table highlights the differences between the two mechanisms:

AliasesConcepts
ObjectiveTune data representationModel a domain
UserSchema writerQuery writer
TypingStrongLazy
LocationEmbedded in dataDefined outside of data
ModificationOnly for new dataFor past and new data
StructureType hierarchyTag-like collection
The Imperfection of Data Models

Creating a unified data model is conceptually The Right Thing, but prior to embarking on a long journey, we have to appreciate that it will always remain an imperfect approximation in practice, for the following reasons:

  • Incompleteness: we have to appreciate that all data models are incomplete because data sources continuously evolve.
  • Incorrectness: in addition to lacking information, data models contain a growing number of errors, for the same evolutionary reasons as above.
  • Variance: data models vary substantially between products, making it difficult to mix-and-match semantics.

Concepts

A concept is a set of extractors to enable more semantic querying. Tenzir translates a query expression containing a concept to a disjunction of all extractors.

For example, Consider Sysmon and Suricata events, each of which have a notion of a network connection with a source IP address. The Sysmon event NetworkConnection contains a field SourceIp and the Suricata event flow contains a field src_ip for this purpose. Without concepts, querying for a specific value would involve writing a disjunction of two predicates:

suricata.flow.src_ip == 6.6.6.6 || sysmon.NetworkConnection.SourceIp == 6.6.6.6

With concepts, you can write this as:

source_ip == 6.6.6.6

Concepts decouple semantics from syntax and allow you to write queries that "scale" independent of the number of data sources. No one wants to remember all format-specific names, aside from being an error-prone practice.

You can define a concept in a module as follows:

concepts:
  source_ip:
    description: the originator of a network-layer connection
    fields:
    - sysmon.NetworkConnection.SourceIp
    - suricata.flow.src_ip

Concepts compose. A concept can include other concepts to represent semantic hierarchies. For example, consider our above source_ip concept. If we want to generalize this concept to also include MAC addresses, we could define a concept source that includes both source_ip and a new field that represents a MAC address:

You define the composite concept in a module as follows:

concepts:
  source_ip:
    description: the originator of a connection
    fields:
    - zeek.conn.id.orig_l2_addr
    concepts:
    - source_ip

You can add new mappings to an existing concept in every module. For example, when adding a new data source that contains an event with a source IP address field, you can define the concept in the corresponding module.