Taxonomies

Experimental Feature

This is an experimental feature: the API is subject to change and robustness is not yet comparable to production-grade features.

Taxonomies enable you to interact with different data formats using their own unified access layer of the domain, instead of having to juggle the various naming schemes of each individual data source. In addition to standardizing data access, taxonomies lift queries to a richer semantic level. A frequently used synonym for taxonomy is common information model (CIM).

note

Taxonomies only apply to the query processing of data in VAST and does not affect ingestion.

Consider the scenario of having two types of network security logs: Sysmon from the endpoint and Suricata on the network side. When a user wants to query all flow events to a particular destination in both formats, they would write a disjunction of two predicates:

suricata.flow.src_ip == 6.6.6.6 || sysmon.NetworkConnection.SourceIp == 6.6.6.6

Concepts remove the linear scaling from writing queries by decoupling semantics from syntax. This takes substantial load off the analyst's mind. Trying to remember format-specific field names is also not a good investment of analyst time, aside from being error-prone in practice. All the user wants to express in the above example is that a source IP address has a certain value. So why not let the user write it directly like this?

source_ip == 6.6.6.6

Taxonomies enable exactly that. VAST automatically translates this short query to the complexer one above.

In the following, we introduce the building blocks of taxonomies that make this possible.

Concepts

A concept is the abstract meaning of data value, independent of the format it occurs in. The representation of a concept is a name that maps to a set of field names. VAST translates a query expression containing a concept name to a disjunction of all fields. That is, concepts are sum types.

Consider Sysmon and Suricata data, each of which have a notion of a network connection event with a source IP address. The symon event NetworkConnection contains a field SourceIp and the Suricata event flow contains a field src_ip for this purpose. You can define a new concept source_ip by mapping these two fields:

- concept:
name: source_ip
description: the originator of a network-layer connection
fields:
- sysmon.NetworkConnection.SourceIp
- suricata.flow.src_ip

An event fulfills a concept if one or more fields of the event type definition occur in the concept mapping. In the above example, the events NetworkConnection and flow both fulfil the concept source_ip.

Dynamic Typing

Note that concepts have no particular type and the contained field names do not have to share the same type either. Queries remain meaningful even if types differ, because VAST ignores all fields where the type does not match. For example, if you define a concept x that includes two fields a of type real and b of type string, then the query x < 4.2 would only consider field a.

Composition

A concept can include other concepts. This enables flexible composition to represent semantic hierarchies.

For example, consider our above source_ip concept. If we want to generalize this concept to also include MAC addresses, we can define a concept source that includes both source_ip and a new field that represents a MAC address:

Naturally, it is equally possible to construct a new concept source_mac to bundle all fields that represent a MAC address source.

Referencing other concepts uses the concepts key:

- concept:
name: source
description: the originator of a connection
fields:
- zeek.conn.id.orig_l2_addr
concepts:
- source_ip

Dynamic Extension

Since concepts define the common substrate across multiple data formats, adding a new format typically requires extending existing concept mappings. For example, when adding a new log type that contains a source IP address, the corresponding field should be appended to the source_ip concept mapping. From an operator's perspective, the addition of a new format should take place locally, i.e., define syntax (type definitions) and semantics (concept definitions) of the new format in one shot.

Dynamic extension is possible by design, because concepts are sum types. Here is an example where we add a new type zeek.conn and extend the source_ip concept:

# Syntax: define the type layout
- type:
name: zeek.conn
fields:
...
- id.orig_h: address
...
# Semantics: define the field meanings
- concept:
name: source_ip
fields:
- zeek.conn.id.orig_h

This out-of-band concept definition and extension mechanism makes it possible to use completely different nomenclatures, allowing users to tailor the VAST query language to their needs.

Experimental Feature

VAST does not support YAML schema definitions yet. Until then, all schema files with type definitions must end in *.schema and concept defintions in *.yaml.

Models

A model contains one or more concepts. While a concept applies to a single field of multiple types, a model combines multiple concepts (and thereby fields) into a tuple. In other words, a model is a subset of fields of an event with specific semantics. Models are thus product types. An event fulfills a model if and only if it fulfills all contained concepts.

Models can only be defined in terms of concepts, not individual fields. If VAST would allow individual fields in model definitions, only the field type could be a member of the model. If only one type fulfills the criteria of a model when it is first defined, missing concepts must be introduced at that point to provide a clear separation of concerns and a natural update mechanism.

Consider again Sysmon and Suricata data and assume that we want to formalize the notion of a connection, which requires the following concepts to be fulfilled: source_ip, source_port, dest_ip, and dest_port. The model definition looks as follows:

model:
name: connection
description: a network connection 4-tuple
definition:
- source_ip
- source_port
- destination_ip
- destination_port

Both sysmon.NetworkConnection and suricata.flow fulfil all concepts of the model connection.

Experimental Feature

VAST doesnot support for models yet. We included this feature in the documentation already to showcase the coherent design, and solicit for feedback from the community.

Composition

Models compose in the same fashion as concepts: it is possible to define a new model entirely out of existing models, or out of a mix of concepts and models. However, a concept cannot include a model.

In the above example, the connection model consists of the source_endpoint and destination_endpoint model, each of which contain two concepts.

Another example for a composite model is a network connection tied to an OS process. Let's assume that there exists a concept process_filename that represents the executable path of a loaded process. Then the model process_network_event could look as follows:

The definition of the process_network_event model looks as follows:

model:
name: process_network_event
description: process initiating or receiving a network connection
definition:
- connection
- process_filename

A query using this model could look like this:

process_network_event = <_, _, 10.0.0.1, 80, "firefox.exe">

The model structure (i.e., the nesting) does not have to occur on the RHS because this information is embodied in the model itself. Otherwise users would have to write very clumsy values, e.g., <<_, _>, <10.0.0.1, 80>, "firefox.exe"> which would come at the cost of usability.

Query Processing

From the query perspective, concepts and models behave like "virtual" record fields that override schema definitions. The resolution process starts with models, continues with concepts, and terminates when the query consists of fields only:

The resolution is recursive in that models/concepts can contain other instances of models/concepts:

To write a query using a concept, VAST only needs to enter in concept mapping and build a disjunction using the referenced fields. In the above example, the concept expression destination_ip in 10.0.0.0/8 unfolds into a disjunction with three predicates, each of which have the same RHS:

sysmon.NetworkConnection.RemoteIp in 10.0.0.0/8
|| suricata.flow.dest_ip in 10.0.0.0/8
|| zeek.conn.id.resp_h in 10.0.0.0/8

An example for a model query is the predicate destination_endpoint = <10.0.0.1, 80>, with the left-hand side being the name of a model and the right-hand side a data record. Using the same mapping tables as above, VAST resolves this query into a conjunction first:

destination_ip == 10.0.0.1 && destination_port == 80

Thereafter, the concept resolution takes place again, assuming that there exist concept definitions for source_port symmetric to destination_ip:

(sysmon.NetworkConnection.RemoteIp == 10.0.0.1
|| suricata.flow.dest_ip == 10.0.0.1
|| zeek.conn.id.resp_h == 10.0.0.1)
&&
(sysmon.NetworkConnection.RemotePort == 80
|| suricata.flow.dest_port == 80
|| zeek.conn.id.resp_p == 80)

The resolution into conjunctions and disjunctions nicely illustrates the duality of models as product types and concepts as sum types.

Usage

Writing Taxonomies

During server startup, VAST picks up taxonomy definitions from YAML files (ending in *.yaml) from the schema directory.

VAST expects a single list in the the YAML file, each of which contains an object with a single key representing the definition type, e.g., concept or model.

Writing Concepts

To define a concept, introduce a new YAML list element with a concept object. Here is an example:

- concept:
name: destination_ip
fields:
- suricata.flow.dest_ip
- zeek.conn.id.resp_h

A concept definition must include a name and a list of fields that VAST should substitute for that name.

Taxonomy Introspection

To inspect existing models and concepts in a given VAST instance, you can use vast status --debug on the command line.

FAQs

How do VAST taxonomies relate to splunk Data Models?

Splunk Data Models are a superset of VAST concepts. Splunk ships with its own Common Information Model (CIM), which is a collection of data models.

VAST ships with a script taxonomize that converts a splunk datamodel JSON file to VAST concept declarations. An example invocation can look as follows:

taxonomize --splunk < Network_Traffic.json | head -n 16

This produces the following output:

# All_Traffic: extracted fields
- concept:
name: splunk.all_traffic.app
description: >-
The application protocol of the traffic.
- concept:
name: splunk.all_traffic.channel
description: >-
The 802.11 channel used by a wireless network.
- concept:
name: splunk.all_traffic.dest_bunit
description: >-
This field is automatically provided by asset and identity correlation
features of applications like Splunk Enterprise Security. Do not define
extractions for this field when writing add-ons.

The script only generates concept declarations, not the actual bindings to concrete data formats, which still needs to happen separately. For example, to bind a splunk field name to an existing VAST concept:

- concept:
name: splunk.all_traffic.src_ip
concepts:
- net.src.ip
Contributing

VAST is an open-source project. We're always open to community contributions, and are always open to bundling user-provided taxonomies.

How do VAST taxonomies relate to ECS?

ECS is an instance of a taxonomy. ECS defines exactly one nomenclature. In VAST, users can define multiple taxonomies. Extension was part of the deisgn.

Tenzir will provide a best-effort definition of ECS mappings to the schema definitions that are shipped with VAST.

What happens with the schema when the taxonomy changes?

Adding a new field to a concept creates a strict superset of the concept. Removing a field from a concept creates a strict subset of the concept.

Adding or removing elements from a model changes the arity of the record, rendering existing queries invalid. For example, if x = <a, b, c> is a valid model query and the model receives a new concept, the model must include a fourth element: x = <a, b, c, d>.

What happens with the taxonomy when the schema changes?

Since concepts are untyped, a changing field type in the schema would not break the concept semantics.

However, if a field in the schema is renamed, the concept definition must be updated as well. Otherwise the renamed field will no longer be included in concept queries.