Skip to main content


Taxonomies enable you to interact with different data formats using their own unified access layer of the domain, instead of having to juggle the various naming schemes of each individual data source. In addition to standardizing data access, taxonomies lift queries to a richer semantic level. A frequently used synonym for taxonomy is common information model (CIM).


Taxonomies only apply to the query processing of data in VAST and does not affect ingestion.

Consider the scenario of having two types of network security logs: Sysmon from the endpoint and Suricata on the network side. When a user wants to query all flow events to a particular destination in both formats, they would write a disjunction of two predicates:

suricata.flow.src_ip == || sysmon.NetworkConnection.SourceIp ==

Concepts remove the linear scaling from writing queries by decoupling semantics from syntax. This takes substantial load off the analyst's mind. Trying to remember format-specific field names is also not a good investment of analyst time, aside from being error-prone in practice. All the user wants to express in the above example is that a source IP address has a certain value. So why not let the user write it directly like this?

source_ip ==

Taxonomies enable exactly that. VAST automatically translates this short query to the complexer one above.

In the following, we introduce the building blocks of taxonomies that make this possible.


A concept is the abstract meaning of data value, independent of the format it occurs in. The representation of a concept is a name that maps to a set of field names. VAST translates a query expression containing a concept name to a disjunction of all fields. That is, concepts are sum types.

Consider Sysmon and Suricata data, each of which have a notion of a network connection event with a source IP address. The symon event NetworkConnection contains a field SourceIp and the Suricata event flow contains a field src_ip for this purpose. You can define a new concept source_ip by mapping these two fields:

- concept:
name: source_ip
description: the originator of a network-layer connection
- sysmon.NetworkConnection.SourceIp
- suricata.flow.src_ip

An event fulfills a concept if one or more fields of the event type definition occur in the concept mapping. In the above example, the events NetworkConnection and flow both fulfil the concept source_ip.

Dynamic Typing#

Note that concepts have no particular type and the contained field names do not have to share the same type either. Queries remain meaningful even if types differ, because VAST ignores all fields where the type does not match. For example, if you define a concept x that includes two fields a of type real and b of type string, then the query x < 4.2 would only consider field a.


A concept can include other concepts. This enables flexible composition to represent semantic hierarchies.

For example, consider our above source_ip concept. If we want to generalize this concept to also include MAC addresses, we can define a concept source that includes both source_ip and a new field that represents a MAC address:

Naturally, it is equally possible to construct a new concept source_mac to bundle all fields that represent a MAC address source.

Referencing other concepts uses the concepts key:

- concept:
name: source
description: the originator of a connection
- source_ip

Dynamic Extension#

Since concepts define the common substrate across multiple data formats, adding a new format typically requires extending existing concept mappings. For example, when adding a new log type that contains a source IP address, the corresponding field should be appended to the source_ip concept mapping. From an operator's perspective, the addition of a new format should take place locally, i.e., define syntax (type definitions) and semantics (concept definitions) of the new format in one shot.

Dynamic extension is possible by design, because concepts are sum types. Here is an example where we add a new type zeek.conn and extend the source_ip concept:

# Syntax: define the type layout
- type:
name: zeek.conn
- id.orig_h: address
# Semantics: define the field meanings
- concept:
name: source_ip

This out-of-band concept definition and extension mechanism makes it possible to use completely different nomenclatures, allowing users to tailor the VAST query language to their needs.

Experimental Feature

VAST does not support YAML schema definitions yet. Until then, all schema files with type definitions must end in *.schema and concept defintions in *.yaml.


A model contains one or more concepts. While a concept applies to a single field of multiple types, a model combines multiple concepts (and thereby fields) into a tuple. In other words, a model is a subset of fields of an event with specific semantics. Models are thus product types. An event fulfills a model if and only if it fulfills all contained concepts.

Models can only be defined in terms of concepts, not individual fields. If VAST would allow individual fields in model definitions, only the field type could be a member of the model. If only one type fulfills the criteria of a model when it is first defined, missing concepts must be introduced at that point to provide a clear separation of concerns and a natural update mechanism.

Consider again Sysmon and Suricata data and assume that we want to formalize the notion of a connection, which requires the following concepts to be fulfilled: source_ip, source_port, dest_ip, and dest_port. The model definition looks as follows:

name: connection
description: a network connection 4-tuple
- source_ip
- source_port
- destination_ip
- destination_port

Both sysmon.NetworkConnection and suricata.flow fulfil all concepts of the model connection.


Models compose in the same fashion as concepts: it is possible to define a new model entirely out of existing models, or out of a mix of concepts and models. However, a concept cannot include a model.

In the above example, the connection model consists of the source_endpoint and destination_endpoint model, each of which contain two concepts.

Another example for a composite model is a network connection tied to an OS process. Let's assume that there exists a concept process_filename that represents the executable path of a loaded process. Then the model process_network_event could look as follows:

The definition of the process_network_event model looks as follows:

name: process_network_event
description: process initiating or receiving a network connection
- connection
- process_filename

A query using this model could look like this:

process_network_event = <_, _,, 80, "firefox.exe">

The model structure (i.e., the nesting) does not have to occur on the RHS because this information is embodied in the model itself. Otherwise users would have to write very clumsy values, e.g., <<_, _>, <, 80>, "firefox.exe"> which would come at the cost of usability.

Query Processing#

From the query perspective, concepts and models behave like "virtual" record fields that override schema definitions. The resolution process starts with models, continues with concepts, and terminates when the query consists of fields only:

The resolution is recursive in that models/concepts can contain other instances of models/concepts:

To write a query using a concept, VAST only needs to enter in concept mapping and build a disjunction using the referenced fields. In the above example, the concept expression destination_ip in unfolds into a disjunction with three predicates, each of which have the same RHS:

sysmon.NetworkConnection.RemoteIp in
|| suricata.flow.dest_ip in
|| in

An example for a model query is the predicate destination_endpoint = <, 80>, with the left-hand side being the name of a model and the right-hand side a data record. Using the same mapping tables as above, VAST resolves this query into a conjunction first:

destination_ip == && destination_port == 80

Thereafter, the concept resolution takes place again, assuming that there exist concept definitions for source_port symmetric to destination_ip:

(sysmon.NetworkConnection.RemoteIp ==
|| suricata.flow.dest_ip ==
|| ==
(sysmon.NetworkConnection.RemotePort == 80
|| suricata.flow.dest_port == 80
|| == 80)

The resolution into conjunctions and disjunctions nicely illustrates the duality of models as product types and concepts as sum types.


Writing Taxonomies#

During server startup, VAST picks up taxonomy definitions from YAML files (ending in *.yaml) from the schema directory.

VAST expects a single list in the the YAML file, each of which contains an object with a single key representing the definition type, e.g., concept or model.

Writing Concepts#

To define a concept, introduce a new YAML list element with a concept object. Here is an example:

- concept:
name: destination_ip
- suricata.flow.dest_ip

A concept definition must include a name and a list of fields that VAST should substitute for that name.

Taxonomy Introspection#

To inspect existing models and concepts in a given VAST instance, you can use vast status --debug on the command line.


How do VAST taxonomies relate to splunk Data Models?#

Splunk Data Models are a superset of VAST concepts. Splunk ships with its own Common Information Model (CIM), which is a collection of data models.

VAST ships with a script taxonomize that converts a splunk datamodel JSON file to VAST concept declarations. An example invocation can look as follows:

taxonomize --splunk < Network_Traffic.json | head -n 16

This produces the following output:

# All_Traffic: extracted fields
- concept:
description: >-
The application protocol of the traffic.
- concept:
description: >-
The 802.11 channel used by a wireless network.
- concept:
name: splunk.all_traffic.dest_bunit
description: >-
This field is automatically provided by asset and identity correlation
features of applications like Splunk Enterprise Security. Do not define
extractions for this field when writing add-ons.

The script only generates concept declarations, not the actual bindings to concrete data formats, which still needs to happen separately. For example, to bind a splunk field name to an existing VAST concept:

- concept:
name: splunk.all_traffic.src_ip
- net.src.ip

VAST is an open-source project. We're always open to community contributions, and are always open to bundling user-provided taxonomies.

How do VAST taxonomies relate to ECS?#

ECS is an instance of a taxonomy. ECS defines exactly one nomenclature. In VAST, users can define multiple taxonomies. Extension was part of the deisgn.

Tenzir will provide a best-effort definition of ECS mappings to the schema definitions that ship with VAST.

What happens with the schema when the taxonomy changes?#

Adding a new field to a concept creates a strict superset of the concept. Removing a field from a concept creates a strict subset of the concept.

Adding or removing elements from a model changes the arity of the record, rendering existing queries invalid. For example, if x = <a, b, c> is a valid model query and the model receives a new concept, the model must include a fourth element: x = <a, b, c, d>.

What happens with the taxonomy when the schema changes?#

Since concepts are untyped, a changing field type in the schema would not break the concept semantics.

However, if a field in the schema is renamed, the concept definition must be updated as well. Otherwise the renamed field will no longer be included in concept queries.

When should I use an alias and when a concept?#

Aliases and concepts are two separate mechanisms to add semantics to the data. Given a record field x: T, an alias applies to the field type T, whereas a concept to the field name x.

The following table highlights the differences between the two mechanisms:

ObjectiveTune VASTModel a domain
UserSchema writerQuery writer
TypingStrong / preservedLazy (sum type) after resolution
LocationBuilt into the dataPre-processing at query time
ModificationOnly for new dataPossible for past and new data
StructureType hierarchyTag-like collection