This is an experimental feature: the API is subject to change and robustness is not yet comparable to production-grade features.
Taxonomies enable you to interact with different data formats using their own unified access layer of the domain, instead of having to juggle the various naming schemes of each individual data source. In addition to standardizing data access, taxonomies lift queries to a richer semantic level. A frequently used synonym for taxonomy is common information model (CIM).
Taxonomies only apply to the query processing of data in VAST and does not affect ingestion.
Consider the scenario of having two types of network security logs: Sysmon from the endpoint and Suricata on the network side. When a user wants to query all flow events to a particular destination in both formats, they would write a disjunction of two predicates:
Concepts remove the linear scaling from writing queries by decoupling semantics from syntax. This takes substantial load off the analyst's mind. Trying to remember format-specific field names is also not a good investment of analyst time, aside from being error-prone in practice. All the user wants to express in the above example is that a source IP address has a certain value. So why not let the user write it directly like this?
Taxonomies enable exactly that. VAST automatically translates this short query to the complexer one above.
In the following, we introduce the building blocks of taxonomies that make this possible.
A concept is the abstract meaning of data value, independent of the format it occurs in. The representation of a concept is a name that maps to a set of field names. VAST translates a query expression containing a concept name to a disjunction of all fields. That is, concepts are sum types.
Consider Sysmon and Suricata data, each of which have a notion of a network
connection event with a source IP address. The symon event
contains a field
SourceIp and the Suricata event
flow contains a field
src_ip for this purpose. You can define a new concept
source_ip by mapping
these two fields:
An event fulfills a concept if one or more fields of the event type definition
occur in the concept mapping. In the above example, the events
flow both fulfil the concept
Note that concepts have no particular type and the contained field names do not
have to share the same type either. Queries remain meaningful even if types
differ, because VAST ignores all fields where the type does not match. For
example, if you define a concept
x that includes two fields
a of type
b of type
string, then the query
x < 4.2 would only consider
A concept can include other concepts. This enables flexible composition to represent semantic hierarchies.
For example, consider our above
source_ip concept. If we want to generalize
this concept to also include MAC addresses, we can define a concept
that includes both
source_ip and a new field that represents a MAC address:
Naturally, it is equally possible to construct a new concept
bundle all fields that represent a MAC address source.
Referencing other concepts uses the
Since concepts define the common substrate across multiple data formats, adding
a new format typically requires extending existing concept mappings. For
example, when adding a new log type that contains a source IP address, the
corresponding field should be appended to the
source_ip concept mapping. From
an operator's perspective, the addition of a new format should take place
locally, i.e., define syntax (type definitions) and semantics (concept
definitions) of the new format in one shot.
Dynamic extension is possible by design, because concepts are sum types. Here
is an example where we add a new type
zeek.conn and extend the
This out-of-band concept definition and extension mechanism makes it possible to use completely different nomenclatures, allowing users to tailor the VAST query language to their needs.
VAST does not support YAML schema definitions yet. Until then, all schema files
with type definitions must end in
*.schema and concept defintions in
A model contains one or more concepts. While a concept applies to a single field of multiple types, a model combines multiple concepts (and thereby fields) into a tuple. In other words, a model is a subset of fields of an event with specific semantics. Models are thus product types. An event fulfills a model if and only if it fulfills all contained concepts.
Models can only be defined in terms of concepts, not individual fields. If VAST would allow individual fields in model definitions, only the field type could be a member of the model. If only one type fulfills the criteria of a model when it is first defined, missing concepts must be introduced at that point to provide a clear separation of concerns and a natural update mechanism.
Consider again Sysmon and Suricata data and assume that we want to formalize
the notion of a
connection, which requires the following concepts to be
dest_port. The model
definition looks as follows:
suricata.flow fulfil all concepts of the
VAST doesnot support for models yet. We included this feature in the documentation already to showcase the coherent design, and solicit for feedback from the community.
Models compose in the same fashion as concepts: it is possible to define a new model entirely out of existing models, or out of a mix of concepts and models. However, a concept cannot include a model.
In the above example, the
connection model consists of the
destination_endpoint model, each of which contain two concepts.
Another example for a composite model is a network connection tied to an OS
process. Let's assume that there exists a concept
represents the executable path of a loaded process. Then the model
process_network_event could look as follows:
The definition of the
process_network_event model looks as follows:
A query using this model could look like this:
The model structure (i.e., the nesting) does not have to occur on the RHS
because this information is embodied in the model itself. Otherwise users would
have to write very clumsy values, e.g.,
<<_, _>, <10.0.0.1, 80>, "firefox.exe"> which would come at the cost of
From the query perspective, concepts and models behave like "virtual" record fields that override schema definitions. The resolution process starts with models, continues with concepts, and terminates when the query consists of fields only:
The resolution is recursive in that models/concepts can contain other instances of models/concepts:
To write a query using a concept, VAST only needs to enter in concept mapping
and build a disjunction using the referenced fields. In the above example, the
destination_ip in 10.0.0.0/8 unfolds into a disjunction
with three predicates, each of which have the same RHS:
An example for a model query is the predicate
destination_endpoint = <10.0.0.1, 80>, with the left-hand side being the name
of a model and the right-hand side a data record. Using the same mapping tables
as above, VAST resolves this query into a conjunction first:
Thereafter, the concept resolution takes place again, assuming that there exist
concept definitions for
source_port symmetric to
The resolution into conjunctions and disjunctions nicely illustrates the duality of models as product types and concepts as sum types.
During server startup, VAST picks up taxonomy definitions from YAML files
*.yaml) from the schema directory.
VAST expects a single list in the the YAML file, each of which contains an
object with a single key representing the definition type, e.g.,
To define a concept, introduce a new YAML list element with a
object. Here is an example:
concept definition must include a
name and a list of
fields that VAST
should substitute for that name.
To inspect existing models and concepts in a given VAST instance, you can use
vast status --debug on the command line.
How do VAST taxonomies relate to splunk Data Models?
VAST ships with a script
taxonomize that converts
a splunk datamodel JSON file to VAST concept declarations. An example
invocation can look as follows:
This produces the following output:
The script only generates concept declarations, not the actual bindings to concrete data formats, which still needs to happen separately. For example, to bind a splunk field name to an existing VAST concept:
VAST is an open-source project. We're always open to community contributions, and are always open to bundling user-provided taxonomies.
How do VAST taxonomies relate to ECS?
ECS is an instance of a taxonomy. ECS defines exactly one nomenclature. In VAST, users can define multiple taxonomies. Extension was part of the deisgn.
Tenzir will provide a best-effort definition of ECS mappings to the schema definitions that are shipped with VAST.
What happens with the schema when the taxonomy changes?
Adding a new field to a concept creates a strict superset of the concept. Removing a field from a concept creates a strict subset of the concept.
Adding or removing elements from a model changes the arity of the record,
rendering existing queries invalid. For example, if
x = <a, b, c> is a valid
model query and the model receives a new concept, the model must include a
x = <a, b, c, d>.
What happens with the taxonomy when the schema changes?
Since concepts are untyped, a changing field type in the schema would not break the concept semantics.
However, if a field in the schema is renamed, the concept definition must be updated as well. Otherwise the renamed field will no longer be included in concept queries.