Database Concepts

note

This section covers only the fundemental database aspects of VAST. The simplification purposfully glosses over many implementation details, which other sections cover in depth.

Since VAST stores massive amounts of event data and makes it accessible through a flexible query language, it feels much like a database at its core. In the most simple form, data from various sources goes into the system and queries extract the data again:

Data Representation

In VAST's data model the most granular piece of information in VAST is an event, which you can think of as a nestable record with typed fields. VAST always processes batches of events to ensure high throughput in and out of the system. Within a batch, all events have the same type, i.e., the same field names and types. We call a batch of events a table slice to highlight the shape of the data. A group of table slices sharing the same type behave like one big logical table.

Internally, a table slice is a single contiguous buffer of bytes. The benefit is fast and cache-friendly data access during processing. The downside is that this building this buffer takes more effort at ingest time. However, since we can parallelize the ingestion process quite naturally, we get the best of both worlds.

VAST associates a unique 64-bit ID for an event. Because data comes in tables slices, every slice has a continuous range of IDs [a,b) where b - a is the number of events in the slice.

Persistent State

Two components in VAST store event data:

  • Index: accelerates queries. VAST has a tailor-made bitmap index framework that supports its data model. Given a query, the index returns a set of IDs that represent events in the archive.

  • Archive: stores the raw data. It behaves like a key-value store that maps IDs to events. The interface is vectorized, i.e., users request a set of IDs and get back the set of tables slices corresponding to the query. The archive backend is pluggable and allows for custom integrations to various stores, such as custom data lakes or cloud blob stores.

Ingestion

As data enters the system, VAST assigns unique IDs before they get relayed to archive and index.

The steps in detail:

  1. Parse the input from a log file, stream of packets from the NIC, or from a socket. Then package it in table slices.

  2. Assign IDs and relay them to archive and index in parallel. Because the data immutable after ID assignment, it is safe to share the same instance. The implementation takes care of efficient copy-on-write messaging.

  3. Process the data:

    • At the archive, simply buffer it and write it to disk.
    • At the index, go over the data and build one index per table column.

Querying

When querying data, a user issues a query. VAST first looks in the index to find the corresponding event IDs, and then takes the IDs to the archive to materialize the result.

The steps in detail:

  1. Issue a query. VAST parses the textual form into an expression AST internally.

  2. Go to the index to find matching events. The index result (hits) is a sequence of event IDs, represented in a space-efficient format like a compressed bitmap.

  3. The archive receives a set of IDs and loads the corresponding events from its backend key-value store.

  4. VAST sends the materialized events back to the user in incremental batches.

In the next section, we go one step deeper and look at how VAST implements these concepts and processing with the actor model.