Tenzir is a security data pipeline solution for security teams, providing the
following abstractions:
Tenzir's pipelines consist of powerful operators that perform computations
over Arrow data frames. The Tenzir Query Language
(TQL) makes it easy to express pipelines—akin to Splunk and
Kusto.
Tenzir's indexed storage engine persists dataflows in an open format
(Parquet &
Feather) so that you can
access them with any query engine, or run pipelines over selective historical
workloads.
Tenzir nodes offer a managed runtime for pipelines and storage.
Interconnected nodes form a data fabric and pipelines can span across them
to implement sophisticated security architectures.
Tenzir fills a gap in the market—powerful enough for data-intensive security use
cases, but easy enough for security users that are not data engineers.
One of the major challenges in cybersecurity is the increasing volume of data
that organizations need to manage and analyze in order to protect their critical
infrastructure and sensitive data from cyber threats. Traditional security
information and event management (SIEM) systems are not designed to scale with
this increasing volume of data, and can become costly and time-consuming to
maintain. Additionally, cloud and data lakes are often geared towards data
engineers rather than security professionals, leading to security teams having
to spend valuable time and resources wrangling data instead of hunting for
threats. This creates a significant barrier for organizations looking to
effectively protect their data and infrastructure from cyber attacks.
What is needed is a security-native data architecture that enables security
teams to take control of their data and easily deploy and manage the flow of
security data. This solution should be able to integrate seamlessly with
existing data architectures, scale from single nodes to highly distributed data
fabrics, and support a variety of deployment options including cloud,
on-premises, decentralized, and even air-gapped environments.
Importantly, a security-native data architecture should be easy to deploy, use,
and manage without the need for dedicated data engineering resources. It
should also be built around security standards and integrate easily with
security tools in a plug-and-play fashion.
Tenzir aims to fill this gap as an open pipelines and storage engine for
building scalable security architectures. Pipelines make it easy to transport,
filter, reshape, and aggregate security events, whereas the embedded storage and
query engine enables numerous detection and response workloads to move upstream
of SIEM for a more cost-effective and scalable implementation. Tenzir is built
using open standards, such as Apache Arrow for data in motion and Apache Parquet
for data at rest, preventing vendor lock-in and promote full control of your
event data and security content.
Traditional SIEMs support basic search and a fixed set of analytical operations.
For moderate data volumes, the established SIEM use cases perform well. But when
scaling up to high-volume telemetry data, traditional SIEMs fall behind and
costs often spiral out of control. Traditional SIEMs also lack good support for
threat hunting and raw exploratory data analysis. That's why more advanced use
cases, such as feature extraction, model training, and detection engineering,
require additional workbenches built on data lakes.
Tenzir complements a SIEM nicely with the following use cases:
Offloading: route the high-volume telemetry to Tenzir that would otherwise
overload your SIEM or be cost-prohibitive to ingest. By keeping the bulk of
the data in Tenzir, you remove bottlenecks and can selectively forward the
activity that matters to your SIEM.
Compliance: Tenzir supports fine-grained retention configuration to meet
GDPR and
other regulatory requirements. When storage capacity needs careful management,
Tenzir's compaction feature allows for weighted
ageing of your data, so that you can
specify relative importance of event types. Tenzir's powerful pipelines allow
you to anonymize, pseudonymize, or encrypt specific fields—either to sanitize
PII data on import, or ad-hoc
on export when data leaves a pipeline or node.
Data Science: The majority of SIEMs provide an API-only, low-bandwidth
access path to your security data. Tenzir is an Arrow-native engine
with unfettered high-bandwidth access so that you can bring demanding
workloads to your. Bring your own tools, e.g., to run iterative clustering
algorithms, perform feature extraction, compute embeddings, and perform
in-stream model inference.
Recommendation
Unlike a heavy-weight legacy SIEM, Tenzir is highly embeddable so that you can
run it everywhere: containerized in the public cloud, in the data center in the
private cloud, on bare-metal appliances deep in the network, or at the edge.
Data warehouses and
OLAP engines
seem like an appealing choice for immutable structured data. They offer
sufficient ingest bandwidth, perform well on aggregations, come frequently with
advanced operations like joins, and often scale out well.
However, as a cornerstone for security operations, they fall short in supporting
the following relevant use cases where Tenzir has the edge:
Data Onboarding: it takes considerable effort to write and maintain
schemas for the tables of the respective data sources. As a data pipeline
product, Tenzir is well suited for your extract-transform-load (ETL) needs:
purpose-built for security data, available integrations for key security
tools, and many data connectors out of the box.
Rich Typing: modeling security event data with a generic database often
reduces the domain-specific values to strings or integers, as opposed to
retaining their original semantics, such as IP addresses or port numbers.
Tenzir offers a rich type system that can retain such semantics during data
onboarding, giving you the ability to query the data with your own taxonomy at
query time.
Recommendation
Data warehouses may be well-suited once data is fully structured, but you still
need to cover the ETL and reverse-ETL aspects. Thanks to automatic schema
inference, Tenzir significantly reduces the cost of data onboarding for data
warehouses.
Unlike OLAP workloads,
OLTP workloads
have strong transactional and consistency guarantees, e.g., when performing
inserts, updates, and deletes. These extra guarantees come at a cost of
throughput and latency when working with large datasets, but are rarely needed
for analytical workloads. In a world of incomplete data, high data velocity, and
immutable representations of activity, relational databases are rarely
encountered.
Recommendation
If you aim to perform numerous modifications on a small subset of event data,
with medium ingest rates, relational databases (like PostgreSQL or MySQL), might
be a better fit. Tenzir's columnar data representation is ill-suited for
row-level modifications, but shines for analytical workloads.
Document DBs, such as MongoDB, offer worry-free ingestion of unstructured
data. They scale well horizontally and flexible querying.
However, they might not be the best choice for the data plane in security
operations, for the following reasons:
Vertical Scaling: when co-locating a storage engine next to high-volume
data sources, e.g., on a network appliance together with a network monitor,
CPU and memory constraints coupled with a non-negligible IPS overhead prohibit
scaling horizontally to build a "cluster in a box."
Analytical Workloads: the document-oriented storage does not perform well
for analytical workloads, such as aggregations. But such workloads are common
when hunting or when deploying detections.
Economy of Representation: security telemetry data exhibits a lot of
repetitiveness between events, such as similar IP addresses, URL prefixes, or
log message formats. This data compresses much better when transposed into a
columnar format, such as Parquet.
A special case of document DBs are full-text search engines, such as
ElasticSearch. The unit of input is typically unstructured text. These engines
use (inverted) indexes and ranking methods to return the most relevant results
for a given combination of search terms.
Recommendation
Most of the security tools work with structured data. Operators spend a
substantial amount of time to convert data from unstructured to structured. But
if your primary use case involves working with text documents, Tenzir might not
be a good fit. That said, needle-in-haystack search and other information
retrieval techniques are still relevant for security analytics, for which
Tenzir's indexed storage engine also has basic support.
Tenzir combines the flexibility of the document-oriented data model with the
power of structured OLAP. By relying internally on columnar data representation
with Apache Arrow and simultaneously performing schema inference, you get the
best of both worlds.
Timeseries databases share a lot in common with OLAP
engines, but focus their data organization around
time as key dimension.
Recommendation
If you plan to access your event data only through the time dimension and need
to model the majority of data as series, a timeseries DBs may suit the bill. If
you access data through other (spatial) attributes, like IP addresses or
domains, a traditional timeseries DB might not be good fit—especially for
high-cardinality attributes. If your workloads involve running more complex
detections, or include needle-in-haystack searches, Tenzir might be a better
fit.
A key-value store performs a key-based point or range lookups to retrieve one or
more values. Security telemetry is high-dimensional data and there are many more
desired entry points than a single key besides time, e.g., IP address,
application protocol, domain name, or hash value.
Recommendation
Key-value stores alone are not suitable as foundation for running security
analytics workloads. There are narrow use cases where key-value stores can
facilitate certain capabilities, e.g., when processing watch lists. (Tenzir
offers a context plugin for this purpose.)
Graph databases are purpose-built for answering complex queries over networks of
nodes and their relationships, such as finding shortest paths, measuring node
centrality, or identifying connected components. While networks and
communication patterns can naturally be represented as graphs, traditional
security analytics query patterns may not benefit from a graph representation.
Recommendation
If graph-centric queries dominate your use case, Tenzir is not the right
execution engine. Tenzir can still prove valuable as foundation for graph
analytics by storing the raw telemetry and feeding it (via Arrow) into graph
engines that support ad-hoc data frame analysis.
Vector databases operate on embeddings, which are high-dimensional floating
point vectors. For generative AI applications, decision support systems, or
search on unstructured data, embeddings are the building block, and vector
databases offer native operations on them, such as approximate nearest neighbor
search.
Recommendation
Tenzir can efficiently represent embeddings via Apache Arrow, but lacks specific
processing capabilities. Use Tenzir to transport your vectors to more
purpose-built engines that also build on Arrow data frames.