In the previous blog, we explained why Parquet and
Feather are great building blocks for modern investigations. In this blog, we take
a look at how they actually perform on the write path in two dimensions:
Size: how much space does typical security telemetry occupy?
Speed: how fast can we write out to a store?
Parquet and Feather have different goals. While Parquet is an on-disk format
that optimizes for size, Feather is a thin layer around the native Arrow
in-memory representation. This puts them at different points in the spectrum of
throughput and latency.
To better understand this spectrum, we instrumented the write path of VAST,
which consists roughly of the following steps:
Parse the input
Convert it into Arrow record batches
Ship Arrow record batches to a VAST server
Write Arrow record batches out into a Parquet or Feather store
Create an index from Arrow record batches
Since steps (1–3) and (5) are the same for both stores, we ignore them in the
following analysis and solely zoom in on (4).
For our evaluation, we use a dataset that models a “normal day in a corporate
network” fused with data from for real-world attacks. While this approach might
not be ideal for detection engineering, it provides enough diversity to analyze
storage and processing behavior.
Specifically, we rely on a 3.77 GB PCAP trace of the M57 case study. We
also injected real-world attacks from
malware-traffic-analysis.net into the PCAP trace. To
make the timestamps look somewhat realistic, we shifted the timestamps of the
PCAPs to pretend that the corresponding activity happens on the same day. For
this we used editcap and then merged the resulting PCAPs into one
big file using mergecap.
We then ran Zeek and Suricata over
the trace to produce structured logs. For full reproducibility, we host this
custom data set in a Google Drive folder.
VAST can ingest PCAP, Zeek, and Suricata natively. All three data sources are
highly valuable for detection and investigation, which is why we use them in
this analysis. They represent a good mix of nested and structured data (Zeek &
Suricata) vs. simple-but-bulky data (PCAP). To give you a flavor, here’s an
example Zeek log:
Note that Zeek’s tab-separated value (TSV) format is already a structured table,
whereas Suricata data needs to be demultiplexed first through the event_type
field.
The PCAP packet type is currently hard-coded in VAST’s PCAP plugin and looks
like this:
Now that we’ve looked at the structure of the dataset, let’s take a look at our
measurement methodology.
Our objective is understanding the storage and runtime characteristics of
Parquet and Feather on the provided input data. To this end, we instrumented
VAST to produce us with a measurement trace file that we then analyze with R for
gaining insights. The corresponding patch is not meant for further
production, so we kept it separate. But we did find an opportunity to improve
VAST and made the Zstd compression level configurable. Our benchmark script is available for full reproducibility.
Our instrumentation produced a CSV file with the following features:
Store: the type of store plugin used in the measurement, i.e., parquet
or feather.
Construction time: the time it takes to convert Arrow record batches into
Parquet or Feather. We fenced the corresponding code blocks and computed the
difference in nanoseconds.
Input size: the number of bytes that the to-be-converted record batches
consume.
Output size: the number of bytes that the store file takes up.
Number of events: the total number of events in all input record batches
Number of record batches: the number Arrow record batches per store
Schema: the name of the schema; there exists one store file per schema
Zstd compression level: the applied Zstd compression level
Every row corresponds to a single store file where we varied some of these
parameters. We used hyperfine as
benchmark driver tool, configured with 8 runs. Let’s take a look at the data.
The schemas belong to three data modules: Zeek, Suricata, and PCAP. A module
is the prefix of a concrete type, e.g., for the schema zeek.conn the module is
zeek and the type is conn. This is only a distinction in terminology,
internally VAST stores the full-qualified type as schema name.
How many events do we have per schema?
Code
The above plot (log-scaled y-axis) shows how many events we have per type.
Between 1 and 100M events, we almost see everything.
What’s the typical event size?
Code
The above plot keeps the x-axis from the previous plot, but exchanges the y-axis
to show normalized event size, in memory after parsing. Most events
take up a few 100 bytes, with packet data consuming a bit more, and one 5x
outlier: suricata.ftp.
Such distributions are normal, even with these outliers. Some telemetry events
simply have more string data that’s a function of user input. For suricata.ftp
specifically, it can grow linearly with the data transmitted. Here’s a stripped
down example of an event that is greater than 5 kB in its raw JSON:
This matches our mental model. A few hundred bytes per event with some outliers.
On the inside, a store is a concatenation of homogeneous Arrow record batches,
all having the same schema.
The Feather format is essentially the IPC wire format of record batches. Schemas
and dictionaries are only included when they change. For our stores, this means
just once in the beginning. In order to access a given row in a Feather file,
you need to start at the beginning, iterate batch by batch until you arrive at
the desired batch, and then materialize it before you can access the desired
row via random access.
Parquet has row groups that are much like a record batch, except that they are
created at write time, so Parquet determines their size rather than the incoming
data. Parquet offers random access over both the row groups and within an
individual batch that is materialized from a row group. The on-disk layout of
Parquet is still row-group by row-group, and in that column by column, so
there’s no big difference between Parquet and Feather in that regard. Parquet
encodes columns using different encoding techniques than Arrow’s IPC format.
Most stores only consist of a few record batches. PCAP is the only difference.
Small stores are suboptimal because the catalog keeps in-memory state that is a
linear function of the number of stores. (We are aware of this concern and are
exploring improvements, but this topic is out of scope for this post.) The issue
here is catalog fragmentation.
As of v2.3, VAST has automatic rebuilding in place, which
merges underfull partitions to reduce pressure on the catalog. This doesn’t fix
the problem of linear state, but gives us much sufficient reach for real-world
deployments.
To better understand the difference between Parquet and Feather, we now take a
look at them right next to each other. In addition to Feather and Parquet, we
use two other types of “stores” for the analysis to facilitate comparison:
Original: the size of the input prior it entered VAST, e.g., the raw JSON or
a PCAP file.
Memory: the size of the data in memory, measured as the sum of Arrow
buffers that make up the table slice.
Let’s kick of the analysis by getting a better understanding at the size
distribution.
Code
Every boxplot corresponds to one store, with original and memory being also
treated like stores. The suffix -Z indicates Zstd level Z, with NA meaning
“compression turned off” entirely. Parquet stores on the right (in purple) have
the smallest size, followed by Feather (red), and then their corresponding
in-memory (green) and original (turquoise) representation. The negative Zstd
level -5 makes Parquet actually worse than Feather.
Analysis
What stands out is that disabling compression for Feather inflates the data
larger than the original. This is not the case for Parquet. Why? Because Parquet
has an orthogonal layer of compression using dictionaries. This absorbs
inefficiencies in heavy-tailed distributions, which are pretty standard in
machine-generated data.
The y-axis of above plot is log-scaled, which makes it hard for relative
comparison. Let’s focus on the medians (the bars in the box) only and bring the
y-axis to a linear scale:
Code
To better understand the compression in numbers, we’ll anchor the original size
at 100% and now show the relative gains of Parquet and Feather:
Store
Class
Bytes/Event
Size (%)
Compression Ratio
parquet+19
parquet
53.5
22.7
4.4
parquet+9
parquet
54.4
23.1
4.3
parquet+1
parquet
55.8
23.7
4.2
feather+19
feather
57.8
24.6
4.1
feather+9
feather
66.9
28.4
3.5
feather+1
feather
68.9
29.3
3.4
parquet+-5
parquet
72.9
31.0
3.2
parquet+NA
parquet
90.8
38.6
2.6
feather+-5
feather
95.8
40.7
2.5
feather+NA
feather
255.1
108.3
0.9
Analysis
Parquet dominates Feather with respect to space savings, but not by much for
high Zstd levels. Zstd levels > 1 do not provide substantial space savings on
average, where observe a compression ratio of ~4x over the base data. Parquet
still provides a 2.6 compression ratio in the absence of compression because
it applies dictionary encoding.
Feather offers competitive compression with ~3x ratio for equal Zstd levels.
However, without compression Feather expands beyond the original dataset size at
a compression ratio of ~0.9.
The above analysis covered averages across schemas. If we juxtapose Parquet and
Feather per schema, we see the difference between the two formats more clearly:
Code
In the above log-log scatterplot, the straight line is the identity function.
Each point represents the median store size for a given schema. If a point is on
the line, it means there is no difference between Feather and Parquet. We only
look at schemas with more than 100k events to ensure that the constant factor
does not perturb the analysis. (Otherwise we end up with points below the
identity line, which are completely dwarfed by the bulk in practice.) The color
and shape shows the different Zstd levels, with NA meaning no compression.
Points clouds closer to the origin mean that the corresponding store class takes
up less space.
Analysis
We observe that disabling compression hits Feather the hardest.
Unexpectedly, a negative Zstd level of -5 does not compress well. The remaining
Zstd levels are difficult to take apart visually, but it appears that the point
clouds form a parallel line, indicating stable compression gains. Notably,
compressing PCAP packets is nearly identical with Feather and Parquet,
presumably because of the low entropy and packet meta data where general-purpose
compressors like Zstd shine.
Zooming in to the bottom left area with average event size of less than 100B,
and removing the log scaling, we see the following:
Code
The respective point clouds form a parallel to the identity function, i.e., the
compression ratio in this region pretty constant across schemas. There’s also no
noticeable difference between Zstd level 1, 9, and 19.
If we take pick a single point, e.g., zeek.conn with
4.7M events,
we can confirm that the relative performance matches the results of our analysis
above:
Code
Finally, we look at the fraction of space Parquet takes compared to Feather on a
per schema basis, restricted to schemas with more than 10k events:
Code
The horizontal line is similar to the identity line in the scatterplot,
indicating that Feather and Parquet compress equally well. The bars represent
that ratio of Parquet divided by Feather. The shorter the bars, the smaller the
size, so the higher the gain over Feather.
Analysis
We see that Zstd level 19 brings Parquet and Feather close together. Even at
Zstd level 1, the median ratio of Parquet stores is 78%, and the 3rd
quartile 82%. This shows that Feather is remarkably competitive for typical
security analytics workloads.
Now that we have looked at the spatial properties of Parquet and Feather, we
take a look at the runtime. With speed, we mean the time it takes to transform
Arrow Record Batches into Parquet and Feather format. This analysis only
considers only CPU time; VAST writes the respective store in memory first and
then flushes it one sequential write. Our mental model is that Feather is faster
than Parquet. Is that the case when enabling compression for both?
To avoid distortion of small events, we also restrict the analysis to schemas
with more than 100k events.
Code
The above boxplots show the time it takes to write a store for a given store and
compression level combination. The log-scaled y-axis shows the normalized to number
of microseconds per event, across the distribution of all schemas. The sort order
is the median processing time, similar to the size discussion above.
Analysis
As expected, we roughly observe an ordering according to Zstd level: more
compression means a longer runtime.
Unexpectedly, for the same Zstd level, Parquet store creation was always
faster. Our unconfirmed hunch is that Feather compression operates on more and
smaller column buffers, whereas Parquet compression only runs over the
concatenated Arrow buffers, yielding bigger strides.
We don’t have an explanation for why disabling compression for Parquet is
slower compared Zstd levels -5 and 1. In theory, strictly less cycles are
spent by disabling the compression code path. Perhaps compression results in
different memory layout that is more cache-efficient. Unfortunately, we did not
have the time to dig deeper into the analysis to figure out why disabling
Parquet compression is slower. Please don’t hesitate to reach out, e.g., via our
community chat.
Let’s compare Parquet and Feather by compression level, per schema:
Code
The above scatterplot has an identity line. Points on this line indicates that
there is no speed difference between Parquet and Feather. Feather is faster for
points below the line, and Parquet is faster for points above the line.
Analysis
In addition to the above boxplot, this scatterplot makes it clearer to see the
impact of the schemas.
Interestingly, there is no significant difference in Zstd levels -5 and 1,
while levels 9 and 19 stand apart further. Disabling compression for Feather
has a stronger effect on speed than for Parquet.
Overall, we were surprised that Feather and Parquet are not far apart in terms
of write performance once compression is enabled. Only when compression is
disabled, Parquet is substantially slower in our measurements.
Finally, we combine the size and speed analysis into a single benchmark. Our
goal is to find an optimal parameterization, i.e., one that strictly dominates
others. To this end, we now plot size against speed:
Code
Every point in the above log-log scatterplot represents a store with a fixed
schema. Since we have multiple stores for a given schema, we took the median
size and median speed. We then varied the run matrix by Zstd level (color) and
store type (triangle/point shape). Points closer to the origin are “better” in
both dimensions. So we’re looking for the left-most and bottom-most ones.
Disabling compression puts points into the bottom-right area, and maximum
compression into the top-left area.
The point closest to the origin has the schema zeek.dce_rpc for Zstd level 1,
both for Feather and Parquet. Is there anything special about this log file?
Here’s a sample:
It appears to be rather normal: 10 columns, several different data types, unique
IDs, and some short strings. By looking at the data alone, there is no obvious
hint that explains the performance.
With dozens to hundreds of different schemas per data source (sometimes even
more), there it will be difficult to single out individual schemas. But a point
cloud is unwieldy for relative comparison. To better represent the variance of
schemas for a given configuration, we can strip the “inner” points and only look
at their convex hull:
Code
Intuitively, the area of a given polygon captures its variance. A smaller area
is “good” in that it offers more predictable behavior. The high amount of
overlap makes it still difficult to perform clearer comparisons. If we facet by
store type, it becomes easier to compare the areas:
Code
Arranging the facets above row-wise makes it easier to compare the y-axis, i.e.,
speed, where lower polygons are better. Arranging them column-wise makes it easier
to compare the x-axis, i.e., size, where the left-most polygons are better:
Code
Analysis
Across both dimensions, Zstd level 1 shows the best average space-time
trade-off for both Parquet and Feather. In the above plots, we also observe our
findings from the speed analysis: Parquet still dominates when compression is
enabled.
In summary, we set out to better understand how Parquet and Feather behave on
the write path of VAST, when acquiring security telemetry from high-volume data
sources. Our findings show that columnar Zstd compression offers great space
savings for both Parquet and Feather. For certain schemas, Feather and Parquet
exhibit only a marginal differences. To our surprise, writing Parquet files is
still faster than Feather for our workloads.
The pressing next question is obviously: what about the read path, i.e., query
latency? This is a topic for future, stay tuned.