Apache Arrow

Apache Arrow Logo

Apache Arrow is a development platform for in-memory data with bindings for many different programming languages.

VAST has first-class support for Arrow: it can store its internal data in the Arrow format, and export data of all kinds in Arrow for a flexible integration with other tools.

For example, the Python script below reads Arrow-formatted data from stdin and prints it back in a readable format batch by batch.

#! /usr/bin/env python
# Example usage:
# vast export arrow '#type ~ /suricata.*/' | ./scripts/print_arrow.py
import sys
import pyarrow
# Open stdin in binary mode.
istream = pyarrow.input_stream(sys.stdin.buffer)
batch_count = 0
row_count = 0
# An Arrow reader consumes a stream of batches with the same schema. When
# reading the result for a query that returns multiple schemas, VAST will use
# multiple writers. Hence, we try to open record batch readers until an
# exception occurs.
while True:
print("open next reader")
reader = pyarrow.ipc.RecordBatchStreamReader(istream)
while True:
batch = reader.read_next_batch()
batch_count += 1
row_count += batch.num_rows
except StopIteration:
print("done with current reader, rows: " + str(row_count))
batch_count = 0
row_count = 0
print("done with all readers")

The documentation for the arrow export format describes the command usage in more technical detail.