We've just released Tenzir v4.2 that introduces two new connectors: S3 and
GCS for interacting with blob storage and ZeroMQ for writing
distributed multi-hop pipelines. There's also a new lines parser for
easier text processing and a bunch of PCAP quality-of-life improvements.
The new s3 connector hooks up Tenzir to the vast
data masses on Amazon S3 and S3-compatible object
storage systems. With the s3 loader, you can access objects on S3 buckets,
assuming you have the proper credentials provided:
tenzir 'from s3 s3://bucket/mystuff/file.json'
Internally, we are using Arrow's filesystem abstraction for establishing
connections. This abstraction already handles AWS's default credentials provider
chain. If you have set up your AWS account in this chain, then you don't need to
worry about setting it up again in config files or similar formats.
S3 buckets can also be public, meaning you don't need any specific credentials
to access the objects therein. AWS offers tons of such public (read-only)
buckets with scientific data on their Marketplace. Tenzir can
also consume public read-only data—for example, e.g., some population density &
demographic estimate data:
The original CSV data is a bit unpolished, e.g., there are line breaks
and superfluous commas in the middle of some values. Tenzir's csv parser
will ignore those lines, but the rest of the data is at your fingertips.
The s3 writer uploads the pipeline output to an object in the bucket:
tenzir "export | to s3 s3://mybucket/folder/ok.json"
For S3, the options that can be included in the URI as query parameters are
region, scheme, endpoint_override, access_key, secret_key,
allow_bucket_creation, and allow_bucket_deletion.
The most exciting of these options would be endpoint_override, as it allows
you to connect to different endpoints of other S3-compatible storage systems:
from s3 s3://examplebucket/test.json?endpoint_override=s3.us-west.mycloudservice.com
The s3 connector is a huge step for Tenzir's capability to interact with blob
storage. Our list of connectors is continuously growing and our modular
framework allows for cranking out many more at ease. More connectors, more data,
more information, more value!
The connector tries to retrieve the appropriate credentials using Google's
Application Default Credentials. This means
you can use the connector conveniently to read from or write to a storage
bucket:
from gcs gs://bucket/path/to/file to gcs gs://bucket/path/to/file
As with s3, you can also use override the default endpoint and other options
by passing URI query parameters. Have a look at the connector
documentation for further details.
The new zmq connector makes it easy to interact
with the raw bytes in ZeroMQ messages. We model the zmqloader as
subscriber with a SUB socket, and the saver as a publisher with the PUB
socket:
What's nice about ZeroMQ is that the directionality of connection establishment
is independent of the socket type. So either end can bind or connect. We opted
for the subscriber to connect by default, and the publisher to bind. You can
override this with the --bind and --connect flags.
Even though we're using a lossy PUB-SUB socket pair, we've added a thin
layer of reliability in that a Tenzir pipeline won't send or receive ZeroMQ
messages before it has at least one connected socket.
Want to exchange and convert events with two single commands? Here's how you
publish JSON and continue as CSV on the other end:
# Publish some data via a ZeroMQ PUB socket:tenzir 'show operators | to zmq write json'# Subscribe to it in another processtenzir 'from zmq read json | write csv'
You can also work with operators that use types. Want to send away chunks of
network packets to a remote machine? Here you go:
# Publish raw bytes:tenzir 'load nic eth0 | save zmq'# Tap into the raw feed at the other end and start parsing:tenzir 'load zmq | read pcap | decapsulate'
Need to expose the source side of a pipeline as a listening instead of
connecting socket? No problem:
# Bind instead of connect with the ZeroMQ SUB socket:tenzir 'from zmq --bind'
These examples show the power of composability: Tenzir operators work with both
bytes and events, enabling in-flight reshaping, format conversation, or simply
data shipping at ease.
We've added a new round of loaders for HTTP and FTP, named http, https,
ftp, and ftps. This makes it a lot easier to pull data into a pipeline that
lives at a remote web or file server. No more shell curl shenanigans!
We modeled the http and https loaders after HTTPie,
which comes with an expressive and intuitive command-line syntax. We recommend
to study the HTTPie documentation to
understand the full extent of the command-line interface. In many cases, you can
perform an exact copy of the HTTPie command line and use it drop-in with the
HTTP loader, e.g., the invocation
http PUT pie.dev/put X-API-Token:123 foo=bar
becomes
from http PUT pie.dev/put X-API-Token:123 foo=bar
More generally, if your HTTPie command line is http X then you can write from
http X to obtain an event stream or load http X for a byte stream. (Note that
we have only the parts of the HTTPie syntax most relevant to our users.)
Internally, we rely on libcurl to perform the actual
file transfer. It is noteworthy that libcurl supports a lot of protocols:
libcurl is a free and easy-to-use client-side URL transfer library, supporting
DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS,
MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS,
TELNET and TFTP. libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP
uploading, HTTP form based upload, proxies, HTTP/2, HTTP/3, cookies,
user+password authentication (Basic, Digest, NTLM, Negotiate, Kerberos), file
transfer resume, http proxy tunneling and more!
Let us know if you have use cases for any of these. Let's take a
look at some more that you can readily work with.
The new lines parser splits its input at newline characters and
produces events with a single field representing the line. This parser is
especially useful for onboarding line-based text files into pipelines.
The -s|--skip-empty flags ignores empty lines. For example, read a text file
as follows:
The pcap parser can now read concatenated PCAP files,
allowing you to easily process large amounts of trace files. This comes
especially handy on the command line:
cat *.pcap | tenzir 'read pcap'
The nic loader has a new flag
--emit-file-headers that prepends a PCAP file header for every batch of bytes
that it produces, yielding a stream of concatenated PCAP files. This gives rise
to creative use cases involving packet shipping. For example, to ship blocks of
packets as "micro traces" via 0mq, you could do:
load nic eth0 | save zmq
This creates 0mq PUB socket where subscribes can come and go. Each 0mq message
is a self-contained PCAP trace, which avoids painful resynchronization logic.
You can consume this feed with a remote subscriber:
load zmq | read pcap
Finally, we also made it easier to identify available network interfaces when
using the nic loader: show nics now returns a list of available interfaces.