Skip to main content
Version: Next

to_hive

Writes events to a URI using hive partitioning.

to_hive uri:string, partition_by=list<field>, format=string, [timeout=duration, max_size=int]

Description

Hive partitioning is a partitioning scheme where a set of fields is used to partition events. For each combination of these fields, a directory is derived under which all events with the same field values will be stored. For example, if the events are partitioned by the fields year and month, then the files in the directory /year=2024/month=10 will contain all events where year == 2024 and month == 10.

uri: string

The base URI for all partitions.

partition_by = list<field>

A list of fields that will be used for partitioning. Note that these fields will be elided from the output, as their value is already specified by the path.

format = string

The name of the format that will be used for writing, for example json or parquet. This will also be used for the file extension.

timeout = duration (optional)

The time after which a new file will be opened for the same partition group. Defaults to 5min.

max_size = int (optional)

The total file size after which a new file will be opened for the same partition group. Note that files will typically be slightly larger than this limit, because it opens a new file when only after it is exceeded. Defaults to 100M.

Examples

Partition by a single field into local JSON files

from [{a: 0, b: 0}, {a: 0, b: 1}, {a: 1, b: 2}]
to_hive "/tmp/out/", partition_by=[a], format="json"
// This pipeline produces two files:
// -> /tmp/out/a=0/1.json:
//    {"b": 0}
//    {"b": 1}
// -> /tmp/out/a=1/2.json:
//    {"b": 2}

Write a Parquet file into Azure Blob Store

Write as Parquet into the Azure Blob Filesystem, partitioned by year, month and day.

to_hive "abfs://domain/bucket", partition_by=[year, month, day], format="parquet"
// -> abfs://domain/bucket/year=<year>/month=<month>/day=<day>/<num>.parquet

Write partitioned JSON into an S3 bucket

Write JSON into S3, partitioned by year and month, opening a new file after 1 GB.

year = ts.year()
month = ts.month()
to_hive "s3://my-bucket/some/subdirectory",
  partition_by=[year, month],
  format="json",
  max_size=1G
// -> s3://my-bucket/some/subdirectory/year=<year>/month=<month>/<num>.json