to_hive
Writes events to a URI using hive partitioning.
Description
Hive partitioning is a partitioning scheme where a set of fields is used to
partition events. For each combination of these fields, a directory is derived
under which all events with the same field values will be stored. For example,
if the events are partitioned by the fields year
and month
, then the files
in the directory /year=2024/month=10
will contain all events where
year == 2024
and month == 10
.
uri: str
The base URI for all partitions.
partition_by = list<field>
A list of fields that will be used for partitioning. Note that these fields will be elided from the output, as their value is already specified by the path.
format = str
The name of the format that will be used for writing, for example json
or
parquet
. This will also be used for the file extension.
timeout = duration (optional)
The time after which a new file will be opened for the same partition group.
Defaults to 5min
.
max_size = int (optional)
The total file size after which a new file will be opened for the same partition
group. Note that files will typically be slightly larger than this limit,
because it opens a new file when only after it is exceeded. Defaults to 100M
.
Examples
Partition by a single field into local JSON files
Write a Parquet file into Azure Blob Store
Write as Parquet into the Azure Blob Filesystem, partitioned by year, month and day.
Write partitioned JSON into an S3 bucket
Write JSON into S3, partitioned by year and month, opening a new file after 1 GB.