to_hive
Writes events to a URI using hive partitioning.
Description
Hive partitioning is a partitioning scheme where a set of fields is used to
partition events. For each combination of these fields, a directory is derived
under which all events with the same field values will be stored. For example,
if the events are partitioned by the fields year
and month
, then the files
in the directory /year=2024/month=10
will contain all events where
year == 2024
and month == 10
.
uri: str
The base URI for all partitions.
partition_by = list<field>
A list of fields that will be used for partitioning. Note that these fields will be elided from the output, as their value is already specified by the path.
format = str
The name of the format that will be used for writing, for example json
or
parquet
. This will also be used for the file extension.
timeout = duration (optional)
The time after which a new file will be opened for the same partition group.
Defaults to 5min
.
max_size = int (optional)
The total file size after which a new file will be opened for the same partition
group. Note that files will typically be slightly larger than this limit,
because it opens a new file when only after it is exceeded. Defaults to 100M
.
Examples
Partition by a single field into local JSON files:
Write as Parquet into the Azure Blob Filesystem, partitioned by year, month and day.
Write JSON into S3, partitioned by year and month, opening a new file after 1 GB.