Writes events to one or multiple objects in Google Cloud Storage.
to_google_cloud_storage url:string, [anonymous=bool, max_size=int, timeout=duration, partition_by=list<field>] { … }Description
Section titled “Description”The to_google_cloud_storage operator writes events to Google Cloud Storage,
automatically opening new objects when a rotation condition triggers. It
supports hive-style partitioning through a ** placeholder in the URL and
per-partition unique object names through a {uuid} placeholder.
By default, authentication is handled by Google’s Application Default Credentials (ADC) chain, which may read from multiple sources:
GOOGLE_APPLICATION_CREDENTIALSenvironment variable pointing to a service account key file- User credentials from
gcloud auth application-default login - Service account attached to the compute instance (Compute Engine, GKE)
- Google Cloud SDK credentials
url: string
Section titled “url: string”URL identifying the Google Cloud Storage location where data should be written.
The syntax is gs://<bucket-name>/<full-path-to-object>(?<options>). The
<options> are query parameters. Per the Arrow
documentation,
the following options exist:
For GCS, the supported parameters are
scheme,endpoint_override, andretry_limit_seconds.
The path portion may contain two placeholders:
**marks the location where hive partition segments are inserted. When present,partition_bymust also be set, and vice versa.{uuid}expands to a fresh UUIDv7 for every object. This is required when partitioning or when rotation can produce multiple objects, so that rotated or per-partition objects do not overwrite each other.
anonymous = bool (optional)
Section titled “anonymous = bool (optional)”Use anonymous credentials instead of any configured authentication. This only works for publicly writable buckets.
Defaults to false.
max_size = int (optional)
Section titled “max_size = int (optional)”Rotates to a new file after the current file exceeds this size in bytes.
Because rotation only fires after the threshold is crossed, individual files
may be slightly larger than max_size.
Defaults to 100M.
timeout = duration (optional)
Section titled “timeout = duration (optional)”Rotates to a new file after the current file has been open for this duration. Rotation is measured from the time the file was first opened.
Defaults to 5min.
partition_by = list<field> (optional)
Section titled “partition_by = list<field> (optional)”A list of fields used to partition events into separate files. For every
distinct combination of partition-field values, a separate file (or group of
rotated files) is written. The URL must contain a ** placeholder, which is
replaced by the hive-style path field1=value1/field2=value2/….
Unlike to_hive, the partitioning fields are not stripped from
the written events — they remain in each record.
Pipeline that transforms the incoming events into the byte stream that is
written to each file. The pipeline must return bytes, so it must end with a
writer such as write_ndjson, write_parquet, or write_csv, optionally
followed by a compressor such as compress_gzip.
The subpipeline runs once per output file. When rotation or partitioning produces a new file, a new instance of the subpipeline is spawned for it.
Examples
Section titled “Examples”Write a single NDJSON object
Section titled “Write a single NDJSON object”to_google_cloud_storage "gs://my-bucket/events/out_{uuid}.json" { write_ndjson}Partition events into Parquet objects by date
Section titled “Partition events into Parquet objects by date”to_google_cloud_storage "gs://my-bucket/events/**/data_{uuid}.parquet", partition_by=[year, month] { write_parquet}Rotate to a new object every 100 MB
Section titled “Rotate to a new object every 100 MB”to_google_cloud_storage "gs://my-bucket/logs/events_{uuid}.json.gz", max_size=100M { write_ndjson compress_gzip}