Smarter HTTP Ingestion

This release introduces format and compression inference from URLs for HTTP data sources, streamlining data loading workflows. It also includes bug fixes for secret resolution and HTTP server mode.

Download the release on GitHub.

Features

HTTP format and compression inference

The from_http and http operators now automatically infer the file format (such as JSON, CSV, Parquet, etc.) and compression type (such as gzip, zstd, etc.) directly from the URL’s file extension, just like the generic from operator. This makes it easier to load data from HTTP sources without manually specifying the format or decompression step.

If the format or compression cannot be determined from the URL, the operators will fall back to using the HTTP Content-Type and Content-Encoding response headers to determine how to parse and decompress the data.

Examples

Inference Succeeds

from_http "https://example.org/data/events.csv.zst"

The operator infers both the zstd compression and the CSV format from the file extension, decompresses, and parses accordingly.

Inference Fails, Fallback to Headers

from_http "https://example.org/download"

If the URL does not contain a recognizable file extension, the operator will use the HTTP Content-Type and Content-Encoding headers from the response to determine the format and compression.

Manual Specification Required

from_http "https://example.org/archive" {
  decompress_gzip
  read_json
}

If neither the URL nor the HTTP headers provide enough information, you can explicitly specify the decompression and parsing steps using a pipeline argument.

By @raxyte in #5300.

Bug Fixes

Fix crash in `from secret`

We fixed a crash in from secret("key"). This is now gracefully rejected, as generic from cannot resolve secrets.

By @IyeOnline in #5321.

`from_http server=true` assertion failures

from_http server=true failed with internal assertions and stopped the pipeline on receiving requests when metadata_field was specified.

By @raxyte in #5325.