Fetch via HTTP and APIs

This guide shows you how to interact with HTTP APIs using from_http and to_http operators. You’ll learn to make GET requests, send data, handle authentication, and implement pagination for large result sets.

Choosing the Right Operator

Tenzir has two HTTP client operators that share the same core client options:

from_http is a input operator that starts a pipeline with an HTTP request and parses the response into events. It streams response body chunks into the parser sub-pipeline as they arrive. Use it for standalone API calls and paginated API ingestion.
to_http is an output operator that sends all events from one invocation as a single HTTP request. It streams the request body from its printer sub-pipeline into the HTTP connection. Use it for webhooks and HTTP-based ingestion APIs, and wrap it in every when you want time-based batches.

Most examples in this guide use from_http, because it is the operator for fetching data from APIs.

Basic API Requests

Start with these fundamental patterns for making HTTP requests to APIs.

Simple GET Requests

To fetch data from an API endpoint, pass the URL as the first parameter:

from_http "https://api.example.com/data.json"

The operator makes a GET request by default and sends the response body to the parser. As the server sends response chunks, Tenzir forwards them to the parser pipeline incrementally. If the response has a supported Content-Type header or the URL path has a supported extension, Tenzir infers the parser automatically.

Parsing the HTTP response body

You can omit the parsing sub-pipeline when Tenzir can infer the response format. Tenzir first checks a non-empty Content-Type response header and then falls back to the URL path extension.

Use an explicit parsing sub-pipeline when the response format is ambiguous, custom, or not reflected by the header or URL.

The operator automatically handles HTTP Content-Encoding. If the downloaded file itself is compressed, add the matching decompressor to the sub-pipeline.

For example, if an API returns CSV data, you can parse it as follows:

from_http "https://api.example.com/users" {
  read_csv
}

This parses the response from CSV into structured events that you can process further.

If the downloaded file itself is compressed, add the decompressor explicitly:

from_http "https://example.org/archive.json.gz" {
  decompress_gzip
  read_json
}

This decompresses the downloaded gzip file and then parses the response as JSON.

POST Requests with Data

Send data to APIs by specifying the method parameter as “post” and providing the request body in the body parameter:

let $body = {"name": "John", "email": "john@example.com"}
from_http "https://api.example.com/users", method="post", body=$body {
  read_json
}

Use to_http when you want to send existing events to an HTTP API:

from {name: "John", email: "john@example.com"}
to_http "https://api.example.com/users" {
  write_json
}

The from_http operator automatically uses post when you specify a body. The to_http operator uses post by default and sends all events from that operator invocation in one request, streaming the request body as the printer pipeline produces bytes.

Request Configuration

Configure requests with headers, authentication, and other options for different API requirements.

Adding Headers

Include custom headers by providing the headers parameter as a record containing key-value pairs:

let $headers = {
  "Authorization": f"Bearer {secret("YOUR_BEARER_TOKEN")}"
}
from_http "https://api.example.com/data", headers=$headers {
  read_json
}

Headers help you authenticate with APIs and specify request formats. Use the fnsecret function to retrieve sensitive API tokens, as in the above example.

TLS and Security

Configure TLS by passing a record to the tls parameter with certificate paths:

from_http "https://secure-api.example.com/data",
  tls={
    certfile: "/path/to/client.crt",
    keyfile: "/path/to/client.key",
  } {
  read_json
}

Use these options when APIs require client certificate authentication.

To skip peer verification (e.g., for self-signed certificates in development):

from_http "https://dev-api.example.com/data",
  tls={skip_peer_verification: true} {
  read_json
}

Timeout and Retry Configuration

Configure timeouts and retry behavior by setting the connection_timeout, max_retry_count, and retry_delay parameters:

from_http "https://api.example.com/data",
  connection_timeout=10s,
  max_retry_count=3,
  retry_delay=2s {
  read_json
}

These settings retry transient transport failures and HTTP 429 and 5xx responses with exponential backoff.

Response metadata

Use from_http to inspect HTTP response metadata while parsing the response.

Accessing response metadata

With from_http, use the $response variable inside a parsing pipeline to access HTTP status codes and headers:

from_http "https://api.example.com/status" {
  read_json
  status_code = $response.code
  server = $response.headers.Server
}

Pagination and Bulk Processing

Handle APIs that return large datasets across multiple pages.

Link Header Pagination

Many REST APIs (such as GitHub, GitLab, and Jira) include pagination URLs in the HTTP Link response header following RFC 8288. Use paginate="link" to follow these automatically:

from_http "https://api.github.com/repos/tenzir/tenzir/issues?per_page=10",
  paginate="link" {
  read_json
}

The operator parses the Link header, finds the rel=next relation, and continues fetching pages until the response no longer includes a next link.

This works with any API that returns a header like:

Link: <https://api.example.com/items?page=2>; rel="next"

Relative URLs in the Link header are resolved against the request URL, so both absolute and relative pagination links work correctly.

OData pagination

Some APIs return an OData collection envelope with records in a top-level value array and the next page URL in @odata.nextLink. Microsoft Graph uses this pagination shape for many collection endpoints. Use paginate="odata" with from_http to unpack the envelope and follow the next link automatically:

from_http "https://graph.microsoft.com/v1.0/users",
  headers={
    "Authorization": f"Bearer {secret("MICROSOFT_GRAPH_TOKEN")}",
    "ConsistencyLevel": "eventual",
  },
  paginate="odata" {
  read_json
}

The operator emits each object from the top-level value array as an event. It follows a top-level string @odata.nextLink as an opaque URL, so you do not need to inspect or rebuild query parameters such as $skiptoken. Pagination stops when the response omits @odata.nextLink or when the field is not a string. Follow-up requests use GET and reuse the configured request headers.

Lambda-based pagination

The from_http operator also supports lambda-based pagination for APIs with custom pagination schemes. Provide a lambda to paginate that extracts the next page URL from the parsed response:

from_http "https://api.example.com/search?q=tenzir",
  paginate=(x => x.next_url if x.has_more) {
  read_json
}

The operator continues making requests as long as the pagination lambda returns a valid URL.

You can also build pagination URLs dynamically:

let $base = "https://api.example.com/items"
from_http f"{$base}?category=security&page=1",
  paginate=(x => f"{$base}?category=security&page={x.page + 1}" if x.page < x.total_pages) {
  read_json
}

For APIs that put pagination state in the request body, return a request record from the lambda. Each request record patches the request that produced the current page. This is useful for OpenSearch-compatible APIs, including OpenSearch, Elasticsearch, and the Wazuh indexer, that use a search_after cursor in the request body.

Keep the from_http subpipeline focused on parsing the response envelope. Move operators such as unroll after from_http, because the pagination lambda receives the parsed page envelope.

let $headers = {
  "Authorization": f"Bearer {secret("OPENSEARCH_TOKEN")}",
}

from_http "https://opensearch.example.com/logs-*/_search",
  headers=$headers,
  body={
    size: 1000,
    sort: [{"@timestamp": "asc"}, {"_id": "asc"}],
    query: {match_all: {}},
  },
  paginate=(x => {
    body: {
      size: 1000,
      sort: [{"@timestamp": "asc"}, {"_id": "asc"}],
      query: {match_all: {}},
      search_after: x.hits.hits[-1].sort,
    },
  } if x.hits.hits != []) {
  read_json
}
unroll hits.hits
this = hits.hits._source

The first request uses POST because it has a body. Follow-up requests inherit that method and the configured headers. The returned request record replaces only the body.

Scroll-style APIs can change the URL and body for the next request while they keep the method and headers:

let $search = "https://opensearch.example.com/logs/_search?scroll=1m"
let $scroll = "https://opensearch.example.com/_search/scroll"
let $headers = {
  "Authorization": f"Bearer {secret("OPENSEARCH_TOKEN")}",
}

from_http $search,
  headers=$headers,
  body={size: 1000, query: {match_all: {}}},
  paginate=(x => {
    url: $scroll,
    body: {
      scroll: "1m",
      scroll_id: x._scroll_id,
    },
  } if x.hits.hits != []) {
  read_json
}
unroll hits.hits
this = hits.hits._source

See the OpenSearch integration and the Wazuh integration for more search-backend examples that use the same pagination pattern.

Rate limiting

Control request frequency by configuring paginate_delay to add delays between pagination requests:

from_http "https://api.example.com/scan?q=example.com",
  paginate=(x => x.next_url if x.has_next),
  paginate_delay=500ms {
  read_json
}

Use paginate_delay to manage request rates appropriately.

Controlling Batching with `to_http`

When sending events to an HTTP endpoint, you can control how many events go into each request.

Send all events in a single request

By default, to_http collects all input events into one request. The printer sub-pipeline serializes the events and Tenzir streams the body to the connection:

from {message: "event-1"},
     {message: "event-2"},
     {message: "event-3"}
to_http "https://example.com/ingest" {
  write_ndjson
}

This is useful for one-shot pipelines that produce a finite set of events.

Send one request per event

Wrap to_http in each to send a separate HTTP request for every event:

from {message: "event-1"},
     {message: "event-2"},
     {message: "event-3"}
each {
  from $this
  to_http "https://example.com/webhook" {
    write_json
  }
}

This sends three independent POST requests, each carrying one event as JSON. Use parallel to control how many requests run concurrently:

subscribe "alerts"
each parallel=4 {
  from $this
  to_http "https://example.com/webhook" {
    write_json
  }
}

When sending a single event and either using JSON or form encoding, from_http can be used. This makes code more concise, but also allows response processing.

subscribe "alerts"
each {
  from_http f"https://example.com/webhook", body=$this {
    read_json
  }
}

Per-event delivery is ideal for webhooks that expect one payload per call or when each event must be sent immediately.

Send periodic batches

Wrap to_http in every to accumulate events over a time window and then flush them as one request:

subscribe "stream-of-events"
every 5m {
  to_http "https://example.com/ingest" {
    write_parquet
  }
}

Every 5 minutes, every stops the input, causing to_http to finish the request wait for response and then restart.

Practical Examples

These examples demonstrate typical use cases for API integration in real-world scenarios.

API Monitoring

Monitor API health and response times:

from_http "https://api.example.com/health" {
  read_json
  date = $response.headers.Date.parse_time("%a, %d %b %Y %H:%M:%S %Z")
  latency = now() - date
}

The above example parses the Date header from the HTTP response via fnparse_time into a timestamp and then compares it to the current wallclock time using the fnnow function.

Error Handling

Handle API errors and failures gracefully in your data pipelines.

Retry Configuration

Configure automatic retries by setting max_retry_count to control the number of retry attempts and retry_delay to control the base delay between retries:

from_http "https://unreliable-api.example.com/data",
  max_retry_count=5,
  retry_delay=2s {
  read_json
}

Status code handling

By default, from_http fails the pipeline for non-2xx responses and emits an error instead of producing an event.

If you want to handle HTTP errors gracefully, set error_field so from_http stores the error as a blob in that field and continues emitting events:

from_http "https://my-server/health", error_field="error" {
  read_lines
}
where error != null

Best Practices

Follow these practices for reliable and efficient API integration:

Secure credentials. Access API keys and tokens via secrets, not in code.
Respect rate limits. Use paginate_delay to control request rates.
Configure approriate retry logic. Configure max_retry_count and retry_delay for handling network or server unavailablity.
Handle errors gracefully. Use error_field in from_http if you want to keep processing non-2xx responses instead of failing the pipeline.