Skip to content

Fetch data from APIs

This guide covers how to fetch data from HTTP APIs using the from_http and http operators. Whether you make simple GET requests, handle authentication, or implement pagination, the operators provide flexible HTTP client capabilities for API integration.

Start with these fundamental patterns for making HTTP requests to APIs.

To fetch data from an API endpoint, pass the URL as the first parameter to the from_http operator:

from_http "https://api.example.com/data"

The operator makes a GET request by default and forwards the response as an event. The from_http operator is an input operator, i.e., it starts a pipeline. The companion operator http is a transformation, allowing you to specify the URL as a field by referencing an event field that contains the URL:

from {url: "https://api.example.com/data"}
http url

This pattern is useful when processing multiple URLs or when URLs are generated dynamically. Most of our subsequent examples use from_http, as the operator options are very similar.

The from_http and http operators attempt to find the right parser for the HTTP response body by inspecting the Content-Type header. For example, if the Content-Type is application/json, then Tenzir applies the read_json operator.

You can manually override the parser for the response body by specifying a parsing pipeline, i.e., a pipeline that transforms bytes to events. For example, if an API returns CSV data without a proper Conten-Type, you can specify the parsing pipeline as follows:

from_http "https://api.example.com/users" {
read_csv
}

This parses the response from CSV into structured events that you can process further.

Send data to APIs by specifying the method parameter as “post” and providing the request body in the payload parameter:

from_http "https://api.example.com/users",
method="post",
payload={"name": "John", "email": "john@example.com"}

Similarly, with the http operator you can also parameterize the entire HTTP request using event fields by referencing field values for each parameter:

from {
url: "https://api.example.com/users",
method: "post",
data: {
name: "John",
email: "john@example.com"
}
}
http url, method=method, payload=data

The operators automatically uses POST method when you specify a payload.

Configure requests with headers, authentication, and other options for different API requirements.

Include custom headers by providing the headers parameter as a record containing key-value pairs:

from_http "https://api.example.com/data", headers={
"Authorization": "Bearer your-token-here",
"Content-Type": "application/json"
}

Headers help you authenticate with APIs and specify request formats.

Enable TLS by setting the tls parameter to true and configure client certificates using the certfile and keyfile parameters:

from_http "https://secure-api.example.com/data",
tls=true,
certfile="/path/to/client.crt",
keyfile="/path/to/client.key"

Use these options when APIs require client certificate authentication.

Configure timeouts and retry behavior by setting the connection_timeout, max_retry_count, and retry_delay parameters:

from_http "https://api.example.com/data",
timeout=10s,
max_retries=3,
retry_delay=2s

These settings help handle network issues and API rate limiting gracefully.

Use HTTP requests to enrich existing data with information from external APIs.

Keep original event data while adding API responses by specifying the response_field parameter to control where the response is stored:

from {
domain: "example.com",
severity: "HIGH",
api_url: "https://threat-intel.example.com/lookup",
response_field: "threat_data",
}
http f"{api_url}?domain={domain}", response_field=response_field

This approach preserves your original data and adds API responses in a specific field.

Capture HTTP response metadata by specifying the metadata_field parameter to store status codes and headers separately from the response body:

from_http "https://api.example.com/status",
response_field=data,
metadata_field=http_meta

The metadata includes status codes and response headers for debugging and monitoring.

Handle APIs that return large datasets across multiple pages.

Implement automatic pagination by providing a lambda function to the paginate parameter that extracts the next page URL from the response:

from_http "https://api.example.com/search?q=query",
paginate=(response => "next_page_url" if response.has_more)

The operator continues making requests as long as the pagination lambda function returns a valid URL.

Handle APIs with custom pagination schemes by building pagination URLs dynamically using expressions that reference response data:

let $base_url = "https://api.example.com/items"
from_http f"{$base_url}?page=1",
paginate=(x => f"{$base_url}?page={x.page + 1}" if x.page < x.total_pages,

This example builds pagination URLs dynamically based on response data.

Control request frequency by configuring the paginate_delay parameter to add delays between requests and the parallel parameter to limit concurrent requests:

from {
url: "https://api.example.com/data",
paginate_delay: 500ms,
parallel: 2
}
http url,
paginate="next_url" if has_next,
paginate_delay=paginate_delay,
parallel=parallel

Use paginate_delay and parallel to manage request rates appropriately.

These examples demonstrate typical use cases for API integration in real-world scenarios.

Monitor API health and response times:

from_http "https://api.example.com/health", metadata_field=metadata
select date=metadata.headers["Date"].parse_time("%a, %d %b %Y %H:%M:%S %Z")
latency = now() - date

The above example parses the Date header from the HTTP response via parse_time into a timestamp and then compares it to the current wallclock time using the now function.

Nit: %T is a shortcut for %H:%M:%S.

Handle API errors and failures gracefully in your data pipelines.

Configure automatic retries by setting the max_retry_count parameter to specify the number of retry attempts and retry_delay to control the time between retries:

from_http "https://unreliable-api.example.com/data",
max_retries=5,
retry_delay=2s

Check HTTP status codes by capturing metadata and filtering based on the code field to handle different response types:

from_http "https://api.example.com/data", metadata_field=metadata
where metadata.code >= 200 and metadata.code < 300

Follow these practices for reliable and efficient API integration:

  1. Use appropriate timeouts. Set a reasonable connection_timeout for your use case.
  2. Implement retry logic. Configure max_retry_count and retry_delay for handling transient failures.
  3. Respect rate limits. Use parallel and paginate_delay to control request rates.
  4. Handle errors gracefully. Check status codes in metdata (metadata_field) and implement fallback logic.
  5. Secure credentials. Access API keys and tokens via secrets, not in code.
  6. Monitor API usage. Track response times and error rates for performance.

Last updated: