Skip to content

This guide shows you how to read files and monitor directories using the from_file operator. You’ll learn to read individual files, batch process directories, and set up real-time file monitoring.

The from_file operator handles various file types and formats. Start with these fundamental patterns for reading individual files.

To read a single file, specify the path to the from_file operator:

from_file "/path/to/file.json"

The operator automatically detects the file format from the file extension. This works for all supported formats including JSON, CSV, Parquet, and others.

The operator handles compressed files automatically. You need no additional configuration:

from_file "/path/to/file.csv.gz"

Supported compression formats include gzip, bzip2, and Zstd.

When automatic format detection doesn’t suffice, specify a custom parsing pipeline:

from_file "/path/to/file.log" {
read_syslog
}

The parsing pipeline runs on the file content and must return events.

You can process multiple files efficiently using glob patterns. This section covers batch processing and recursive directory operations.

Use glob patterns to process multiple files at once:

from_file "/path/to/directory/*.csv.zst"

This example processes all Zstd-compressed CSV files in the specified directory.

You can also use glob patterns to consume files regardless of their format:

from_file "~/data/**"

This processes all files in the ~/data directory and its subdirectories, automatically detecting and parsing each file format.

Use ** to match files recursively through subdirectories:

from_file "/path/to/directory/**.csv"

When you process multiple files with custom parsing, the pipeline runs separately for each file:

from_file "/path/to/directory/*.log" {
read_lines
}

Process all files in a data directory using recursive globbing:

from_file "/data/exports/**.parquet"

Set up real-time file processing by monitoring directories for changes. These features enable continuous data ingestion workflows.

Use the watch parameter to monitor a directory for new files:

from_file "/path/to/directory/*.csv", watch=true

This sets up continuous monitoring, processing new files as they appear in the directory.

Combine watching with automatic file removal using the remove parameter:

from_file "/path/to/directory/*.csv", watch=true, remove=true

This approach helps you implement file-based queues where the system should automatically clean up processed files.

Monitor a log directory and process files as they arrive:

from_file "/var/log/application/*.log", watch=true {
read_lines
}

Process archived data and remove files after successful ingestion:

from_file "/archive/*.csv.gz", remove=true

Access files directly from cloud storage providers using their native URLs. The operator supports major cloud platforms transparently.

Access S3 buckets directly using s3:// URLs:

from_file "s3://bucket/path/to/file.csv"

Glob patterns work with S3 as well:

from_file "s3://bucket/data/**/*.parquet"

Access GCS buckets using gs:// URLs:

from_file "gs://bucket/path/to/file.csv"

Access Azure Blob Storage using abfs:// URLs:

from_file "abfs://container/path/to/file.csv"

Glob patterns work with Azure Blob Storage as well:

from_file "abfs://container/data/**/*.parquet"

Cloud storage integration uses Apache Arrow’s filesystem APIs and supports the same glob patterns and options as local files, including recursive globbing across cloud storage hierarchies.

Last updated: