Read and watch files

This guide shows you how to read files and monitor directories using the from_file operator. You’ll learn to read individual files, batch process directories, and set up real-time file monitoring.

Basic file reading

The from_file operator handles various file types and formats. Start with these fundamental patterns for reading individual files.

Single files

To read a single file, specify the path to the from_file operator:

from_file "/path/to/file.json"

The operator automatically detects the file format from the file extension. This works for all supported formats including JSON, CSV, Parquet, and others.

Compressed files

The operator handles compressed files automatically. You need no additional configuration:

from_file "/path/to/file.csv.gz"

Supported compression formats include gzip, bzip2, and Zstd.

Custom parsing

When automatic format detection doesn’t suffice, specify a custom parsing pipeline:

from_file "/path/to/file.log" {
  read_syslog
}

The parsing pipeline runs on the file content and must return events.

Directory processing

You can process multiple files efficiently using glob patterns. This section covers batch processing and recursive directory operations.

Processing multiple files

Use glob patterns to process multiple files at once:

from_file "/path/to/directory/*.csv.zst"

This example processes all Zstd-compressed CSV files in the specified directory.

You can also use glob patterns to consume files regardless of their format:

from_file "~/data/**"

This processes all files in the ~/data directory and its subdirectories, automatically detecting and parsing each file format.

Recursive directory processing

Use ** to match files recursively through subdirectories:

from_file "/path/to/directory/**.csv"

Custom parsing for multiple files

When you process multiple files with custom parsing, the pipeline runs separately for each file:

from_file "/path/to/directory/*.log" {
  read_lines
}

Batch data processing

Process all files in a data directory using recursive globbing:

from_file "/data/exports/**.parquet"

File monitoring

Set up real-time file processing by monitoring directories for changes. These features enable continuous data ingestion workflows.

Watch for new files

Use the watch parameter to monitor a directory for new files:

from_file "/path/to/directory/*.csv", watch=true

This sets up continuous monitoring, processing new files as they appear in the directory.

Remove files after processing

Combine watching with automatic file removal using the remove parameter:

from_file "/path/to/directory/*.csv", watch=true, remove=true

This approach helps you implement file-based queues where the system should automatically clean up processed files.

Real-time log processing

Monitor a log directory and process files as they arrive:

from_file "/var/log/application/*.log", watch=true {
  read_lines
}

Archive processing with cleanup

Process archived data and remove files after successful ingestion:

from_file "/archive/*.csv.gz", remove=true

Cloud storage integration

Access files directly from cloud storage providers using their native URLs. The operator supports major cloud platforms transparently.

Amazon S3

Access S3 buckets directly using s3:// URLs:

from_file "s3://bucket/path/to/file.csv"

Glob patterns work with S3 as well:

from_file "s3://bucket/data/**/*.parquet"

Google Cloud Storage

Access GCS buckets using gs:// URLs:

from_file "gs://bucket/path/to/file.csv"

Azure Blob Storage

Access Azure Blob Storage using abfs:// URLs:

from_file "abfs://container/path/to/file.csv"

Glob patterns work with Azure Blob Storage as well:

from_file "abfs://container/data/**/*.parquet"

Cloud storage integration uses Apache Arrow’s filesystem APIs and supports the same glob patterns and options as local files, including recursive globbing across cloud storage hierarchies.