This guide shows you how to clean and normalize values in your data before mapping to a schema. You’ll learn to handle null placeholders, normalize sentinel values, fix types, and provide defaults.
Replace null placeholders
Section titled “Replace null placeholders”Many data sources use string placeholders instead of actual null values.
Common patterns include "None", "N/A", "-", and empty strings.
Normalize across all fields
Section titled “Normalize across all fields”Use replace to convert placeholders to null
across all string fields:
from {status: "active", error: "None"}, {status: "N/A", error: "timeout"}, {status: "-", error: ""}replace what="None", with=nullreplace what="N/A", with=nullreplace what="-", with=nullreplace what="", with=null{status: "active", error: null}{status: null, error: "timeout"}{status: null, error: null}Normalize specific fields
Section titled “Normalize specific fields”When you only want to normalize certain fields, specify them explicitly:
from {enabled: "YES", disabled: "NO", status: "YES"}replace enabled, what="YES", with=truereplace enabled, what="NO", with=falsereplace disabled, what="YES", with=truereplace disabled, what="NO", with=false// status remains unchanged{enabled: true, disabled: false, status: "YES"}Chain replacements
Section titled “Chain replacements”Combine multiple replacements for thorough cleanup:
from {value: "N/A"}, {value: "null"}, {value: "undefined"}, {value: ""}replace what="N/A", with=nullreplace what="null", with=nullreplace what="undefined", with=nullreplace what="", with=nullNormalize sentinel values
Section titled “Normalize sentinel values”Some systems use specific values to indicate special states. Normalize these to consistent representations.
Boolean conversions
Section titled “Boolean conversions”Convert string booleans to actual boolean values:
from {active: "true", verified: "1", enabled: "yes"}, {active: "false", verified: "0", enabled: "no"}active = active == "true"verified = verified == "1"enabled = enabled == "yes"{active: true, verified: true, enabled: true}{active: false, verified: false, enabled: false}Numeric sentinels
Section titled “Numeric sentinels”Convert special numeric values:
from {port: -1, count: 0xFFFFFFFF, size: -999}// -1 often means "any port" or "not applicable"port = null if port == -1// 0xFFFFFFFF often means "unknown" in network datacount = null if count == 4294967295// -999 is sometimes used as a missing value indicatorsize = null if size == -999Fix types
Section titled “Fix types”Raw data often arrives with incorrect types. Convert strings to appropriate native types.
Parse timestamps
Section titled “Parse timestamps”Convert string timestamps to proper time values:
from {ts: "2024-01-15T10:30:45Z", epoch: "1705316445"}ts = ts.time()epoch = epoch.int().seconds().from_epoch(){ts: 2024-01-15T10:30:45Z, epoch: 2024-01-15T11:00:45Z}Parse IP addresses
Section titled “Parse IP addresses”Convert string IPs to native IP types:
from {src: "192.168.1.100", dst: "10.0.0.1"}src = src.ip()dst = dst.ip(){src: 192.168.1.100, dst: 10.0.0.1}Parse subnets
Section titled “Parse subnets”Convert CIDR notation strings to subnet types:
from {network: "10.0.0.0/8", allowed: "192.168.0.0/16"}network = network.subnet()allowed = allowed.subnet(){network: 10.0.0.0/8, allowed: 192.168.0.0/16}Parse numbers
Section titled “Parse numbers”Convert numeric strings to integers or floats:
from {port: "443", ratio: "0.95", count: "1000"}port = port.int()ratio = ratio.float()count = count.int(){port: 443, ratio: 0.95, count: 1000}Parse durations
Section titled “Parse durations”Convert duration strings to duration types:
from {timeout: "30s", interval: "5min", ttl: "24h"}timeout = timeout.duration()interval = interval.duration()ttl = ttl.duration(){timeout: 30s, interval: 5min, ttl: 24h}Provide default values
Section titled “Provide default values”Use the else keyword to fill in missing values:
from {name: "alice", score: 85}, {name: "bob"}, {name: "charlie", score: null}score = score? else 0status = status? else "unknown"{name: "alice", score: 85, status: "unknown"}{name: "bob", score: 0, status: "unknown"}{name: "charlie", score: 0, status: "unknown"}The ? operator suppresses warnings when accessing fields that don’t exist.
Trim and normalize strings
Section titled “Trim and normalize strings”Clean up whitespace and case inconsistencies:
from {name: " ALICE ", email: " Bob@Example.COM "}name = name.trim().to_title()email = email.trim().to_lower(){name: "Alice", email: "bob@example.com"}Drop null fields from nested records
Section titled “Drop null fields from nested records”Remove null fields from nested records to reduce event size:
from {user: {name: "alice", email: null, phone: null}}drop_null_fields user{user: {name: "alice"}}Avoid using drop_null_fields without arguments on the top-level record. Each
unique combination of present fields creates a distinct schema, leading to
schema fragmentation that defeats the purpose of normalization. It also
removes fields you just normalized.
Practical example
Section titled “Practical example”Here’s a complete cleanup pipeline for raw firewall logs:
from_kafka "firewall-raw"this = message.parse_json()
// Replace string nullsreplace what="N/A", with=nullreplace what="-", with=nullreplace what="", with=null
// Convert typessrc_ip = src_ip?.ip()dst_ip = dst_ip?.ip()src_port = src_port?.int()dst_port = dst_port?.int()timestamp = timestamp?.time()bytes = bytes?.int()
// Normalize booleansblocked = action? == "block"
// Provide defaultsaction = action? else "unknown"protocol = protocol? else "unknown"
// Drop nulls from user context (e.g., domain, group may be missing)drop_null_fields userBest practices
Section titled “Best practices”-
Clean early: Apply cleanup transformations at the start of your pipeline before any business logic.
-
Handle missing fields: Use
?when accessing fields that might not exist, and provide sensible defaults withelse. -
Document your assumptions: Comment which sentinel values you’re normalizing and why.
-
Test edge cases: Verify cleanup handles unusual values like empty strings, whitespace-only strings, and mixed case.
-
Consider performance: The
replaceoperator is efficient for bulk replacements, but be mindful of chaining many replacements.