Skip to content

This guide shows you how to clean and normalize values in your data before mapping to a schema. You’ll learn to handle null placeholders, normalize sentinel values, fix types, and provide defaults.

Many data sources use string placeholders instead of actual null values. Common patterns include "None", "N/A", "-", and empty strings.

Use replace to convert placeholders to null across all string fields:

from {status: "active", error: "None"},
{status: "N/A", error: "timeout"},
{status: "-", error: ""}
replace what="None", with=null
replace what="N/A", with=null
replace what="-", with=null
replace what="", with=null
{status: "active", error: null}
{status: null, error: "timeout"}
{status: null, error: null}

When you only want to normalize certain fields, specify them explicitly:

from {enabled: "YES", disabled: "NO", status: "YES"}
replace enabled, what="YES", with=true
replace enabled, what="NO", with=false
replace disabled, what="YES", with=true
replace disabled, what="NO", with=false
// status remains unchanged
{enabled: true, disabled: false, status: "YES"}

Combine multiple replacements for thorough cleanup:

from {value: "N/A"},
{value: "null"},
{value: "undefined"},
{value: ""}
replace what="N/A", with=null
replace what="null", with=null
replace what="undefined", with=null
replace what="", with=null

Some systems use specific values to indicate special states. Normalize these to consistent representations.

Convert string booleans to actual boolean values:

from {active: "true", verified: "1", enabled: "yes"},
{active: "false", verified: "0", enabled: "no"}
active = active == "true"
verified = verified == "1"
enabled = enabled == "yes"
{active: true, verified: true, enabled: true}
{active: false, verified: false, enabled: false}

Convert special numeric values:

from {port: -1, count: 0xFFFFFFFF, size: -999}
// -1 often means "any port" or "not applicable"
port = null if port == -1
// 0xFFFFFFFF often means "unknown" in network data
count = null if count == 4294967295
// -999 is sometimes used as a missing value indicator
size = null if size == -999

Raw data often arrives with incorrect types. Convert strings to appropriate native types.

Convert string timestamps to proper time values:

from {ts: "2024-01-15T10:30:45Z", epoch: "1705316445"}
ts = ts.time()
epoch = epoch.int().seconds().from_epoch()
{ts: 2024-01-15T10:30:45Z, epoch: 2024-01-15T11:00:45Z}

Convert string IPs to native IP types:

from {src: "192.168.1.100", dst: "10.0.0.1"}
src = src.ip()
dst = dst.ip()
{src: 192.168.1.100, dst: 10.0.0.1}

Convert CIDR notation strings to subnet types:

from {network: "10.0.0.0/8", allowed: "192.168.0.0/16"}
network = network.subnet()
allowed = allowed.subnet()
{network: 10.0.0.0/8, allowed: 192.168.0.0/16}

Convert numeric strings to integers or floats:

from {port: "443", ratio: "0.95", count: "1000"}
port = port.int()
ratio = ratio.float()
count = count.int()
{port: 443, ratio: 0.95, count: 1000}

Convert duration strings to duration types:

from {timeout: "30s", interval: "5min", ttl: "24h"}
timeout = timeout.duration()
interval = interval.duration()
ttl = ttl.duration()
{timeout: 30s, interval: 5min, ttl: 24h}

Use the else keyword to fill in missing values:

from {name: "alice", score: 85},
{name: "bob"},
{name: "charlie", score: null}
score = score? else 0
status = status? else "unknown"
{name: "alice", score: 85, status: "unknown"}
{name: "bob", score: 0, status: "unknown"}
{name: "charlie", score: 0, status: "unknown"}

The ? operator suppresses warnings when accessing fields that don’t exist.

Clean up whitespace and case inconsistencies:

from {name: " ALICE ", email: " Bob@Example.COM "}
name = name.trim().to_title()
email = email.trim().to_lower()
{name: "Alice", email: "bob@example.com"}

Remove null fields from nested records to reduce event size:

from {user: {name: "alice", email: null, phone: null}}
drop_null_fields user
{user: {name: "alice"}}

Avoid using drop_null_fields without arguments on the top-level record. Each unique combination of present fields creates a distinct schema, leading to schema fragmentation that defeats the purpose of normalization. It also removes fields you just normalized.

Here’s a complete cleanup pipeline for raw firewall logs:

from_kafka "firewall-raw"
this = message.parse_json()
// Replace string nulls
replace what="N/A", with=null
replace what="-", with=null
replace what="", with=null
// Convert types
src_ip = src_ip?.ip()
dst_ip = dst_ip?.ip()
src_port = src_port?.int()
dst_port = dst_port?.int()
timestamp = timestamp?.time()
bytes = bytes?.int()
// Normalize booleans
blocked = action? == "block"
// Provide defaults
action = action? else "unknown"
protocol = protocol? else "unknown"
// Drop nulls from user context (e.g., domain, group may be missing)
drop_null_fields user
  1. Clean early: Apply cleanup transformations at the start of your pipeline before any business logic.

  2. Handle missing fields: Use ? when accessing fields that might not exist, and provide sensible defaults with else.

  3. Document your assumptions: Comment which sentinel values you’re normalizing and why.

  4. Test edge cases: Verify cleanup handles unusual values like empty strings, whitespace-only strings, and mixed case.

  5. Consider performance: The replace operator is efficient for bulk replacements, but be mindful of chaining many replacements.

Last updated: