Clean up values

This guide shows you how to clean and normalize values in your data before mapping to a schema. You’ll learn to handle null placeholders, normalize sentinel values, fix types, and provide defaults.

Replace null placeholders

Many data sources use string placeholders instead of actual null values. Common patterns include "None", "N/A", "-", and empty strings.

Normalize across all fields

Use replace to convert placeholders to null across all string fields:

from {status: "active", error: "None"},
     {status: "N/A", error: "timeout"},
     {status: "-", error: ""}
replace what="None", with=null
replace what="N/A", with=null
replace what="-", with=null
replace what="", with=null

{status: "active", error: null}
{status: null, error: "timeout"}
{status: null, error: null}

Normalize specific fields

When you only want to normalize certain fields, specify them explicitly:

from {enabled: "YES", disabled: "NO", status: "YES"}
replace enabled, what="YES", with=true
replace enabled, what="NO", with=false
replace disabled, what="YES", with=true
replace disabled, what="NO", with=false
// status remains unchanged

{enabled: true, disabled: false, status: "YES"}

Chain replacements

Combine multiple replacements for thorough cleanup:

from {value: "N/A"},
     {value: "null"},
     {value: "undefined"},
     {value: ""}
replace what="N/A", with=null
replace what="null", with=null
replace what="undefined", with=null
replace what="", with=null

Normalize sentinel values

Some systems use specific values to indicate special states. Normalize these to consistent representations.

Boolean conversions

Convert string booleans to actual boolean values:

from {active: "true", verified: "1", enabled: "yes"},
     {active: "false", verified: "0", enabled: "no"}
active = active == "true"
verified = verified == "1"
enabled = enabled == "yes"

{active: true, verified: true, enabled: true}
{active: false, verified: false, enabled: false}

Numeric sentinels

Convert special numeric values:

from {port: -1, count: 0xFFFFFFFF, size: -999}
// -1 often means "any port" or "not applicable"
port = null if port == -1
// 0xFFFFFFFF often means "unknown" in network data
count = null if count == 4294967295
// -999 is sometimes used as a missing value indicator
size = null if size == -999

Fix types

Raw data often arrives with incorrect types. Convert strings to appropriate native types.

Parse timestamps

Convert string timestamps to proper time values:

from {ts: "2024-01-15T10:30:45Z", epoch: "1705316445"}
ts = ts.time()
epoch = epoch.int().seconds().from_epoch()

{ts: 2024-01-15T10:30:45Z, epoch: 2024-01-15T11:00:45Z}

Parse IP addresses

Convert string IPs to native IP types:

from {src: "192.168.1.100", dst: "10.0.0.1"}
src = src.ip()
dst = dst.ip()

{src: 192.168.1.100, dst: 10.0.0.1}

Parse subnets

Convert CIDR notation strings to subnet types:

from {network: "10.0.0.0/8", allowed: "192.168.0.0/16"}
network = network.subnet()
allowed = allowed.subnet()

{network: 10.0.0.0/8, allowed: 192.168.0.0/16}

Parse numbers

Convert numeric strings to integers or floats:

from {port: "443", ratio: "0.95", count: "1000"}
port = port.int()
ratio = ratio.float()
count = count.int()

{port: 443, ratio: 0.95, count: 1000}

Parse durations

Convert duration strings to duration types:

from {timeout: "30s", interval: "5min", ttl: "24h"}
timeout = timeout.duration()
interval = interval.duration()
ttl = ttl.duration()

{timeout: 30s, interval: 5min, ttl: 24h}

Provide default values

Use the else keyword to fill in missing values:

from {name: "alice", score: 85},
     {name: "bob"},
     {name: "charlie", score: null}
score = score? else 0
status = status? else "unknown"

{name: "alice", score: 85, status: "unknown"}
{name: "bob", score: 0, status: "unknown"}
{name: "charlie", score: 0, status: "unknown"}

The ? operator suppresses warnings when accessing fields that don’t exist.

Trim and normalize strings

Clean up whitespace and case inconsistencies:

from {name: "  ALICE  ", email: "  Bob@Example.COM  "}
name = name.trim().to_title()
email = email.trim().to_lower()

{name: "Alice", email: "bob@example.com"}

Drop null fields from nested records

Remove null fields from nested records to reduce event size:

from {user: {name: "alice", email: null, phone: null}}
drop_null_fields user

{user: {name: "alice"}}

Avoid using drop_null_fields without arguments on the top-level record. Each unique combination of present fields creates a distinct schema, leading to schema fragmentation that defeats the purpose of normalization. It also removes fields you just normalized.

Practical example

Here’s a complete cleanup pipeline for raw firewall logs:

from_kafka "firewall-raw"
this = message.parse_json()

// Replace string nulls
replace what="N/A", with=null
replace what="-", with=null
replace what="", with=null

// Convert types
src_ip = src_ip?.ip()
dst_ip = dst_ip?.ip()
src_port = src_port?.int()
dst_port = dst_port?.int()
timestamp = timestamp?.time()
bytes = bytes?.int()

// Normalize booleans
blocked = action? == "block"

// Provide defaults
action = action? else "unknown"
protocol = protocol? else "unknown"

// Drop nulls from user context (e.g., domain, group may be missing)
drop_null_fields user

Best practices

Clean early: Apply cleanup transformations at the start of your pipeline before any business logic.
Handle missing fields: Use ? when accessing fields that might not exist, and provide sensible defaults with else.
Document your assumptions: Comment which sentinel values you’re normalizing and why.
Test edge cases: Verify cleanup handles unusual values like empty strings, whitespace-only strings, and mixed case.
Consider performance: The replace operator is efficient for bulk replacements, but be mindful of chaining many replacements.