Skip to content

Removes duplicate events based on a common key.

deduplicate [keys…:any, limit=int, distance=int, create_timeout=duration,
write_timeout=duration, read_timeout=duration, count_field=field]

The deduplicate operator removes duplicates from a stream of events, based on the value of one or more fields.

The expressions that form the deduplication key. Pass one or more positional arguments, for example deduplicate src_ip, dst_ip, to build a compound key from multiple fields.

Defaults to this, i.e., deduplicating entire events.

The number of duplicate keys allowed before an event is suppressed.

Defaults to 1, which is equivalent to removing all duplicates.

Distance between two events that can be considered duplicates. A value of 1 means that only adjacent events can be considered duplicate.

When unspecified, the distance is infinite.

The time that needs to pass until a surpressed event is no longer considered a duplicate. The timeout resets when the first event for a given key is let through.

The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when any event for a given key is let through.

For a limit of 1, the write timeout is equivalent to the create timeout.

The write timeout must be smaller than the create timeout.

The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when a key is seen, even if the event is suppressed.

The read timeout must be smaller than the write and create timeouts.

When specified, adds a field to each output event containing the number of events that were dropped since the last output for that key. Events that are the first occurrence of a key or that trigger output after expiration have a count of 0.

Deduplicate a stream of events by using deduplicate without arguments:

from \
{foo: 1, bar: "a"},
{foo: 1, bar: "a"},
{foo: 1, bar: "a"},
{foo: 1, bar: "b"},
{foo: null, bar: "b"},
{bar: "b"},
{foo: null, bar: "b"},
{foo: null, bar: "b"}
deduplicate
{foo: 1, bar: "a"}
{foo: 1, bar: "b"}
{foo: null, bar: "b"}
{bar: "b"}

Use deduplicate bar to restrict the deduplication to the values of field bar:

from \
{foo: 1, bar: "a"},
{foo: 1, bar: "a"},
{foo: 1, bar: "a"},
{foo: 1, bar: "b"},
{foo: null, bar: "b"},
{bar: "b"},
{foo: null, bar: "b"},
{foo: null, bar: "b"}
deduplicate bar
{foo: 1, bar: "a"}
{foo: 1, bar: "b"}

When writing deduplicate foo, note how the missing foo field is treated as if it had the value null, i.e., it’s not included in the output.

from \
{foo: 1, bar: "a"},
{foo: 1, bar: "a"},
{foo: 1, bar: "a"},
{foo: 1, bar: "b"},
{foo: null, bar: "b"},
{bar: "b"},
{foo: null, bar: "b"},
{foo: null, bar: "b"}
deduplicate foo?
{foo: 1, bar: "a"}
{foo: null, bar: "b"}

Deduplicate events based on multiple fields

Section titled “Deduplicate events based on multiple fields”

Multiple positional arguments form a tuple that must match entirely to suppress an event. For example, deduplicate foo, bar keeps the first event for each unique combination of foo and bar:

from \
{foo: 1, bar: "a", idx: 1},
{foo: 1, bar: "a", idx: 2},
{foo: 1, bar: "b", idx: 3},
{foo: 2, bar: "a", idx: 4},
{foo: 1, bar: "b", idx: 5}
deduplicate foo, bar
{foo: 1, bar: "a", idx: 1}
{foo: 1, bar: "b", idx: 3}
{foo: 2, bar: "a", idx: 4}

Get up to 10 warnings per hour for each run of a pipeline

Section titled “Get up to 10 warnings per hour for each run of a pipeline”
diagnostics live=true
deduplicate pipeline_id, run, limit=10, create_timeout=1h

Get an event whenever the node disconnected from the Tenzir Platform

Section titled “Get an event whenever the node disconnected from the Tenzir Platform”
metrics "platform", live=true
deduplicate connected, distance=1
where not connected

Use the count_field option to add a field showing how many events were dropped for each key:

from \
{x: 1, seq: 1},
{x: 1, seq: 2},
{x: 1, seq: 3},
{x: 2, seq: 4},
{x: 2, seq: 5},
{x: 1, seq: 6}
deduplicate x, distance=2, count_field=drop_count
{x: 1, seq: 1, drop_count: 0}
{x: 2, seq: 4, drop_count: 0}
{x: 1, seq: 6, drop_count: 2}

The first event has a count of 0. When the next event with x: 1 is emitted at seq: 6, it shows that 2 events were dropped (seq 2 and 3) before this one was allowed through due to the distance=2 constraint. The event at seq: 4 has x: 2, which is a different key, so it also has drop_count: 0. Events that trigger output after timeout expiration also have a count of 0, since the deduplication state for that key was reset.

Last updated: