deduplicate

Removes duplicate events based on a common key.

deduplicate [keys…:any, limit=int, distance=int, create_timeout=duration,
             write_timeout=duration, read_timeout=duration, count_field=field]

Description

The deduplicate operator removes duplicates from a stream of events, based on the value of one or more fields.

`keys…: any (optional)`

The expressions that form the deduplication key. Pass one or more positional arguments, for example deduplicate src_ip, dst_ip, to build a compound key from multiple fields.

Defaults to this, i.e., deduplicating entire events.

`limit = int (optional)`

The number of duplicate keys allowed before an event is suppressed.

Defaults to 1, which is equivalent to removing all duplicates.

`distance = int (optional)`

Distance between two events that can be considered duplicates. A value of 1 means that only adjacent events can be considered duplicate.

When unspecified, the distance is infinite.

`create_timeout = duration (optional)`

The time that needs to pass until a surpressed event is no longer considered a duplicate. The timeout resets when the first event for a given key is let through.

`write_timeout = duration (optional)`

The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when any event for a given key is let through.

For a limit of 1, the write timeout is equivalent to the create timeout.

The write timeout must be smaller than the create timeout.

`read_timeout = duration (optional)`

The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when a key is seen, even if the event is suppressed.

The read timeout must be smaller than the write and create timeouts.

`count_field = field (optional)`

When specified, adds a field to each output event containing the number of events that were dropped since the last output for that key. Events that are the first occurrence of a key or that trigger output after expiration have a count of 0.

Examples

Deduplicate entire events

Deduplicate a stream of events by using deduplicate without arguments:

from \
  {foo: 1, bar: "a"},
  {foo: 1, bar: "a"},
  {foo: 1, bar: "a"},
  {foo: 1, bar: "b"},
  {foo: null, bar: "b"},
  {bar: "b"},
  {foo: null, bar: "b"},
  {foo: null, bar: "b"}
deduplicate

{foo: 1, bar: "a"}
{foo: 1, bar: "b"}
{foo: null, bar: "b"}
{bar: "b"}

Deduplicate events based on single fields

Use deduplicate bar to restrict the deduplication to the values of field bar:

from \
  {foo: 1, bar: "a"},
  {foo: 1, bar: "a"},
  {foo: 1, bar: "a"},
  {foo: 1, bar: "b"},
  {foo: null, bar: "b"},
  {bar: "b"},
  {foo: null, bar: "b"},
  {foo: null, bar: "b"}
deduplicate bar

{foo: 1, bar: "a"}
{foo: 1, bar: "b"}

When writing deduplicate foo, note how the missing foo field is treated as if it had the value null, i.e., it’s not included in the output.

from \
  {foo: 1, bar: "a"},
  {foo: 1, bar: "a"},
  {foo: 1, bar: "a"},
  {foo: 1, bar: "b"},
  {foo: null, bar: "b"},
  {bar: "b"},
  {foo: null, bar: "b"},
  {foo: null, bar: "b"}
deduplicate foo?

{foo: 1, bar: "a"}
{foo: null, bar: "b"}

Deduplicate events based on multiple fields

Multiple positional arguments form a tuple that must match entirely to suppress an event. For example, deduplicate foo, bar keeps the first event for each unique combination of foo and bar:

from \
  {foo: 1, bar: "a", idx: 1},
  {foo: 1, bar: "a", idx: 2},
  {foo: 1, bar: "b", idx: 3},
  {foo: 2, bar: "a", idx: 4},
  {foo: 1, bar: "b", idx: 5}
deduplicate foo, bar

{foo: 1, bar: "a", idx: 1}
{foo: 1, bar: "b", idx: 3}
{foo: 2, bar: "a", idx: 4}

Get up to 10 warnings per hour for each run of a pipeline

diagnostics live=true
deduplicate pipeline_id, run, limit=10, create_timeout=1h

Get an event whenever the node disconnected from the Tenzir Platform

metrics "platform", live=true
deduplicate connected, distance=1
where not connected

Track how many duplicates were dropped

Use the count_field option to add a field showing how many events were dropped for each key:

from \
  {x: 1, seq: 1},
  {x: 1, seq: 2},
  {x: 1, seq: 3},
  {x: 2, seq: 4},
  {x: 2, seq: 5},
  {x: 1, seq: 6}
deduplicate x, distance=2, count_field=drop_count

{x: 1, seq: 1, drop_count: 0}
{x: 2, seq: 4, drop_count: 0}
{x: 1, seq: 6, drop_count: 2}

The first event has a count of 0. When the next event with x: 1 is emitted at seq: 6, it shows that 2 events were dropped (seq 2 and 3) before this one was allowed through due to the distance=2 constraint. The event at seq: 4 has x: 2, which is a different key, so it also has drop_count: 0. Events that trigger output after timeout expiration also have a count of 0, since the deduplication state for that key was reset.

deduplicate

Description

keys…: any (optional)

limit = int (optional)

distance = int (optional)

create_timeout = duration (optional)

write_timeout = duration (optional)

read_timeout = duration (optional)

count_field = field (optional)