Removes duplicate events based on a common key.
deduplicate [keys…:any, limit=int, distance=int, create_timeout=duration, write_timeout=duration, read_timeout=duration, count_field=field]Description
Section titled “Description”The deduplicate operator removes duplicates from a stream of events, based
on the value of one or more fields.
keys…: any (optional)
Section titled “keys…: any (optional)”The expressions that form the deduplication key. Pass one or more positional
arguments, for example deduplicate src_ip, dst_ip, to build a compound key
from multiple fields.
Defaults to this, i.e., deduplicating entire events.
limit = int (optional)
Section titled “limit = int (optional)”The number of duplicate keys allowed before an event is suppressed.
Defaults to 1, which is equivalent to removing all duplicates.
distance = int (optional)
Section titled “distance = int (optional)”Distance between two events that can be considered duplicates. A value of 1
means that only adjacent events can be considered duplicate.
When unspecified, the distance is infinite.
create_timeout = duration (optional)
Section titled “create_timeout = duration (optional)”The time that needs to pass until a surpressed event is no longer considered a duplicate. The timeout resets when the first event for a given key is let through.
write_timeout = duration (optional)
Section titled “write_timeout = duration (optional)”The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when any event for a given key is let through.
For a limit of 1, the write timeout is equivalent to the create timeout.
The write timeout must be smaller than the create timeout.
read_timeout = duration (optional)
Section titled “read_timeout = duration (optional)”The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when a key is seen, even if the event is suppressed.
The read timeout must be smaller than the write and create timeouts.
count_field = field (optional)
Section titled “count_field = field (optional)”When specified, adds a field to each output event containing the number of
events that were dropped since the last output for that key. Events that are
the first occurrence of a key or that trigger output after expiration have a
count of 0.
Examples
Section titled “Examples”Deduplicate entire events
Section titled “Deduplicate entire events”Deduplicate a stream of events by using deduplicate without arguments:
from \ {foo: 1, bar: "a"}, {foo: 1, bar: "a"}, {foo: 1, bar: "a"}, {foo: 1, bar: "b"}, {foo: null, bar: "b"}, {bar: "b"}, {foo: null, bar: "b"}, {foo: null, bar: "b"}deduplicate{foo: 1, bar: "a"}{foo: 1, bar: "b"}{foo: null, bar: "b"}{bar: "b"}Deduplicate events based on single fields
Section titled “Deduplicate events based on single fields”Use deduplicate bar to restrict the deduplication to the values of field
bar:
from \ {foo: 1, bar: "a"}, {foo: 1, bar: "a"}, {foo: 1, bar: "a"}, {foo: 1, bar: "b"}, {foo: null, bar: "b"}, {bar: "b"}, {foo: null, bar: "b"}, {foo: null, bar: "b"}deduplicate bar{foo: 1, bar: "a"}{foo: 1, bar: "b"}When writing deduplicate foo, note how the missing foo field is treated as
if it had the value null, i.e., it’s not included in the output.
from \ {foo: 1, bar: "a"}, {foo: 1, bar: "a"}, {foo: 1, bar: "a"}, {foo: 1, bar: "b"}, {foo: null, bar: "b"}, {bar: "b"}, {foo: null, bar: "b"}, {foo: null, bar: "b"}deduplicate foo?{foo: 1, bar: "a"}{foo: null, bar: "b"}Deduplicate events based on multiple fields
Section titled “Deduplicate events based on multiple fields”Multiple positional arguments form a tuple that must match entirely to suppress
an event. For example, deduplicate foo, bar keeps the first event for each
unique combination of foo and bar:
from \ {foo: 1, bar: "a", idx: 1}, {foo: 1, bar: "a", idx: 2}, {foo: 1, bar: "b", idx: 3}, {foo: 2, bar: "a", idx: 4}, {foo: 1, bar: "b", idx: 5}deduplicate foo, bar{foo: 1, bar: "a", idx: 1}{foo: 1, bar: "b", idx: 3}{foo: 2, bar: "a", idx: 4}Get up to 10 warnings per hour for each run of a pipeline
Section titled “Get up to 10 warnings per hour for each run of a pipeline”diagnostics live=truededuplicate pipeline_id, run, limit=10, create_timeout=1hGet an event whenever the node disconnected from the Tenzir Platform
Section titled “Get an event whenever the node disconnected from the Tenzir Platform”metrics "platform", live=truededuplicate connected, distance=1where not connectedTrack how many duplicates were dropped
Section titled “Track how many duplicates were dropped”Use the count_field option to add a field showing how many events were
dropped for each key:
from \ {x: 1, seq: 1}, {x: 1, seq: 2}, {x: 1, seq: 3}, {x: 2, seq: 4}, {x: 2, seq: 5}, {x: 1, seq: 6}deduplicate x, distance=2, count_field=drop_count{x: 1, seq: 1, drop_count: 0}{x: 2, seq: 4, drop_count: 0}{x: 1, seq: 6, drop_count: 2}The first event has a count of 0. When the next event with x: 1 is emitted
at seq: 6, it shows that 2 events were dropped (seq 2 and 3) before this one
was allowed through due to the distance=2 constraint. The event at seq: 4
has x: 2, which is a different key, so it also has drop_count: 0. Events
that trigger output after timeout expiration also have a count of 0, since
the deduplication state for that key was reset.