deduplicate
Removes duplicate events based on a common key.
deduplicate [key:any, limit=int, distance=int, create_timeout=duration,
write_timeout=duration, read_timeout=duration]
Description
The deduplicate
operator removes duplicates from a stream of events, based
on the value of one or more fields.
key: any (optional)
The key to deduplicate. To deduplicate multiple fields, use a record expression like {foo: bar, baz: qux}
.
Defaults to this
, i.e., deduplicating entire events.
limit = int (optional)
The number of duplicate keys allowed before an event is suppressed.
Defaults to 1
, which is equivalent to removing all duplicates.
distance = int (optional)
Distance between two events that can be considered duplicates. A value of 1
means that only adjacent events can be considered duplicate.
When unspecified, the distance is infinite.
create_timeout = duration (optional)
The time that needs to pass until a surpressed event is no longer considered a duplicate. The timeout resets when the first event for a given key is let through.
write_timeout = duration (optional)
The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when any event for a given key is let through.
For a limit of 1
, the write timeout is equivalent to the create timeout.
The write timeout must be smaller than the create timeout.
read_timeout = duration (optional)
The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when a key is seen, even if the event is suppressed.
The read timeout must be smaller than the write and create timeouts.
Examples
Simple deduplication
Consider the following data:
{"foo": 1, "bar": "a"}
{"foo": 1, "bar": "a"}
{"foo": 1, "bar": "a"}
{"foo": 1, "bar": "b"}
{"foo": null, "bar": "b"}
{"bar": "b"}
{"foo": null, "bar": "b"}
{"foo": null, "bar": "b"}
For deduplicate
, all duplicate events are removed:
{"foo": 1, "bar": "a"}
{"foo": 1, "bar": "b"}
{"foo": null, "bar": "b"}
{"bar": "b"}
If deduplicate bar
is used, only the field bar
is considered when
determining whether an event is a duplicate:
{"foo": 1, "bar": "a"}
{"foo": 1, "bar": "b"}
And for deduplicate foo
, only the field foo
is considered. Note, how the
missing foo
field is treated as if it had the value null
, i.e., it's not
included in the output.
{"foo": 1, "bar": "a"}
{"foo": null, "bar": "b"}
Get up to 10 warnings per hour for each run of a pipeline
diagnostics live=true
deduplicate {id: pipeline_id, run: run}, limit=10, create_timeout=1h
Get an event whenever the node disconnected from the Tenzir Platform
metrics "platform", live=true
deduplicate connected, distance=1
where not connected