Elastic just released their new pipeline query language called ES|QL. This is a conscious attempt to consolidate the language zoo in the Elastic ecosystem (queryDSL, EQL, KQL, SQL, Painless, Canvas/Timelion). Elastic said that they worked on this effort for over a year. The documentation is still sparse, but we still tried to read between the lines to understand what this new pipeline language has to offer.
The reason why we are excited about this announcement is because we have also designed and implemented a pipeline language over the past 8 months that we just launched at BlackHat. First, we see the release of ES|QL as a confirmation for pipeline model. In our blog post about Splunk's Search Processing Language (SPL), we briefly mentioned why SQL might not be the best choice for analysts, and why thinking about one operator at a time provides an easier user experience. Second, we'd like to look under the hood of ES|QL to compare and reflect on our own Tenzir Query Language (TQL).
Walk-through by Example
So, ES|QL, how does it feel?
FROM employees
| EVAL hired_year = TO_INTEGER(DATE_FORMAT(hire_date, "YYYY"))
| WHERE hired_year > 1984
| STATS avg_salary = AVG(salary) BY languages
| EVAL avg_salary = ROUND(avg_salary)
| EVAL lang_code = TO_STRING(languages)
| ENRICH languages_policy ON lang_code WITH lang = language_name
| WHERE NOT IS_NULL(lang)
| KEEP avg_salary, lang
| SORT avg_salary ASC
| LIMIT 3
This syntax reads very straight-forward. Splunk users will immediately grasp what it does, as there is a remarkable similarity in operator naming. Let's go through each pipeline operator individually:
FROM
generates a table with up to 10k rows from a data stream, index, or alias. We asked ourselves why there is a hard-baked 10k limit? Shouldn't that be the job ofLIMIT
? The limit feels a technical limitation rather than a conscious design decision. In TQL, we have unbounded streams but also follow the single responsibility principle: one operator has exactly one job.EVAL
appends new or replaces existing columns. We named this operatorextend
because we found the Splunk-inspired command name "eval" too generic for this use case.1WHERE
filters the input with an expression. We have the samewhere
in TQL.STATS
groups its input viaBY
and applies aggregation functions on select fields of each group. Elastic went with Splunk nomenclature for this central operation, perhaps also to make the transition from Splunk to Elastic as easy as possible.ENRICH
adds data from existing indexes. It's effectively a join operation, and theON
keywords makes it possible to select the join field. Interestingly, the word "join" doesn't appear on the documentation. We hypothesize that this was a conscious choice, as a database join may feel intimidating for beginning and intermediate users.KEEP
selects a set of columns from the input and drops all others. It is the inverse ofDROP
. In TQL, we call these projection operatorsselect
and alsodrop
.SORT
sorts rows by one or more fields.SORT height DESC, first_name ASC
sorts the fieldheight
in descending order and the fieldfirst_name
in ascending order. The syntax of oursort
is identical. Controlling the position of null values works withNULLS FIRST
andNULLS LAST
. In TQL, we went Kusto-like withnulls-first
andnulls-last
.LIMIT
restricts the number of output rows. In TQL, we havehead
andtail
for this purpose.
Sources, Transformations, ... but Sinks?
ES|QL differentiates two types of commands (which we call operators in TQL):
In TQL, an operator is a source, a transformation, or a sink. Some
operators can be of multiple categories, like shell
.
Maybe this is still coming, but ES|QL doesn't appear to offer sinks. We hypothesize that users should consume pipeline output uniformly as JSON through a REST API.
Syntax
Syntactically, the ES|QL language is similar to TQL. The following points stood out:
- The
|
(pipe) symbol separates commands that describe the dataflow. - Comments work as in C++:
//
for single line and/*
and*/
for multi-line comments. - Expressions can occur in
WHERE
,STATS
, and other commands. The following relational operators exist:- Arithmetic comparisons via
<
,<=
,==
,>=
,>
- Set membership via
IN
- Glob-like wildcard search via
LIKE
- Regular expressions via
RLIKE
- Arithmetic comparisons via
- Date-time literals make it easier to express dates (
seconds
,hours
, etc.) and timespans (e.g.,1 day
). We found that expressing numeric values across multiple orders of magnitude is common, e.g., when dealing with GBs. This is why we also offer SI literals in TQL, allowing you to write large numbers as1 Mi
or1 M
. - ES|QL features multiple scalar functions.
that perform value-to-value transformations. Functions can occur in
ROW
,EVAL
, andWHERE
. - Similarly, aggregation functions perform a vector-to-scalar
transformation per group in
STATS
.
Engine
ES|QL comes with its own executor, i.e., it's not transpiled into any of the existing engines. A running pipelines is a task and there exists an API for querying their state, which may return something like:
Data Model
The concept of multi-valued fields exists to bridge the world
between JSON records and 2D tables. This shows the heritage of the type system,
which evolved from document stores as opposed to structured data stores. In
document land, every record may have a different shape (or schema). The term
multi-valued effectively means list, e.g., [1, 2, 3]
.
Noteworthy:
- The order of multi-valued fields is undefined.
- It's possible to impose set semantics by using the
keyword
type. Specifying this type causes duplicate removal on ingest. - Other types, like
long
, do not cause removal of duplicates on ingest.
The output is still semi-structured in that listness is something dynamic on a per-value basis. Consider this output:
The column b
has the list value [1, 2]
in the first row and 3
in the
second. In a strict type system (like TQL), the type of b
could be
list<long>
but then the second row would have value [3]
instead of 3
. Sum
types (called union
or variant
in many languages) are another way to
represent heterogeneous data as in the above example. If we described b
with
the type union<long, list<long>>
instead of long
, then it would be perfectly
fine for b
to take one value [1, 2]
in one row and 3
in another.
For TQL, we built our data model on top of data frames. We express structure in terms of records and lists, and arbitrarily nested combinations of them. It would be up the user to define set semantics that ensures unique values. We consider adding such a set type in the future (possible as type constraint or attribute) as we gain more complete support of the underlying Arrow type system. Similarly, we plan on adding sum types in the future.
Summary
The release of ES|QL witnesses a current trend of convergence in terms of query languages. The pipeline concept now exists for several decades. Splunk was the first company to successfully commercialize this interface with SPL, but today there are many players in the market that have a similar language. Microsoft open-sourced their Kusto language, and we see other vendors embedding it into their products, such as Cribl Search. Most SIEM vendors also have their own inhouse pipeline language.
The data ecosystem has numerous languages for advanced users to offer, such as dplyr, jq, pandas, and polars. And new ones are mushrooming everywhere, e.g., PRQL, Zed.
With our own TQL, we seek to bridge the data and security analytics world, by offering an intuitive language that is easy to grasp, but that internally maps to vectorized execution on top of data frames that can be easily shared with other runtimes.
If you want to look deeper at ES|QL, check out the branch
feature/esql
. Find something interesting about pipelines to
discuss? Swing by our Discord and start a conversation.
- We took the name
extend
from Kusto. In general, we find that Kusto has very self-descriptive operator names. During the design of TQL, we compared many different languages and often favored Kusto's choice of name.↩