PQL

  • PQL is a derivative of SQL derivative that supports selection, projection, aggregation, grouping aggregation. There is no support for Joins or Subqueries.
  • Specifically, for Pinot:
    • Grouping keys always appear in query results, even if not requested
    • Aggregations are computed in parallel
    • Results of aggregations with large amounts of group keys (>1M) are approximated
    • ORDER BY only works for selection queries, for aggregations one must use the TOP keyword

PQL Examples

The Pinot Query Language (PQL) is very similar to standard SQL:

SELECT COUNT(*) FROM myTable

Aggregation

SELECT COUNT(*), MAX(foo), SUM(bar) FROM myTable

Grouping on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM myTable
  GROUP BY bar, baz TOP 50

Filtering

SELECT COUNT(*) FROM myTable
  WHERE foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

Selection (Projection)

SELECT * FROM myTable
  WHERE quux < 5
  LIMIT 50

Ordering on Selection

SELECT foo, bar FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 100

Pagination on Selection

Note: results might not be consistent if column ordered by has same value in multiple rows.

SELECT foo, bar FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100

Wild-card match (in WHERE clause only)

To count rows where the column airlineName starts with U

SELECT count(*) FROM SomeTable
  WHERE regexp_like(airlineName, '^U.*')
  GROUP BY airlineName TOP 10

Examples with UDF

As of now, functions have to be implemented within Pinot. Injecting functions is not allowed yet. The examples below demonstrate the use of UDFs

SELECT count(*) FROM myTable
  GROUP BY timeConvert(timeColumnName, 'SECONDS', 'DAYS')

Examples with BYTES column

Pinot supports queries on BYTES column using HEX string. The query response also uses hex string to represent bytes value.

E.g. the query below fetches all the rows for a given UID.

SELECT * FROM myTable
  WHERE UID = "c8b3bce0b378fc5ce8067fc271a34892"

PQL Specification

SELECT

The select statement is as follows

SELECT <outputColumn> (, outputColumn, outputColumn,...)
  FROM <tableName>
  (WHERE ... | GROUP BY ... | ORDER BY ... | TOP ... | LIMIT ...)

outputColumn can be * to project all columns, columns (foo, bar, baz) or aggregation functions like (MIN(foo), MAX(bar), AVG(baz)).

Supported aggregations on single-value columns

  • COUNT
  • MIN
  • MAX
  • SUM
  • AVG
  • MINMAXRANGE
  • DISTINCTCOUNT
  • DISTINCTCOUNTHLL
  • DISTINCTCOUNTRAWHLL: Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL (see pinot-core/**/HllUtil.java as an example) and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
  • FASTHLL (WARN: will be deprecated soon. FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format)
  • PERCENTILE[0-100]: e.g. PERCENTILE5, PERCENTILE50, PERCENTILE99, etc.
  • PERCENTILEEST[0-100]: e.g. PERCENTILEEST5, PERCENTILEEST50, PERCENTILEEST99, etc.

Supported aggregations on multi-value columns

  • COUNTMV
  • MINMV
  • MAXMV
  • SUMMV
  • AVGMV
  • MINMAXRANGEMV
  • DISTINCTCOUNTMV
  • DISTINCTCOUNTHLLMV
  • DISTINCTCOUNTRAWHLLMV: Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL (see pinot-core/**/HllUtil.java as an example) and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
  • FASTHLLMV (WARN: will be deprecated soon. It does not make lots of sense to configure serialized HyperLogLog column as a dimension)
  • PERCENTILE[0-100]MV: e.g. PERCENTILE5MV, PERCENTILE50MV, PERCENTILE99MV, etc.
  • PERCENTILEEST[0-100]MV: e.g. PERCENTILEEST5MV, PERCENTILEEST50MV, PERCENTILEEST99MV, etc.

WHERE

Supported predicates are comparisons with a constant using the standard SQL operators (=, <, <=, >, >=, <>, ‘!=’) , range comparisons using BETWEEN (foo BETWEEN 42 AND 69), set membership (foo IN (1, 2, 4, 8)) and exclusion (foo NOT IN (1, 2, 4, 8)). For BETWEEN, the range is inclusive.

Comparison with a regular expression is supported using the regexp_like function, as in WHERE regexp_like(columnName, 'regular expression')

GROUP BY

The GROUP BY clause groups aggregation results by a list of columns, or transform functions on columns (see below)

ORDER BY

The ORDER BY clause orders selection results by a list of columns. PQL supports ordering DESC or ASC.

TOP

The TOP n clause causes the ‘n’ largest group results to be returned. If not specified, the top 10 groups are returned.

LIMIT

The LIMIT n clause causes the selection results to contain at most ‘n’ results. The LIMIT a, b clause paginate the selection results from the ‘a’ th results and return at most ‘b’ results.

Transform Function in Aggregation and Grouping

In aggregation and grouping, each column can be transformed from one or multiple columns. For example, the following query will calculate the maximum value of column foo divided by column bar grouping on the column time converted form time unit MILLISECONDS to SECONDS:

SELECT MAX(DIV(foo, bar) FROM myTable
  GROUP BY TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')

Supported transform functions

ADD
Sum of at least two values
SUB
Difference between two values
MULT
Product of at least two values
DIV
Quotient of two values
TIMECONVERT
Takes 3 arguments, converts the value into another time unit. e.g. TIMECONVERT(time, 'MILLISECONDS', 'SECONDS') This expression converts the value of coulumn time (taken to be in milliseconds) to the nearest seconds (i.e. the nearest seconds that is lower than the value of date column)
DATETIMECONVERT
Takes 4 arguments, converts the value into another date time format, and buckets time based on the given time granularity. e.g. DATETIMECONVERT(date, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', '15:MINUTES') This expression converts the column date which is formatted as 1:MILLISECONDS:EPOCH, and converts it into format 1:SECONDS:EPOCH with a granularity of 15:MINUTES (i.e. nearest 15-minute value lower than the value of date column.
VALUEIN
Takes at least 2 arguments, where the first argument is a multi-valued column, and the following arguments are constant values. The transform function will filter the value from the multi-valued column with the given constant values. The VALUEIN transform function is especially useful when the same multi-valued column is both filtering column and grouping column. e.g. VALUEIN(mvColumn, 3, 5, 15)

Differences with SQL

  • JOIN is not supported
  • Use TOP instead of LIMIT for truncation
  • LIMIT n has no effect in grouping queries, should use TOP n instead. If no TOP n defined, PQL will use TOP 10 as default truncation setting.
  • No need to select the columns to group with.

The following two queries are both supported in PQL, where the non-aggregation columns are ignored.

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
  GROUP BY bar, baz
  TOP 50

SELECT bar, baz, MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
  GROUP BY bar, baz
  TOP 50
  • The results will always order by the aggregated value (descending).

The results for query:

SELECT MIN(foo), MAX(foo) FROM myTable
  GROUP BY bar
  TOP 50

will be the same as the combining results from the following queries:

SELECT MIN(foo) FROM myTable
  GROUP BY bar
  TOP 50
SELECT MAX(foo) FROM myTable
  GROUP BY bar
  TOP 50

where we don’t put the results for the same group together.

  • We are beginning work on standard sql support. As a first step, we have introduced ORDER BY.

In order to use ORDER BY certain options need to be set in the request json payload:

  1. groupByMode - Setting this to sql will take the code path of standard sql, and hence accept ORDER BY. By default, this is pql
{
  "pql" : "SELECT COUNT(*) from myTable GROUP BY foo ORDER BY foo DESC TOP 100",
  "queryOptions" : "groupByMode=sql"
}
  1. responseFormat - Setting this to sql will present results in the standard sql way i.e. tabular, with same keys across all aggregations. This only works when used in combination with groupByMode=sql. By default, this is pql
{
  "pql" : "SELECT SUM(foo), SUM(bar) from myTable GROUP BY moo ORDER BY SUM(bar) ASC, moo DESC TOP 10",
  "queryOptions" : "groupByMode=sql;responseFormat=sql"
}

ResultTable looks as follows:

{
  "resultTable": {
    "columns":["moo", "SUM(foo)","SUM(bar)"],
    "results":[["abc", 10, 100],
               ["pqr", 20, 200],
               ["efg", 20, 200],
               ["lmn", 30, 300]]
}

These options are also available on the query console (checkboxes Group By Mode: SQL and Response Format: SQL)