Top-N
Top-N queries ask for the N smallest or largest values ordered by columns. Both smallest and largest values sets are considered Top-N queries. Top-N queries are useful in cases where the need is to display only the N bottom-most or the N top- most records from batch/streaming table on a condition. This result set can be used for further analysis.
Flink uses the combination of a OVER window clause and a filter condition to express a Top-N query. With the power of OVER window clause, Flink also supports per group Top-N. For example, the top five products per category that have the maximum sales in realtime. Top-N queries are supported for SQL on batch and streaming tables.
The following shows the syntax of the Top-N statement:
PARTITION BY col1[, col2...]
: Specifies the partition columns. Each partition will have a Top-N result.- : The
rownum <= N
is required for Flink to recognize this query is a Top-N query. The N represents the N smallest or largest records will be retained.
The unique keys of Top-N query is the combination of partition columns and rownum column. Top-N query can also derive the unique key of upstream. Take following job as an example, say product_id
is the unique key of the ShopSales
, then the unique keys of the Top-N query are [, rownum
] and [product_id
].
No Ranking Output Optimization
As described above, the rownum
field will be written into the result table as one field of the unique key, which may lead to a lot of records being written to the result table. For example, when the record (say product-1001
) of ranking 9 is updated and its rank is upgraded to 1, all the records from ranking 1 ~ 9 will be output to the result table as update messages. If the result table receives too many data, it will become the bottleneck of the SQL job.
The optimization way is omitting rownum field in the outer SELECT clause of the Top-N query. This is reasonable because the number of the top N records is usually not large, thus the consumers can sort the records themselves quickly. Without rownum field, in the example above, only the changed record (product-1001
) needs to be sent to downstream, which can reduce much IO to the result table.
The following example shows how to optimize the above Top-N example in this way: