Scan queries

    The Scan query returns raw Apache Druid rows in streaming mode.

    In addition to straightforward usage where a Scan query is issued to the Broker, the Scan query can also be issued directly to Historical processes or streaming ingestion tasks. This can be useful if you want to retrieve large amounts of data in parallel.

    An example Scan query object is shown below:

    The following are the main parameters for Scan queries:

    The format of the result when resultFormat equals :

    The Scan query currently supports ordering based on timestamp for non-legacy queries. Note that using time ordering will yield results that do not indicate which segment rows are from (segmentId will show up as null). Furthermore, time ordering is only supported where the result set limit is less than druid.query.scan.maxRowsQueuedForOrdering rows or all segments scanned have fewer than druid.query.scan.maxSegmentPartitionsOrderedInMemory partitions. Also, time ordering is not supported for queries issued directly to historicals unless a list of segments is specified. The reasoning behind these limitations is that the implementation of time ordering uses two strategies that can consume too much heap memory if left unbounded. These strategies (listed below) are chosen on a per-Historical basis depending on query result set limit and the number of segments being scanned.

    1. N-Way Merge: For each segment, each partition is opened in parallel. Since each partition’s rows are already time-ordered, an n-way merge can be performed on the results from each partition. This approach doesn’t persist the entire result set in memory (like the Priority Queue) as it streams back batches as they are returned from the merge function. However, attempting to query too many partition could also result in high memory usage due to the need to open decompression and decoding buffers for each. The druid.query.scan.maxSegmentPartitionsOrderedInMemory limit protects from this by capping the number of partitions opened at any times when time ordering is used.

    The Scan query supports a legacy mode designed for protocol compatibility with the former scan-query contrib extension. In legacy mode you can expect the following behavior changes:

    • The __time column is returned as "timestamp" rather than "__time". This will take precedence over any other column you may have that is named "timestamp".
    • Timestamps are returned as ISO8601 time strings rather than integers (milliseconds since 1970-01-01 00:00:00 UTC).

    Legacy mode can be triggered either by passing "legacy" : true in your query JSON, or by setting druid.query.scan.legacy = true on your Druid processes. If you were previously using the scan-query contrib extension, the best way to migrate is to activate legacy mode during a rolling upgrade, then switch it off after the upgrade is complete.

    Configuration properties:

    Sample query context JSON object: