SQL-based ingestion reference

    This topic is a reference guide for the multi-stage query architecture in Apache Druid. For examples of real-world usage, refer to the page.

    Use the EXTERN function to read external data.

    Function format:

    EXTERN consists of the following parts:

    1. Any as a JSON-encoded string.
    2. Any Druid input format as a JSON-encoded string.
    3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a and a type. The type can be string, long, double, or float. This row signature is used to map the external data into the SQL layer.

    For more information, see .

    Use the INSERT statement to insert data.

    Unlike standard SQL, INSERT loads data into the target table according to column name, not positionally. If necessary, use AS in your SELECT column list to assign the correct names. Do not rely on their positions within the SELECT clause.

    Statement format:

    INSERT consists of the following parts:

    1. Optional .
    2. An INSERT INTO <dataSource> clause at the start of your query, such as INSERT INTO your-table.
    3. A PARTITIONED BY clause, such as PARTITIONED BY DAY.
    4. An optional clause.

    You can use the REPLACE function to replace all or some of the data.

    Unlike standard SQL, REPLACE loads data into the target table according to column name, not positionally. If necessary, use AS in your SELECT column list to assign the correct names. Do not rely on their positions within the SELECT clause.

    REPLACE all data

    Function format to replace all data:

    REPLACE specific time ranges

    Function format to replace specific time ranges:

    REPLACE consists of the following parts:

    1. Optional .
    2. A REPLACE INTO <dataSource> clause at the start of your query, such as REPLACE INTO "your-table".
    3. An OVERWRITE clause after the datasource, either OVERWRITE ALL or OVERWRITE WHERE:
    • OVERWRITE ALL replaces the entire existing datasource with the results of the query.
    • OVERWRITE WHERE drops the time segments that match the condition you set. Conditions are based on the __time column and use the format __time [< > = <= >=] TIMESTAMP. Use them with AND, OR, and NOT between them, inclusive of the timestamps specified. No other expressions or functions are valid in OVERWRITE.
    1. A clause for the actual data you want to use for the replacement.
    2. A PARTITIONED BY clause, such as PARTITIONED BY DAY.
    3. An optional clause.

    For more information, see Overwrite data with REPLACE.

    The PARTITIONED BY <time granularity> clause is required for INSERT and . See Partitioning for details.

    The following granularity arguments are accepted:

    • Time unit: HOUR, DAY, MONTH, or YEAR. Equivalent to FLOOR(__time TO TimeUnit).
    • TIME_FLOOR(__time, 'granularity_string'), where granularity_string is one of the ISO 8601 periods listed below. The first argument must be __time.
    • FLOOR(__time TO TimeUnit), where TimeUnit is any unit supported by the . The first argument must be __time.
    • ALL or ALL TIME, which effectively disables time partitioning by placing all data in a single time chunk. To use LIMIT or OFFSET at the outer level of your INSERT or REPLACE query, you must set PARTITIONED BY to ALL or ALL TIME.

    The following ISO 8601 periods are supported for TIME_FLOOR:

    • PT1S
    • PT1M
    • PT10M
    • PT15M
    • PT30M
    • PT1H
    • PT6H
    • P1D
    • P1W
    • P1M
    • P3M
    • P1Y

    The CLUSTERED BY <column list> clause is optional for and REPLACE. It accepts a list of column names or expressions.

    For more information about clustering, see .

    In addition to the Druid SQL , the multi-stage query task engine accepts certain context parameters that are specific to it.

    Use context parameters alongside your queries to customize the behavior of the query. If you’re using the API, include the context parameters in the query context when you submit a query:

    If you’re using the web console, you can specify the context parameters through various UI options.

    The following table lists the context parameters for the MSQ task engine:

    Knowing the limits for the MSQ task engine can help you troubleshoot any that you encounter. Many of the errors occur as a result of reaching a limit.

    The following table lists query limits:

    The following table describes error codes you may encounter in the multiStageQuery.payload.status.errorReport.error.errorCode field: