Tutorial: Querying data

    The tutorial assumes that you’ve already completed one of the 4 ingestion tutorials, as we will be querying the sample Wikipedia edits data.

    Druid queries are sent over HTTP. The Druid console includes a view to issue queries to Druid and nicely format the results.

    Druid supports a dialect of SQL for querying.

    This query retrieves the 10 Wikipedia pages with the most page edits on 2015-09-12.

    Let’s look at the different ways to issue this query.

    You can issue the above query from the console.

    The console query view provides autocomplete functionality with inline documentation.

    Query options

    You can also configure extra to be sent with the query from the options menu.

    Note that the console will (by default) wrap your SQL queries in a limit where appropriate so that queries such as SELECT * FROM wikipedia can complete. You can turn off this behavior from the Smart query limit toggle.

    Query SQL via dsql

    For convenience, the Druid package includes a SQL command-line client, located at bin/dsql from the Druid package root.

    Let’s now run bin/dsql; you should see the following prompt:

    1. Welcome to dsql, the command-line client for Druid SQL.
    2. Type "\h" for help.
    3. dsql>

    To submit the query, paste it to the dsql prompt and press enter:

    1. dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
    2. ┌──────────────────────────────────────────────────────────┬───────┐
    3. page Edits
    4. ├──────────────────────────────────────────────────────────┼───────┤
    5. Wikipedia:Vandalismusmeldung 33
    6. User:Cyde/List of candidates for speedy deletion/Subpage 28
    7. Jeremy Corbyn 27
    8. Wikipedia:Administrators' noticeboard/Incidents │ 21 │
    9. │ Flavia Pennetta │ 20 │
    10. │ Total Drama Presents: The Ridonculous Race │ 18 │
    11. │ User talk:Dudeperson176123 │ 18 │
    12. │ Wikipédia:Le Bistro/12 septembre 2015 │ 18 │
    13. │ Wikipedia:Requests for page protection │ 17 │
    14. └──────────────────────────────────────────────────────────┴───────┘
    15. Retrieved 10 rows in 0.06s.

    The SQL queries are submitted as JSON over HTTP.

    The tutorial package includes an example file that contains the SQL query shown above at quickstart/tutorial/wikipedia-top-pages-sql.json. Let’s submit that query to the Druid Broker:

    1. curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8888/druid/v2/sql

    The following results should be returned:

    More Druid SQL examples

    Here is a collection of queries to try out:

    Query over time

    1. SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted
    2. FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
    3. GROUP BY 1

    Query example

    General group by

    1. FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
    2. GROUP BY channel, page
    3. ORDER BY SUM(added) DESC

    Select raw data

    1. SELECT user, page
    2. FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 02:00:00' AND TIMESTAMP '2015-09-12 03:00:00'
    3. LIMIT 5

    Query example

    Druid SQL has the ability to explain the query plan for a given query. In the console this functionality is accessible from the ... button.

    Using a query from an example above:

    EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;

    Druid’s native query format is expressed in JSON.

    Native query via the console

    You can issue native Druid queries from the console’s Query view.

    Here is a query that retrieves the 10 Wikipedia pages with the most page edits on 2015-09-12.

    1. {
    2. "queryType" : "topN",
    3. "dataSource" : "wikipedia",
    4. "intervals" : ["2015-09-12/2015-09-13"],
    5. "granularity" : "all",
    6. "dimension" : "page",
    7. "metric" : "count",
    8. "threshold" : 10,
    9. "aggregations" : [
    10. {
    11. "type" : "count",
    12. "name" : "count"
    13. }
    14. ]
    15. }

    Simply paste it into the console to switch the editor into JSON mode.

    Native query

    We have included a sample native TopN query under quickstart/tutorial/wikipedia-top-pages.json:

    Let’s submit this query to Druid:

    1. curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8888/druid/v2?pretty

    You should see the following query results:

    1. [ {
    2. "result" : [ {
    3. "page" : "Wikipedia:Vandalismusmeldung"
    4. }, {
    5. "count" : 28,
    6. "page" : "User:Cyde/List of candidates for speedy deletion/Subpage"
    7. }, {
    8. "count" : 27,
    9. "page" : "Jeremy Corbyn"
    10. }, {
    11. "count" : 21,
    12. "page" : "Wikipedia:Administrators' noticeboard/Incidents"
    13. }, {
    14. "count" : 20,
    15. "page" : "Flavia Pennetta"
    16. }, {
    17. "count" : 18,
    18. "page" : "Total Drama Presents: The Ridonculous Race"
    19. }, {
    20. "count" : 18,
    21. "page" : "User talk:Dudeperson176123"
    22. }, {
    23. "count" : 18,
    24. "page" : "Wikipédia:Le Bistro/12 septembre 2015"
    25. }, {
    26. "count" : 17,
    27. "page" : "Wikipedia:In the news/Candidates"
    28. }, {
    29. "count" : 17,
    30. "page" : "Wikipedia:Requests for page protection"
    31. } ]

    The Queries documentation has more information on Druid’s native JSON queries.

    The has more information on using Druid SQL queries.