Tutorial: Updating existing data
For this tutorial, we’ll assume you’ve already downloaded Apache Druid as described in the single-machine quickstart and have it running on your local machine.
It will also be helpful to have finished , Tutorial: Querying data, and .
This section of the tutorial will cover how to overwrite an existing interval of data.
Let’s load an initial data set which we will overwrite and append to.
The spec we’ll use for this tutorial is located at . This spec creates a datasource called updates-tutorial
from the quickstart/tutorial/updates-data.json
input file.
Let’s submit that task:
dsql> select * from "updates-tutorial";
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T01:01:00.000Z │ tiger │ 1 │ 100 │
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 42 │
│ 2018-01-01T03:01:00.000Z │ giraffe │ 1 │ 14124 │
└──────────────────────────┴──────────┴───────┴────────┘
Retrieved 3 rows in 1.42s.
Overwrite the initial data
To overwrite this data, we can submit another task for the same interval, but with different input data.
The quickstart/tutorial/updates-overwrite-index.json
spec will perform an overwrite on the updates-tutorial
datasource.
Note that this task reads input from quickstart/tutorial/updates-data2.json
, and appendToExisting
is set to false
(indicating this is an overwrite).
Let’s submit that task:
bin/post-index-task --file quickstart/tutorial/updates-overwrite-index.json --url http://localhost:8081
When Druid finishes loading the new segment from this overwrite task, the “tiger” row now has the value “lion”, the “aardvark” row has a different number, and the “giraffe” row has been replaced. It may take a couple of minutes for the changes to take effect:
Let’s try appending some new data to the updates-tutorial
datasource now. We will add the data from quickstart/tutorial/updates-data3.json
.
Let’s submit that task:
When Druid finishes loading the new segment from this overwrite task, the new rows will have been added to the datasource. Note that roll-up occurred for the “lion” row:
dsql> select * from "updates-tutorial";
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T01:01:00.000Z │ lion │ 2 │ 400 │
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 9999 │
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 111 │
│ 2018-01-01T05:01:00.000Z │ mongoose │ 1 │ 737 │
│ 2018-01-01T06:01:00.000Z │ snake │ 1 │ 1234 │
│ 2018-01-01T07:01:00.000Z │ octopus │ 1 │ 115 │
└──────────────────────────┴──────────┴───────┴────────┘
Retrieved 6 rows in 0.02s.
Let’s try another way of appending data.
The quickstart/tutorial/updates-append-index2.json
task spec reads input from quickstart/tutorial/updates-data4.json
and will append its data to the updates-tutorial
datasource. Note that appendToExisting
is set to true
in this spec.
Let’s submit that task:
When the new data is loaded, we can see two additional rows after “octopus”. Note that the new “bear” row with number 222 has not been rolled up with the existing bear-111 row, because the new data is held in a separate segment.
dsql> select * from "updates-tutorial";
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T01:01:00.000Z │ lion │ 2 │ 400 │
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 111 │
│ 2018-01-01T06:01:00.000Z │ snake │ 1 │ 1234 │
│ 2018-01-01T07:01:00.000Z │ octopus │ 1 │ 115 │
│ 2018-01-01T04:01:00.000Z │ bear │ 1 │ 222 │
│ 2018-01-01T09:01:00.000Z │ falcon │ 1 │ 1241 │
└──────────────────────────┴──────────┴───────┴────────┘
Retrieved 8 rows in 0.02s.
dsql> select __time, animal, SUM("count"), SUM("number") from "updates-tutorial" group by __time, animal;
┌──────────────────────────┬──────────┬────────┬────────┐
│ __time │ animal │ EXPR$2 │ EXPR$3 │
├──────────────────────────┼──────────┼────────┼────────┤
│ 2018-01-01T01:01:00.000Z │ lion │ 2 │ 400 │
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 9999 │
│ 2018-01-01T04:01:00.000Z │ bear │ 2 │ 333 │
│ 2018-01-01T05:01:00.000Z │ mongoose │ 1 │ 737 │
│ 2018-01-01T06:01:00.000Z │ snake │ 1 │ 1234 │
│ 2018-01-01T07:01:00.000Z │ octopus │ 1 │ 115 │
│ 2018-01-01T09:01:00.000Z │ falcon │ 1 │ 1241 │
Retrieved 7 rows in 0.23s.