Tutorial: Loading a file
For this tutorial, we’ll assume you’ve already downloaded Druid as described in the quickstart using the single-machine configuration and have it running on your local machine. You don’t need to have loaded any data yet.
A data load is initiated by submitting an ingestion task spec to the Druid Overlord. For this tutorial, we’ll be loading the sample Wikipedia page edits data.
An ingestion spec can be written by hand or by using the “Data loader” that is built into the Druid console. The data loader can help you build an ingestion spec by sampling your data and and iteratively configuring various ingestion parameters. The data loader currently only supports native batch ingestion (support for streaming, including data stored in Apache Kafka and AWS Kinesis, is coming in future releases). Streaming ingestion is only available through a written ingestion spec today.
We’ve included a sample of Wikipedia edits from September 12, 2015 to get you started.
Navigate to and click Load data
in the console header.
Select Local disk
and click Connect data
.
Enter quickstart/tutorial/
as the base directory and wikiticker-2015-09-12-sampled.json.gz
as a filter. The separation of base directory and wildcard file filter is there if you need to ingest data from multiple files.
Click Preview
and make sure that the data you are seeing is correct.
Once the data is located, you can click “Next: Parse data” to go to the next step.
The data loader will try to automatically determine the correct parser for the data. In this case it will successfully determine json
. Feel free to play around with different parser options to get a preview of how Druid will parse your data.
With the json
parser selected, click Next: Parse time
to get to the step centered around determining your primary timestamp column.
Druid’s architecture requires a primary timestamp column (internally stored in a column called __time
). If you do not have a timestamp in your data, select . In our example, the data loader will determine that the time
column in our raw data is the only candidate that can be used as the primary time column.
Click Next: ...
twice to go past the Transform
and Filter
steps. You do not need to enter anything in these steps as applying ingestion time transforms and filters are out of scope for this tutorial.
Once you are satisfied with the schema, click Next
to go to the Partition
step where you can fine tune how the data will be partitioned into segments.
Here, you can adjust how the data will be split up into segments in Druid. Since this is a small dataset, there are no adjustments that need to be made in this step.
Clicking past the Tune
step, to get to the publish step.
The Publish
step is where we can specify what the datasource name in Druid. Let’s name this datasource . Finally, click Next
to review your spec.
This is the spec you have constructed. Feel free to go back and make changes in previous steps to see how changes will update the spec. Similarly, you can also edit the spec directly and see it reflected in the previous steps.
Once you are satisfied with the spec, click Submit
and an ingestion task will be created.
You will be taken to the task view with the focus on the newly created task. The task view is set to auto refresh, wait until your task succeeds.
When a tasks succeeds it means that it built one or more segments that will now be picked up by the data servers.
Navigate to the Datasources
view from the header.
Wait until your datasource (wikipedia
) appears. This can take a few seconds as the segments are being loaded.
A datasource is queryable once you see a green (fully available) circle. At this point, you can go to the Query
view to run SQL queries against the datasource.
Run a SELECT * FROM "wikipedia"
query to see your results.
Loading data with a spec (via console)
The Druid package includes the following sample native batch ingestion task spec at quickstart/tutorial/wikipedia-index.json
, shown here for convenience, which has been configured to read the quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz
input file:
This spec will create a datasource named “wikipedia”.
From the task view, click on Submit task
and select Raw JSON task
.
This will bring up the spec submission dialog where you can paste the spec above.
Once the spec is submitted, you can follow the same instructions as above to wait for the data to load and then query it.
For convenience, the Druid package includes a batch ingestion helper script at .
This script will POST an ingestion task to the Druid Overlord and poll Druid until the data is available for querying.
Run the following command from Druid package root:
You should see output like the following:
Once the spec is submitted, you can follow the same instructions as above to wait for the data to load and then query it.
Loading data without the script
Let’s briefly discuss how we would’ve submitted the ingestion task without using the script. You do not need to run these commands.
To submit the task, POST it to Druid in a new terminal window from the apache-druid-0.18.1 directory:
Which will print the ID of the task if the submission was successful:
You can monitor the status of this task from the console as outlined above.
Once the data is loaded, please follow the to run some example queries on the newly loaded data.
Cleanup
If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the var
directory under the druid package, as the other tutorials will write to the same “wikipedia” datasource.