Tutorial: Roll-up
This tutorial will demonstrate the effects of roll-up on an example dataset.
For this tutorial, we’ll assume you’ve already downloaded Druid as described in the single-machine quickstart and have it running on your local machine.
It will also be helpful to have finished and Tutorial: Querying data.
For this tutorial, we’ll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.
A file containing this sample input data is located at .
{
"type" : "index_parallel",
"spec" : {
"dataSchema" : {
"dataSource" : "rollup-tutorial",
"dimensionsSpec" : {
"dimensions" : [
"srcIP",
"dstIP"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"metricsSpec" : [
{ "type" : "count", "name" : "count" },
{ "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "week",
"queryGranularity" : "minute",
"intervals" : ["2018-01-01/2018-01-03"],
"rollup" : true
},
"ioConfig" : {
"type" : "index_parallel",
"inputSource" : {
"type" : "local",
"baseDir" : "quickstart/tutorial",
"filter" : "rollup-data.json"
},
"inputFormat" : {
"type" : "json"
},
"appendToExisting" : false
},
"type" : "index_parallel",
"partitionsSpec": {
"type": "dynamic"
},
"maxRowsInMemory" : 25000
}
}
}
Roll-up has been enabled by setting "rollup" : true
in the granularitySpec
.
Note that we have srcIP
and dstIP
defined as dimensions, a longSum metric is defined for the packets
and columns, and the queryGranularity
has been defined as minute
.
We will see how these definitions are used after we load this data.
From the apache-druid-24.0.2 package root, run the following command:
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
After the script completes, we will query the data.
Let’s look at the three events in the original input data that occurred during 2018-01-01T01:01
:
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
These three rows have been “rolled up” into the following row:
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
The input rows have been grouped by the timestamp and dimension columns {timestamp, srcIP, dstIP}
with sum aggregations on the metric columns packets
and bytes
.
Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the "queryGranularity":"minute"
setting in the ingestion spec.
Likewise, these two events that occurred during 2018-01-01T01:02
have been rolled up:
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
Note that the count
metric shows how many rows in the original input data contributed to the final “rolled up” row.