Tutorial: Roll-up

This tutorial will demonstrate the effects of roll-up on an example dataset.

For this tutorial, we’ll assume you’ve already downloaded Druid as described in the single-machine quickstart and have it running on your local machine.

It will also be helpful to have finished and Tutorial: Querying data.

For this tutorial, we’ll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.

A file containing this sample input data is located at .

{
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "rollup-tutorial",
      "dimensionsSpec" : {
        "dimensions" : [
          "srcIP",
          "dstIP"
        ]
      },
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "metricsSpec" : [
        { "type" : "count", "name" : "count" },
        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "week",
        "queryGranularity" : "minute",
        "intervals" : ["2018-01-01/2018-01-03"],
        "rollup" : true
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial",
        "filter" : "rollup-data.json"
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    },
      "type" : "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      },
      "maxRowsInMemory" : 25000
    }
  }
}

Roll-up has been enabled by setting "rollup" : true in the granularitySpec.

Note that we have srcIP and dstIP defined as dimensions, a longSum metric is defined for the packets and columns, and the queryGranularity has been defined as minute.

We will see how these definitions are used after we load this data.

From the apache-druid-24.0.2 package root, run the following command:

bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081

After the script completes, we will query the data.

Let’s look at the three events in the original input data that occurred during 2018-01-01T01:01:

{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}

These three rows have been “rolled up” into the following row:

┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘

The input rows have been grouped by the timestamp and dimension columns {timestamp, srcIP, dstIP} with sum aggregations on the metric columns packets and bytes.

Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the "queryGranularity":"minute" setting in the ingestion spec.

Likewise, these two events that occurred during 2018-01-01T01:02 have been rolled up:

┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │

{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}

Note that the count metric shows how many rows in the original input data contributed to the final “rolled up” row.