Ingestion spec reference

Ingestion specs consists of three main components:

dataSchema, which configures the , primary timestamp, , metrics, and (if needed).
ioConfig, which tells Druid how to connect to the source system and how to parse data. For more information, see the documentation for each .
tuningConfig, which controls various tuning parameters specific to each .

Example ingestion spec for task type index_parallel (native batch):

The specific options supported by these sections will depend on the ingestion method you have chosen. For more examples, refer to the documentation for each ingestion method.

You can also load data visually, without the need to write an ingestion spec, using the “Load data” functionality available in Druid’s . Druid’s visual data loader supports Kafka, , and native batch mode.

The dataSchema is a holder for the following components:

primary timestamp
metrics
(if needed).

An example dataSchema is:

"dataSchema": {
  "dataSource": "wikipedia",
  "timestampSpec": {
    "column": "timestamp",
    "format": "auto"
  },
  "dimensionsSpec": {
    "dimensions": [
      { "page" },
      { "language" },
      { "type": "long", "name": "userId" }
    ]
  },
  "metricsSpec": [
    { "type": "count", "name": "count" },
    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
  ],
  "granularitySpec": {
    "segmentGranularity": "day",
    "queryGranularity": "none",
    "intervals": [
    ]
  }
}

The dataSource is located in dataSchema → dataSource and is simply the name of the datasource that data will be written to. An example dataSource is:

"dataSource": "my-first-datasource"

`timestampSpec`

The timestampSpec is located in dataSchema → timestampSpec and is responsible for configuring the primary timestamp. An example timestampSpec is:

"timestampSpec": {
  "column": "timestamp",
  "format": "auto"
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first (if any), then timestampSpec, then , and finally dimensionsSpec and . Keep this in mind when writing your ingestion spec.

A timestampSpec can have the following components:

The dimensionsSpec is located in dataSchema → dimensionsSpec and is responsible for configuring dimensions. An example dimensionsSpec is:

A dimensionsSpec can have the following components:

Field	Description	Default
dimensions	A list of . Cannot have the same column in both `dimensions` and `dimensionExclusions`. If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See inclusions and exclusions below for details.	`[]`
dimensionExclusions	The names of dimensions to exclude from ingestion. Only names are supported here, not objects. This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See below for details.	`[]`
spatialDimensions	An array of spatial dimensions.	`[]`

Dimension objects

Each dimension in the dimensions list can either be a name or an object. Providing a name is equivalent to providing a string type dimension object with the given name, e.g. "page" is equivalent to {"name": "page", "type": "string"}.

Inclusions and exclusions

Druid will interpret a dimensionsSpec in two possible ways: normal or schemaless.

Normal interpretation occurs when either dimensions or spatialDimensions is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of dimensionExclusions will be ignored.

Schemaless interpretation occurs when both dimensions and spatialDimensions are empty or null. In this case, the set of dimensions is determined in the following way:

First, start from the set of all root-level fields from the input record, as determined by the . “Root-level” includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a flattenSpec. All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
If a is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
Any field listed in dimensionExclusions is excluded.
The field listed as column in the timestampSpec is excluded.
Any field used as an input to an aggregator from the is excluded.
Any field with the same name as an aggregator from the metricsSpec is excluded.
All other fields are ingested as string typed dimensions with the .

Note: Fields generated by a transformSpec are not currently considered candidates for schemaless dimension interpretation.

`metricsSpec`

The metricsSpec is located in dataSchema → metricsSpec and is a list of aggregators to apply at ingestion time. This is most useful when is enabled, since it’s how you configure ingestion-time aggregation.

An example metricsSpec is:

"metricsSpec": [
  { "type": "count", "name": "count" },
  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
]

The granularitySpec is located in dataSchema → granularitySpec and is responsible for configuring the following operations:

Partitioning a datasource into time chunks (via segmentGranularity).
Truncating the timestamp, if desired (via queryGranularity).
Specifying which time chunks of segments should be created, for batch ingestion (via intervals).
Specifying whether ingestion-time should be used or not (via rollup).

Other than rollup, these operations are all based on the primary timestamp.

An example granularitySpec is:

"granularitySpec": {
  "segmentGranularity": "day",
  "queryGranularity": "none",
  "intervals": [
    "2013-08-31/2013-09-01"
  ],
  "rollup": true
}

A granularitySpec can have the following components:

Field	Description	Default
type	Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.	`uniform`
segmentGranularity	granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any granularity can be provided here. Note that all segments in the same time chunk should have the same segment granularity. Ignored if `type` is set to `arbitrary`.	`day`
queryGranularity	The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc). Any can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`.	`none`
rollup	Whether to use ingestion-time rollup or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp.	`true`
intervals	A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is. If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data. If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals. Ignored for any form of streaming ingestion.	`null`

`transformSpec`

The transformSpec is located in dataSchema → transformSpec and is responsible for transforming and filtering records during ingestion time. It is optional. An example transformSpec is:

"transformSpec": {
  "transforms": [
    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
  ],
  "filter": {
    "type": "selector",
    "dimension": "country",
    "value": "San Serriffe"
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then , then transformSpec, and finally and metricsSpec. Keep this in mind when writing your ingestion spec.

Transforms

The transforms list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a “name” which can be referred to by your dimensionsSpec, metricsSpec, etc.

If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that shadow fields may still refer to the fields they shadow. This can be used to transform a field “in-place”.

Transforms can refer to the timestamp of an input row by referring to __time as part of the expression. They can also replace the timestamp if you set their “name” to __time. In both cases, __time should be treated as a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied after the timestampSpec.

Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:

The expression is a .

Filter

The filter conditionally filters input rows during ingestion. Only rows that pass the filter will be ingested. Any of Druid’s standard can be used. Note that within a transformSpec, the transforms are applied before the , so the filter can refer to a transform.

The dataSchema spec has been changed in 0.17.0. The new spec is supported by all ingestion methods except for Hadoop ingestion. See dataSchema for the new spec.

The legacy dataSchema spec has below two more components in addition to the ones listed in the section above.

input row parser, (if needed)

`parser` (Deprecated)

In legacy dataSchema, the parser is located in the dataSchema → parser and is responsible for configuring a wide variety of items related to parsing input records. The parser is deprecated and it is highly recommended to use inputFormat instead. For details about inputFormat and supported parser types, see the .

For details about major components of the parseSpec, refer to their subsections:

timestampSpec, responsible for configuring the .
dimensionsSpec, responsible for configuring .
flattenSpec, responsible for flattening nested data formats.

An example parser is:

"parser": {
  "type": "string",
  "parseSpec": {
    "format": "json",
    "flattenSpec": {
      "useFieldDiscovery": true,
      "fields": [
        { "type": "path", "name": "userId", "expr": "$.user.id" }
      ]
    },
    "timestampSpec": {
      "column": "timestamp",
      "format": "auto"
    },
    "dimensionsSpec": {
      "dimensions": [
        { "page" },
        { "language" },
        { "type": "long", "name": "userId" }
      ]
    }
  }
}

`flattenSpec`

In the legacy dataSchema, the flattenSpec is located in dataSchema → parser → parseSpec → flattenSpec and is responsible for bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid’s flat data model. See Flatten spec for more details.

The ioConfig influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted filesystem, or any other supported source system. The inputFormat property applies to all except for Hadoop ingestion. The Hadoop ingestion still uses the parser in the legacy dataSchema. The rest of ioConfig is specific to each individual ingestion method. An example ioConfig to read JSON data is:

"ioConfig": {
    "type": "<ingestion-method-specific type code>",
    "inputFormat": {
      "type": "json"
    },
    ...
}

For more details, see the documentation provided by each .

Tuning properties are specified in a tuningConfig, which goes at the top level of an ingestion spec. Some properties apply to all ingestion methods, but most are specific to each individual ingestion method. An example tuningConfig that sets all of the shared, common properties to their defaults is:

"tuningConfig": {
  "type": "<ingestion-method-specific type code>",
  "maxRowsInMemory": 1000000,
  "maxBytesInMemory": <one-sixth of JVM memory>,
  "indexSpec": {
    "bitmap": { "type": "roaring" },
    "dimensionCompression": "lz4",
    "metricCompression": "lz4",
    "longEncoding": "longs"
  },
  <other ingestion-method-specific properties>
}

`indexSpec`

The indexSpec object can include the following properties:

Field	Description	Default
bitmap	Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.	`{“type”: “roaring”}`
dimensionCompression	Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.	`lz4`
metricCompression	Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).	`lz4`
longEncoding	Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.	`longs`

Ingestion spec

Ingestion spec reference

timestampSpec

Dimension objects

Inclusions and exclusions

metricsSpec

transformSpec