Ingestion spec reference
Ingestion specs consists of three main components:
- dataSchema, which configures the , primary timestamp, , metrics, and (if needed).
- ioConfig, which tells Druid how to connect to the source system and how to parse data. For more information, see the documentation for each .
- tuningConfig, which controls various tuning parameters specific to each .
Example ingestion spec for task type index_parallel
(native batch):
The specific options supported by these sections will depend on the ingestion method you have chosen. For more examples, refer to the documentation for each ingestion method.
You can also load data visually, without the need to write an ingestion spec, using the “Load data” functionality available in Druid’s . Druid’s visual data loader supports Kafka, , and native batch mode.
The dataSchema
is a holder for the following components:
- datasource name
- dimensions
- transforms and filters (if needed).
An example dataSchema
is:
"dataSchema": {
"dataSource": "wikipedia",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"page",
"language",
{ "type": "long", "name": "userId" }
]
},
"metricsSpec": [
{ "type": "count", "name": "count" },
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
],
"granularitySpec": {
"segmentGranularity": "day",
"queryGranularity": "none",
"intervals": [
]
}
}
The dataSource
is located in dataSchema
→ dataSource
and is simply the name of the datasource that data will be written to. An example dataSource
is:
"dataSource": "my-first-datasource"
timestampSpec
The timestampSpec
is located in dataSchema
→ timestampSpec
and is responsible for configuring the primary timestamp. An example timestampSpec
is:
"timestampSpec": {
"column": "timestamp",
"format": "auto"
}
Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first (if any), then timestampSpec, then , and finally dimensionsSpec and . Keep this in mind when writing your ingestion spec.
A timestampSpec
can have the following components:
You can use the timestamp in a expression as __time
because Druid parses the timestampSpec
before applying transforms. You can also set the expression name
to __time
to replace the value of the timestamp.
Treat __time
as a millisecond timestamp: the number of milliseconds since Jan 1, 1970 at midnight UTC.
The dimensionsSpec
is located in dataSchema
→ dimensionsSpec
and is responsible for configuring dimensions. An example dimensionsSpec
is:
Field | Description | Default |
---|---|---|
dimensions | A list of . You cannot include the same column in both dimensions and dimensionExclusions .If dimensions and spatialDimensions are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in dimensionExclusions as String-typed dimension columns. See inclusions and exclusions for details.As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider by those same dimensions. | [] |
dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects. This list is only used if the dimensions and spatialDimensions lists are both null or empty arrays; otherwise it is ignored. See inclusions and exclusions below for details. | [] |
spatialDimensions | An array of . | [] |
includeAllDimensions | You can set includeAllDimensions to true to ingest both explicit dimensions in the dimensions field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using flattenSpec. If this is not set and the dimensions field is not empty, Druid will ingest only explicit dimensions. If this is not set and the dimensions field is empty, all discovered dimensions will be ingested. | false |
Dimension objects
Each dimension in the dimensions
list can either be a name or an object. Providing a name is equivalent to providing a string
type dimension object with the given name, e.g. "page"
is equivalent to {"name": "page", "type": "string"}
.
Dimension objects can have the following components:
Inclusions and exclusions
Druid will interpret a dimensionsSpec
in two possible ways: normal or schemaless.
Normal interpretation occurs when either dimensions
or spatialDimensions
is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of dimensionExclusions
will be ignored.
Schemaless interpretation occurs when both dimensions
and spatialDimensions
are empty or null. In this case, the set of dimensions is determined in the following way:
- First, start from the set of all root-level fields from the input record, as determined by the . “Root-level” includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a flattenSpec. All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
- If a is being used, the set of root-level fields includes any fields generated by the
flattenSpec
. TheuseFieldDiscovery
parameter determines whether the original root-level fields will be retained or discarded. - Any field listed in
dimensionExclusions
is excluded. - The field listed as
column
in the timestampSpec is excluded. - Any field used as an input to an aggregator from the is excluded.
- Any field with the same name as an aggregator from the metricsSpec is excluded.
- All other fields are ingested as
string
typed dimensions with the .
Note: Fields generated by a transformSpec are not currently considered candidates for schemaless dimension interpretation.
metricsSpec
The metricsSpec
is located in dataSchema
→ metricsSpec
and is a list of aggregators to apply at ingestion time. This is most useful when is enabled, since it’s how you configure ingestion-time aggregation.
An example metricsSpec
is:
"metricsSpec": [
{ "type": "count", "name": "count" },
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
]
The granularitySpec
is located in dataSchema
→ granularitySpec
and is responsible for configuring the following operations:
- Partitioning a datasource into (via
segmentGranularity
). - Truncating the timestamp, if desired (via
queryGranularity
). - Specifying which time chunks of segments should be created, for batch ingestion (via
intervals
). - Specifying whether ingestion-time rollup should be used or not (via
rollup
).
Other than rollup
, these operations are all based on the .
An example granularitySpec
is:
"granularitySpec": {
"segmentGranularity": "day",
"queryGranularity": "none",
"intervals": [
"2013-08-31/2013-09-01"
],
"rollup": true
}
A granularitySpec
can have the following components:
Field | Description | Default |
---|---|---|
type | uniform | uniform |
segmentGranularity | Time chunking granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to day , the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any can be provided here. Note that all segments in the same time chunk should have the same segment granularity. | day |
queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than segmentGranularity . This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of minute will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).Any granularity can be provided here. Use none to store timestamps as-is, without any truncation. Note that rollup will be applied if it is set even when the queryGranularity is set to none . | none |
rollup | Whether to use ingestion-time or not. Note that rollup is still effective even when queryGranularity is set to none . Your data will be rolled up if they have the exactly same timestamp. | true |
intervals | A list of intervals defining time chunks for segments. Specify interval values using ISO8601 format. For example, [“2021-12-06T21:27:10+00:00/2021-12-07T00:00:00+00:00”] . If you omit the time, the time defaults to “00:00:00”.Druid breaks the list up and rounds off the list values based on the segmentGranularity .If null or not provided, batch ingestion tasks generally determine which time chunks to output based on the timestamps found in the input data.If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks throw away any records with timestamps outside of the specified intervals. Ignored for any form of streaming ingestion. | null |
transformSpec
The transformSpec
is located in dataSchema
→ transformSpec
and is responsible for transforming and filtering records during ingestion time. It is optional. An example transformSpec
is:
"transformSpec": {
"transforms": [
{ "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
],
"filter": {
"type": "selector",
"dimension": "country",
"value": "San Serriffe"
}
Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first (if any), then timestampSpec, then , and finally dimensionsSpec and . Keep this in mind when writing your ingestion spec.
Transforms
If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that shadow fields may still refer to the fields they shadow. This can be used to transform a field “in-place”.
Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular, they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field with another field containing all nulls, which will act similarly to removing the field.
Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
The expression
is a .
Filter
The filter
conditionally filters input rows during ingestion. Only rows that pass the filter will be ingested. Any of Druid’s standard can be used. Note that within a transformSpec
, the transforms
are applied before the filter
, so the filter can refer to a transform.
The
dataSchema
spec has been changed in 0.17.0. The new spec is supported by all ingestion methods except for Hadoop ingestion. See for the new spec.
The legacy dataSchema
spec has below two more components in addition to the ones listed in the dataSchema section above.
- , flattening of nested data (if needed)
parser
(Deprecated)
In legacy dataSchema
, the is located in the dataSchema
→ parser
and is responsible for configuring a wide variety of items related to parsing input records. The parser
is deprecated and it is highly recommended to use inputFormat
instead. For details about inputFormat
and supported parser
types, see the “Data formats” page.
For details about major components of the parseSpec
, refer to their subsections:
- , responsible for configuring the primary timestamp.
- , responsible for configuring dimensions.
- , responsible for flattening nested data formats.
An example parser
is:
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{ "type": "path", "name": "userId", "expr": "$.user.id" }
]
},
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"page",
"language",
{ "type": "long", "name": "userId" }
]
}
}
}
flattenSpec
In the legacy dataSchema
, the flattenSpec
is located in dataSchema
→ parser
→ parseSpec
→ flattenSpec
and is responsible for bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid’s flat data model. See for more details.
The ioConfig
influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted filesystem, or any other supported source system. The inputFormat
property applies to all except for Hadoop ingestion. The Hadoop ingestion still uses the parser in the legacy dataSchema
. The rest of ioConfig
is specific to each individual ingestion method. An example ioConfig
to read JSON data is:
"ioConfig": {
"type": "<ingestion-method-specific type code>",
"inputFormat": {
"type": "json"
},
...
}
For more details, see the documentation provided by each .
Tuning properties are specified in a tuningConfig
, which goes at the top level of an ingestion spec. Some properties apply to all , but most are specific to each individual ingestion method. An example tuningConfig
that sets all of the shared, common properties to their defaults is:
"tuningConfig": {
"type": "<ingestion-method-specific type code>",
"maxRowsInMemory": 1000000,
"maxBytesInMemory": <one-sixth of JVM memory>,
"indexSpec": {
"bitmap": { "type": "roaring" },
"dimensionCompression": "lz4",
"metricCompression": "lz4",
"longEncoding": "longs"
},
<other ingestion-method-specific properties>
}
indexSpec
Field | Description | Default |
---|---|---|
bitmap | Compression format for bitmap indexes. Should be a JSON object with type set to roaring or concise . For type roaring , the boolean property compressRunOnSerialization (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient. | {“type”: “roaring”} |
dimensionCompression | Compression format for dimension columns. Options are lz4 , lzf , or uncompressed . | lz4 |
metricCompression | Compression format for primitive type metric columns. Options are lz4 , lzf , uncompressed , or none (which is more efficient than uncompressed , but not supported by older versions of Druid). | lz4 |
longEncoding | Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are auto or longs . auto encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. stores the value as-is with 8 bytes each. | longs |
Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each for details.