Data formats

This page lists all default and core extension data formats supported by Druid. For additional data formats supported with community extensions, please see our community extensions list.

The following samples show data formats that are natively supported in Druid:

JSON

CSV

  1. 2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143
  2. 2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330
  3. 2013-08-31T07:11:21Z,"Cherno Alpha","ru","masterYi","false","true","true","false","article","Asia","Russia","Oblast","Moscow",123,12,111
  4. 2013-08-31T11:58:39Z,"Crimson Typhoon","zh","triplets","true","false","true","false","wikipedia","Asia","China","Shanxi","Taiyuan",905,5,900
  5. 2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9

TSV (Delimited)

  1. 2013-08-31T01:02:33Z "Gypsy Danger" "en" "nuclear" "true" "true" "false" "false" "article" "North America" "United States" "Bay Area" "San Francisco" 57 200 -143
  2. 2013-08-31T03:32:45Z "Striker Eureka" "en" "speed" "false" "true" "true" "false" "wikipedia" "Australia" "Australia" "Cantebury" "Syndey" 459 129 330
  3. 2013-08-31T07:11:21Z "Cherno Alpha" "ru" "masterYi" "false" "true" "true" "false" "article" "Asia" "Russia" "Oblast" "Moscow" 123 12 111
  4. 2013-08-31T11:58:39Z "Crimson Typhoon" "zh" "triplets" "true" "false" "true" "false" "wikipedia" "Asia" "China" "Shanxi" "Taiyuan" 905 5 900
  5. 2013-08-31T12:41:27Z "Coyote Tango" "ja" "cancer" "true" "false" "true" "false" "wikipedia" "Asia" "Japan" "Kanto" "Tokyo" 1 10 -9

Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting.

Besides text formats, Druid also supports binary formats such as and Parquet formats.

Druid supports custom data formats and can use the Regex parser or the JavaScript parsers to parse these formats. Please note that using any of these parsers for parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers.

All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the inputFormat entry in your .

Configure the JSON inputFormat to load JSON data as follows:

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "json"
  4. },
  5. ...
  6. }

CSV

Configure the CSV inputFormat to load CSV data as follows:

FieldTypeDescriptionRequired
typeStringThis should say csv.yes
listDelimiterStringA custom delimiter for multi-value dimensions.no (default = ctrl+A)
columnsJSON arraySpecifies the columns of the data. The columns should be in the same order with the columns of your data.yes if findColumnsFromHeader is false or missing
findColumnsFromHeaderBooleanIf this is set, the task will find the column names from the header row. Note that skipHeaderRows will be applied before finding column names from the header. For example, if you set skipHeaderRows to 2 and findColumnsFromHeader to true, the task will skip the first two lines and then extract column information from the third line. columns will be ignored if this is set to true.no (default = false if columns is set; otherwise null)
skipHeaderRowsIntegerIf this is set, the task will skip the first skipHeaderRows rows.no (default = 0)

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "csv",
  4. "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"]
  5. },
  6. ...
  7. }

TSV (Delimited)

Configure the TSV inputFormat to load TSV data as follows:

FieldTypeDescriptionRequired
typeStringThis should say tsv.yes
delimiterStringA custom delimiter for data values.no (default = \t)
listDelimiterStringA custom delimiter for multi-value dimensions.no (default = ctrl+A)
columnsJSON arraySpecifies the columns of the data. The columns should be in the same order with the columns of your data.yes if findColumnsFromHeader is false or missing
findColumnsFromHeaderBooleanIf this is set, the task will find the column names from the header row. Note that skipHeaderRows will be applied before finding column names from the header. For example, if you set skipHeaderRows to 2 and findColumnsFromHeader to true, the task will skip the first two lines and then extract column information from the third line. columns will be ignored if this is set to true.no (default = false if columns is set; otherwise null)
skipHeaderRowsIntegerIf this is set, the task will skip the first skipHeaderRows rows.no (default = 0)

Be sure to change the delimiter to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "tsv",
  4. "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
  5. "delimiter":"|"
  6. },
  7. ...
  8. }

ORC

To use the ORC input format, load the Druid Orc extension ( ).

To upgrade from versions earlier than 0.15.0 to 0.15.0 or new, read Migration from ‘contrib’ extension.

Configure the ORC inputFormat to load ORC data as follows:

FieldTypeDescriptionRequired
typeStringThis should say orc.yes
flattenSpecJSON ObjectSpecifies flattening configuration for nested ORC data. See for more info.no
binaryAsStringBooleanSpecifies if the binary orc column which is not logically marked as a string should be treated as a UTF-8 encoded string.no (default = false)

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "orc",
  4. "flattenSpec": {
  5. "useFieldDiscovery": true,
  6. "fields": [
  7. {
  8. "type": "path",
  9. "name": "nested",
  10. "expr": "$.path.to.nested"
  11. }
  12. ]
  13. },
  14. "binaryAsString": false
  15. },
  16. ...
  17. }

Parquet

To use the Parquet input format load the Druid Parquet extension ().

Configure the Parquet inputFormat to load Parquet data as follows:

FieldTypeDescriptionRequired
typeStringThis should be set to parquet to read Parquet fileyes
flattenSpecJSON ObjectDefine a flattenSpec to extract nested values from a Parquet file. Note that only ‘path’ expression are supported (‘jq’ is unavailable).no (default will auto-discover ‘root’ level properties)
binaryAsStringBooleanSpecifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string.no (default = false)

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "parquet",
  4. "flattenSpec": {
  5. "useFieldDiscovery": true,
  6. "fields": [
  7. {
  8. "type": "path",
  9. "name": "nested",
  10. "expr": "$.path.to.nested"
  11. }
  12. ]
  13. },
  14. "binaryAsString": false
  15. },
  16. ...
  17. }

Avro Stream

To use the Avro Stream input format load the Druid Avro extension (druid-avro-extensions).

For more information on how Druid handles Avro types, see section for

Configure the Avro inputFormat to load Avro data as follows:

FieldTypeDescriptionRequired
typeStringThis should be set to avro_stream to read Avro serialized datayes
flattenSpecJSON ObjectDefine a flattenSpec to extract nested values from a Avro record. Note that only ‘path’ expression are supported (‘jq’ is unavailable).no (default will auto-discover ‘root’ level properties)
avroBytesDecoderJSON ObjectSpecifies how to decode bytes to Avro record.yes
binaryAsStringBooleanSpecifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string.no (default = false)

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "avro_stream",
  4. "avroBytesDecoder": {
  5. "type": "schema_inline",
  6. "schema": {
  7. //your schema goes here, for example
  8. "namespace": "org.apache.druid.data",
  9. "name": "User",
  10. "type": "record",
  11. "fields": [
  12. { "name": "FullName", "type": "string" },
  13. { "name": "Country", "type": "string" }
  14. ]
  15. }
  16. },
  17. "flattenSpec": {
  18. "useFieldDiscovery": true,
  19. "fields": [
  20. {
  21. "type": "path",
  22. "name": "someRecord_subInt",
  23. "expr": "$.someRecord.subInt"
  24. }
  25. ]
  26. },
  27. "binaryAsString": false
  28. },
  29. ...
  30. }
Avro Bytes Decoder

If type is not included, the avroBytesDecoder defaults to schema_repo.

Inline Schema Based Avro Bytes Decoder

The “schema_inline” decoder reads Avro records using a fixed schema and does not support schema migration. If you may need to migrate schemas in the future, consider one of the other decoders, all of which use a message header that allows the parser to identify the proper Avro schema for reading records.

This decoder can be used if all the input events can be read using the same schema. In this case, specify the schema in the input task JSON itself, as described below.

  1. ...
  2. "avroBytesDecoder": {
  3. "type": "schema_inline",
  4. "schema": {
  5. //your schema goes here, for example
  6. "namespace": "org.apache.druid.data",
  7. "name": "User",
  8. "type": "record",
  9. "fields": [
  10. { "name": "FullName", "type": "string" },
  11. { "name": "Country", "type": "string" }
  12. ]
  13. }
  14. }
  15. ...
Multiple Inline Schemas Based Avro Bytes Decoder

Use this decoder if different input events can have different read schemas. In this case, specify the schema in the input task JSON itself, as described below.

  1. ...
  2. "avroBytesDecoder": {
  3. "type": "multiple_schemas_inline",
  4. "schemas": {
  5. //your id -> schema map goes here, for example
  6. "1": {
  7. "namespace": "org.apache.druid.data",
  8. "name": "User",
  9. "type": "record",
  10. "fields": [
  11. { "name": "FullName", "type": "string" },
  12. { "name": "Country", "type": "string" }
  13. ]
  14. },
  15. "2": {
  16. "namespace": "org.apache.druid.otherdata",
  17. "name": "UserIdentity",
  18. "type": "record",
  19. "fields": [
  20. { "name": "Name", "type": "string" },
  21. { "name": "Location", "type": "string" }
  22. ]
  23. },
  24. ...
  25. ...
  26. }
  27. }
  28. ...

Note that it is essentially a map of integer schema ID to avro schema object. This parser assumes that record has following format. first 1 byte is version and must always be 1. next 4 bytes are integer schema ID serialized using big-endian byte order. remaining bytes contain serialized avro message.

SchemaRepo Based Avro Bytes Decoder

This Avro bytes decoder first extracts subject and id from the input message bytes, and then uses them to look up the Avro schema used to decode the Avro record from bytes. For details, see the and AVRO-1124. You will need an http service like schema repo to hold the avro schema. For information on registering a schema on the message producer side, see org.apache.druid.data.input.AvroStreamInputRowParserTest#testParse().

FieldTypeDescriptionRequired
typeStringThis should say schema_repo.no
subjectAndIdConverterJSON ObjectSpecifies how to extract the subject and id from message bytes.yes
schemaRepositoryJSON ObjectSpecifies how to look up the Avro schema from subject and id.yes
Avro-1124 Subject And Id Converter

This section describes the format of the subjectAndIdConverter object for the schema_repo Avro bytes decoder.

FieldTypeDescriptionRequired
typeStringThis should say avro_1124.no
topicStringSpecifies the topic of your Kafka stream.yes
Avro-1124 Schema Repository

This section describes the format of the schemaRepository object for the schema_repo Avro bytes decoder.

FieldTypeDescriptionRequired
typeStringThis should say avro_1124_rest_client.no
urlStringSpecifies the endpoint url of your Avro-1124 schema repository.yes
Confluent Schema Registry-based Avro Bytes Decoder

This Avro bytes decoder first extracts a unique id from input message bytes, and then uses it to look up the schema in the Schema Registry used to decode the Avro record from bytes. For details, see the Schema Registry documentation and .

FieldTypeDescriptionRequired
typeStringThis should say schema_registry.no
urlStringSpecifies the url endpoint of the Schema Registry.yes
capacityIntegerSpecifies the max size of the cache (default = Integer.MAX_VALUE).no
urlsArraySpecifies the url endpoints of the multiple Schema Registry instances.yes(if url is not provided)
configJsonTo send additional configurations, configured for Schema Registry. This can be supplied via a DynamicConfigProviderno
headersJsonTo send headers to the Schema Registry. This can be supplied via a no

For a single schema registry instance, use Field url or urls for multi instances.

Single Instance:

Multiple Instances:

  1. ...
  2. "avroBytesDecoder" : {
  3. "type" : "schema_registry",
  4. "urls" : [<schema-registry-url-1>, <schema-registry-url-2>, ...],
  5. "config" : {
  6. "basic.auth.credentials.source": "USER_INFO",
  7. "basic.auth.user.info": "fred:letmein",
  8. "schema.registry.ssl.truststore.location": "/some/secrets/kafka.client.truststore.jks",
  9. "schema.registry.ssl.truststore.password": "<password>",
  10. "schema.registry.ssl.keystore.location": "/some/secrets/kafka.client.keystore.jks",
  11. "schema.registry.ssl.keystore.password": "<password>",
  12. "schema.registry.ssl.key.password": "<password>",
  13. "schema.registry.ssl.key.password",
  14. ...
  15. },
  16. "headers": {
  17. "traceID" : "b29c5de2-0db4-490b-b421",
  18. "timeStamp" : "1577191871865",
  19. "druid.dynamic.config.provider":{
  20. "type":"mapString",
  21. "config":{
  22. "registry.header.prop.1":"value.1",
  23. "registry.header.prop.2":"value.2"
  24. }
  25. }
  26. ...
  27. }
  28. }
  29. ...

Avro OCF

To load the Avro OCF input format, load the Druid Avro extension ().

Configure the Avro OCF inputFormat to load Avro OCF data as follows:

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "avro_ocf",
  4. "flattenSpec": {
  5. "useFieldDiscovery": true,
  6. "fields": [
  7. {
  8. "type": "path",
  9. "name": "someRecord_subInt",
  10. "expr": "$.someRecord.subInt"
  11. }
  12. ]
  13. },
  14. "schema": {
  15. "namespace": "org.apache.druid.data.input",
  16. "type": "record",
  17. "fields" : [
  18. { "name": "timestamp", "type": "long" },
  19. { "name": "eventType", "type": "string" },
  20. { "name": "id", "type": "long" },
  21. { "name": "someRecord", "type": {
  22. "type": "record", "name": "MySubRecord", "fields": [
  23. { "name": "subInt", "type": "int"},
  24. { "name": "subLong", "type": "long"}
  25. ]
  26. }}]
  27. },
  28. "binaryAsString": false
  29. },
  30. ...
  31. }

Protobuf

You need to include the as an extension to use the Protobuf input format.

Configure the Protobuf inputFormat to load Protobuf data as follows:

FieldTypeDescriptionRequired
typeStringThis should be set to protobuf to read Protobuf serialized datayes
flattenSpecJSON ObjectDefine a flattenSpec to extract nested values from a Protobuf record. Note that only ‘path’ expression are supported (‘jq’ is unavailable).no (default will auto-discover ‘root’ level properties)
protoBytesDecoderJSON ObjectSpecifies how to decode bytes to Protobuf record.yes

For example:

  1. "ioConfig": {
  2. "inputFormat": {
  3. "type": "protobuf",
  4. "protoBytesDecoder": {
  5. "type": "file",
  6. "descriptor": "file:///tmp/metrics.desc",
  7. "protoMessageType": "Metrics"
  8. }
  9. "flattenSpec": {
  10. "useFieldDiscovery": true,
  11. "fields": [
  12. {
  13. "type": "path",
  14. "name": "someRecord_subInt",
  15. "expr": "$.someRecord.subInt"
  16. }
  17. ]
  18. }
  19. },
  20. ...
  21. }

The flattenSpec bridges the gap between potentially nested input data (such as JSON, Avro, etc) and Druid’s flat data model. It is an object within the inputFormat object.

Configure your flattenSpec as follows:

FieldDescriptionDefault
useFieldDiscoveryIf true, interpret all root-level fields as available fields for usage by , transformSpec, , and metricsSpec.

If false, only explicitly specified fields (see fields) will be available for use.
true
fieldsSpecifies the fields of interest and how they are accessed. See for more detail.[]

For example:

  1. "flattenSpec": {
  2. "useFieldDiscovery": true,
  3. { "name": "baz", "type": "root" },
  4. { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
  5. { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
  6. ]
  7. }

After Druid reads the input data records, it applies the flattenSpec before applying any other specs such as timestampSpec, ,

dimensionsSpec, or . Keep this in mind when writing your ingestion spec.

Flattening is only supported for data formats that support nesting, including avro, json, orc, and parquet.

Field flattening specifications

Each entry in the fields list can have the following components:

FieldDescriptionDefault
typeOptions are as follows:

  • root, referring to a field at the root level of the record. Only really useful if useFieldDiscovery is false.
  • path, referring to a field using JsonPath notation. Supported by most data formats that offer nesting, including avro, json, orc, and parquet.
  • jq, referring to a field using notation. Only supported for the json format.
none (required)
nameName of the field after flattening. This name can be referred to by the timestampSpec, , dimensionsSpec, and .none (required)
exprExpression for accessing the field while flattening. For type path, this should be JsonPath. For type jq, this should be notation. For other types, this parameter is ignored.none (required for types path and jq)

Notes on flattening

  • For convenience, when defining a root-level field, it is possible to define only the field name, as a string, instead of a JSON object. For example, {"name": "baz", "type": "root"} is equivalent to "baz".
  • Enabling useFieldDiscovery will only automatically detect “simple” fields at the root level that correspond to data types that Druid supports. This includes strings, numbers, and lists of strings or numbers. Other types will not be automatically detected, and must be specified explicitly in the fields list.
  • Duplicate field names are not allowed. An exception will be thrown.
  • If useFieldDiscovery is enabled, any discovered field with the same name as one already defined in the fields list will be skipped, rather than added twice.
  • is useful for testing path-type expressions.
  • jackson-jq supports a subset of the full jq syntax. Please refer to the for details.

This section lists all default and core extension parsers. For community extension parsers, please see our community extensions list.

String Parser

string typed parsers operate on text based inputs that can be split into individual records by newlines. Each line can be further parsed using parseSpec.

FieldTypeDescriptionRequired
typeStringThis should say string in general, or hadoopyString when used in a Hadoop indexing job.yes
parseSpecJSON ObjectSpecifies the format, timestamp, and dimensions of the data.yes

Avro Hadoop Parser

You need to include the druid-avro-extensions as an extension to use the Avro Hadoop Parser.

See the section for how Avro types are handled in Druid

This parser is for Hadoop batch ingestion. The inputFormat of inputSpec in ioConfig must be set to "org.apache.druid.data.input.avro.AvroValueInputFormat". You may want to set Avro reader’s schema in jobProperties in tuningConfig, e.g.: "avro.schema.input.value.path": "/path/to/your/schema.avsc" or "avro.schema.input.value": "your_schema_JSON_object". If the Avro reader’s schema is not set, the schema in Avro object container file will be used. See for more information.

FieldTypeDescriptionRequired
typeStringThis should say avro_hadoop.yes
parseSpecJSON ObjectSpecifies the timestamp and dimensions of the data. Should be an “avro” parseSpec.yes
fromPigAvroStorageBooleanSpecifies whether the data file is stored using AvroStorage.no(default == false)

An Avro parseSpec can contain a flattenSpec using either the “root” or “path” field types, which can be used to read nested Avro records. The “jq” field type is not currently supported for Avro.

For example, using Avro Hadoop parser with custom reader’s schema file:

  1. {
  2. "type" : "index_hadoop",
  3. "spec" : {
  4. "dataSchema" : {
  5. "dataSource" : "",
  6. "parser" : {
  7. "type" : "avro_hadoop",
  8. "parseSpec" : {
  9. "format": "avro",
  10. "timestampSpec": <standard timestampSpec>,
  11. "dimensionsSpec": <standard dimensionsSpec>,
  12. "flattenSpec": <optional>
  13. }
  14. }
  15. },
  16. "ioConfig" : {
  17. "type" : "hadoop",
  18. "inputSpec" : {
  19. "type" : "static",
  20. "inputFormat": "org.apache.druid.data.input.avro.AvroValueInputFormat",
  21. "paths" : ""
  22. }
  23. },
  24. "tuningConfig" : {
  25. "jobProperties" : {
  26. "avro.schema.input.value.path" : "/path/to/my/schema.avsc"
  27. }
  28. }
  29. }
  30. }

ORC Hadoop Parser

You need to include the druid-orc-extensions as an extension to use the ORC Hadoop Parser.

If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a higher version, please read carefully.

This parser is for Hadoop batch ingestion. The inputFormat of inputSpec in ioConfig must be set to "org.apache.orc.mapreduce.OrcInputFormat".

FieldTypeDescriptionRequired
typeStringThis should say orcyes
parseSpecJSON ObjectSpecifies the timestamp and dimensions of the data (timeAndDims and orc format) and a flattenSpec (orc format)yes

The parser supports two parseSpec formats: orc and timeAndDims.

orc supports auto field discovery and flattening, if specified with a . If no flattenSpec is specified, useFieldDiscovery will be enabled by default. Specifying a dimensionSpec is optional if useFieldDiscovery is enabled: if a dimensionSpec is supplied, the list of dimensions it defines will be the set of ingested dimensions, if missing the discovered fields will make up the list.

timeAndDims parse spec must specify which fields will be extracted as dimensions through the dimensionSpec.

All column types are supported, with the exception of union types. Columns of list type, if filled with primitives, may be used as a multi-value dimension, or specific elements can be extracted with flattenSpec expressions. Likewise, primitive fields may be extracted from map and struct types in the same manner. Auto field discovery will automatically create a string dimension for every (non-timestamp) primitive or list of primitives, as well as any flatten expressions defined in the flattenSpec.

Hadoop job properties

Like most Hadoop jobs, the best outcomes will add "mapreduce.job.user.classpath.first": "true" or "mapreduce.job.classloader": "true" to the jobProperties section of tuningConfig. Note that it is likely if using "mapreduce.job.classloader": "true" that you will need to set mapreduce.job.classloader.system.classes to include -org.apache.hadoop.hive. to instruct Hadoop to load org.apache.hadoop.hive classes from the application jars instead of system jars, e.g.

  1. ...
  2. "mapreduce.job.classloader": "true",
  3. "mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml",
  4. ...

This is due to the hive-storage-api dependency of the orc-mapreduce library, which provides some classes under the org.apache.hadoop.hive package. If instead using the setting "mapreduce.job.user.classpath.first": "true", then this will not be an issue.

Examples

orc parser, orc parseSpec, auto field discovery, flatten expressions
  1. {
  2. "type": "index_hadoop",
  3. "spec": {
  4. "ioConfig": {
  5. "type": "hadoop",
  6. "inputSpec": {
  7. "type": "static",
  8. "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
  9. "paths": "path/to/file.orc"
  10. },
  11. ...
  12. },
  13. "dataSchema": {
  14. "dataSource": "example",
  15. "parser": {
  16. "type": "orc",
  17. "parseSpec": {
  18. "format": "orc",
  19. "flattenSpec": {
  20. "useFieldDiscovery": true,
  21. "fields": [
  22. {
  23. "type": "path",
  24. "name": "nestedDim",
  25. "expr": "$.nestedData.dim1"
  26. },
  27. {
  28. "type": "path",
  29. "name": "listDimFirstItem",
  30. "expr": "$.listDim[1]"
  31. }
  32. ]
  33. },
  34. "timestampSpec": {
  35. "column": "timestamp",
  36. "format": "millis"
  37. }
  38. }
  39. },
  40. ...
  41. },
  42. "tuningConfig": <hadoop-tuning-config>
  43. }
  44. }
  45. }
orc parser, orc parseSpec, field discovery with no flattenSpec or dimensionSpec
  1. {
  2. "type": "index_hadoop",
  3. "spec": {
  4. "ioConfig": {
  5. "type": "hadoop",
  6. "inputSpec": {
  7. "type": "static",
  8. "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
  9. "paths": "path/to/file.orc"
  10. },
  11. ...
  12. },
  13. "dataSchema": {
  14. "dataSource": "example",
  15. "parser": {
  16. "type": "orc",
  17. "parseSpec": {
  18. "format": "orc",
  19. "timestampSpec": {
  20. "column": "timestamp",
  21. "format": "millis"
  22. }
  23. }
  24. },
  25. ...
  26. },
  27. "tuningConfig": <hadoop-tuning-config>
  28. }
  29. }
  30. }
orc parser, orc parseSpec, no autodiscovery
  1. {
  2. "type": "index_hadoop",
  3. "spec": {
  4. "ioConfig": {
  5. "type": "hadoop",
  6. "inputSpec": {
  7. "type": "static",
  8. "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
  9. "paths": "path/to/file.orc"
  10. },
  11. ...
  12. },
  13. "dataSchema": {
  14. "dataSource": "example",
  15. "parser": {
  16. "type": "orc",
  17. "parseSpec": {
  18. "format": "orc",
  19. "flattenSpec": {
  20. "useFieldDiscovery": false,
  21. "fields": [
  22. {
  23. "type": "path",
  24. "name": "nestedDim",
  25. "expr": "$.nestedData.dim1"
  26. },
  27. {
  28. "type": "path",
  29. "name": "listDimFirstItem",
  30. "expr": "$.listDim[1]"
  31. }
  32. ]
  33. },
  34. "timestampSpec": {
  35. "column": "timestamp",
  36. "format": "millis"
  37. },
  38. "dimensionsSpec": {
  39. "dimensions": [
  40. "dim1",
  41. "dim3",
  42. "nestedDim",
  43. "listDimFirstItem"
  44. ],
  45. "dimensionExclusions": [],
  46. "spatialDimensions": []
  47. }
  48. }
  49. },
  50. ...
  51. },
  52. "tuningConfig": <hadoop-tuning-config>
  53. }
  54. }
  55. }
orc parser, timeAndDims parseSpec
  1. {
  2. "type": "index_hadoop",
  3. "spec": {
  4. "ioConfig": {
  5. "type": "hadoop",
  6. "inputSpec": {
  7. "type": "static",
  8. "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
  9. "paths": "path/to/file.orc"
  10. },
  11. ...
  12. },
  13. "dataSchema": {
  14. "dataSource": "example",
  15. "parser": {
  16. "type": "orc",
  17. "parseSpec": {
  18. "format": "timeAndDims",
  19. "timestampSpec": {
  20. "column": "timestamp",
  21. "format": "auto"
  22. "dimensionsSpec": {
  23. "dimensions": [
  24. "dim1",
  25. "dim2",
  26. "dim3",
  27. "listDim"
  28. ],
  29. "dimensionExclusions": [],
  30. "spatialDimensions": []
  31. }
  32. }
  33. },
  34. ...
  35. },
  36. "tuningConfig": <hadoop-tuning-config>
  37. }
  38. }

Parquet Hadoop Parser

You need to include the druid-parquet-extensions as an extension to use the Parquet Hadoop Parser.

The Parquet Hadoop parser is for and parses Parquet files directly. The inputFormat of in ioConfig must be set to org.apache.druid.data.input.parquet.DruidParquetInputFormat.

The Parquet Hadoop Parser supports auto field discovery and flattening if provided with a flattenSpec with the parquet parseSpec. Parquet nested list and map should operate correctly with JSON path expressions for all supported types.

FieldTypeDescriptionRequired
typeStringThis should say parquet.yes
parseSpecJSON ObjectSpecifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims and parquetyes
binaryAsStringBooleanSpecifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string.no(default = false)

When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto or a explicitly defined is required.

Parquet Hadoop Parser vs Parquet Avro Hadoop Parser

Both parsers read from Parquet files, but slightly differently. The main differences are:

  • The Parquet Hadoop Parser uses a simple conversion while the Parquet Avro Hadoop Parser converts Parquet data into avro records first with the parquet-avro library and then parses avro data using the druid-avro-extensions module to ingest into Druid.
  • The Parquet Hadoop Parser sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to ‘unwrap’ primitive list elements into multi-value dimensions.
  • The Parquet Hadoop Parser supports int96 Parquet values, while the Parquet Avro Hadoop Parser does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.

Based on those differences, we suggest using the Parquet Hadoop Parser over the Parquet Avro Hadoop Parser to allow ingesting data beyond the schema constraints of Avro conversion. However, the Parquet Avro Hadoop Parser was the original basis for supporting the Parquet format, and as such it is a bit more mature.

Examples

parquet parser, parquet parseSpec
parquet parser, timeAndDims parseSpec
  1. {
  2. "type": "index_hadoop",
  3. "spec": {
  4. "ioConfig": {
  5. "type": "hadoop",
  6. "inputSpec": {
  7. "type": "static",
  8. "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
  9. "paths": "path/to/file.parquet"
  10. },
  11. ...
  12. },
  13. "dataSchema": {
  14. "dataSource": "example",
  15. "parser": {
  16. "type": "parquet",
  17. "parseSpec": {
  18. "format": "timeAndDims",
  19. "timestampSpec": {
  20. "column": "timestamp",
  21. "format": "auto"
  22. },
  23. "dimensionsSpec": {
  24. "dimensions": [
  25. "dim1",
  26. "dim2",
  27. "dim3",
  28. "listDim"
  29. ],
  30. "dimensionExclusions": [],
  31. "spatialDimensions": []
  32. }
  33. }
  34. },
  35. ...
  36. },
  37. "tuningConfig": <hadoop-tuning-config>
  38. }
  39. }

Parquet Avro Hadoop Parser

Consider using the over this parser to ingest Parquet files. See Parquet Hadoop Parser vs Parquet Avro Hadoop Parser for the differences between those parsers.

You need to include both the [druid-avro-extensions] as extensions to use the Parquet Avro Hadoop Parser.

The Parquet Avro Hadoop Parser is for Hadoop batch ingestion. This parser first converts the Parquet data into Avro records, and then parses them to ingest into Druid. The inputFormat of inputSpec in ioConfig must be set to org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat.

The Parquet Avro Hadoop Parser supports auto field discovery and flattening if provided with a with the avro parseSpec. Parquet nested list and map logical types should operate correctly with JSON path expressions for all supported types. This parser sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to ‘unwrap’ primitive list elements into multi-value dimensions.

Note that the int96 Parquet value type is not supported with this parser.

FieldTypeDescriptionRequired
typeStringThis should say parquet-avro.yes
parseSpecJSON ObjectSpecifies the timestamp and dimensions of the data, and optionally, a flatten spec. Should be avro.yes
binaryAsStringBooleanSpecifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string.no(default = false)

When the time dimension is a , a format should not be supplied. When the format is UTF8 (String), either auto or an explicitly defined format is required.

Example

  1. {
  2. "type": "index_hadoop",
  3. "spec": {
  4. "ioConfig": {
  5. "type": "hadoop",
  6. "inputSpec": {
  7. "type": "static",
  8. "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
  9. "paths": "path/to/file.parquet"
  10. },
  11. ...
  12. },
  13. "dataSchema": {
  14. "dataSource": "example",
  15. "parser": {
  16. "type": "parquet-avro",
  17. "parseSpec": {
  18. "format": "avro",
  19. "flattenSpec": {
  20. "useFieldDiscovery": true,
  21. "fields": [
  22. {
  23. "type": "path",
  24. "name": "nestedDim",
  25. "expr": "$.nestedData.dim1"
  26. },
  27. {
  28. "type": "path",
  29. "name": "listDimFirstItem",
  30. "expr": "$.listDim[1]"
  31. }
  32. ]
  33. },
  34. "timestampSpec": {
  35. "column": "timestamp",
  36. "format": "auto"
  37. },
  38. "dimensionsSpec": {
  39. "dimensions": [],
  40. "dimensionExclusions": [],
  41. "spatialDimensions": []
  42. }
  43. }
  44. },
  45. ...
  46. },
  47. "tuningConfig": <hadoop-tuning-config>
  48. }
  49. }
  50. }

Avro Stream Parser

This parser is for and reads Avro data from a stream directly.

FieldTypeDescriptionRequired
typeStringThis should say avro_stream.no
avroBytesDecoderJSON ObjectSpecifies avroBytesDecoder to decode bytes to Avro record.yes
parseSpecJSON ObjectSpecifies the timestamp and dimensions of the data. Should be an “avro” parseSpec.yes

An Avro parseSpec can contain a using either the “root” or “path” field types, which can be used to read nested Avro records. The “jq” field type is not currently supported for Avro.

For example, using Avro stream parser with schema repo Avro bytes decoder:

  1. "parser" : {
  2. "type" : "avro_stream",
  3. "avroBytesDecoder" : {
  4. "type" : "schema_repo",
  5. "subjectAndIdConverter" : {
  6. "type" : "avro_1124",
  7. "topic" : "${YOUR_TOPIC}"
  8. },
  9. "schemaRepository" : {
  10. "type" : "avro_1124_rest_client",
  11. "url" : "${YOUR_SCHEMA_REPO_END_POINT}",
  12. }
  13. },
  14. "parseSpec" : {
  15. "format": "avro",
  16. "timestampSpec": <standard timestampSpec>,
  17. "dimensionsSpec": <standard dimensionsSpec>,
  18. "flattenSpec": <optional>
  19. }
  20. }

Protobuf Parser

You need to include the as an extension to use the Protobuf Parser.

This parser is for stream ingestion and reads Protocol buffer data from a stream directly.

Sample spec:

  1. "parser": {
  2. "type": "protobuf",
  3. "protoBytesDecoder": {
  4. "type": "file",
  5. "descriptor": "file:///tmp/metrics.desc",
  6. "protoMessageType": "Metrics"
  7. },
  8. "parseSpec": {
  9. "format": "json",
  10. "timestampSpec": {
  11. "column": "timestamp",
  12. "format": "auto"
  13. },
  14. "dimensionsSpec": {
  15. "dimensions": [
  16. "unit",
  17. "http_method",
  18. "http_code",
  19. "page",
  20. "metricType",
  21. "server"
  22. ],
  23. "dimensionExclusions": [
  24. "timestamp",
  25. "value"
  26. ]
  27. }
  28. }
  29. }

See the for more details and examples.

Protobuf Bytes Decoder

If type is not included, the protoBytesDecoder defaults to schema_registry.

File-based Protobuf Bytes Decoder

This Protobuf bytes decoder first read a descriptor file, and then parse it to get schema used to decode the Protobuf record from bytes.

FieldTypeDescriptionRequired
typeStringThis should say file.yes
descriptorStringProtobuf descriptor file name in the classpath or URL.yes
protoMessageTypeStringProtobuf message type in the descriptor. Both short name and fully qualified name are accepted. The parser uses the first message type found in the descriptor if not specified.no

Sample spec:

  1. "protoBytesDecoder": {
  2. "type": "file",
  3. "descriptor": "file:///tmp/metrics.desc",
  4. "protoMessageType": "Metrics"
  5. }
Confluent Schema Registry-based Protobuf Bytes Decoder

This Protobuf bytes decoder first extracts a unique id from input message bytes, and then uses it to look up the schema in the Schema Registry used to decode the Avro record from bytes. For details, see the Schema Registry and repository.

FieldTypeDescriptionRequired
typeStringThis should say schema_registry.yes
urlStringSpecifies the url endpoint of the Schema Registry.yes
capacityIntegerSpecifies the max size of the cache (default = Integer.MAX_VALUE).no
urlsArraySpecifies the url endpoints of the multiple Schema Registry instances.yes(if url is not provided)
configJsonTo send additional configurations, configured for Schema Registry. This can be supplied via a .no
headersJsonTo send headers to the Schema Registry. This can be supplied via a DynamicConfigProviderno

For a single schema registry instance, use Field url or urls for multi instances.

Single Instance:

  1. ...
  2. "protoBytesDecoder": {
  3. "url": <schema-registry-url>,
  4. "type": "schema_registry"
  5. }
  6. ...

Multiple Instances:

  1. ...
  2. "protoBytesDecoder": {
  3. "urls": [<schema-registry-url-1>, <schema-registry-url-2>, ...],
  4. "type": "schema_registry",
  5. "capacity": 100,
  6. "config" : {
  7. "basic.auth.credentials.source": "USER_INFO",
  8. "basic.auth.user.info": "fred:letmein",
  9. "schema.registry.ssl.truststore.location": "/some/secrets/kafka.client.truststore.jks",
  10. "schema.registry.ssl.truststore.password": "<password>",
  11. "schema.registry.ssl.keystore.location": "/some/secrets/kafka.client.keystore.jks",
  12. "schema.registry.ssl.keystore.password": "<password>",
  13. "schema.registry.ssl.key.password": "<password>",
  14. ...
  15. },
  16. "headers": {
  17. "traceID" : "b29c5de2-0db4-490b-b421",
  18. "timeStamp" : "1577191871865",
  19. "druid.dynamic.config.provider":{
  20. "type":"mapString",
  21. "config":{
  22. "registry.header.prop.1":"value.1",
  23. "registry.header.prop.2":"value.2"
  24. }
  25. }
  26. ...
  27. }
  28. }
  29. ...

The Parser is deprecated for , Kafka indexing service, and . Consider using the input format instead for these types of ingestion.

ParseSpecs serve two purposes:

  • The String Parser use them to determine the format (i.e., JSON, CSV, TSV) of incoming rows.
  • All Parsers use them to determine the timestamp and dimensions of incoming rows.

If format is not included, the parseSpec defaults to tsv.

Use this with the String Parser to load JSON.

FieldTypeDescriptionRequired
formatStringThis should say json.no
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
flattenSpecJSON ObjectSpecifies flattening configuration for nested JSON data. See for more info.no

Sample spec:

  1. "parseSpec": {
  2. "format" : "json",
  3. "timestampSpec" : {
  4. "column" : "timestamp"
  5. },
  6. "dimensionSpec" : {
  7. "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
  8. }
  9. }

JSON Lowercase ParseSpec

The jsonLowercase parser is deprecated and may be removed in a future version of Druid.

This is a special variation of the JSON ParseSpec that lower cases all the column names in the incoming JSON data. This parseSpec is required if you are updating to Druid 0.7.x from Druid 0.6.x, are directly ingesting JSON with mixed case column names, do not have any ETL in place to lower case those column names, and would like to make queries that include the data you created using 0.6.x and 0.7.x.

FieldTypeDescriptionRequired
formatStringThis should say jsonLowercase.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes

CSV ParseSpec

Use this with the String Parser to load CSV. Strings are parsed using the com.opencsv library.

FieldTypeDescriptionRequired
formatStringThis should say csv.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
listDelimiterStringA custom delimiter for multi-value dimensions.no (default = ctrl+A)
columnsJSON arraySpecifies the columns of the data.yes

Sample spec:

  1. "parseSpec": {
  2. "format" : "csv",
  3. "timestampSpec" : {
  4. "column" : "timestamp"
  5. },
  6. "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
  7. "dimensionsSpec" : {
  8. "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
  9. }
  10. }

CSV Index Tasks

If your input files contain a header, the columns field is optional and you don’t need to set. Instead, you can set the hasHeaderRow field to true, which makes Druid automatically extract the column information from the header. Otherwise, you must set the columns field and ensure that field must match the columns of your input data in the same order.

Also, you can skip some header rows by setting skipHeaderRows in your parseSpec. If both skipHeaderRows and hasHeaderRow options are set, skipHeaderRows is first applied. For example, if you set skipHeaderRows to 2 and hasHeaderRow to true, Druid will skip the first two lines and then extract column information from the third line.

Note that hasHeaderRow and skipHeaderRows are effective only for non-Hadoop batch index tasks. Other types of index tasks will fail with an exception.

Other CSV Ingestion Tasks

The columns field must be included and and ensure that the order of the fields matches the columns of your input data in the same order.

TSV / Delimited ParseSpec

Use this with the String Parser to load any delimited text that does not require special escaping. By default, the delimiter is a tab, so this will load TSV.

FieldTypeDescriptionRequired
formatStringThis should say tsv.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
delimiterStringA custom delimiter for data values.no (default = \t)
listDelimiterStringA custom delimiter for multi-value dimensions.no (default = ctrl+A)
columnsJSON String arraySpecifies the columns of the data.yes

Sample spec:

  1. "parseSpec": {
  2. "format" : "tsv",
  3. "timestampSpec" : {
  4. "column" : "timestamp"
  5. },
  6. "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
  7. "delimiter":"|",
  8. "dimensionsSpec" : {
  9. "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
  10. }
  11. }

Be sure to change the delimiter to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.

TSV (Delimited) Index Tasks

If your input files contain a header, the columns field is optional and doesn’t need to be set. Instead, you can set the hasHeaderRow field to true, which makes Druid automatically extract the column information from the header. Otherwise, you must set the columns field and ensure that field must match the columns of your input data in the same order.

Also, you can skip some header rows by setting skipHeaderRows in your parseSpec. If both skipHeaderRows and hasHeaderRow options are set, skipHeaderRows is first applied. For example, if you set skipHeaderRows to 2 and hasHeaderRow to true, Druid will skip the first two lines and then extract column information from the third line.

Note that hasHeaderRow and skipHeaderRows are effective only for non-Hadoop batch index tasks. Other types of index tasks will fail with an exception.

Other TSV (Delimited) Ingestion Tasks

The columns field must be included and and ensure that the order of the fields matches the columns of your input data in the same order.

Regex ParseSpec

The columns field must match the columns of your regex matching groups in the same order. If columns are not provided, default columns names (“column_1”, “column2”, … “column_n”) will be assigned. Ensure that your column names include all your dimensions.

JavaScript ParseSpec

  1. "parseSpec":{
  2. "format" : "javascript",
  3. "timestampSpec" : {
  4. "column" : "timestamp"
  5. },
  6. "dimensionsSpec" : {
  7. "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
  8. },
  9. "function" : "function(str) { var parts = str.split(\"-\"); return { one: parts[0], two: parts[1] } }"

Note with the JavaScript parser that data must be fully parsed and returned as a {key:value} format in the JS logic. This means any flattening or parsing multi-dimensional values must be done here.

JavaScript-based functionality is disabled by default. Please refer to the Druid for guidelines about using Druid’s JavaScript functionality, including instructions on how to enable it.

TimeAndDims ParseSpec

Use this with non-String Parsers to provide them with timestamp and dimensions information. Non-String Parsers handle all formatting decisions on their own, without using the ParseSpec.

FieldTypeDescriptionRequired
formatStringThis should say timeAndDims.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes

Orc ParseSpec

Use this with the Hadoop ORC Parser to load ORC files.

FieldTypeDescriptionRequired
formatStringThis should say orc.no
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
flattenSpecJSON ObjectSpecifies flattening configuration for nested JSON data. See flattenSpec for more info.no

Use this with the Hadoop Parquet Parser to load Parquet files.

FieldTypeDescriptionRequired
formatStringThis should say parquet.no
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
flattenSpecJSON ObjectSpecifies flattening configuration for nested JSON data. See for more info.no