HDFS

    Besides the above settings, you also need to include all Hadoop configuration files (such as core-site.xml, ) in the Druid classpath. One way to do this is copying all those files under ${DRUID_HOME}/conf/_common.

    If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set druid.hadoop.security.kerberos.principal and druid.hadoop.security.kerberos.keytab, this is an alternative to the cron job method that runs kinit command periodically.

    You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.

    Configuration for AWS S3

    You also need to include the , especially the hadoop-aws.jar in the Druid classpath. Run the below command to install the file under ${DRUID_HOME}/extensions/druid-hdfs-storage in all nodes.

    Finally, you need to add the below properties in the core-site.xml. For more configurations, see the Hadoop AWS module.

    Configuration for Google Cloud Storage

    To use the Google Cloud Storage as the deep storage, you need to configure druid.storage.storageDirectory properly.

    Finally, you need to configure the core-site.xml file with the filesystem and authentication properties needed for GCS. You may want to copy the below example properties. Please follow the instructions at https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md for more details. For more configurations, and GCS core template.

    Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.

    Reading data from HDFS or Cloud Storage

    The HDFS input source is supported by the to read files directly from the HDFS Storage. You may be able to read objects from cloud storage with the HDFS input source, but we highly recommend to use a proper Input Source instead if possible because it is simple to set up. For now, only the and the Google Cloud Storage input source are supported for cloud storage types, and so you may still want to use the HDFS input source to read from cloud storage other than those two.