Metrics

    Metrics are emitted as JSON objects to a runtime log file or over HTTP (to a service such as Apache Kafka). Metric emission is disabled by default.

    All Druid metrics share a common set of fields:

    • - the time the metric was created
    • metric - the name of the metric
    • service - the service name that emitted the metric
    • host - the host name that emitted the metric

    Metrics may have additional dimensions beyond those listed above.

    Most metric values reset each emission period. By default druid emission period is 1 minute, this can be changed by setting the property druid.monitoring.emissionPeriod.

    Query metrics

    Historical

    MetricDescriptionDimensionsNormal Value
    query/timeMilliseconds taken to complete a query.Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension.< 1s
    query/segment/timeMilliseconds taken to query individual segment. Includes time to page in the segment from disk.id, status, segment.several hundred milliseconds
    query/wait/timeMilliseconds spent waiting for a segment to be scanned.id, segment.< several hundred milliseconds
    segment/scan/pendingNumber of segments in queue waiting to be scanned.Close to 0
    query/segmentAndCache/timeMilliseconds taken to query individual segment or hit the cache (if it is enabled on the Historical process).id, segment.several hundred milliseconds
    query/cpu/timeMicroseconds of CPU time taken to complete a queryCommon: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension.Varies
    query/countnumber of total queriesThis metric is only available if the QueryCountStatsMonitor module is included.
    query/success/countnumber of queries successfully processedThis metric is only available if the QueryCountStatsMonitor module is included.
    query/failed/countnumber of failed queriesThis metric is only available if the QueryCountStatsMonitor module is included.
    query/interrupted/countnumber of queries interrupted due to cancellation or timeoutThis metric is only available if the QueryCountStatsMonitor module is included.

    Real-time

    MetricDescriptionDimensionsNormal Value
    query/timeMilliseconds taken to complete a query.Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension.< 1s
    query/wait/timeMilliseconds spent waiting for a segment to be scanned.id, segment.several hundred milliseconds
    segment/scan/pendingNumber of segments in queue waiting to be scanned.Close to 0
    query/countnumber of total queriesThis metric is only available if the QueryCountStatsMonitor module is included.
    query/success/countnumber of queries successfully processedThis metric is only available if the QueryCountStatsMonitor module is included.
    query/failed/countnumber of failed queriesThis metric is only available if the QueryCountStatsMonitor module is included.
    query/interrupted/countnumber of queries interrupted due to cancellation or timeoutThis metric is only available if the QueryCountStatsMonitor module is included.
    MetricDescriptionNormal Value
    jetty/numOpenConnectionsNumber of open jetty connections.Not much higher than number of jetty threads.

    Cache

    MetricDescriptionNormal Value
    query/cache/delta/Cache metrics since the last emission.N/A
    query/cache/total/Total cache metrics.N/A

    Memcached only metrics

    MetricDescriptionDimensionsNormal Value
    query/cache/memcached/totalCache metrics unique to memcached (only if druid.cache.type=memcached) as their actual valuesVariableN/A
    query/cache/memcached/deltaCache metrics unique to memcached (only if druid.cache.type=memcached) as their delta from the prior event emissionVariableN/A

    SQL Metrics

    If SQL is enabled, the Broker will emit the following metrics for SQL.

    MetricDescriptionDimensionsNormal Value
    sqlQuery/timeMilliseconds taken to complete a SQL.id, nativeQueryIds, dataSource, remoteAddress, success.< 1s
    sqlQuery/bytesnumber of bytes returned in SQL response.id, nativeQueryIds, dataSource, remoteAddress, success.

    These metrics are applicable for the Kafka Indexing Service.

    MetricDescriptionDimensionsNormal Value
    Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.dataSource.Greater than 0, should not be a very high number
    ingest/kafka/maxLagMax lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.dataSource.Greater than 0, should not be a very high number
    ingest/kafka/avgLagAverage lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.dataSource.Greater than 0, should not be a very high number

    Ingestion Metrics (Kinesis Indexing Service)

    These metrics are applicable for the Kinesis Indexing Service.

    MetricDescriptionDimensionsNormal Value
    ingest/kinesis/lag/timeTotal lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis across all shards. Minimum emission period for this metric is a minute.dataSource.Greater than 0, up to max Kinesis retention period in milliseconds
    ingest/kinesis/maxLag/timeMax lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis across all shards. Minimum emission period for this metric is a minute.dataSource.Greater than 0, up to max Kinesis retention period in milliseconds
    ingest/kinesis/avgLag/timeAverage lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis across all shards. Minimum emission period for this metric is a minute.dataSource.Greater than 0, up to max Kinesis retention period in milliseconds

    Ingestion metrics (Realtime process)

    These metrics are only available if the RealtimeMetricsMonitor is included in the monitors list for the Realtime process. These metrics are deltas for each emission period.

    Indexing service

    MetricDescriptionDimensionsNormal Value
    task/run/timeMilliseconds taken to run a task.dataSource, taskId, taskType, taskStatus.Varies.
    task/action/log/timeMilliseconds taken to log a task action to the audit log.dataSource, taskId, taskType< 1000 (subsecond)
    task/action/run/timeMilliseconds taken to execute a task action.dataSource, taskId, taskTypeVaries from subsecond to a few seconds, based on action type.
    segment/added/bytesSize in bytes of new segments created.dataSource, taskId, taskType, interval.Varies.
    segment/moved/bytesSize in bytes of segments moved/archived via the Move Task.dataSource, taskId, taskType, interval.Varies.
    segment/nuked/bytesSize in bytes of segments deleted via the Kill Task.dataSource, taskId, taskType, interval.Varies.
    task/success/countNumber of successful tasks per emission period. This metric is only available if the TaskCountStatsMonitor module is included.dataSource.Varies.
    task/failed/countNumber of failed tasks per emission period. This metric is only available if the TaskCountStatsMonitor module is included.dataSource.Varies.
    task/running/countNumber of current running tasks. This metric is only available if the TaskCountStatsMonitor module is included.dataSource.Varies.
    task/pending/countNumber of current pending tasks. This metric is only available if the TaskCountStatsMonitor module is included.dataSource.Varies.
    task/waiting/countNumber of current waiting tasks. This metric is only available if the TaskCountStatsMonitor module is included.dataSource.Varies.
    taskSlot/total/countNumber of total task slots per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.Varies.
    taskSlot/idle/countNumber of idle task slots per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.Varies.
    taskSlot/used/countNumber of busy task slots per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.Varies.
    taskSlot/lazy/countNumber of total task slots in lazy marked MiddleManagers and Indexers per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.Varies.
    taskSlot/blacklisted/countNumber of total task slots in blacklisted MiddleManagers and Indexers per emission period. This metric is only available if the TaskSlotCountStatsMonitor module is included.Varies.

    These metrics are for the Druid Coordinator and are reset each time the Coordinator runs the coordination logic.

    MetricDescriptionDimensionsNormal Value
    segment/assigned/countNumber of segments assigned to be loaded in the cluster.tier.Varies.
    segment/moved/countNumber of segments moved in the cluster.tier.Varies.
    segment/dropped/countNumber of segments dropped due to being overshadowed.tier.Varies.
    segment/deleted/countNumber of segments dropped due to rules.tier.Varies.
    segment/unneeded/countNumber of segments dropped due to being marked as unused.tier.Varies.
    segment/cost/rawUsed in cost balancing. The raw cost of hosting segments.tier.Varies.
    segment/cost/normalizationUsed in cost balancing. The normalization of hosting segments.tier.Varies.
    segment/cost/normalizedUsed in cost balancing. The normalized cost of hosting segments.tier.Varies.
    segment/loadQueue/sizeSize in bytes of segments to load.server.Varies.
    segment/loadQueue/failedNumber of segments that failed to load.server.0
    segment/loadQueue/countNumber of segments to load.server.Varies.
    segment/dropQueue/countNumber of segments to drop.server.Varies.
    segment/sizeTotal size of used segments in a data source. Emitted only for data sources to which at least one used segment belongs.dataSource.Varies.
    segment/countNumber of used segments belonging to a data source. Emitted only for data sources to which at least one used segment belongs.dataSource.< max
    segment/overShadowed/countNumber of overshadowed segments.Varies.
    segment/unavailable/countNumber of segments (not including replicas) left to load until segments that should be loaded in the cluster are available for queries.dataSource.0
    segment/underReplicated/countNumber of segments (including replicas) left to load until segments that should be loaded in the cluster are available for queries.tier, dataSource.0
    tier/historical/countNumber of available historical nodes in each tier.tier.Varies.
    tier/replication/factorConfigured maximum replication factor in each tier.tier.Varies.
    Total capacity in bytes required in each tier.tier.Varies.
    tier/total/capacityTotal capacity in bytes available in each tier.tier.Varies.
    compact/task/countNumber of tasks issued in the auto compaction run.Varies.
    compactTask/maxSlot/countMax number of task slots that can be used for auto compaction tasks in the auto compaction run.Varies.
    compactTask/availableSlot/countNumber of available task slots that can be used for auto compaction tasks in the auto compaction run. (this is max slot minus any currently running compaction task)Varies.
    segment/waitCompact/bytesTotal bytes of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction).datasource.Varies.
    segment/waitCompact/countTotal number of segments of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction).datasource.Varies.
    interval/waitCompact/countTotal number of intervals of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction).datasource.Varies.
    segment/compacted/bytesTotal bytes of this datasource that are already compacted with the spec set in the auto compaction config.datasource.Varies.
    segment/compacted/countTotal number of segments of this datasource that are already compacted with the spec set in the auto compaction config.datasource.Varies.
    interval/compacted/countTotal number of intervals of this datasource that are already compacted with the spec set in the auto compaction config.datasource.Varies.
    segment/skipCompact/bytesTotal bytes of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.datasource.Varies.
    segment/skipCompact/countTotal number of segments of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.datasource.Varies.
    interval/skipCompact/countTotal number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.datasource.Varies.

    If emitBalancingStats is set to true in the Coordinator , then log entries for class org.apache.druid.server.coordinator.duty.EmitClusterStatsAndMetrics will have extra information on balancing decisions.

    General Health

    MetricDescriptionDimensionsNormal Value
    segment/maxMaximum byte limit available for segments.Varies.
    segment/usedBytes used for served segments.dataSource, tier, priority.< max
    segment/usedPercentPercentage of space used by served segments.dataSource, tier, priority.< 100%
    segment/countNumber of served segments.dataSource, tier, priority.Varies.
    segment/pendingDeleteOn-disk size in bytes of segments that are waiting to be cleared outVaries.

    JVM

    These metrics are only available if the JVMMonitor module is included.

    MetricDescriptionDimensionsNormal Value
    jvm/pool/committedCommitted pool.poolKind, poolName.close to max pool
    jvm/pool/initInitial pool.poolKind, poolName.Varies.
    jvm/pool/maxMax pool.poolKind, poolName.Varies.
    jvm/pool/usedPool used.poolKind, poolName.< max pool
    jvm/bufferpool/countBufferpool count.bufferpoolName.Varies.
    jvm/bufferpool/usedBufferpool used.bufferpoolName.close to capacity
    jvm/bufferpool/capacityBufferpool capacity.bufferpoolName.Varies.
    jvm/mem/initInitial memory.memKind.Varies.
    jvm/mem/maxMax memory.memKind.Varies.
    jvm/mem/usedUsed memory.memKind.< max memory
    jvm/mem/committedCommitted memory.memKind.close to max memory
    jvm/gc/countGarbage collection count.gcName (cms/g1/parallel/etc.), gcGen (old/young)Varies.
    jvm/gc/cpuCount of CPU time in Nanoseconds spent on garbage collection. Note: jvm/gc/cpu represents the total time over multiple GC cycles; divide by jvm/gc/count to get the mean GC time per cyclegcName, gcGenSum of jvm/gc/cpu should be within 10-30% of sum of jvm/cpu/total, depending on the GC algorithm used (reported by )

    EventReceiverFirehose

    The following metric is only available if the EventReceiverFirehoseMonitor module is included.

    Sys

    MetricDescriptionDimensionsNormal Value
    sys/swap/freeFree swap.Varies.
    sys/swap/maxMax swap.Varies.
    sys/swap/pageInPaged in swap.Varies.
    sys/swap/pageOutPaged out swap.Varies.
    sys/disk/write/countWrites to disk.fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions.Varies.
    sys/disk/read/countReads from disk.fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions.Varies.
    sys/disk/write/sizeBytes written to disk. Can we used to determine how much paging is occurring with regards to segments.fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions.Varies.
    sys/disk/read/sizeBytes read from disk. Can we used to determine how much paging is occurring with regards to segments.fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions.Varies.
    sys/net/write/sizeBytes written to the network.netName, netAddress, netHwaddrVaries.
    sys/net/read/sizeBytes read from the network.netName, netAddress, netHwaddrVaries.
    sys/fs/usedFilesystem bytes used.fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions.< max
    sys/fs/maxFilesystesm bytes max.fsDevName, fsDirName, fsTypeName, fsSysTypeName, fsOptions.Varies.
    sys/mem/usedMemory used.< max
    sys/mem/maxMemory max.Varies.
    sys/storage/usedDisk space used.fsDirName.Varies.
    CPU used.cpuName, cpuTime.Varies.