Metrics
Metrics are emitted as JSON objects to a runtime log file or over HTTP (to a service such as Apache Kafka). Metric emission is disabled by default.
All Druid metrics share a common set of fields:
- - the time the metric was created
metric
- the name of the metricservice
- the service name that emitted the metrichost
- the host name that emitted the metricvalue
- some numeric value associated with the metric
Metrics may have additional dimensions beyond those listed above.
Most metric values reset each emission period. By default druid emission period is 1 minute, this can be changed by setting the property druid.monitoring.emissionPeriod
.
Query metrics
Historical
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/time | Milliseconds taken to complete a query. | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | < 1s |
query/segment/time | Milliseconds taken to query individual segment. Includes time to page in the segment from disk. | id, status, segment. | several hundred milliseconds |
query/wait/time | Milliseconds spent waiting for a segment to be scanned. | id, segment. | < several hundred milliseconds |
segment/scan/pending | Number of segments in queue waiting to be scanned. | Close to 0 | |
query/segmentAndCache/time | Milliseconds taken to query individual segment or hit the cache (if it is enabled on the Historical process). | id, segment. | several hundred milliseconds |
query/cpu/time | Microseconds of CPU time taken to complete a query | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | Varies |
query/count | number of total queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/success/count | number of queries successfully processed | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/failed/count | number of failed queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/interrupted/count | number of queries interrupted due to cancellation or timeout | This metric is only available if the QueryCountStatsMonitor module is included. |
Real-time
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/time | Milliseconds taken to complete a query. | Common: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id. Aggregation Queries: numMetrics, numComplexMetrics. GroupBy: numDimensions. TopN: threshold, dimension. | < 1s |
query/wait/time | Milliseconds spent waiting for a segment to be scanned. | id, segment. | several hundred milliseconds |
segment/scan/pending | Number of segments in queue waiting to be scanned. | Close to 0 | |
query/count | number of total queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/success/count | number of queries successfully processed | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/failed/count | number of failed queries | This metric is only available if the QueryCountStatsMonitor module is included. | |
query/interrupted/count | number of queries interrupted due to cancellation or timeout | This metric is only available if the QueryCountStatsMonitor module is included. |
Metric | Description | Normal Value |
---|---|---|
jetty/numOpenConnections | Number of open jetty connections. | Not much higher than number of jetty threads. |
Cache
Metric | Description | Normal Value | |
---|---|---|---|
query/cache/delta/ | Cache metrics since the last emission. | N/A | |
query/cache/total/ | Total cache metrics. | N/A |
Memcached only metrics
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
query/cache/memcached/total | Cache metrics unique to memcached (only if druid.cache.type=memcached ) as their actual values | Variable | N/A |
query/cache/memcached/delta | Cache metrics unique to memcached (only if druid.cache.type=memcached ) as their delta from the prior event emission | Variable | N/A |
If SQL is enabled, the Broker will emit the following metrics for SQL.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
sqlQuery/time | Milliseconds taken to complete a SQL. | id, nativeQueryIds, dataSource, remoteAddress, success. | < 1s |
sqlQuery/bytes | number of bytes returned in SQL response. | id, nativeQueryIds, dataSource, remoteAddress, success. |
Ingestion Metrics (Kafka Indexing Service)
These metrics are applicable for the Kafka Indexing Service.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
ingest/kafka/lag | Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute. | dataSource. | Greater than 0, should not be a very high number |
ingest/kafka/maxLag | Max lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute. | dataSource. | Greater than 0, should not be a very high number |
ingest/kafka/avgLag | Average lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute. | dataSource. | Greater than 0, should not be a very high number |
These metrics are only available if the RealtimeMetricsMonitor is included in the monitors list for the Realtime process. These metrics are deltas for each emission period.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
ingest/events/thrownAway | Number of events rejected because they are outside the windowPeriod. | dataSource, taskId, taskType. | 0 |
ingest/events/unparseable | Number of events rejected because the events are unparseable. | dataSource, taskId, taskType. | 0 |
ingest/events/duplicate | Number of events rejected because the events are duplicated. | dataSource, taskId, taskType. | 0 |
ingest/events/processed | Number of events successfully processed per emission period. | dataSource, taskId, taskType. | Equal to your # of events per emission period. |
ingest/rows/output | Number of Druid rows persisted. | dataSource, taskId, taskType. | Your # of events with rollup. |
ingest/persists/count | Number of times persist occurred. | dataSource, taskId, taskType. | Depends on configuration. |
ingest/persists/time | Milliseconds spent doing intermediate persist. | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/persists/cpu | Cpu time in Nanoseconds spent on doing intermediate persist. | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/persists/backPressure | Milliseconds spent creating persist tasks and blocking waiting for them to finish. | dataSource, taskId, taskType. | 0 or very low |
ingest/persists/failed | Number of persists that failed. | dataSource, taskId, taskType. | 0 |
ingest/handoff/failed | Number of handoffs that failed. | dataSource, taskId, taskType. | 0 |
ingest/merge/time | Milliseconds spent merging intermediate segments | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/merge/cpu | Cpu time in Nanoseconds spent on merging intermediate segments. | dataSource, taskId, taskType. | Depends on configuration. Generally a few minutes at most. |
ingest/handoff/count | Number of handoffs that happened. | dataSource, taskId, taskType. | Varies. Generally greater than 0 once every segment granular period if cluster operating normally |
ingest/sink/count | Number of sinks not handoffed. | dataSource, taskId, taskType. | 1~3 |
ingest/events/messageGap | Time gap between the data time in event and current system time. | dataSource, taskId, taskType. | Greater than 0, depends on the time carried in event |
Note: If the JVM does not support CPU time measurement for the current thread, ingest/merge/cpu and ingest/persists/cpu will be 0.
Indexing service
Coordination
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
segment/assigned/count | Number of segments assigned to be loaded in the cluster. | tier. | Varies. |
segment/moved/count | Number of segments moved in the cluster. | tier. | Varies. |
segment/dropped/count | Number of segments dropped due to being overshadowed. | tier. | Varies. |
segment/deleted/count | Number of segments dropped due to rules. | tier. | Varies. |
segment/unneeded/count | Number of segments dropped due to being marked as unused. | tier. | Varies. |
segment/cost/raw | Used in cost balancing. The raw cost of hosting segments. | tier. | Varies. |
segment/cost/normalization | Used in cost balancing. The normalization of hosting segments. | tier. | Varies. |
segment/cost/normalized | Used in cost balancing. The normalized cost of hosting segments. | tier. | Varies. |
segment/loadQueue/size | Size in bytes of segments to load. | server. | Varies. |
segment/loadQueue/failed | Number of segments that failed to load. | server. | 0 |
segment/loadQueue/count | Number of segments to load. | server. | Varies. |
segment/dropQueue/count | Number of segments to drop. | server. | Varies. |
segment/size | Total size of used segments in a data source. Emitted only for data sources to which at least one used segment belongs. | dataSource. | Varies. |
Number of used segments belonging to a data source. Emitted only for data sources to which at least one used segment belongs. | dataSource. | < max | |
segment/overShadowed/count | Number of overshadowed segments. | Varies. | |
segment/unavailable/count | Number of segments (not including replicas) left to load until segments that should be loaded in the cluster are available for queries. | dataSource. | 0 |
segment/underReplicated/count | Number of segments (including replicas) left to load until segments that should be loaded in the cluster are available for queries. | tier, dataSource. | 0 |
tier/historical/count | Number of available historical nodes in each tier. | tier. | Varies. |
tier/replication/factor | Configured maximum replication factor in each tier. | tier. | Varies. |
tier/required/capacity | Total capacity in bytes required in each tier. | tier. | Varies. |
tier/total/capacity | Total capacity in bytes available in each tier. | tier. | Varies. |
If emitBalancingStats
is set to true
in the Coordinator dynamic configuration, then for class org.apache.druid.server.coordinator.duty.EmitClusterStatsAndMetrics
will have extra information on balancing decisions.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
segment/max | Maximum byte limit available for segments. | Varies. | |
segment/used | Bytes used for served segments. | dataSource, tier, priority. | < max |
segment/usedPercent | Percentage of space used by served segments. | dataSource, tier, priority. | < 100% |
segment/count | Number of served segments. | dataSource, tier, priority. | Varies. |
segment/pendingDelete | On-disk size in bytes of segments that are waiting to be cleared out | Varies. |
JVM
These metrics are only available if the JVMMonitor module is included.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
jvm/pool/committed | Committed pool. | poolKind, poolName. | close to max pool |
jvm/pool/init | Initial pool. | poolKind, poolName. | Varies. |
jvm/pool/max | Max pool. | poolKind, poolName. | Varies. |
jvm/pool/used | Pool used. | poolKind, poolName. | < max pool |
jvm/bufferpool/count | Bufferpool count. | bufferpoolName. | Varies. |
jvm/bufferpool/used | Bufferpool used. | bufferpoolName. | close to capacity |
jvm/bufferpool/capacity | Bufferpool capacity. | bufferpoolName. | Varies. |
jvm/mem/init | Initial memory. | memKind. | Varies. |
jvm/mem/max | Max memory. | memKind. | Varies. |
jvm/mem/used | Used memory. | memKind. | < max memory |
jvm/mem/committed | Committed memory. | memKind. | close to max memory |
jvm/gc/count | Garbage collection count. | gcName (cms/g1/parallel/etc.), gcGen (old/young) | Varies. |
jvm/gc/cpu | Count of CPU time in Nanoseconds spent on garbage collection. Note: jvm/gc/cpu represents the total time over multiple GC cycles; divide by jvm/gc/count to get the mean GC time per cycle | gcName, gcGen | Sum of jvm/gc/cpu should be within 10-30% of sum of jvm/cpu/total , depending on the GC algorithm used (reported by ) |
EventReceiverFirehose
The following metric is only available if the EventReceiverFirehoseMonitor module is included.
Metric | Description | Dimensions | Normal Value |
---|---|---|---|
ingest/events/buffered | Number of events queued in the EventReceiverFirehose’s buffer | serviceName, dataSource, taskId, taskType, bufferCapacity. | Equal to current # of events in the buffer queue. |
ingest/bytes/received | Number of bytes received by the EventReceiverFirehose. | serviceName, dataSource, taskId, taskType. | Varies. |
Sys
These metrics are only available if the SysMonitor module is included.