Metrics

    You can access the metric system from any user function that extends RichFunction by calling . This method returns a MetricGroup object on which you can create and register new metrics.

    Flink supports Counters, Gauges, Histograms and Meters.

    Counter

    A Counter is used to count something. The current value can be in- or decremented using inc()/inc(long n) or dec()/dec(long n). You can create and register a Counter by calling counter(String name) on a MetricGroup.

    Java

    Scala

    1. class MyMapper extends RichMapFunction[String,String] {
    2. @transient private var counter: Counter = _
    3. override def open(parameters: Configuration): Unit = {
    4. counter = getRuntimeContext()
    5. .getMetricGroup()
    6. .counter("myCounter")
    7. }
    8. override def map(value: String): String = {
    9. counter.inc()
    10. value
    11. }
    12. }

    Alternatively you can also use your own Counter implementation:

    Java

    1. public class MyMapper extends RichMapFunction<String, String> {
    2. private transient Counter counter;
    3. @Override
    4. public void open(Configuration config) {
    5. this.counter = getRuntimeContext()
    6. .getMetricGroup()
    7. .counter("myCustomCounter", new CustomCounter());
    8. }
    9. @Override
    10. public String map(String value) throws Exception {
    11. this.counter.inc();
    12. return value;
    13. }
    14. }

    Scala

    1. class MyMapper extends RichMapFunction[String,String] {
    2. @transient private var counter: Counter = _
    3. override def open(parameters: Configuration): Unit = {
    4. counter = getRuntimeContext()
    5. .getMetricGroup()
    6. .counter("myCustomCounter", new CustomCounter())
    7. }
    8. override def map(value: String): String = {
    9. counter.inc()
    10. value
    11. }
    12. }

    Gauge

    A Gauge provides a value of any type on demand. In order to use a Gauge you must first create a class that implements the org.apache.flink.metrics.Gauge interface. There is no restriction for the type of the returned value. You can register a gauge by calling gauge(String name, Gauge gauge) on a MetricGroup.

    Java

    1. public class MyMapper extends RichMapFunction<String, String> {
    2. private transient int valueToExpose = 0;
    3. @Override
    4. public void open(Configuration config) {
    5. getRuntimeContext()
    6. .getMetricGroup()
    7. .gauge("MyGauge", new Gauge<Integer>() {
    8. @Override
    9. public Integer getValue() {
    10. return valueToExpose;
    11. }
    12. });
    13. }
    14. @Override
    15. public String map(String value) throws Exception {
    16. valueToExpose++;
    17. return value;
    18. }
    19. }

    Scala

    1. new class MyMapper extends RichMapFunction[String,String] {
    2. @transient private var valueToExpose = 0
    3. override def open(parameters: Configuration): Unit = {
    4. getRuntimeContext()
    5. .getMetricGroup()
    6. .gauge[Int, ScalaGauge[Int]]("MyGauge", ScalaGauge[Int]( () => valueToExpose ) )
    7. }
    8. override def map(value: String): String = {
    9. valueToExpose += 1
    10. value
    11. }
    12. }

    Note that reporters will turn the exposed object into a String, which means that a meaningful toString() implementation is required.

    Histogram

    A Histogram measures the distribution of long values. You can register one by calling histogram(String name, Histogram histogram) on a MetricGroup.

    Java

    1. public class MyMapper extends RichMapFunction<Long, Long> {
    2. private transient Histogram histogram;
    3. @Override
    4. public void open(Configuration config) {
    5. this.histogram = getRuntimeContext()
    6. .getMetricGroup()
    7. .histogram("myHistogram", new MyHistogram());
    8. }
    9. @Override
    10. public Long map(Long value) throws Exception {
    11. this.histogram.update(value);
    12. return value;
    13. }
    14. }

    Scala

    1. class MyMapper extends RichMapFunction[Long,Long] {
    2. @transient private var histogram: Histogram = _
    3. override def open(parameters: Configuration): Unit = {
    4. histogram = getRuntimeContext()
    5. .getMetricGroup()
    6. }
    7. override def map(value: Long): Long = {
    8. histogram.update(value)
    9. value
    10. }
    11. }

    Flink does not provide a default implementation for Histogram, but offers a Wrapper that allows usage of Codahale/DropWizard histograms. To use this wrapper add the following dependency in your pom.xml:

    You can then register a Codahale/DropWizard histogram like this:

    Java

    1. public class MyMapper extends RichMapFunction<Long, Long> {
    2. private transient Histogram histogram;
    3. @Override
    4. public void open(Configuration config) {
    5. com.codahale.metrics.Histogram dropwizardHistogram =
    6. new com.codahale.metrics.Histogram(new SlidingWindowReservoir(500));
    7. this.histogram = getRuntimeContext()
    8. .getMetricGroup()
    9. .histogram("myHistogram", new DropwizardHistogramWrapper(dropwizardHistogram));
    10. }
    11. @Override
    12. public Long map(Long value) throws Exception {
    13. this.histogram.update(value);
    14. return value;
    15. }
    16. }

    Scala

    1. class MyMapper extends RichMapFunction[Long, Long] {
    2. @transient private var histogram: Histogram = _
    3. override def open(config: Configuration): Unit = {
    4. com.codahale.metrics.Histogram dropwizardHistogram =
    5. new com.codahale.metrics.Histogram(new SlidingWindowReservoir(500))
    6. histogram = getRuntimeContext()
    7. .getMetricGroup()
    8. .histogram("myHistogram", new DropwizardHistogramWrapper(dropwizardHistogram))
    9. }
    10. override def map(value: Long): Long = {
    11. histogram.update(value)
    12. value
    13. }
    14. }

    Meter

    A Meter measures an average throughput. An occurrence of an event can be registered with the markEvent() method. Occurrence of multiple events at the same time can be registered with markEvent(long n) method. You can register a meter by calling meter(String name, Meter meter) on a MetricGroup.

    Java

    1. public class MyMapper extends RichMapFunction<Long, Long> {
    2. private transient Meter meter;
    3. @Override
    4. public void open(Configuration config) {
    5. this.meter = getRuntimeContext()
    6. .getMetricGroup()
    7. .meter("myMeter", new MyMeter());
    8. }
    9. @Override
    10. public Long map(Long value) throws Exception {
    11. this.meter.markEvent();
    12. return value;
    13. }
    14. }

    Scala

    1. class MyMapper extends RichMapFunction[Long,Long] {
    2. @transient private var meter: Meter = _
    3. override def open(config: Configuration): Unit = {
    4. meter = getRuntimeContext()
    5. .getMetricGroup()
    6. .meter("myMeter", new MyMeter())
    7. }
    8. override def map(value: Long): Long = {
    9. meter.markEvent()
    10. value
    11. }
    12. }

    Flink offers a Wrapper that allows usage of Codahale/DropWizard meters. To use this wrapper add the following dependency in your pom.xml:

    1. <dependency>
    2. <groupId>org.apache.flink</groupId>
    3. <artifactId>flink-metrics-dropwizard</artifactId>
    4. <version>1.14.4</version>
    5. </dependency>

    You can then register a Codahale/DropWizard meter like this:

    Java

    1. public class MyMapper extends RichMapFunction<Long, Long> {
    2. private transient Meter meter;
    3. @Override
    4. public void open(Configuration config) {
    5. com.codahale.metrics.Meter dropwizardMeter = new com.codahale.metrics.Meter();
    6. this.meter = getRuntimeContext()
    7. .getMetricGroup()
    8. .meter("myMeter", new DropwizardMeterWrapper(dropwizardMeter));
    9. }
    10. @Override
    11. public Long map(Long value) throws Exception {
    12. this.meter.markEvent();
    13. return value;
    14. }
    15. }

    Scala

    1. class MyMapper extends RichMapFunction[Long,Long] {
    2. @transient private var meter: Meter = _
    3. val dropwizardMeter: com.codahale.metrics.Meter = new com.codahale.metrics.Meter()
    4. meter = getRuntimeContext()
    5. .getMetricGroup()
    6. .meter("myMeter", new DropwizardMeterWrapper(dropwizardMeter))
    7. }
    8. override def map(value: Long): Long = {
    9. meter.markEvent()
    10. value
    11. }
    12. }

    Scope

    Every metric is assigned an identifier and a set of key-value pairs under which the metric will be reported.

    The identifier is based on 3 components: a user-defined name when registering the metric, an optional user-defined scope and a system-provided scope. For example, if A.B is the system scope, C.D the user scope and E the name, then the identifier for the metric will be A.B.C.D.E.

    You can configure which delimiter to use for the identifier (default: .) by setting the metrics.scope.delimiter key in .

    User Scope

    You can define a user scope by calling MetricGroup#addGroup(String name), MetricGroup#addGroup(int name) or MetricGroup#addGroup(String key, String value). These methods affect what MetricGroup#getMetricIdentifier and MetricGroup#getScopeComponents return.

    Java

    Scala

    1. counter = getRuntimeContext()
    2. .getMetricGroup()
    3. .addGroup("MyMetrics")
    4. .counter("myCounter")
    5. counter = getRuntimeContext()
    6. .getMetricGroup()
    7. .addGroup("MyMetricsKey", "MyMetricsValue")
    8. .counter("myCounter")

    System Scope

    Which context information should be included can be configured by setting the following keys in conf/flink-conf.yaml. Each of these keys expect a format string that may contain constants (e.g. “taskmanager”) and variables (e.g. “<task_id>”) which will be replaced at runtime.

    • metrics.scope.jm
      • Default: <host>.jobmanager
      • Applied to all metrics that were scoped to a job manager.
    • metrics.scope.jm.job
      • Default: <host>.jobmanager.<job_name>
      • Applied to all metrics that were scoped to a job manager and job.
    • metrics.scope.tm
      • Default: <host>.taskmanager.<tm_id>
      • Applied to all metrics that were scoped to a task manager.
    • metrics.scope.tm.job
      • Default: <host>.taskmanager.<tm_id>.<job_name>
      • Applied to all metrics that were scoped to a task manager and job.
    • metrics.scope.task
      • Default: <host>.taskmanager.<tm_id>.<job_name>.<task_name>.<subtask_index>
      • Applied to all metrics that were scoped to a task.
    • metrics.scope.operator
      • Default: <host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>
      • Applied to all metrics that were scoped to an operator.

    There are no restrictions on the number or order of variables. Variables are case sensitive.

    The default scope for operator metrics will result in an identifier akin to localhost.taskmanager.1234.MyJob.MyOperator.0.MyMetric

    If you also want to include the task name but omit the task manager information you can specify the following format:

    metrics.scope.operator: <host>.<job_name>.<task_name>.<operator_name>.<subtask_index>

    This could create the identifier localhost.MyJob.MySource_->_MyOperator.MyOperator.0.MyMetric.

    Note that for this format string an identifier clash can occur should the same job be run multiple times concurrently, which can lead to inconsistent metric data. As such it is advised to either use format strings that provide a certain degree of uniqueness by including IDs (e.g <job_id>) or by assigning unique names to jobs and operators.

    List of all Variables

    • JobManager: <host>
    • TaskManager: <host>, <tm_id>
    • Job: <job_id>, <job_name>
    • Task: <task_id>, <task_name>, <task_attempt_id>, <task_attempt_num>, <subtask_index>
    • Operator: <operator_id>,<operator_name>, <subtask_index>

    Important: For the Batch API, <operator_id> is always equal to <task_id>.

    User Variables

    You can define a user variable by calling MetricGroup#addGroup(String key, String value). This method affects what MetricGroup#getMetricIdentifier, MetricGroup#getScopeComponents and MetricGroup#getAllVariables() returns.

    Important: User variables cannot be used in scope formats.

    Java

    1. counter = getRuntimeContext()
    2. .getMetricGroup()
    3. .addGroup("MyMetricsKey", "MyMetricsValue")
    4. .counter("myCounter");

    Scala

    1. counter = getRuntimeContext()
    2. .getMetricGroup()
    3. .addGroup("MyMetricsKey", "MyMetricsValue")
    4. .counter("myCounter")

    For information on how to set up Flink’s metric reporters please take a look at the metric reporters documentation.

    System metrics

    By default Flink gathers several metrics that provide deep insights on the current state. This section is a reference of all these metrics.

    The tables below generally feature 5 columns:

    • The “Scope” column describes which scope format is used to generate the system scope. For example, if the cell contains “Operator” then the scope format for “metrics.scope.operator” is used. If the cell contains multiple values, separated by a slash, then the metrics are reported multiple times for different entities, like for both job- and taskmanagers.

    • The (optional)“Infix” column describes which infix is appended to the system scope.

    • The “Metrics” column lists the names of all metrics that are registered for the given scope and infix.

    • The “Description” column provides information as to what a given metric is measuring.

    • The “Type” column describes which metric type is used for the measurement.

    Note that all dots in the infix/metric name columns are still subject to the “metrics.delimiter” setting.

    Thus, in order to infer the metric identifier:

    1. Take the scope-format based on the “Scope” column
    2. Append the value in the “Infix” column if present, and account for the “metrics.delimiter” setting
    3. Append metric name.

    CPU

    The memory-related metrics require Oracle’s memory management (also included in OpenJDK’s Hotspot implementation) to be in place. Some metrics might not be exposed when using other JVM implementations (e.g. IBM’s J9).

    ScopeInfixMetricsDescriptionType
    Job-/TaskManagerStatus.JVM.MemoryHeap.UsedThe amount of heap memory currently used (in bytes).Gauge
    Heap.CommittedThe amount of heap memory guaranteed to be available to the JVM (in bytes).Gauge
    Heap.MaxThe maximum amount of heap memory that can be used for memory management (in bytes).
    This value might not be necessarily equal to the maximum value specified through -Xmx or the equivalent Flink configuration parameter. Some GC algorithms allocate heap memory that won’t be available to the user code and, therefore, not being exposed through the heap metrics.
    Gauge
    NonHeap.UsedThe amount of non-heap memory currently used (in bytes).Gauge
    NonHeap.CommittedThe amount of non-heap memory guaranteed to be available to the JVM (in bytes).Gauge
    NonHeap.MaxThe maximum amount of non-heap memory that can be used for memory management (in bytes).Gauge
    Metaspace.UsedThe amount of memory currently used in the Metaspace memory pool (in bytes).Gauge
    Metaspace.CommittedThe amount of memory guaranteed to be available to the JVM in the Metaspace memory pool (in bytes).Gauge
    Metaspace.MaxThe maximum amount of memory that can be used in the Metaspace memory pool (in bytes).Gauge
    Direct.CountThe number of buffers in the direct buffer pool.Gauge
    Direct.MemoryUsedThe amount of memory used by the JVM for the direct buffer pool (in bytes).Gauge
    Direct.TotalCapacityThe total capacity of all buffers in the direct buffer pool (in bytes).Gauge
    Mapped.CountThe number of buffers in the mapped buffer pool.Gauge
    Mapped.MemoryUsedThe amount of memory used by the JVM for the mapped buffer pool (in bytes).Gauge
    Mapped.TotalCapacityThe number of buffers in the mapped buffer pool (in bytes).Gauge
    Status.Flink.MemoryManaged.UsedThe amount of managed memory currently used.Gauge
    Managed.TotalThe total amount of managed memory.Gauge

    Threads

    ScopeInfixMetricsDescriptionType
    Job-/TaskManagerStatus.JVM.ThreadsCountThe total number of live threads.Gauge

    GarbageCollection

    ScopeInfixMetricsDescriptionType
    Job-/TaskManagerStatus.JVM.GarbageCollector<GarbageCollector>.CountThe total number of collections that have occurred.Gauge
    <GarbageCollector>.TimeThe total time spent performing garbage collection.Gauge

    ClassLoader

    ScopeInfixMetricsDescriptionType
    Job-/TaskManagerStatus.JVM.ClassLoaderClassesLoadedThe total number of classes loaded since the start of the JVM.Gauge
    ClassesUnloadedThe total number of classes unloaded since the start of the JVM.Gauge

    Network

    Default shuffle service

    Metrics related to data exchange between task executors using netty network communication.

    ScopeInfixMetricsDescriptionType
    TaskManagerStatus.Shuffle.NettyAvailableMemorySegmentsThe number of unused memory segments.Gauge
    UsedMemorySegmentsThe number of used memory segments.Gauge
    TotalMemorySegmentsThe number of allocated memory segments.Gauge
    AvailableMemoryThe amount of unused memory in bytes.Gauge
    UsedMemoryThe amount of used memory in bytes.Gauge
    TotalMemoryThe amount of allocated memory in bytes.Gauge
    TaskShuffle.Netty.Input.BuffersinputQueueLengthThe number of queued input buffers.Gauge
    inPoolUsageAn estimate of the input buffers usage. (ignores LocalInputChannels)Gauge
    inputFloatingBuffersUsageAn estimate of the floating input buffers usage. (ignores LocalInputChannels)Gauge
    inputExclusiveBuffersUsageAn estimate of the exclusive input buffers usage. (ignores LocalInputChannels)Gauge
    Shuffle.Netty.Output.BuffersoutputQueueLengthThe number of queued output buffers.Gauge
    outPoolUsageAn estimate of the output buffers usage.Gauge
    Shuffle.Netty.<Input|Output>.<gate|partition>
    (only available if taskmanager.net.detailed-metrics config option is set)
    totalQueueLenTotal number of queued buffers in all input/output channels.Gauge
    minQueueLenMinimum number of queued buffers in all input/output channels.Gauge
    maxQueueLenMaximum number of queued buffers in all input/output channels.Gauge
    avgQueueLenAverage number of queued buffers in all input/output channels.Gauge
    Shuffle.Netty.InputnumBytesInLocalThe total number of bytes this task has read from a local source.Counter
    numBytesInLocalPerSecondThe number of bytes this task reads from a local source per second.Meter
    numBytesInRemoteThe total number of bytes this task has read from a remote source.Counter
    numBytesInRemotePerSecondThe number of bytes this task reads from a remote source per second.Meter
    numBuffersInLocalThe total number of network buffers this task has read from a local source.Counter
    numBuffersInLocalPerSecondThe number of network buffers this task reads from a local source per second.Meter
    numBuffersInRemoteThe total number of network buffers this task has read from a remote source.Counter
    numBuffersInRemotePerSecondThe number of network buffers this task reads from a remote source per second.Meter
    ScopeMetricsDescriptionType
    JobManagernumRegisteredTaskManagersThe number of registered taskmanagers.Gauge
    numRunningJobsThe number of running jobs.Gauge
    taskSlotsAvailableThe number of available task slots.Gauge
    taskSlotsTotalThe total number of task slots.Gauge

    Availability

    ScopeMetricsDescriptionType
    Job (only available on JobManager)restartingTimeThe time it took to restart the job, or how long the current restart has been in progress (in milliseconds).Gauge
    uptimeThe time that the job has been running without interruption.

    Returns -1 for completed jobs (in milliseconds).

    Gauge
    downtimeFor jobs currently in a failing/recovering situation, the time elapsed during this outage.

    Returns 0 for running jobs and -1 for completed jobs (in milliseconds).

    Gauge
    fullRestartsAttention: deprecated, use numRestarts.Gauge
    numRestartsThe total number of restarts since this job was submitted, including full restarts and fine-grained restarts.Gauge

    Checkpointing

    Note that for failed checkpoints, metrics are updated on a best efforts basis and may be not accurate.

    ScopeMetricsDescriptionType
    Job (only available on JobManager)lastCheckpointDurationThe time it took to complete the last checkpoint (in milliseconds).Gauge
    lastCheckpointSizeThe total size of the last checkpoint (in bytes).Gauge
    lastCheckpointExternalPathThe path where the last external checkpoint was stored.Gauge
    lastCheckpointRestoreTimestampTimestamp when the last checkpoint was restored at the coordinator (in milliseconds).Gauge
    numberOfInProgressCheckpointsThe number of in progress checkpoints.Gauge
    numberOfCompletedCheckpointsThe number of successfully completed checkpoints.Gauge
    numberOfFailedCheckpointsThe number of failed checkpoints.Gauge
    totalNumberOfCheckpointsThe number of total checkpoints (in progress, completed, failed).Gauge
    TaskcheckpointAlignmentTimeThe time in nanoseconds that the last barrier alignment took to complete, or how long the current alignment has taken so far (in nanoseconds). This is the time between receiving first and the last checkpoint barrier. You can find more information in the Monitoring State and Checkpoints sectionGauge
    checkpointStartDelayNanosThe time in nanoseconds that elapsed between the creation of the last checkpoint and the time when the checkpointing process has started by this Task. This delay shows how long it takes for the first checkpoint barrier to reach the task. A high value indicates back-pressure. If only a specific task has a long start delay, the most likely reason is data skew.Gauge

    RocksDB

    Certain RocksDB native metrics are available but disabled by default, you can find full documentation here

    IO

    Connectors

    Kafka Connectors

    Please refer to Kafka monitoring.

    Kinesis Connectors

    ScopeMetricsUser VariablesDescriptionType
    OperatormillisBehindLateststream, shardIdThe number of milliseconds the consumer is behind the head of the stream, indicating how far behind current time the consumer is, for each Kinesis shard. A particular shard’s metric can be specified by stream name and shard id. A value of 0 indicates record processing is caught up, and there are no new records to process at this moment. A value of -1 indicates that there is no reported value for the metric, yet.Gauge
    OperatorsleepTimeMillisstream, shardIdThe number of milliseconds the consumer spends sleeping before fetching records from Kinesis. A particular shard’s metric can be specified by stream name and shard id.Gauge
    OperatormaxNumberOfRecordsPerFetchstream, shardIdThe maximum number of records requested by the consumer in a single getRecords call to Kinesis. If ConsumerConfigConstants.SHARD_USE_ADAPTIVE_READS is set to true, this value is adaptively calculated to maximize the 2 Mbps read limits from Kinesis.Gauge
    OperatornumberOfAggregatedRecordsPerFetchstream, shardIdThe number of aggregated Kinesis records fetched by the consumer in a single getRecords call to Kinesis.Gauge
    OperatornumberOfDeggregatedRecordsPerFetchstream, shardIdThe number of deaggregated Kinesis records fetched by the consumer in a single getRecords call to Kinesis.Gauge
    OperatoraverageRecordSizeBytesstream, shardIdThe average size of a Kinesis record in bytes, fetched by the consumer in a single getRecords call.Gauge
    OperatorrunLoopTimeNanosstream, shardIdThe actual time taken, in nanoseconds, by the consumer in the run loop.Gauge
    OperatorloopFrequencyHzstream, shardIdThe number of calls to getRecords in one second.Gauge
    OperatorbytesRequestedPerFetchstream, shardIdThe bytes requested (2 Mbps / loopFrequencyHz) in a single call to getRecords.Gauge

    HBase Connectors

    ScopeMetricsUser VariablesDescriptionType
    OperatorlookupCacheHitRaten/aCache hit ratio for lookup.Gauge

    System resources reporting is disabled by default. When metrics.system-resource is enabled additional metrics listed below will be available on Job- and TaskManager. System resources metrics are updated periodically and they present average values for a configured interval (metrics.system-resource-probing-interval).

    System resources reporting requires an optional dependency to be present on the classpath (for example placed in Flink’s lib directory):

    • com.github.oshi:oshi-core:3.4.0 (licensed under EPL 1.0 license)

    Including it’s transitive dependencies:

    • net.java.dev.jna:jna-platform:jar:4.2.2
    • net.java.dev.jna:jna:jar:4.2.2

    Failures in this regard will be reported as warning messages like NoClassDefFoundError logged by SystemResourcesMetricsInitializer during the startup.

    System CPU

    ScopeInfixMetricsDescription
    Job-/TaskManagerSystem.CPUUsageOverall % of CPU usage on the machine.
    Idle% of CPU Idle usage on the machine.
    Sys% of System CPU usage on the machine.
    User% of User CPU usage on the machine.
    IOWait% of IOWait CPU usage on the machine.
    Irq% of Irq CPU usage on the machine.
    SoftIrq% of SoftIrq CPU usage on the machine.
    Nice% of Nice Idle usage on the machine.
    Load1minAverage CPU load over 1 minute
    Load5minAverage CPU load over 5 minute
    Load15minAverage CPU load over 15 minute
    UsageCPU*% of CPU usage per each processor

    System memory

    ScopeInfixMetricsDescription
    Job-/TaskManagerSystem.MemoryAvailableAvailable memory in bytes
    TotalTotal memory in bytes
    System.SwapUsedUsed swap bytes
    TotalTotal swap in bytes

    System network

    Flink allows to track the latency of records travelling through the system. This feature is disabled by default. To enable the latency tracking you must set the latencyTrackingInterval to a positive number in either the Flink configuration or ExecutionConfig.

    At the latencyTrackingInterval, the sources will periodically emit a special record, called a LatencyMarker. The marker contains a timestamp from the time when the record has been emitted at the sources. Latency markers can not overtake regular user records, thus if records are queuing up in front of an operator, it will add to the latency tracked by the marker.

    Note that the latency markers are not accounting for the time user records spend in operators as they are bypassing them. In particular the markers are not accounting for the time records spend for example in window buffers. Only if operators are not able to accept new records, thus they are queuing up, the latency measured using the markers will reflect that.

    The LatencyMarkers are used to derive a distribution of the latency between the sources of the topology and each downstream operator. These distributions are reported as histogram metrics. The granularity of these distributions can be controlled in the . For the highest granularity subtask Flink will derive the latency distribution between every source subtask and every downstream subtask, which results in quadratic (in the terms of the parallelism) number of histograms.

    Currently, Flink assumes that the clocks of all machines in the cluster are in sync. We recommend setting up an automated clock synchronisation service (like NTP) to avoid false latency results.

    Warning Enabling latency metrics can significantly impact the performance of the cluster (in particular for subtask granularity). It is highly recommended to only use them for debugging purposes.

    State access latency tracking

    Flink also allows to track the keyed state access latency for standard Flink state-backends or customized state backends which extending from AbstractStateBackend. This feature is disabled by default. To enable this feature you must set the state.backend.latency-track.keyed-state-enabled to true in the .

    Once tracking keyed state access latency is enabled, Flink will sample the state access latency every N access, in which N is defined by state.backend.latency-track.sample-interval. This configuration has a default value of 100. A smaller value will get more accurate results but have a higher performance impact since it is sampled more frequently.

    As the type of this latency metrics is histogram, state.backend.latency-track.history-size will control the maximum number of recorded values in history, which has the default value of 128. A larger value of this configuration will require more memory, but will provide a more accurate result.

    Metrics can be queried through the Monitoring REST API.

    Below is a list of available endpoints, with a sample JSON response. All endpoints are of the sample form http://hostname:8081/jobmanager/metrics, below we list only the path part of the URLs.

    Values in angle brackets are variables, for example http://hostname:8081/jobs/<jobid>/metrics will have to be requested for example as http://hostname:8081/jobs/7684be6004e4e955c2a558a9bc463f65/metrics.

    Request metrics for a specific entity:

    • /jobmanager/metrics
    • /taskmanagers/<taskmanagerid>/metrics
    • /jobs/<jobid>/metrics
    • /jobs/<jobid>/vertices/<vertexid>/subtasks/<subtaskindex>

    Request metrics aggregated across all entities of the respective type:

    • /taskmanagers/metrics
    • /jobs/metrics
    • /jobs/<jobid>/vertices/<vertexid>/subtasks/metrics

    Request metrics aggregated over a subset of all entities of the respective type:

    • /taskmanagers/metrics?taskmanagers=A,B,C
    • /jobs/metrics?jobs=D,E,F
    • /jobs/<jobid>/vertices/<vertexid>/subtasks/metrics?subtask=1,2,3

    Warning Metric names can contain special characters that you need to be escape when querying metrics. For example, “a_+_b” would be escaped to “a_%2B_b”.

    List of characters that should be escaped:

    CharacterEscape Sequence
    #%23
    $%24
    &%26
    +%2B
    /%2F
    ;%3B
    =%3D
    ?%3F
    @%40

    Request a list of available metrics:

    GET /jobmanager/metrics

    1. [
    2. {
    3. "id": "metric1"
    4. },
    5. {
    6. "id": "metric2"
    7. }
    8. ]

    Request the values for specific (unaggregated) metrics:

    GET taskmanagers/ABCDE/metrics?get=metric1,metric2

    1. [
    2. {
    3. "id": "metric1",
    4. "value": "34"
    5. },
    6. {
    7. "id": "metric2",
    8. "value": "2"
    9. }
    10. ]

    Request aggregated values for specific metrics:

    GET /taskmanagers/metrics?get=metric1,metric2

    1. [
    2. {
    3. "id": "metric1",
    4. "min": 1,
    5. "max": 34,
    6. "avg": 15,
    7. "sum": 45
    8. },
    9. {
    10. "id": "metric2",
    11. "min": 2,
    12. "max": 14,
    13. "avg": 7,
    14. "sum": 16
    15. }
    16. ]

    Request specific aggregated values for specific metrics:

    GET /taskmanagers/metrics?get=metric1,metric2&agg=min,max

    1. [
    2. {
    3. "id": "metric1",
    4. "min": 1,
    5. "max": 34
    6. },
    7. {
    8. "id": "metric2",
    9. "min": 2,
    10. "max": 14
    11. }
    12. ]

    Dashboard integration

    Metrics that were gathered for each task or operator can also be visualized in the Dashboard. On the main page for a job, select the Metrics tab. After selecting one of the tasks in the top graph you can select metrics to display using the Add Metric drop-down menu.

    • Task metrics are listed as .

    Each metric will be visualized as a separate graph, with the x-axis representing time and the y-axis the measured value. All graphs are automatically updated every 10 seconds, and continue to do so when navigating to another page.