Monitoring Back Pressure

If you see a back pressure warning (e.g. ) for a task, this means that it is producing data faster than the downstream operators can consume. Records in your job flow downstream (e.g. from sources to sinks) and back pressure is propagated in the opposite direction, up the stream.

Take a simple Source -> Sink job as an example. If you see a warning for Source, this means that Sink is consuming data slower than is producing. Sink is back pressuring the upstream operator Source.

Every parallel instance of a task (subtask) is exposing a group of three metrics:

backPressureTimeMsPerSecond, time that subtask spent being back pressured
busyTimeMsPerSecond, time that subtask was busy doing some actual work At any point of time these three metrics are adding up approximately to 1000ms.

Internally, back pressure is judged based on the availability of output buffers. If a task has no available output buffers, then that task is considered back pressured. Idleness, on the other hand, is determined by whether or not there is input available.

The WebUI aggregates the maximum value of the back pressure and busy metrics from all of the subtasks and presents those aggregated values inside the JobGraph. Besides displaying the raw values, tasks are also color-coded to make the investigation easier.

In the Back Pressure tab next to the job overview you can find more detailed metrics.

For subtasks whose status is OK, there is no indication of back pressure. HIGH, on the other hand, means that a subtask is back pressured. Status is defined in the following way:

OK: 0% <= back pressured <= 10%
HIGH: 50% < back pressured <= 100%