Production Readiness Checklist

    The max parallelism, set on a per-job and per-operator granularity, determines the maximum parallelism to which a stateful operator can scale. There is currently no way to change the maximum parallelism of an operator after a job has started without discarding that operators state. The reason maximum parallelism exists, versus allowing stateful operators to be infinitely scalable, is that it has some impact on your application’s performance and state size. Flink has to maintain specific metadata for its ability to rescale state which grows linearly with max parallelism. In general, you should choose max parallelism that is high enough to fit your future needs in scalability, while keeping it low enough to maintain reasonable performance.

    You can explicitly set maximum parallelism by using setMaxParallelism(int maxparallelism). If no max parallelism is set Flink will decide using a function of the operators parallelism when the job is first started:

    • MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15) : for all parallelism > 128.

    See the description of state backends for choosing the right one for your use case.

    is Flink’s primary fault-tolerance mechanism, wherein a snapshot of your job’s state persisted periodically to some durable location. In the case of failure, Flink will restart from the most recent checkpoint and resume processing. A jobs checkpoint interval configures how often Flink will take these snapshots. While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.

    • What is the SLA of your service: Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.

    • How much load can your Task Managers sustain: All of Flinks’ built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. can be a powerful tool to reduce the cost of any given checkpoint.

    And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice.

    The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster. It is a single point of failure within the cluster, and if it crashes, no new jobs can be submitted, and running applications will fail.