Apache Beam Flink Pipeline Engine
The Flink runner supports two modes: Local Direct Flink Runner and Flink Runner.
The Flink Runner and Flink are suitable for large scale, continuous jobs, and provide:
A streaming-first runtime that supports both batch processing and data streaming programs
Fault-tolerance with exactly-once processing guarantees
Natural back-pressure in streaming programs
Integration with YARN and other components of the Apache Hadoop ecosystem
Check the Apache Beam Flink runner docs for more information.
You can also execute using the ‘bin/flink run’ command. There is a main class you can use with the option of the run command:
It accepts 3 arguments:
Argument | Description |
---|---|
1 | The filename of the pipeline to execute. |
2 | The filename of the metadata to load (JSON). You can export metadata in the Hop GUI under the tools menu (part of the Beam plugin in ) |
3 | The name of the pipeline run configuration to use |
The Flink run command also needs a fat jar as an argument. This can be generated in the Hop GUI under the tools menu or using command:
Important : project configurations, environments and these things are not valid in the context of the Flink runtime. This is a TODO for the Hop community to think how we can do this best. Your input is welcome. In the meantime pass variables to the JVM by setting these in the conf/flink-conf.yml file by adding a line:
In general, it is better not to use relative paths like when specifying filenames when executing pipelines remotely. It’s usually better to pick a few root folders as variables. PROJECT_HOME is as good as any variable to use.
An example Flink run command might look like this: