Running Apache Flink on Alluxio

    This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.

    • Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
    • Alluxio has been set up and is running.
    • Flink has been set up and is running.

    Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.

    If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml configuration file:

    In case you don’t have a Hadoop setup, you have to create a file called core-site.xml with the following contents:

    1. <configuration>
    2. <property>
    3. <name>fs.alluxio.impl</name>
    4. <value>alluxio.hadoop.FileSystem</value>
    5. </configuration>

    In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. We recommend you to download the tarball from Alluxio . Alternatively, advanced users can choose to compile this client jar from the source code by following the instructions here. The Alluxio client jar can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar.

    We need to make the Alluxio jar file available to Flink, because it contains the configured alluxio.hadoop.FileSystem class.

    There are different ways to achieve that:

    • Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar file into the lib directory of Flink (for local and standalone cluster setups)
    • Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar file into the ship directory for Flink on YARN.
    • Specify the location of the jar file in the HADOOP_CLASSPATH environment variable (make sure its available on all cluster nodes as well). For example like this:

    In addition, if there are any client-related properties specified in , translate those to in {FLINK_HOME}/conf/flink-conf.yaml for Flink to pick up Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml.

    1. -Dalluxio.user.file.writetype.default=CACHE_THROUGH

    If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg.

    This example assumes you have set up Alluxio and Flink as previously described.

    Put the file LICENSE into Alluxio, assuming you are in the top level Alluxio project directory:

    Run the following command from the top level Flink project directory:

    1. $ bin/flink run examples/batch/WordCount.jar \
    2. --input alluxio://localhost:19998/LICENSE \