Running Apache Flink on Alluxio
This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.
- Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
- Alluxio has been set up and is running.
- Flink has been set up and is running.
Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.
If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml
configuration file:
In case you don’t have a Hadoop setup, you have to create a file called core-site.xml
with the following contents:
<configuration>
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</configuration>
In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. We recommend you to download the tarball from Alluxio . Alternatively, advanced users can choose to compile this client jar from the source code by following the instructions here. The Alluxio client jar can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar
.
We need to make the Alluxio jar
file available to Flink, because it contains the configured alluxio.hadoop.FileSystem
class.
There are different ways to achieve that:
- Put the
/<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar
file into thelib
directory of Flink (for local and standalone cluster setups) - Put the
/<PATH_TO_ALLUXIO>/client/alluxio-2.5.0-client.jar
file into theship
directory for Flink on YARN. - Specify the location of the jar file in the
HADOOP_CLASSPATH
environment variable (make sure its available on all cluster nodes as well). For example like this:
In addition, if there are any client-related properties specified in , translate those to env.java.opts
in {FLINK_HOME}/conf/flink-conf.yaml
for Flink to pick up Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml
.
env.java.opts: -Dalluxio.user.file.writetype.default=CACHE_THROUGH
If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg
.
This example assumes you have set up Alluxio and Flink as previously described.
Put the file LICENSE
into Alluxio, assuming you are in the top level Alluxio project directory:
Run the following command from the top level Flink project directory:
$ bin/flink run examples/batch/WordCount.jar \
--input alluxio://localhost:19998/LICENSE \