Hadoop formats

Add the following dependency to your pom.xml to use hadoop

If you want to run your Flink application locally (e.g. from your IDE), you also need to add a hadoop-client dependency such as:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.8.3</version>
    <scope>provided</scope>

To use Hadoop InputFormats with Flink the format must first be wrapped using either readHadoopFile or createHadoopInput of the HadoopInputs utility class. The former is used for input formats derived from while the latter has to be used for general purpose input formats. The resulting InputFormat can be used to create a data source by using ExecutionEnvironmen#createInput.

The following example shows how to use Hadoop’s TextInputFormat.

Java

Scala

val env = ExecutionEnvironment.getExecutionEnvironment
val input: DataSet[(LongWritable, Text)] =
  env.createInput(HadoopInputs.readHadoopFile(
                    new TextInputFormat, classOf[LongWritable], classOf[Text], textPath))
// Do something with the data.
[...]

The following example shows how to use Hadoop’s TextOutputFormat.

Java

Scala

val hadoopResult: DataSet[(Text, IntWritable)] = [...]
val hadoopOF = new HadoopOutputFormat[Text,IntWritable](
  new TextOutputFormat[Text, IntWritable],
  new JobConf)
hadoopOF.getJobConf.set("mapred.textoutputformat.separator", " ")
FileOutputFormat.setOutputPath(hadoopOF.getJobConf, new Path(resultPath))
hadoopResult.output(hadoopOF)