Flink and Map Reduce compatibility
You can:
- use any Hadoop
InputFormat
as a DataSource. - use any Hadoop
OutputFormat
as a . - use a Hadoop
Mapper
as FlatMapFunction.
This document shows how to use existing Hadoop MapReduce code with Flink. Please refer to the guide for reading from Hadoop supported file systems.
Support for Hadoop is contained in the flink-hadoop-compatibility
Maven module.
If you want to run your Flink application locally (e.g. from your IDE), you also need to add a hadoop-client
dependency such as:
Hadoop Mappers are semantically equivalent to Flink’s FlatMapFunctions and Hadoop Reducers are equivalent to Flink’s . Flink provides wrappers for implementations of Hadoop MapReduce’s and Reducer
interfaces, i.e., you can reuse your Hadoop Mappers and Reducers in regular Flink programs. At the moment, only the Mapper and Reduce interfaces of Hadoop’s mapred API (org.apache.hadoop.mapred
) are supported.
The wrappers take a DataSet<Tuple2<KEYIN,VALUEIN>>
as input and produce a DataSet<Tuple2<KEYOUT,VALUEOUT>>
as output where KEYIN
and KEYOUT
are the keys and VALUEIN
and are the values of the Hadoop key-value pairs that are processed by the Hadoop functions. For Reducers, Flink offers a wrapper for a GroupReduceFunction with (HadoopReduceCombineFunction
) and without a Combiner (HadoopReduceFunction
). The wrappers accept an optional JobConf
object to configure the Hadoop Mapper or Reducer.
org.apache.flink.hadoopcompatibility.mapred.HadoopMapFunction
,org.apache.flink.hadoopcompatibility.mapred.HadoopReduceFunction
, andorg.apache.flink.hadoopcompatibility.mapred.HadoopReduceCombineFunction
.
and can be used as regular Flink FlatMapFunctions or .
The following example shows how to use Hadoop Mapper
and functions.
Please note: The Reducer wrapper works on groups as defined by Flink’s groupBy() operation. It does not consider any custom partitioners, sort or grouping comparators you might have set in the JobConf
.