Intro to the Python DataStream API

    Python DataStream API is a Python version of DataStream API which allows Python users could write Python DatStream API jobs.

    The following code example shows the common structure of Python DataStream API programs.

    Create a StreamExecutionEnvironment

    The is a central concept of the DataStream API program. The following code example shows how to create a StreamExecutionEnvironment:

    1. from pyflink.datastream import StreamExecutionEnvironment
    2. env = StreamExecutionEnvironment.get_execution_environment()

    The DataStream API gets its name from the special DataStream class that is used to represent a collection of data in a Flink program. You can think of them as immutable collections of data that can contain duplicates. This data can either be finite or unbounded, the API that you use to work on them is the same.

    A DataStream is similar to a regular Python Collection in terms of usage but is quite different in some key ways. They are immutable, meaning that once they are created you cannot add or remove elements. You can also not simply inspect the elements inside but only work on them using the DataStream API operations, which are also called transformations.

    You can create an initial DataStream by adding a source in a Flink program. Then you can derive new streams from this and combine them by using API methods such as map, filter, and so on.

    You can create a DataStream from a list object:

    1. from pyflink.common.typeinfo import Types
    2. from pyflink.datastream import StreamExecutionEnvironment
    3. env = StreamExecutionEnvironment.get_execution_environment()
    4. ds = env.from_collection(
    5. collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')],
    6. type_info=Types.ROW([Types.INT(), Types.STRING()]))

    The parameter type_info is optional, if not specified, the output type of the returned DataStream will be Types.PICKLED_BYTE_ARRAY().

    Create using DataStream connectors

    You can also create a DataStream using DataStream connectors with method add_source as following:

    1. from pyflink.common.serialization import JsonRowDeserializationSchema
    2. from pyflink.common.typeinfo import Types
    3. from pyflink.datastream import StreamExecutionEnvironment
    4. from pyflink.datastream.connectors import FlinkKafkaConsumer
    5. env = StreamExecutionEnvironment.get_execution_environment()
    6. # the sql connector for kafka is used here as it's a fat jar and could avoid dependency issues
    7. env.add_jars("file:///path/to/flink-sql-connector-kafka.jar")
    8. deserialization_schema = JsonRowDeserializationSchema.builder() \
    9. .type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()
    10. kafka_consumer = FlinkKafkaConsumer(
    11. topics='test_source_topic',
    12. deserialization_schema=deserialization_schema,
    13. ds = env.add_source(kafka_consumer)

    Note It currently only supports FlinkKafkaConsumer to be used as DataStream source connectors with method add_source.

    You could also call the from_source method to create a DataStream using unified DataStream source connectors:

    1. from pyflink.common.typeinfo import Types
    2. from pyflink.common.watermark_strategy import WatermarkStrategy
    3. from pyflink.datastream import StreamExecutionEnvironment
    4. from pyflink.datastream.connectors import NumberSequenceSource
    5. env = StreamExecutionEnvironment.get_execution_environment()
    6. ds = env.from_source(
    7. source=seq_num_source,
    8. watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
    9. source_name='seq_num_source',
    10. type_info=Types.LONG())

    Note Currently, it only supports NumberSequenceSource and FileSource as unified DataStream source connectors.

    Note The DataStream created using from_source could be executed in both batch and streaming executing mode.

    Table & SQL connectors could also be used to create a DataStream. You could firstly create a Table using Table & SQL connectors and then convert it to a DataStream.

    Note The StreamExecutionEnvironment env should be specified when creating the TableEnvironment t_env.

    Note As all the Java Table & SQL connectors could be used in PyFlink Table API, this means that all of them could also be used in PyFlink DataStream API.

    DataStream Transformations

    Operators transform one or more DataStream into a new DataStream. Programs can combine multiple transformations into sophisticated dataflow topologies.

    The following example shows a simple example about how to convert a DataStream into another DataStream using map transformation:

    1. ds = ds.map(lambda a: a + 1)

    Please see for an overview of the available DataStream transformations.

    It also supports to convert a DataStream to a Table and vice verse.

    1. # convert a DataStream to a Table
    2. table = t_env.from_data_stream(ds, 'a, b, c')
    3. # convert a Table to a DataStream
    4. ds = table.to_append_stream(table, Types.ROW([Types.INT(), Types.STRING()]))
    5. # or
    6. ds = table.to_retract_stream(table, Types.ROW([Types.INT(), Types.STRING()]))

    Emit Results

    Print

    1. ds.print()

    You can call the execute_and_collect method to collect the data of a DataStream to client:

    1. with ds.execute_and_collect() as results:
    2. for result in results:
    3. print(result)

    Note The method execute_and_collect will collect the data of the DataStream to the memory of the client and so it’s a good practice to limit the number of rows collected.

    Emit results to a DataStream sink connector

    You can call the add_sink method to emit the data of a DataStream to a DataStream sink connector:

    Note It currently only supports FlinkKafkaProducer, JdbcSink and StreamingFileSink to be used as DataStream sink connectors with method add_sink.

    Note The method add_sink could only be used in streaming executing mode.

    You could also call the sink_to method to emit the data of a DataStream to a unified DataStream sink connector:

    1. from pyflink.datastream.connectors import FileSink, OutputFileConfig
    2. from pyflink.common.serialization import Encoder
    3. output_path = '/opt/output/'
    4. file_sink = FileSink \
    5. .for_row_format(output_path, Encoder.simple_string_encoder()) \
    6. .with_output_file_config(OutputFileConfig.builder().with_part_prefix('pre').with_part_suffix('suf').build()) \
    7. ds.sink_to(file_sink)

    Note It currently only supports FileSink as unified DataStream sink connectors.

    Note The method sink_to could be used in both batch and streaming executing mode.

    Table & SQL connectors could also be used to write out a DataStream. You need firstly convert a DataStream to a Table and then write it to a Table & SQL sink connector.

    1. from pyflink.common import Row
    2. from pyflink.common.typeinfo import Types
    3. from pyflink.datastream import StreamExecutionEnvironment
    4. from pyflink.table import StreamTableEnvironment
    5. env = StreamExecutionEnvironment.get_execution_environment()
    6. t_env = StreamTableEnvironment.create(stream_execution_environment=env)
    7. # option 1:the result type of ds is Types.ROW
    8. def split(s):
    9. splits = s[1].split("|")
    10. for sp in splits:
    11. yield Row(s[0], sp)
    12. ds = ds.map(lambda i: (i[0] + 1, i[1])) \
    13. .flat_map(split, Types.ROW([Types.INT(), Types.STRING()])) \
    14. .key_by(lambda i: i[1]) \
    15. .reduce(lambda i, j: Row(i[0] + j[0], i[1]))
    16. # option 1:the result type of ds is Types.TUPLE
    17. def split(s):
    18. splits = s[1].split("|")
    19. for sp in splits:
    20. yield s[0], sp
    21. ds = ds.map(lambda i: (i[0] + 1, i[1])) \
    22. .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \
    23. .key_by(lambda i: i[1]) \
    24. .reduce(lambda i, j: (i[0] + j[0], i[1]))
    25. # emit ds to print sink
    26. t_env.execute_sql("""
    27. CREATE TABLE my_sink (
    28. a INT,
    29. b VARCHAR
    30. ) WITH (
    31. 'connector' = 'print'
    32. )
    33. """)
    34. table = t_env.from_data_stream(ds)
    35. table_result = table.execute_insert("my_sink")

    Note The output type of DataStream ds must be composite type.

    1. env.execute()

    If you convert the DataStream to a Table and then write it to a Table API & SQL sink connector, it may happen that you need to submit the job using method.

    1. t_env.execute()