Amazon Kinesis Data Streams SQL Connector
The Kinesis connector allows for reading data from and writing data into Amazon Kinesis Data Streams (KDS).
In order to use the Kinesis connector the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
How to create a Kinesis data stream table
Follow the instructions from the Amazon KDS Developer Guide to set up a Kinesis stream. The following example shows how to create a table backed by a Kinesis data stream:
The following metadata can be exposed as read-only () columns in a table definition.
The extended CREATE TABLE
example demonstrates the syntax for exposing these metadata fields:
Connector Options
Depending on your deployment you would choose a different Credentials Provider to allow access to Kinesis. By default, the AUTO
Credentials Provider is used. If the access key ID and secret key are set in the deployment configuration, this results in using the BASIC
provider.
A specific AWSCredentialsProvider can be optionally set using the aws.credentials.provider
setting. Supported values are:
AUTO
- Use the default AWS Credentials Provider chain that searches for credentials in the following order:ENV_VARS
,SYS_PROPS
,WEB_IDENTITY_TOKEN
,PROFILE
, and EC2/ECS credentials provider.BASIC
- Use access key ID and secret key supplied as configuration.ENV_VAR
- UseAWS_ACCESS_KEY_ID
&AWS_SECRET_ACCESS_KEY
environment variables.SYS_PROP
- Use Java system propertiesaws.accessKeyId
andaws.secretKey
.PROFILE
- Use an AWS credentials profile to create the AWS credentials.ASSUME_ROLE
- Create AWS credentials by assuming a role. The credentials for assuming the role must be supplied.WEB_IDENTITY_TOKEN
- Create AWS credentials by assuming a role using Web Identity Token.
You can configure table sources to start reading a table-backing Kinesis data stream from a specific position through the scan.stream.initpos
option. Available values are:
LATEST
: read shards starting from the latest record.TRIM_HORIZON
: read shards starting from the earliest record possible (data may be trimmed by Kinesis depending on the current retention settings of the backing stream).AT_TIMESTAMP
: read shards starting from a specified timestamp. The timestamp value should be specified through thescan.stream.initpos-timestamp
in one of the following formats:- A value conforming to a user-defined
SimpleDateFormat
specified at . If a user does not define a format, the default pattern will beyyyy-MM-dd'T'HH:mm:ss.SSSXXX
. For example, timestamp value is2016-04-04
and user-defined format isyyyy-MM-dd
, or timestamp value is2016-04-04T19:58:46.480-00:00
and a user-defined format is not provided.
- A value conforming to a user-defined
Kinesis data streams consist of one or more shards, and the sink.partitioner
option allows you to control how records written into a multi-shard Kinesis-backed table will be partitioned between its shards. Valid values are:
fixed
: KinesisPartitionKey
values derived from the Flink subtask index, so each Flink partition ends up in at most one Kinesis partition (assuming that no re-sharding takes place at runtime).random
: KinesisPartitionKey
values are assigned randomly. This is the default value for tables not defined with aPARTITION BY
clause.- Custom
FixedKinesisPartitioner
subclass: e.g.'org.mycompany.MyPartitioner'
.
Note Using EFO will .
You can enable and configure EFO with the following properties:
scan.stream.recordpublisher
: Determines whether to useEFO
orPOLLING
.scan.stream.efo.consumername
: A name to identify the consumer when the above value isEFO
.scan.stream.efo.registration
: Strategy for (de-)registration ofEFO
consumers with the name given by thescan.stream.efo.consumername
value. Valid strategies are:LAZY
(default): Stream consumers are registered when the Flink job starts running. If the stream consumer already exists, it will be reused. This is the preferred strategy for the majority of applications. However, jobs with parallelism greater than 1 will result in tasks competing to register and acquire the stream consumer ARN. For jobs with very large parallelism this can result in an increased start-up time. The describe operation has a limit of 20 transactions per second, this means application startup time will increase by roughlyparallelism/20 seconds
.EAGER
: Stream consumers are registered in theFlinkKinesisConsumer
constructor. If the stream consumer already exists, it will be reused. This will result in registration occurring when the job is constructed, either on the Flink Job Manager or client environment submitting the job. Using this strategy results in a single thread registering and retrieving the stream consumer ARN, reducing startup time overLAZY
(with large parallelism). However, consider that the client environment will require access to the AWS services.NONE
: Stream consumer registration is not performed byFlinkKinesisConsumer
. Registration must be performed externally using the to invoke RegisterStreamConsumer. Stream consumer ARNs should be provided to the job via the consumer configuration.
Note For a given Kinesis data stream, each EFO consumer must have a unique name. However, consumer names do not have to be unique across data streams. Reusing a consumer name will result in existing subscriptions being terminated.
Note With the LAZY
strategy, stream consumers are de-registered when the job is shutdown gracefully. In the event that a job terminates within executing the shutdown hooks, stream consumers will remain active. In this situation the stream consumers will be gracefully reused when the application restarts. With the NONE
and EAGER
strategies, stream consumer de-registration is not performed by .
Data Type Mapping
Kinesis stores records as Base64-encoded binary data objects, so it doesn’t have a notion of internal record structure. Instead, Kinesis records are deserialized and serialized by formats, e.g. ‘avro’, ‘csv’, or ‘json’. To determine the data type of the messages in your Kinesis-backed tables, pick a suitable Flink format with the format
keyword. Please refer to the Formats pages for more details.