s3 source
In order to use the s3
source, configure your AWS Identity and Access Management (IAM) permissions to grant Data Prepper access to Amazon S3. You can use a configuration similar to the following JSON configuration:
If your S3 objects or Amazon SQS queues do not use , remove the kms:Decrypt
permission.
Configuration
You can use the following options to configure the s3
source.
The following parameters allow you to configure usage for Amazon SQS in the s3
source plugin.
Option | Required | Type | Description |
---|---|---|---|
queue_url | Yes | String | The URL of the Amazon SQS queue from which messages are received. |
maximum_messages | No | Integer | The maximum number of messages to receive from the Amazon SQS queue in any single request. Default value is 10 . |
visibility_timeout | No | Duration | The visibility timeout to apply to messages read from the Amazon SQS queue. This should be set to the amount of time that Data Prepper may take to read all the S3 objects in a batch. Default value is 30s . |
wait_time | No | Duration | The amount of time to wait for long polling on the Amazon SQS API. Default value is 20s . |
poll_delay | No | Duration | A delay to place between reading/processing a batch of Amazon SQS messages and making a subsequent request. Default value is 0s . |
aws
The newline
codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the processor to parse each line.
Use the following options to configure the newline
codec.
Option | Required | Type | Description |
---|---|---|---|
skip_lines | No | Integer | The number of lines to skip before creating events. You can use this configuration to skip common header rows. Default is 0 . |
header_destination | No | String | A key value to assign to the header line of the S3 object. If this option is specified, then each event will contain a header_destination field. |
json codec
The json
codec parses each S3 object as a single JSON object from a JSON array and then creates a Data Prepper log event for each object in the array.
The csv
codec parses objects in comma-separated value (CSV) format, with each row producing a Data Prepper log event. Use the following options to configure the csv
codec.
Using s3_select
with the s3
source
Option | Required | Type | Description |
---|---|---|---|
expression | Yes, when using s3_select | String | The expression used to query the object. Maps directly to the property. |
expression_type | No | String | The type of the provided expression. Default value is SQL . Maps directly to the ExpressionType. |
input_serialization | Yes, when using s3_select | String | Provides the S3 Select file format. Amazon S3 uses this format to parse object data into records and returns only records that match the specified SQL expression. May be csv , json , or parquet . |
compression_type | No | String | Specifies an object’s compression format. Maps directly to the . |
csv | No | csv | Provides the CSV configuration for processing CSV data. |
json | No | Provides the JSON configuration for processing JSON data. |
csv
Use the following options in conjunction with the csv
configuration for s3_select
to determine how your parsed CSV file should be formatted.
These options map directly to options available in the S3 Select data type.
json
Use the following option in conjunction with json
for s3_select
to determine how S3 Select processes the JSON file.
Option | Required | Type | Description |
---|---|---|---|
type | No | String | The type of JSON array. May be either DOCUMENT or LINES . Maps directly to the property. |
The s3
source includes the following metrics.
s3ObjectsFailed
: The number of S3 objects that thes3
source failed to read.s3ObjectsNotFound
: The number of S3 objects that thes3
source failed to read due to an S3 “Not Found” error. These are also counted toward .s3ObjectsAccessDenied
: The number of S3 objects that thes3
source failed to read due to an “Access Denied” or “Forbidden” error. These are also counted towards3ObjectsFailed
.s3ObjectsSucceeded
: The number of S3 objects that thes3
source successfully read.sqsMessagesReceived
: The number of Amazon SQS messages received from the queue by thes3
source.sqsMessagesDeleted
: The number of Amazon SQS messages deleted from the queue by thes3
source.sqsMessagesFailed
: The number of Amazon SQS messages that thes3
source failed to parse.
Timers
s3ObjectReadTimeElapsed
: Measures the amount of time thes3
source takes to perform a request to GET an S3 object, parse it, and write events to the buffer.sqsMessageDelay
: Measures the time elapsed from when S3 creates an object to when it is fully parsed.
s3ObjectProcessedBytes
: Measures the bytes processed by thes3
source for a given object. For compressed objects, this is the uncompressed size.s3ObjectsEvents
: Measures the number of events (sometimes called records) produced by an S3 object.
Example: Uncompressed logs
source:
s3:
notification_type: sqs
codec:
newline:
compression: none
sqs:
queue_url: "https://sqs.us-east-1.amazonaws.com/123456789012/MyQueue"
aws:
region: "us-east-1"