s3 source

    In order to use the s3 source, configure your AWS Identity and Access Management (IAM) permissions to grant Data Prepper access to Amazon S3. You can use a configuration similar to the following JSON configuration:

    If your S3 objects or Amazon SQS queues do not use , remove the kms:Decrypt permission.

    Configuration

    You can use the following options to configure the s3 source.

    The following parameters allow you to configure usage for Amazon SQS in the s3 source plugin.

    OptionRequiredTypeDescription
    queue_urlYesStringThe URL of the Amazon SQS queue from which messages are received.
    maximum_messagesNoIntegerThe maximum number of messages to receive from the Amazon SQS queue in any single request. Default value is 10.
    visibility_timeoutNoDurationThe visibility timeout to apply to messages read from the Amazon SQS queue. This should be set to the amount of time that Data Prepper may take to read all the S3 objects in a batch. Default value is 30s.
    wait_timeNoDurationThe amount of time to wait for long polling on the Amazon SQS API. Default value is 20s.
    poll_delayNoDurationA delay to place between reading/processing a batch of Amazon SQS messages and making a subsequent request. Default value is 0s.

    aws

    The newline codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the processor to parse each line.

    Use the following options to configure the newline codec.

    OptionRequiredTypeDescription
    skip_linesNoIntegerThe number of lines to skip before creating events. You can use this configuration to skip common header rows. Default is 0.
    header_destinationNoStringA key value to assign to the header line of the S3 object. If this option is specified, then each event will contain a header_destination field.

    json codec

    The json codec parses each S3 object as a single JSON object from a JSON array and then creates a Data Prepper log event for each object in the array.

    The csv codec parses objects in comma-separated value (CSV) format, with each row producing a Data Prepper log event. Use the following options to configure the csv codec.

    Using s3_select with the s3 source

    OptionRequiredTypeDescription
    expressionYes, when using s3_selectStringThe expression used to query the object. Maps directly to the property.
    expression_typeNoStringThe type of the provided expression. Default value is SQL. Maps directly to the ExpressionType.
    input_serializationYes, when using s3_selectStringProvides the S3 Select file format. Amazon S3 uses this format to parse object data into records and returns only records that match the specified SQL expression. May be csv, json, or parquet.
    compression_typeNoStringSpecifies an object’s compression format. Maps directly to the .
    csvNocsvProvides the CSV configuration for processing CSV data.
    jsonNoProvides the JSON configuration for processing JSON data.

    csv

    Use the following options in conjunction with the csv configuration for s3_select to determine how your parsed CSV file should be formatted.

    These options map directly to options available in the S3 Select data type.

    json

    Use the following option in conjunction with json for s3_select to determine how S3 Select processes the JSON file.

    OptionRequiredTypeDescription
    typeNoStringThe type of JSON array. May be either DOCUMENT or LINES. Maps directly to the property.

    The s3 source includes the following metrics.

    • s3ObjectsFailed: The number of S3 objects that the s3 source failed to read.
    • s3ObjectsNotFound: The number of S3 objects that the s3 source failed to read due to an S3 “Not Found” error. These are also counted toward .
    • s3ObjectsAccessDenied: The number of S3 objects that the s3 source failed to read due to an “Access Denied” or “Forbidden” error. These are also counted toward s3ObjectsFailed.
    • s3ObjectsSucceeded: The number of S3 objects that the s3 source successfully read.
    • sqsMessagesReceived: The number of Amazon SQS messages received from the queue by the s3 source.
    • sqsMessagesDeleted: The number of Amazon SQS messages deleted from the queue by the s3 source.
    • sqsMessagesFailed: The number of Amazon SQS messages that the s3 source failed to parse.

    Timers

    • s3ObjectReadTimeElapsed: Measures the amount of time the s3 source takes to perform a request to GET an S3 object, parse it, and write events to the buffer.
    • sqsMessageDelay: Measures the time elapsed from when S3 creates an object to when it is fully parsed.
    • s3ObjectProcessedBytes: Measures the bytes processed by the s3 source for a given object. For compressed objects, this is the uncompressed size.
    • s3ObjectsEvents: Measures the number of events (sometimes called records) produced by an S3 object.

    Example: Uncompressed logs

    1. source:
    2. s3:
    3. notification_type: sqs
    4. codec:
    5. newline:
    6. compression: none
    7. sqs:
    8. queue_url: "https://sqs.us-east-1.amazonaws.com/123456789012/MyQueue"
    9. aws:
    10. region: "us-east-1"