The Airbyte Amazon S3 connector

Obsolete connector

We consider this connector obsolete. For improved performance, better flexibility and reliability, use the S3-compatible Object Storage connector.

You can use this connector for source and target endpoints.

Source endpoint configuration

You can configure your endpoint to transfer CSV, Parquet, Avro or JSON Lines file types.

Warning

If syncing from a private bucket, the credentials you use for the connection must have both read and list access on the S3 bucket.

  1. Name the table you want to create for data from Amazon S3 in the Dataset field.

  2. Specify Path Pattern to identify the files to select for transfer. Enter ** to match all files in a bucket or specify the exact path to the files with extensions. Use wcmatch.glob syntax for this pattern.

  3. Provide a Schema as a JSON string. You can specify {} to auto-infer the schema.

  4. Configure properties specific to a format:

    CSV
    • Delimiter is a one-character string. This is a required field.

    • Quote char is used to quote values.

    • Escape char is used for escape special characters. Leave this field blank to ignore.

    • Encoding as shown in the list of Python encodings . Leave this field blank to use the default UTF-8 encoding.

    • Check the Double quote box if two quotes in CSV files correspond to a single quote.

    • Check the Newlines in values if the CSV files in your bucket contain newline characters. If enabled, this setting might lower performance.

    • Block size is the number of bytes to process in memory in parallel while reading files. We recommend you to keep this field with a default value: 10000.

    • The Additional reader options field accepts a JSON string that denotes the PyArrow ConvertOptions . If options specified here duplicate any of the above options, this field's values override the previously specified ones.

    • The Advanced options field accepts a JSON string that denotes the PyArrow ReadOptions . If options specified here duplicate any of the above options, this field's values override the previously specified ones.

    Parquet
    • Buffer size enables buffering for column chunks and specifies buffer size. If not specified, every column is fully loaded into memory.

    • Columns contain names of columns whose data will be transferred. If you specify at least one column name here, only the specified column(s) are transferred. Leave empty to transfer all columns.

    • Batch size is the maximum number of records per batch.

    Avro

    This format requires no additional settings.

    JSON Lines
    • The Allow newlines in values checkbox enables newline characters in JSON values. Enabling this parameter may affect transfer performance.

    • The Unexpected field behavior drop-down menu allows you to select how to process the JSON fields outside the provided schema:

      • Ignore - ignores unexpected JSON fields.
      • Error - return an error when encountering unexpected JSON fields.
      • Infer - type-infer unexpected JSON fields and include them in the output. We recommend using this option by default
    • Block Size is the number of bytes to process in memory in parallel while reading files. We recommend you to keep this field with a default value: 10000.

  5. Specify the S3: Amazon Web Services settings:

    • The name of your Bucket.

    • Your AWS Access Key ID. This field isn't necessary if you are accessing a public AWS bucket.

    • Your AWS Secret Access Key. This field isn't necessary if you are accessing a public AWS bucket.

      Tip

      You can find your credentials on the Identity and Access Management (IAM) page in the AWS console. Look for the Access keys for CLI, SDK, & API access section and click Create access key or use an existing one.

    • Path Prefix as a file location in a folder to speed up the file search in a bucket.

    • Endpoint name if you use an S3-compatible service. Leave blank to use AWS itself.

    • Check the Use SSL box to use SSL/TLS connection.

    • Check Verify SSL Cert to allow self-signed certificates.

To create an Amazon S3 source endpoint with API, use the endpoint.airbyte.S3Source model.

Target endpoint

You can configure your endpoint to transfer data as CSV, JSON and Raw data file types into your AWS or another S3-compatible bucket.

See the list of connectable source endpoints in our Endpoints connectivity matrix.

  1. Specify the Connection settings:

    • Your AWS Access Key ID. Leave empty if you are accessing a public AWS bucket or if you use another S3-compatible service.

    • Your AWS Secret Access Key. Leave empty if you are accessing a public AWS bucket or if you use another S3-compatible service.

      Tip

      You can find your credentials on the Identity and Access Management (IAM) page in the AWS console. Look for the Access keys for CLI, SDK, & API access section and click Create access key or use an existing one.

    • Specify the AWS Region to which the connector will send requests. This field is AWS-specific, leave it empty if you use another S3-compatible service.

    • Specify the Endpoint to an S3-compatible service. Leave empty if you want to use AWS.

    • Check the Use SSL box to use a secure SSL/TLS connection.

    • Check the Verify SSL certificate box to allow self-signed SSL certificates.

    • Specify the Bucket name.

    • Specify the object's Folder name. The DoubleCloud S3 connector supports date templates for folder naming: YYYY/MM/DD/folder_name.

      The default field value is 2006/01/02.

  2. From the drop-down menu, select the serialization format:

    • JSON
    • CSV
    • Raw data

    This is the format the connector will use to write the data into the bucket.

  3. Select the encoding format for your data:

    • Uncompressed - doesn't apply any compression

    • Gzip - applies the GNU Gzip file compression.

  4. Specify the advanced settings:

    • Under Buffer size, set the size of object files into which the connector will split the data:

      1. In the input field, specify the number of multiple-byte units.

      2. From the drop-down menu, select the multiple-byte unit:

        • B - byte
        • KB - kilobyte
        • MB - megabyte
        • GB - gigabyte

        The default setting is 512 MB.

    • Under Flush interval, set the interval after which the connector will write the object files regardless of their size:

      1. In the input field, specify the number of time units.

      2. From the drop-down menu, select the time unit:

        • msec
        • sec
        • min
        • hours
        • days

        The default setting is 30 sec.

    • Set the timezone name(s) in the tz database timezone format into which the connector will split the data. For example, America/Edmonton.

    • Specify the row time column name into which the connector will write the logical time for data.

To create an Amazon S3 target endpoint with API, use the endpoint.airbyte.S3Target model.

See also