The Amazon S3 connector

You can use this connector for source endpoints.

Source endpoint configuration

You can configure your endpoint to transfer CSV, Parquet, Avro or JSON Lines file types.

Warning

If syncing from a private bucket, the credentials you use for the connection must have have both read and list access on the S3 bucket.

  1. Name the table you want to create for data from Amazon S3 in the Dataset field.

  2. Specify Path Pattern to identify the files to select for transfer. Enter ** to match all files in a bucket or specify the exact path to the files with extensions. Use wcmatch.glob syntax for this pattern.

  3. Provide a Schema as a JSON string. You can specify {} to auto-infer the schema.

  4. Configure properties specific to a format:

    • Delimiter is a one-character string. This is an obligatory field.

    • Quote char is used to quote values.

    • Escape char is used for escape special characters. Leave this field blank to ignore.

    • Encoding as shown in the list of Python encodings . Leave this field blank to use the default UTF-8 encoding.

    • Check the Double quote box if two quotes in CSV files correspond to a single quote.

    • Check the Newlines in values if the CSV files in your bucket contain newline characters. If enabled, this setting might lower performance.

    • Block size is the number of bytes to process in memory in parallel while reading files. We recommend you to keep this field with a default value: 10000.

    • The Additional reader options field accepts a JSON string that denotes the PyArrow ConvertOptions . If options specified here duplicate any of the above options, this field's values override the previously specified ones.

    • The Advanced options field accepts a JSON string that denotes the PyArrow ReadOptions . If options specified here duplicate any of the above options, this field's values override the previously specified ones.

    • Buffer size enables buffering for column chunks and specifies buffer size. If not specified, every column is fully loaded into memory.

    • Columns contain names of columns whose data will be transferred. If you specify at least one column name here, only the specified column(s) are transferred. Leave empty to transfer all columns.

    • Batch size is the maximum number of records per batch.

    This format requires no additional settings.

    • The Allow newlines in values checkbox enables newline characters in JSON values. Enabling this parameter may affect transfer performance.

    • The Unexpected field behavior drop-down menu allows you to select how to process the JSON fields outside the provided schema:

      • Ignore - ignores unexpected JSON fields.
      • Error - return an error when encountering unexpected JSON fields.
      • Infer - type-infer unexpected JSON fields and include them in the output. We recommend using this option by default
    • Block Size is the number of bytes to process in memory in parallel while reading files. We recommend you to keep this field with a default value: 10000.

  5. Specify the S3: Amazon Web Services settings:

    • The name of your Bucket.

    • Your AWS Access Key ID. This field isn't necessary if you are accessing a public AWS bucket.

    • Your AWS Secret Access Key. This field isn't necessary if you are accessing a public AWS bucket.

      Tip

      You can find your credentials on the Identity and Access Management (IAM) page in the AWS console. Look for the Access keys for CLI, SDK, & API access section and click Create access key or use an existing one.

    • Path Prefix as a file location in a folder to speed up the file search in a bucket.

    • Endpoint name if you use an S3-compatible service. Leave blank to use AWS itself.

    • Check the Use SSL box to use SSL/TLS connection.

    • Check Verify SSL Cert to allow self-signed certificates.

You can see this endpoint usage in real scenarios: