Loading Data from AWS S3 to AWS Redshift

Loading Data from AWS S3 to AWS Redshift

AWS S3 is vital component of the AWS (Amazon Web Services) ecosystem. Any types of files can be stored in AWS S3 and can be retrieved or loaded to most of the products in AWS ecosystem seamlessly. The optimal way to load data into the AWS Redshift is via AWS S3. Very large amount of data can be be loaded into AWS Redshift within few minutes using if the data is stored in S3. In post will show how to load the data into AWS Redshift from AWS S3.

Loading data to Redshift using COPY command

The below command will copy the specified csv file from the S3 . The delimited specified is tab ('\t')  and you have to specify the delimiter based on the file. You can also specify the quote character used in the file. If these additional parameters are not passed, then the default values will be used.

copy schema.dest_table_name
    from 's3://bucket_name/filename.csv'
    access_key_id 'your_access_key_id'
    secret_access_key 'your_secret_access_key'
    delimiter '\t'
    CSV quote '^'

Loading compressed file into Redshift from AWS S3

Storing compressed files in S3 can reduce the monthly cost and data transfer. GZIP compressed csv files can be loaded into Redshift with much ease.

copy schema.dest_table_name
    from 's3://bucket_name/filename.csv.gz'
    access_key_id 'your_access_key_id'
    secret_access_key 'your_secret_access_key'
    delimiter '\t'
    CSV quote '^'
    gzip
    

Loading multiple files into Redshift from AWS S3 prefix/folder

Multiple files can be loaded to redshift in parallel by specifying the S3 folder/prefix path.

copy schema.dest_table_name
    from 's3://bucket_name/folder_name/'
    access_key_id 'your_access_key_id'
    secret_access_key 'your_secret_access_key'
    delimiter '\t'
    CSV quote '^'

Loading CSV to Redshift with additional options

Various options can be used in COPY command as per our requirement.

copy schema.dest_table_name
    from 's3://bucket_name/folder_name/'
    access_key_id 'your_access_key_id'
    secret_access_key 'your_secret_access_key'
    delimiter '\t'
    CSV quote '^'
    IGNOREHEADER 1
    maxerror as 10
    timeformat 'YYYY-MM-DD HH24:MI:SS'

IGNOREHEADER - skip the header row.

MAXERROR - maximum number of errors acceptable.

TIMEFORMAT - Used to specify the time format of the csv file.

Loading PARQUET file to Redshift from AWS S3

Parquet is an open-source columnar data format mostly used in the big data ecosystem.

copy schema.dest_table_name
    from 's3://bucket_name/folder_name/'
    FORMAT AS PARQUET

Using IAM roles for authenticating AWS S3 to Redshift data load

copy schema.dest_table_name
    from 's3://bucket_name/folder_name/'
    iam_role 'arn:aws:iam::86365107279:role/ReadS3Redshift'
    FORMAT AS PARQUET