Loading Data from AWS S3 to AWS Redshift
AWS S3 is vital component of the AWS (Amazon Web Services) ecosystem. Any types of files can be stored in AWS S3 and can be retrieved or loaded to most of the products in AWS ecosystem seamlessly. The optimal way to load data into the AWS Redshift is via AWS S3. Very large amount of data can be be loaded into AWS Redshift within few minutes using if the data is stored in S3. In post will show how to load the data into AWS Redshift from AWS S3.
Loading data to Redshift using COPY command
The below command will copy the specified csv file from the S3 . The delimited specified is tab ('\t') and you have to specify the delimiter based on the file. You can also specify the quote character used in the file. If these additional parameters are not passed, then the default values will be used.
copy schema.dest_table_name
from 's3://bucket_name/filename.csv'
access_key_id 'your_access_key_id'
secret_access_key 'your_secret_access_key'
delimiter '\t'
CSV quote '^'
Loading compressed file into Redshift from AWS S3
Storing compressed files in S3 can reduce the monthly cost and data transfer. GZIP compressed csv files can be loaded into Redshift with much ease.
copy schema.dest_table_name
from 's3://bucket_name/filename.csv.gz'
access_key_id 'your_access_key_id'
secret_access_key 'your_secret_access_key'
delimiter '\t'
CSV quote '^'
gzip
Loading multiple files into Redshift from AWS S3 prefix/folder
Multiple files can be loaded to redshift in parallel by specifying the S3 folder/prefix path.
copy schema.dest_table_name
from 's3://bucket_name/folder_name/'
access_key_id 'your_access_key_id'
secret_access_key 'your_secret_access_key'
delimiter '\t'
CSV quote '^'
Loading CSV to Redshift with additional options
Various options can be used in COPY command as per our requirement.
copy schema.dest_table_name
from 's3://bucket_name/folder_name/'
access_key_id 'your_access_key_id'
secret_access_key 'your_secret_access_key'
delimiter '\t'
CSV quote '^'
IGNOREHEADER 1
maxerror as 10
timeformat 'YYYY-MM-DD HH24:MI:SS'
IGNOREHEADER - skip the header row.
MAXERROR - maximum number of errors acceptable.
TIMEFORMAT - Used to specify the time format of the csv file.
Loading PARQUET file to Redshift from AWS S3
Parquet is an open-source columnar data format mostly used in the big data ecosystem.
copy schema.dest_table_name
from 's3://bucket_name/folder_name/'
FORMAT AS PARQUET
Using IAM roles for authenticating AWS S3 to Redshift data load
copy schema.dest_table_name
from 's3://bucket_name/folder_name/'
iam_role 'arn:aws:iam::86365107279:role/ReadS3Redshift'
FORMAT AS PARQUET