Redshift Copy – S3 , EMR, DynamoDB
Redshift COPY command will load data from
- Data Files Located on Amazon S3 / EMR Cluster / from remote host using a SSH Connection
- Amazon DynamoDB table
To use the COPY command, user must have INSERT Privilege on the Amazon Redshift table
Copy from Amazon S3 bucket
To load data from Files located in one or more S3 buckets, use FROM clause to indicate the location of S3 Path. Also, IAM role associated to Redshift cluster utilized here should have READ access on the S3 bucket paths indicated in the COPY command.
Copy from Single S3 file . In the following example, to copy data located from S3 path into a ‘dept’ table, following is the command
copy dept from 's3://yourbucket/dept' iam_role 'arn:aws:iam::0999999789012:role/YourRedshiftRole';
Copy from Multiple S3 files . In the following example, to copy data located from multiple S3 files mentioned in the .mainfest file into a ‘dept’ table , following is the command
Mainfest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3
copy dept from 's3://yourbucket/s3files.mainfest' iam_role 'arn:aws:iam::0999999789012:role/YourRedshiftRole'; Following is sample manifest file { "entries": [ {"url":"s3://yourbucket/deptdata.1","mandatory":true}, {"url":"s3://yourbucket/deptdata.2","mandatory":true}, {"url":"s3://yourbucket/deptdata.1","mandatory":false} ] }
Copy from Parquet and ORC file formats
To load data from Apache Parquet and Apache ORC file formats located in one or more S3 buckets, use FROM clause to indicate the location of S3 Path. Also, IAM role associated to Redshift cluster utilized here should have READ access on the S3 bucket paths indicated in the COPY command.
In the following example, to copy Parquet format data located from S3 path into a ‘dept’ table, following is the command
copy dept from 's3://yourbucket/dept' iam_role 'arn:aws:iam::0999999789012:role/YourRedshiftRole' FORMAT AS PARQUET;
For more information, Pls refer to https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html
Copy from Amazon EMR
To load data in parallel from an Amazon EMR cluster configured to write text files to the cluster’s Hadoop Distributed File System (HDFS) in the form of fixed-width files, character-delimited files, CSV files, JSON-formatted files, or Avro files .
Also, Redshift cluster utilized here should have been configured for EMR cluster connectivity. For steps to configure connectivity , Pls refer to https://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-emr.html
In the following example,
In the COPY command, specify the Amazon EMR cluster ID and the HDFS file path and file name.
copy dept
from 'emr://youremrclusterid/hdfsfolder/hdfspart*' credentials
iam_role 'arn:aws:iam::0999996789012:role/yourRedshiftRole';
Copy from Amazon DynamoDB
To load data from DynamoDB tables, use FROM clause to indicate the DynamoDB table name. Also, IAM role associated to Redshift cluster utilized here should have READ access on the DynamoDB tables indicated in the COPY command.
When the COPY command reads data from the Amazon DynamoDB table, the resulting data transfer is part of that table’s provisioned throughput. To avoid consuming excessive amounts of provisioned read throughput, set “readratio” configuration parameter to lower value
The following example loads data from a DynamoDB table
copy favoritemovies from 'dynamodb://dept' iam_role 'arn:aws:iam::0199999789012:role/yourRedshiftRole' readratio 50;
Conclusion
For data files, Best approach is to load the data from amazon S3 files. The other approach like EMR cluster or remote hosts involves integrating SSH keys which are not secure and add lot of admin overhead.