Redshift Copy

Redshift Copy – S3 , EMR, DynamoDB

Redshift COPY command will load data from

  • Data Files Located on Amazon S3 / EMR Cluster / from remote host using a SSH Connection
  • Amazon DynamoDB table

To use the COPY command, user must have INSERT Privilege on the Amazon Redshift table

Copy from Amazon S3 bucket

To load data from Files located in one or more S3 buckets, use FROM clause to indicate the location of S3 Path. Also, IAM role associated to Redshift cluster utilized here should have READ access on the S3 bucket paths indicated in the COPY command.

Copy from Single S3 file . In the following example, to copy data located from S3 path into a ‘dept’ table, following is the command

copy dept
from 's3://yourbucket/dept'
iam_role 'arn:aws:iam::0999999789012:role/YourRedshiftRole';

Copy from Multiple S3 files . In the following example, to copy data located from multiple S3 files mentioned in the .mainfest file into a ‘dept’ table , following is the command

Mainfest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3

copy dept
from 's3://yourbucket/s3files.mainfest'
iam_role 'arn:aws:iam::0999999789012:role/YourRedshiftRole';

Following is sample manifest file
{
   "entries": [
     {"url":"s3://yourbucket/deptdata.1","mandatory":true},
     {"url":"s3://yourbucket/deptdata.2","mandatory":true},
     {"url":"s3://yourbucket/deptdata.1","mandatory":false}
   ]
 }

Copy from Parquet and ORC file formats

To load data from Apache Parquet and Apache ORC file formats located in one or more S3 buckets, use FROM clause to indicate the location of S3 Path. Also, IAM role associated to Redshift cluster utilized here should have READ access on the S3 bucket paths indicated in the COPY command.

In the following example, to copy Parquet format data located from S3 path into a ‘dept’ table, following is the command

copy dept
from 's3://yourbucket/dept'
iam_role 'arn:aws:iam::0999999789012:role/YourRedshiftRole'
FORMAT AS PARQUET;

For more information, Pls refer to https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html

Copy from Amazon EMR

To load data in parallel from an Amazon EMR cluster configured to write text files to the cluster’s Hadoop Distributed File System (HDFS) in the form of fixed-width files, character-delimited files, CSV files, JSON-formatted files, or Avro files .

Also, Redshift cluster utilized here should have been configured for EMR cluster connectivity. For steps to configure connectivity , Pls refer to https://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-emr.html

In the following example,

In the COPY command, specify the Amazon EMR cluster ID and the HDFS file path and file name.

copy dept
from 'emr://youremrclusterid/hdfsfolder/hdfspart*' credentials 
iam_role 'arn:aws:iam::0999996789012:role/yourRedshiftRole';

Copy from Amazon DynamoDB

To load data from DynamoDB tables, use FROM clause to indicate the DynamoDB table name. Also, IAM role associated to Redshift cluster utilized here should have READ access on the DynamoDB tables indicated in the COPY command.

When the COPY command reads data from the Amazon DynamoDB table, the resulting data transfer is part of that table’s provisioned throughput. To avoid consuming excessive amounts of provisioned read throughput, set “readratio” configuration parameter to lower value

The following example loads data from a DynamoDB table

copy favoritemovies from 'dynamodb://dept'
 iam_role 'arn:aws:iam::0199999789012:role/yourRedshiftRole'
 readratio 50;  

Conclusion

For data files, Best approach is to load the data from amazon S3 files. The other approach like EMR cluster or remote hosts involves integrating SSH keys which are not secure and add lot of admin overhead.