Redshift Spectrum – Hive Integration

Steps to Integrate Hive Catalog with Redshift Spectrum

If your S3 Data lake catalog resides in Hive on EMR cluster, following are the steps to integrate Redshift Spectrum with your hive data catalog

  • Establish connectivity between your Redshift Cluster and EMR cluster. To achieve this
    1. Create a outbound rule associated to your Redshift Cluster to open connectivity to EMR Master node IP address and port 9083 (Hive Metastore listens to this port)
    2. Add a Inbound rule in your EMR cluster to allow connectivity from your Redshift Cluster.
  • Create a new database in your Hive catalog to associate it in during Redshift spectrum schema definition.
  • For the IAM role associated to your Redshift cluster, add a policy that allows READ access to S3 bucket that will be reference in your Redshift Spectrum tables. Sample given below.
{
     "Version": "2012-10-17",
     "Statement": [{
         "Effect": "Allow",
         "Action": ["s3:Get", "s3:List"],
         "Resource": "arn:aws:s3:::mydatalakeBucket/*"
     }]
 }
  • If you would have enabled “Enhanced VPC Routing” which is a best security practice to route traffic from your Redshift cluster to S3 bucket through VPC instead of internet, you will need to add the following bucket policy on the S3 bucket. Redshift Spectrum behaves differently when “Enhanced VPC Routing” is enabled. Redshift Spectrum is routed through a AWS Private network. More information can be found here – https://docs.aws.amazon.com/redshift/latest/mgmt/spectrum-enhanced-vpc.html
{
   "Version":"2012-10-17",
   "Statement":[
   {
      "Sid":"BucketPolicyForSpectrum",
      "Effect":"Allow",
      "Principal": {"AWS": ["arn:aws:iam::987654321012:root"]},
      "Action":[“s3:GetObject",”s3:List"],      "Resource":["arn:aws:s3:::mydatalakeBucket/"],
      "Condition":{"StringEquals":{"aws:UserAgent": "AWS Redshift/Spectrum"]}}
    }
   ]
 }
  • Create a Redshift Spectrum Schema . Sample given below. Highlighted parameters will vary for your environment
create external schema if not exists spectrumhive_schema
 from hive metastore
 database 'spectrumhive_database'
 uri 'ip-10-0-222-222.us-east-2.compute.internal' port 9083 
 iam_role 'arn:aws:iam::987654321012:role/SpectrumhiveRole';
  • Create a Redshift External table referring to Data lake S3 bucket.
CREATE EXTERNAL TABLE spectrumhive_schema.example_tbl0(
     col0 INT,
     col1 VARCHAR(255)
 ) PARTITIONED BY (part INT) STORED AS TEXTFILE
 LOCATION 's3://mydatalakeBucket/s3/location/';              

Possible Issues

  1. If you received the following metastore error, Ensure the hive database referred in your schema creation exists in your Hive metastore

ERROR: Hive Metastore error : HOST : ip-10-0-222-222.us-east-2.compute.internal PORT: 9083 ERROR: Default TEexception. 1 statement failed.