AWS Glue Databricks

Post author:admin
Post published:November 21, 2021
Post category:Amazon Glue / Databricks

Use AWS Glue for Databricks

Databricks Deployment has a central Hive Metastore by Default for Data Catalog. There will be option to utilize External Hive Metastore Instance or AWS Glue Catalog.

This post will walkthrough the steps to Integrate AWS Glue Catalog as the Metastore for Databricks Runtime

Step 1 : In AWS Console

Create an instance profile to access a Glue Data Catalog. Attach the below Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantCatalogAccessToGlue",
      "Effect": "Allow",
      "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeletePartition",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

Step 2 : In AWS Console

Modify the Instance profile created for Databricks EC2 access . This is to allow Databricks to pass the instance profile you created in Step 1 to the EC2 instances for the Spark clusters.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1403287045000",
      "Effect": "Allow",
      "Action": [
        "ec2:AssociateDhcpOptions",
        "ec2:AssociateIamInstanceProfile",
        "ec2:AssociateRouteTable",
        "ec2:AttachInternetGateway",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CancelSpotInstanceRequests",
        "ec2:CreateDhcpOptions",
        "ec2:CreateInternetGateway",
        "ec2:CreatePlacementGroup",
        "ec2:CreateRoute",
        "ec2:CreateSecurityGroup",
        "ec2:CreateSubnet",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateVpc",
        "ec2:CreateVpcPeeringConnection",
        "ec2:DeleteInternetGateway",
        "ec2:DeletePlacementGroup",
        "ec2:DeleteRoute",
        "ec2:DeleteRouteTable",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteSubnet",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DeleteVpc",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeIamInstanceProfileAssociations",
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:DescribePlacementGroups",
        "ec2:DescribePrefixLists",
        "ec2:DescribeReservedInstancesOfferings",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSpotInstanceRequests",
        "ec2:DescribeSpotPriceHistory",
        "ec2:DescribeSubnets",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpcs",
        "ec2:DetachInternetGateway",
        "ec2:DisassociateIamInstanceProfile",
        "ec2:ModifyVpcAttribute",
        "ec2:ReplaceIamInstanceProfileAssociation",
        "ec2:RequestSpotInstances",
        "ec2:RevokeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RunInstances",
        "ec2:TerminateInstances"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
    }
  ]
}

Replace the arn in the above “PassRole” with your Databricks IAM role.

Step 3 : In Databricks Workspace

Add the Glue Catalog instance profile to a Databricks workspace

Go to the admin console.
Click the Instance Profiles tab.
Click the Add Instance Profile button. A dialog displays.
Paste in the Instance Profile ARN from Step 1.

Step 4: In Databricks Workspace

Create a cluster.
Click the Instances tab on the cluster creation page.
In the Instance Profiles drop-down list, select the instance profile.
Verify that you can access the Glue Catalog, using the following command in a notebook

show databases;

If the command succeeds, this Databricks Runtime cluster is configured to use Glue.

References

https://databricks.com/blog/2019/09/03/transform-your-aws-data-lake-using-databricks-delta-and-aws-glue-data-catalog-service.html

https://docs.databricks.com/data/metastores/aws-glue-metastore.html

AWS Glue Databricks

Use AWS Glue for Databricks

Unlocking the Power of Data: A..

Databricks Certification: How to Get Certified..

Databricks Unity Catalog: A Comprehensive Guide..

AWS DevOps Interview Questions

Databricks vs Redshift

Use AWS Glue for Databricks

You Might Also Like