AWS Glue Databricks

Use AWS Glue for Databricks

Databricks Deployment has a central Hive Metastore by Default for Data Catalog. There will be option to utilize External Hive Metastore Instance or AWS Glue Catalog.

This post will walkthrough the steps to Integrate AWS Glue Catalog as the Metastore for Databricks Runtime

Step 1 : In AWS Console

Create an instance profile to access a Glue Data Catalog. Attach the below Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantCatalogAccessToGlue",
      "Effect": "Allow",
      "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeletePartition",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

Step 2 : In AWS Console

Modify the Instance profile created for Databricks EC2 access . This is to allow Databricks to pass the instance profile you created in Step 1 to the EC2 instances for the Spark clusters.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1403287045000",
      "Effect": "Allow",
      "Action": [
        "ec2:AssociateDhcpOptions",
        "ec2:AssociateIamInstanceProfile",
        "ec2:AssociateRouteTable",
        "ec2:AttachInternetGateway",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CancelSpotInstanceRequests",
        "ec2:CreateDhcpOptions",
        "ec2:CreateInternetGateway",
        "ec2:CreatePlacementGroup",
        "ec2:CreateRoute",
        "ec2:CreateSecurityGroup",
        "ec2:CreateSubnet",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateVpc",
        "ec2:CreateVpcPeeringConnection",
        "ec2:DeleteInternetGateway",
        "ec2:DeletePlacementGroup",
        "ec2:DeleteRoute",
        "ec2:DeleteRouteTable",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteSubnet",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DeleteVpc",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeIamInstanceProfileAssociations",
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:DescribePlacementGroups",
        "ec2:DescribePrefixLists",
        "ec2:DescribeReservedInstancesOfferings",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSpotInstanceRequests",
        "ec2:DescribeSpotPriceHistory",
        "ec2:DescribeSubnets",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpcs",
        "ec2:DetachInternetGateway",
        "ec2:DisassociateIamInstanceProfile",
        "ec2:ModifyVpcAttribute",
        "ec2:ReplaceIamInstanceProfileAssociation",
        "ec2:RequestSpotInstances",
        "ec2:RevokeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RunInstances",
        "ec2:TerminateInstances"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
    }
  ]
}

Replace the arn in the above “PassRole” with your Databricks IAM role.

Step 3 : In Databricks Workspace

Add the Glue Catalog instance profile to a Databricks workspace

  1. Go to the admin console.
  2. Click the Instance Profiles tab.
  3. Click the Add Instance Profile button. A dialog displays.
  4. Paste in the Instance Profile ARN from Step 1.Instance profile ARN

Step 4: In Databricks Workspace

  1. Create a cluster.
  2. Click the Instances tab on the cluster creation page.
  3. In the Instance Profiles drop-down list, select the instance profile.Select instance profile
  4. Verify that you can access the Glue Catalog, using the following command in a notebook
show databases;

If the command succeeds, this Databricks Runtime cluster is configured to use Glue.

References

https://databricks.com/blog/2019/09/03/transform-your-aws-data-lake-using-databricks-delta-and-aws-glue-data-catalog-service.html

https://docs.databricks.com/data/metastores/aws-glue-metastore.html