Introduction to Databricks, concepts, Architecture
Databricks is a cloud-based collaborative data science, data engineering, and data analytics platform that combines the best of data warehouses and data lakes into a lakehouse architecture.
Databricks has three key components
- Databricks Data Science & Engineering – Databricks environment for collaboration among data scientists, data engineers, and data analysts.
- Databricks Machine Learning – An integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.
- Databricks SQL – A simple experience for SQL users who want to run quick ad-hoc queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
What is Databricks Workspace?
A workspace is an environment for accessing all of your Databricks assets. A workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.
Your organization can choose to have multiple workspaces or just one: it depends on your needs. Following are the objects contained in Databricks workspace.
Notebook – A web-based interface to documents that contain runnable commands, visualizations, and narrative text.
Dashboard – An interface that provides organized access to visualizations.
Library – A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries and you can add your own.
Repo – A folder whose contents are co-versioned together by syncing them to a remote Git repository.
Experiment – A collection of MLflow runs for training a machine learning model.
What is Databricks Cluster?
Databricks cluster is a set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job.
- You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
- The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.
- Pool – A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times.
What is Databricks Runtime?
The set of core components that run on the clusters managed by Databricks. Databricks offers several types of runtimes:
- Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
- Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
- Databricks Runtime for Genomics is a version of Databricks Runtime optimized for working with genomic and biomedical data.
- Databricks Light is the Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime.
What is Databricks SQL?
Databricks SQL is geared toward data analysts who work primarily with SQL queries and BI tools. It provides an intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake.
SQL endpoint: A computation resource on which you execute SQL queries.
Databricks operates out of a control plane and a data plane.
Databricks Control Pane
Control plane includes the backend services that Databricks manages in its own AWS account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.
Databricks Data Pane
Data plane is where your data is processed
- For most Databricks computation, the compute resources are in your AWS account in what is called the Classic data plane. This is the type of data plane Databricks uses for notebooks, jobs, and for Classic Databricks SQL endpoints.
- If you enable Serverless compute for Databricks SQL, the compute resources for Databricks SQL are in a shared Serverless data plane.