What is Databricks?
Databricks is a unified cloud-based data platform that is powered by Apache Spark. It specializes in collaboration and analytics for big data. Databricks is a data science workspace, with Collaborative Notebooks, Machine Learning Runtime, and Managed ML flow.
- Collaborative Notebooks support multiple data analytics languages, such as SQL, Scala, R, Python, and Java. Data analysts will find it much easier and timesaving to work with their teammates, share insights with built-in visualization, and automatic visioning
- Machine Learning Runtime (MLR) takes off the burden from managing necessary libraries and keeping the module versions up-to-date; instead, data scientists can connect to the most popular Machine Learning frameworks (TensorFlow, Keras, XGBoost, Scikit-learn, etc.) with one click. MLR can also speed up the model tuning process with its built-in AutoML function, by hyperparameter tuning and model search, using Hyperopt and ML flow
What are the Benefits?
Databricks’ ability to process and transform a massive amount of data makes it an industry-leading solution for data scientists and analysts. Some of its key benefits include:
Getting Started: Data practitioners can find commonly used programming languages – namely, Python, R, and SQL that can be used in Databricks. This shortens time spent on getting familiar with the language, and ease the learning curve for newcomers. When launched, users see the notebook in a format that is similar to Jupyter notebook, which is widely used around the world.
Collaboration: Besides what’s mentioned above, Databricks encourages multiple team members to work on the same project with interactive workspaces. All members can work under the same workspaces without worrying about version control.
Production: After training and testing, data engineers can quickly deploy the model in Databricks. Deployment for big data is prone to be messy and complex. But Databricks can give your team an edge.
Ameex is a proud partner with Snowflake, and we are excited to deliver cloud data warehouses to our clients.
In this blog, we will walk through the steps on how to connect Databricks to Snowflake so that you can begin your data journey with first-in-class machine-learning capabilities. You will discover that your data is securely stored in a reliable cloud warehouse.
To begin with, connecting Databricks to Snowflake will require the following:
- An up-to-date Databricks account, with secret setup; A Snowflake account, with the critical information below available:
- URL for your Snowflake account
- Login name and password for the user who connects to the account
- Default database and schema to use for the session after establishing the connection
- Default virtual warehouse to use for the session after establishing the connection
The connection process can be summarized as:
- Enable token-based authentication for Databricks workspace
- Install Databricks CLI
- Create Databricks Scope
- Create Databricks Secrets within the Scope
- Use the Secrets to connect Databricks to Snowflake
Step 1: Enable token-based authentication for your workspace
- Click on your User icon at the top right corner in your Databricks account and navigate to Admin Console
- Once in the Admin Console, select Access Control
Find the Personal Access Tokens, and click Enable
After a few minutes, the Personal Access Tokens would be available.
- Click on your User icon at the top right corner in your Databricks account and navigate to User Settings
- Select Access Tokens
Click Generate New Token button
You can enter an optional description for the new token and specify the expiration period
Click the Generate button, copy the generated token and store it for the next step
Step 2: Install Databricks CLI
The Databricks command-line interface will be helpful in providing an interface to the platform. We will install this assuming you have the following:
- Python 2.7.9 and above or
- Python 3.6 and above
Run pip install Databricks-cli using the appropriate version of pip for your Python installation. If you are using Python 3, run pip3 install Databricks-cli.
2. Set up
Run Databricks configure –token
In the prompts below, type in your host and token (from previous step)
Databricks Host (should begin with https://):
3. Access credential
Your access credential should be stored in the file ~/.databrickscfg
host = https://<databricks-instance>
token = <personal-access-token>
Step 3: Create Databricks scope
- Create a scope
Scope name is case insensitive
Databricks secrets create-scope –scope <scope-name>
- Scopes are created with MANAGE permission by default. If your account is not Premium, you must override the same and grant manage permission to “users” while creating the scope:
Databricks secrets create-scope –scope <scope-name> –initial-manage-principal users
- To double check the scope is created successfully:
Databricks secrets list-scopes
Step 4: Once scope is ready, secrets are required to be created
- Create a secret
Secret name is case insensitive
Databricks secrets put –scope <scope-name> –key <key-name>
- Confirm the secret is created
Databricks secrets list –scope <scope-name>