Databricks Integration with Snowflake / Blogs / Perficient

What is Databricks?

Databricks is a unified cloud-based data platform that is powered by Apache Spark. It specializes in collaboration and analytics for big data. Databricks is a data science workspace, with Collaborative Notebooks, Machine Learning Runtime, and Managed ML flow.

Collaborative Notebooks support multiple data analytics languages, such as SQL, Scala, R, Python, and Java. Data analysts will find it much easier and timesaving to work with their teammates, share insights with built-in visualization, and automatic visioning
Machine Learning Runtime (MLR) takes off the burden from managing necessary libraries and keeping the module versions up-to-date; instead, data scientists can connect to the most popular Machine Learning frameworks (TensorFlow, Keras, XGBoost, Scikit-learn, etc.) with one click. MLR can also speed up the model tuning process with its built-in AutoML function, by hyperparameter tuning and model search, using Hyperopt and ML flow

What are the Benefits?

Databricks’ ability to process and transform a massive amount of data makes it an industry-leading solution for data scientists and analysts. Some of its key benefits include:

Getting Started: Data practitioners can find commonly used programming languages – namely, Python, R, and SQL that can be used in Databricks. This shortens time spent on getting familiar with the language, and ease the learning curve for newcomers. When launched, users see the notebook in a format that is similar to Jupyter notebook, which is widely used around the world.

Collaboration: Besides what’s mentioned above, Databricks encourages multiple team members to work on the same project with interactive workspaces. All members can work under the same workspaces without worrying about version control.

Production: After training and testing, data engineers can quickly deploy the model in Databricks. Deployment for big data is prone to be messy and complex. But Databricks can give your team an edge.

Why Snowflake?

Ameex is a proud partner with Snowflake, and we are excited to deliver cloud data warehouses to our clients.

In this blog, we will walk through the steps on how to connect Databricks to Snowflake so that you can begin your data journey with first-in-class machine-learning capabilities. You will discover that your data is securely stored in a reliable cloud warehouse.

To begin with, connecting Databricks to Snowflake will require the following:

An up-to-date Databricks account, with secret setup; A Snowflake account, with the critical information below available:
- URL for your Snowflake account
- Login name and password for the user who connects to the account
- Default database and schema to use for the session after establishing the connection
- Default virtual warehouse to use for the session after establishing the connection

The connection process can be summarized as:

Enable token-based authentication for Databricks workspace
Install Databricks CLI
Create Databricks Scope
Create Databricks Secrets within the Scope
Use the Secrets to connect Databricks to Snowflake

Step 1: Enable token-based authentication for your workspace

Click on your User icon at the top right corner in your Databricks account and navigate to Admin Console

Once in the Admin Console, select Access Control
Find the Personal Access Tokens, and click Enable
Confirm

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

After a few minutes, the Personal Access Tokens would be available.

Click on your User icon at the top right corner in your Databricks account and navigate to User Settings

Select Access Tokens
Click Generate New Token button
You can enter an optional description for the new token and specify the expiration period
Click the Generate button, copy the generated token and store it for the next step

Step 2: Install Databricks CLI

The Databricks command-line interface will be helpful in providing an interface to the platform. We will install this assuming you have the following:

Python 2.7.9 and above or
Python 3.6 and above

1. Install

Run pip install Databricks-cli using the appropriate version of pip for your Python installation. If you are using Python 3, run pip3 install Databricks-cli.

2. Set up

Run Databricks configure –token
In the prompts below, type in your host and token (from previous step)

Databricks Host (should begin with https://):
Token:

3. Access credential

Your access credential should be stored in the file ~/.databrickscfg

host = https://<databricks-instance>
token = <personal-access-token>

Step 3: Create Databricks scope

Create a scope

Scope name is case insensitive

Databricks secrets create-scope –scope <scope-name>

Scopes are created with MANAGE permission by default. If your account is not Premium, you must override the same and grant manage permission to “users” while creating the scope:

Databricks secrets create-scope –scope <scope-name> –initial-manage-principal users

To double check the scope is created successfully:

Databricks secrets list-scopes

Step 4: Once scope is ready, secrets are required to be created

Create a secret

Secret name is case insensitive

Databricks secrets put –scope <scope-name> –key <key-name>

Confirm the secret is created

Databricks secrets list –scope <scope-name>

Step 5: Connect Databricks to Snowflake

For the last step, you can refer to the following documents: Python, Scala.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Databricks Integration with Snowflake

by Adarsh Srivastava on May 15th, 2020 | ~ minute read