I don’t like testing Databricks notebooks and that’s a problem. I like Databricks. I like Test Driven Development. Not in an evangelical; 100% code coverage or fail kind of way. I just find that a reasonable amount of code coverage gives me a reasonable amount of confidence. Databricks has documentation for unit testing. I tried to test one notebook using another notebook. Then I realized I also like CI/CD. I’ve been writing code for decades and there are just some best practices that I consider foundational. Separating functions and test from notebooks seemed like familiar ground. Ultimately, I spent more time than I had expected to get a working process going. I worked through some issues (part one). I got worked by some issues (part two). By the time I was done I had forgotten half of it, so I decided to pull it together in a blog post that I know at least I’ll read.
Project
This was an unusually straightforward project. These are great candidates for getting these details right since there weren’t the typical client-specific issues that wouldn’t come up again. In fact, the client only had access to Databricks; the IT department controlled the underlying cloud. They just needed to pull json data from some APIs for analytics. I created a notebook with widgets that provided a few basic parameters like start and end date and some search terms. There are python scripts that I call from the notebook that do the heavy lifting like calling the API, validating the request and response and persisting the data to delta tables. There are also some python utilities like custom exceptions and a logging mechanism that persists information to another delta table. The idea is that the notebooks are client-facing and the python files are the targets of test.
Tools
I’m going to be specific in describing how I set things up. It’s not because its the best way; I’m just trying to be very clear. I’m using a Mac and I use the terminal a lot. I am using Databricks on Azure, specifically Databricks version 14.3 LTS. I made a conscious decision to not do anything Azure-specific because I want to focus on Databricks. We’re working locally with Databricks, so download the Databricks CLI. You’ll need to request a service principal from your account or workspace admin. I’ll stick to python; so make sure you have Python 3.x available on your machine. You will want to make sure you can pip in some packages, like pyspark, databricks_cli, coverage, etc. I’m using Bitbucket Cloud for my git-compliant source-code control and my CI/CD pipeline. I’m using Visual Studio Code as my IDE. My goal is to be able to work with VS Code full-time when I’m working with Databricks. (spoiler alert: I don’t quite get there.)
VS Code Challenges
You should bring up the Databricks documentation for developer tools while you are configuring your own environment. They are very good. Specifically, start with the instructions for VS Code. You’ll install the Databricks extension and Databricks Connect. Following those instructions will get you far and I don’t see any need to repeat what has been explained so well. They don’t mention a couple of key pain points, though.
You need to use the Premium rather than the Standard edition to get all the features I talk about. At the very least, you can’t use a personal access token as your authentication mechanism. The issue is discussed a little bit here: but I don’t see it mentioned much otherwise. I am using a corporate instance with Unity Catalog, so it was not an issue. However, just trying the free version likely will not work end-to-end.
The Databricks UI for python does not support linting. We had already started coding before I decided this would be a good idea. The code was stored in a repo in Databricks and that repo was connected to Bitbucket. Databricks has instructions for this, but the instructions seemed confusing and it was easier just to do it. It worked as expected. What wasn’t expected was all the linting issues I saw when I pulled into VS Code. My life got a lot easier when I updated the setting.json in the workspace .vscode to include "files.trimTrailingWhitespace": true
. The python scripts were pretty easy to deal with since they were all basically linting issues that I just didn’t consider because the IDE didn’t tell me. The notebooks were a different story.
Notebook Challenges
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
The first thing you notice when you externalize functions from your notebook into another file is that Databricks really didn’t think you were going to do that. When you open a notebook, it assumes that you are going to be running code in that notebook. The import operation only occurs once per session. So if you are importing from a python file, find a problem and change the python file, you still see the problem. Reloading the notebook is a hassle. This is a nice little code snippet to have around; just remove it before it goes to production.
import importlib import your_module # replace 'your_module' with the actual name of your module # Reload the module importlib.reload(your_module)
I’ve been telling data scientists to give Databricks try because they are already family with Jupyter notebooks, so I was surprised to see that they just have a .py extension and a line at the top that says # Databricks notebook source
. It also took me a while to notice that under each cell there are tiny little text-only buttons that say Run Cell, Run Above and Debug Cell. And yes, you can debug a notebook from VS Code when you are connected to a running cluster using the Databricks extension and you see Databricks Connect enabled at the bottom of VS Code. I debugged some code on the server right away and it was honestly really cool. I felt good for almost five minutes.
Most of the notebook issues were just linting warnings. They show up in the Problems view, but they are not always problems; mostly blue circles that say Line too long
in my markdown cells. I will never fix those. But there were Undefined variable 'dbutils'
errors are all over the place and these needed attention. As it turns out, all of my widget code uses dbutils.widgets
which works just fine in Databricks but not locally. There didn’t seem to be a lot of easy-to-implement solutions, but I eventually decided to just include an import with a comment: from databricks.sdk.runtime import dbutils # use when working in VS Code
. Now I have a couple of things I need to remember to move before it goes to production. I could probably write a pre-commit hook, but I’d rather not. (I probably will have to at some point, though.)
At this point, I’ve got an IDE that helps me write better, or at least more pythonic, code. The BitBucket integration works with no issues. I can spin up my cluster from my local machine. I can debug code from my local instance to the server. I had to add one line that’s only used in my local machine and I have a throw-away cell I use when I’m working with a notebook in the Databricks UI. I am not going to address linting issues in markdown cells because down that road lies madness. And I’m debugging on the server locally. Now I just need to start running some tests. No big deal; I just click on the Test Explorer and ….
Testing Challenges
Testing was the point of this entire exercise. I thought that once I got my python code from Databricks into VS Code via a git repo I would just start testing. Running python tests in VS Code is not hard. In hindsight, it isn’t hard. Step one is not unique to Databricks; just give the Test Explorer a Python interpreter to run against, tell it where to find the test in the directory structure and tell it the pattern of the test file naming convention. Honestly, it takes longer to read that sentence than execute the process. But no tests showed up and it kept asking for configuration details. Fine, I’ll just do it from the command line. Just run ./.venv/bin/python -m unittest discover -v -s /my_project/test -p *_test.py
from the terminal and we’re good. Or not. Whatever. I’ll just edit .vscode/settings.json
directly. But it already looked like this:
"python.testing.unittestArgs": [ "-v", "-s", "./my_project/test", "-p", "*_test.py" ], "python.testing.pytestEnabled": false, "python.testing.unittestEnabled": true
Restart VS Code. Nothing. Check the path and permissions. Nothing. Look again for a missing comma or something. Nothing. Ask myself why isn’t this working like other python projects?
Because I started coding in Databricks.
I quietly add some empty __init__.py
files like I would have it was just a normal python project. The tests appear. I try to run the tests. Relative import error. Add empty __init__.py
files to all the directories. Run tests. All green. All good. The end …………
Or is it?
Conclusion
Tune in to the next post where we discuss why you need to capture Py4JJavaError if you want to understand why a piece of code that works on your local machine might fail when you run it on the server.