I’ve been writing about Test-Driven Development in Databricks and some of the interesting issues that you can run into with Python objects. It’s always been my opinion that code that is not testable is detestable. Admittedly, its been very difficult getting to where I wanted to be with Databricks and TDD. Unfortunately, it’s hard to […]
David Callaghan – Solutions Architect
As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.
Connect with David
Blogs from this Author
Understanding the role of Py4J in Databricks
I mentioned that my attempt to implement TDD with Databricks was not totally successful. Setting up the local environment was not a problem and getting a service id for CI/CD component was more of an administrative than a technical problem. Using mocks to test python objects that are serialized to Spark is actually the issue. […]
Test Driven Development with Databricks
I don’t like testing Databricks notebooks and that’s a problem. I like Databricks. I like Test Driven Development. Not in an evangelical; 100% code coverage or fail kind of way. I just find that a reasonable amount of code coverage gives me a reasonable amount of confidence. Databricks has documentation for unit testing. I tried […]
LinkedIn open sources a control plane for lake houses
LinkedIn open sources a lot of code. Kafka, of course, but also Samza and Voldemoort and a bunch of Hadoop tools like DataFu and Gobblin. Open-source projects tend to be created by developers to solve engineering problems while commercial products … Anyway, LinkedIn has a new open-source data offering called OpenHouse, which is billed as […]
Databricks Lakehouse Federation Public Preview
Sometimes, its nice to be able to skip a step. Most data projects involve data movement before data access. Usually this is not an issue; everyone agrees that the data must be made available before it can be available. There are use cases where the data movement part is a blocker because of time, cost, […]
Data Lake Governance with Tagging in Databricks Unity Catalog
The goal of Databricks Unity Catalog is to provide centralized security and management to data and AI assets across the data lakehouse. Unity Catalog provides fine-grained access control for all the securable objects in the lakehouse; databases, tables, files and even models. Gone are the limitations of the Hive metadata store. The Unity Catalog metastore […]
Feature Engineering with Databricks and Unity Catalog
Feature Engineering is the preprocessing step used to make raw data usable as input to an ML model through transformation, aggregation, enrichment, joining, normalization and other processes. Sometimes feature engineering is used against the output of another model rather than the raw data (transfer learning). At a high level, feature engineering has a lot in […]
Simulating Synchronous Operations with Asynchronous Code in Distributed Systems
Ensuring real-time status updates for end users in web applications can be challenging, particularly when working with Databricks, which lacks native support for synchronous updates. This means that changes made in Databricks may not be immediately reflected to end users, impacting the real-time nature of status updates. In this technical blog post, we will explore […]
Elastic Cloud Enterprise for Regulated Corporate Search
Regulated industries, such as financial and healthcare companies, often need to make hard choices when it comes to balancing innovation and compliance. Most technology companies are focused on cloud-first, if not entirely cloud-native, offerings, particularly in the search and data space. I was recently working with a large financial services company that wanted to consolidate […]
Integrating SAP Datasphere and Databricks Lakehouse for Unified Analytics
Integrating SAP and Databricks has typically required a lot of glue. Set up the SAP Data Hub environment, connect to the SAP data, set up a pipeline with Pipeline Modeler, configure the Streaming Analytics Service, setup Kafka or MQTT and receive the streaming data in Databricks with Spark Streaming. Most of these intermediate steps required […]
Real-time Data Processing: Databricks vs Flink
Real-time data processing is a critical need for modern-day businesses. It involves processing data as soon as it is generated to derive insights and take immediate actions. Databricks Streaming and Apache Flink are two popular stream processing frameworks that enable developers to build real-time data pipelines, applications and services at scale. In this article, we […]
Accelerate and Scale your Event Driven Architecture with GridGain
Are you looking for a way to accelerate and scale your Event Driven Architecture in the cloud? GridGain is here to help. GridGain, built on top of Apache Ignite, is a comprehensive in-memory computing platform that provides distributed caching, messaging, and compute capabilities, with enterprise-grade support. With its performance capabilities, it can increase the overall […]