David Callaghan – Senior Solutions Architect

Databricks Champion | Center of Excellence Lead | Data Privacy & Governance Expert | Speaker & Trainer | 30+ Yrs in Enterprise Data Architecture

Connect with David

Blogs from this Author

8 Ways to Data Scientist’s Can Optimize Their Parquet Queries

Some data formats are columnar. This means they store information in columns or rows. They are popular because they can be used for certain types of queries more easily than row-based ones. Parquet supports parallel query processing, meaning it can split up your data into several files in order to read in multiple processors at […]

Data + Intelligence

Finding the right balance of nOps

There are a proliferation of acronyms with the Ops suffix for the software architect to choose from. It’s reasonable to question whether the number are needed and necessary. All of these are, at the core, a targeted expressions of foundational business management methodology. The end goal will be continuous improvement in some business critical metric. […]

Data + Intelligence

Adopting a Risk-Based Strategy for Data

Ransomware attacks have been in the news lately, possibly because of the 225% increase in total losses from ransomware in the United States alone in 2020. An increase in sophistication by attackers is a major factor, and many of these ransomware attacks were enabled at least in part by insider negligence. As the level of […]

Data + Intelligence

Deep Dive into Databricks Tempo for Time Series Analytics

Time-series data has typically been fit imperfectly into whatever database we were using at the time for other tasks. There are time series databases (TSDB) coming to market. TSDBs are optimized to store and retrieve associated pairs of times and values. TSDB’s architecture focuses on time-stamp data storage and the compressions, summarization and life-cycle management […]

Data + Intelligence

Watercolor Koala And Panda Sitting On The Tree.

Koalas are better than Pandas (on Spark)

I help companies build out, manage and hopefully get value from large data stores. Or at least, I try. In order to get value from these petabytes-scale datastores, I need the data scientists to be able to easily apply their statistical and domain knowledge. There’s one fundamental problem: large datasets are always distributed and data […]

Data + Intelligence

DataOps with IBM

DataOps seeks to deliver high quality data fast in the same way that DevOps delivers high quality code fast. The names are similar; the goals are similar; the implementation is very different. Code quality can be measured using similar tools across multiple projects. Data quality is a mission-critical, enterprise-wide effort. The effort has consistently proven […]

Data + Intelligence

Trust models in distributed ledgers

Consensus, getting distributed processes to agree on a single value, is a fundamental problem in computer science. Distributed processing is difficult. In fact, there are logical proofs that show pretty conclusively that there won’t be a single perfect algorithm for handling consensus in an asynchronous system made of imperfect nodes. As long as there is […]

Data + Intelligence Platforms and Technology

Understanding Performance in Blockchain Systems

Blockchain is an example of distributed ledger systems and as such shares the same performance concerns as any other distributed system. In order to measure the performance of a distributed system with an acceptable degree of accuracy, it’s best to simplify as many of the variables under our control as possible. The size of the […]

Data + Intelligence

Take advantage of windows in your Spark data science pipeline

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]

Analytics Data + Intelligence

Shot Of Two Technicians Working Together In A Server Room

Bringing Informatica Intelligent Cloud Service into your Release Management Pipeline

Informatica Intelligent Cloud Services (IICS) now offers a free command line utility that can be used to integrate your ETL jobs into most enterprise release management pipelines. It’s called the Asset Management command line interface (CLI). Version two now allows you to extract an IICS job into a single compressed file. Moving a single standalone […]

Data + Intelligence Informatica

Scale your data science practice formally

Frequently, the “crawl, walk, run, fly” metaphor is used when describing the path to implementing a scalable data science practice. There are a lot of problems with this concept, not the least of which is the fact there is already motion involved. People are already doing BI work, often complex work enabling high value results. […]

Analytics Data + Intelligence Machine Intelligence

Tune the dials to optimize your Spark machine learning pipeline

Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. Spark defaults are never the right way to go. It makes more sense to know what settings are most effective at […]

Analytics Cloud Data + Intelligence Machine Intelligence