David Callaghan – Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

Connect with David

Blogs from this Author

Balance

Finding the right balance of nOps

There are a proliferation of acronyms with the Ops suffix for the software architect to choose from. It’s reasonable to question whether the number are needed and necessary. All of these are, at the core, a targeted expressions of foundational business management methodology. The end goal will be continuous improvement in some business critical metric. […]

Istock 1214111410

Adopting a Risk-Based Strategy for Data

Ransomware attacks have been in the news lately, possibly because of the 225% increase in total losses from ransomware in the United States alone in 2020. An increase in sophistication by attackers is a major factor, and many of these ransomware attacks were enabled at least in part by insider negligence. As the level of […]

Istock 1255683032

Deep Dive into Databricks Tempo for Time Series Analytics

Time-series data has typically been fit imperfectly into whatever database we were using at the time for other tasks. There are time series databases  (TSDB) coming to market. TSDBs are optimized to store and retrieve associated pairs of times and values. TSDB’s architecture focuses on time-stamp data storage and the compressions, summarization and life-cycle management […]

Watercolor Koala And Panda Sitting On The Tree.

Koalas are better than Pandas (on Spark)

I help companies build out, manage and hopefully get value from large data stores. Or at least, I try. In order to get value from these petabytes-scale datastores, I need the data scientists to be able to easily apply their statistical and domain knowledge. There’s one fundamental problem: large datasets are always distributed and data […]

digital data cloud

DataOps with IBM

DataOps seeks to deliver high quality data fast in the same way that DevOps delivers high quality code fast. The names are similar; the goals are similar; the implementation is very different. Code quality can be measured using similar tools across multiple projects. Data quality is a mission-critical, enterprise-wide effort. The effort has consistently proven […]

Trust models in distributed ledgers

Consensus, getting distributed processes to agree on a single value, is a fundamental problem in computer science. Distributed processing is difficult. In fact, there are logical proofs that show pretty conclusively that there won’t be a single perfect algorithm for handling consensus in an asynchronous system made of imperfect nodes. As long as there is […]

Understanding Performance in Blockchain Systems

Blockchain is an example of distributed ledger systems and as such shares the same performance concerns as any other distributed system. In order to measure the performance of a distributed system with an acceptable degree of accuracy, it’s best to simplify as many of the variables under our control as possible. The size of the […]

InForm

Take advantage of windows in your Spark data science pipeline

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]

Istock 927720230 Featured Image

Bringing Informatica Intelligent Cloud Service into your Release Management Pipeline

Informatica Intelligent Cloud Services (IICS) now offers a free command line utility that can be used to integrate your ETL jobs into most enterprise release management pipelines. It’s called the Asset Management command line interface (CLI). Version two now allows you to extract an IICS job into a single compressed file. Moving a single standalone […]

Scale your data science practice formally

Frequently, the “crawl, walk, run, fly” metaphor is used when describing the path to implementing a scalable data science practice. There are a lot of problems with this concept, not the least of which is the fact there is already motion involved. People are already doing BI work, often complex work enabling high value results. […]

Computer And Tools

Tune the dials to optimize your Spark machine learning pipeline

Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. Spark defaults are never the right way to go. It makes more sense to know what settings are most effective at […]

getting started

Big Data Bootcamp by the Beach: Getting Started Smart

In the first post in this series, I talked about giving a Big Data Bootcamp in the Dominican Republic to a large group of very smart students. In this post, I’ll go over the basic tools and techniques that I think are most relevant in the job market. These are basic tools that most are […]

Load More