Some data formats are columnar. This means they store information in columns or rows. They are popular because they can be used for certain types of queries more easily than row-based ones. Parquet supports parallel query processing, meaning it can split up your data into several files in order to read in multiple processors at […]
Blogs from this Author
Finding the right balance of nOps
There are a proliferation of acronyms with the Ops suffix for the software architect to choose from. It’s reasonable to question whether the number are needed and necessary. All of these are, at the core, a targeted expressions of foundational business management methodology. The end goal will be continuous improvement in some business critical metric. […]
Adopting a Risk-Based Strategy for Data
Ransomware attacks have been in the news lately, possibly because of the 225% increase in total losses from ransomware in the United States alone in 2020. An increase in sophistication by attackers is a major factor, and many of these ransomware attacks were enabled at least in part by insider negligence. As the level of […]
Deep Dive into Databricks Tempo for Time Series Analytics
Time-series data has typically been fit imperfectly into whatever database we were using at the time for other tasks. There are time series databases (TSDB) coming to market. TSDBs are optimized to store and retrieve associated pairs of times and values. TSDB’s architecture focuses on time-stamp data storage and the compressions, summarization and life-cycle management […]
Koalas are better than Pandas (on Spark)
I help companies build out, manage and hopefully get value from large data stores. Or at least, I try. In order to get value from these petabytes-scale datastores, I need the data scientists to be able to easily apply their statistical and domain knowledge. There’s one fundamental problem: large datasets are always distributed and data […]
DataOps with IBM
DataOps seeks to deliver high quality data fast in the same way that DevOps delivers high quality code fast. The names are similar; the goals are similar; the implementation is very different. Code quality can be measured using similar tools across multiple projects. Data quality is a mission-critical, enterprise-wide effort. The effort has consistently proven […]
Trust models in distributed ledgers
Consensus, getting distributed processes to agree on a single value, is a fundamental problem in computer science. Distributed processing is difficult. In fact, there are logical proofs that show pretty conclusively that there won’t be a single perfect algorithm for handling consensus in an asynchronous system made of imperfect nodes. As long as there is […]
Understanding Performance in Blockchain Systems
Blockchain is an example of distributed ledger systems and as such shares the same performance concerns as any other distributed system. In order to measure the performance of a distributed system with an acceptable degree of accuracy, it’s best to simplify as many of the variables under our control as possible. The size of the […]
Take advantage of windows in your Spark data science pipeline
Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]
Bringing Informatica Intelligent Cloud Service into your Release Management Pipeline
Informatica Intelligent Cloud Services (IICS) now offers a free command line utility that can be used to integrate your ETL jobs into most enterprise release management pipelines. It’s called the Asset Management command line interface (CLI). Version two now allows you to extract an IICS job into a single compressed file. Moving a single standalone […]
Scale your data science practice formally
Frequently, the “crawl, walk, run, fly” metaphor is used when describing the path to implementing a scalable data science practice. There are a lot of problems with this concept, not the least of which is the fact there is already motion involved. People are already doing BI work, often complex work enabling high value results. […]
Tune the dials to optimize your Spark machine learning pipeline
Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. Spark defaults are never the right way to go. It makes more sense to know what settings are most effective at […]