Bill Busch, Author at Perficient Blogs

Change Your Mindset to Create a Better Data and Analytics Program

Bill Busch — Wed, 09 Sep 2020 11:45:15 +0000

Self-Service for data is not a new concept. Even in the early 2000s, companies have struggled with giving “power users” or “information workers” access to data to develop value-based insights. One of my past customers had a simple as a web page with hundreds of CSV formatted data extracts that people could “self-service data” based on their assigned role (security group). Although crude, it was one of the more popular pages on the corporate portal. The data extracts where indexed like a table of contents with descriptions that made data easy to locate. Data was downloaded with a point and click, so it was easy to access.

Challenges Facing Data Leaders

Although this previous example is 15 years old, it illustrates some of the critical challenges with which data leaders are struggling to provide data self-service or data as a service. Specifically, these challenges include how to:

Publish curated data
Organize data so it can be found
Provide easy access to data
Describe data in business terms

Today we have a bevy of tools that enable self-service data, data integration/preparation, analytics, BI, and machine learning capabilities. From the architecture perspective, containerization, microservices, DevOps, DataOps, and cloud services all provide infrastructure and processes to enable scalable and cost-effective self-service data and analytics. Weaving all these tools and technologies into an enterprise’s data ecosystem can be daunting, even for large, well-resourced companies.

Self-Service data and analytics (think AI, ML, and Model Building) and self-service Business Intelligence require different mindsets. With self-service BI, we had the luxury of buying a single tool like Microstrategy or Tableau and just enabling self-service during implementation. The success depended primarily on how well you or your consulting partner implemented the device and how well it was governed.

A Different Mindset

However, with Data and Analytics, we have a set of complexities that we did not have with self-service BI. These include enabling and governing direct data access, providing tools to transform, prep, and cleanse data, facilitating analytical models being deployed to production, creating sandboxes in the cloud, and helping users connect a wide variety of analytical and AI tools to enterprise data.

At Perficient, we had the opportunity to guide a large number of organizations through the process of specifying and implementing data and analytics architecture. Through this vast experience, we have observed that companies that gain a significant return on their analytics and data investments have changed their mindset from “let’s implement a tool” to “let’s provide a service.”

Whether we call this Data as a Service or Analytics as a Service or any other name du jour really does matter. But it was the mindset to define a set of enabling services (that involve data and analytics) then continually improve these critical services. This mindset revolves around approaching your program from your customer’s perspective. Talking to your data consumer and understanding how they access and use data, what their challenges that impede productivity are all items with which the leaders of the Data and Analytics program should be familiar.

Approaching the data and analytics program from the consumer point of view will undoubtedly change your perspective. Instead of looking at implementing tools that make IT happy, successful programs view tools as a way of making your users happy – and rarely will that happen without a service or capability driven implementation.

Data Architecture: 2.5 Types of Modern Data Integration Tools

Bill Busch — Mon, 10 Feb 2020 13:55:15 +0000

As we move into the modern cloud data architecture era, enterprises are deploying 2 primary classes of data integration tools to handle the traditional ETL and ELT use cases.

The first type of Data integration tool is GUI-Based Data Integration solutions.

Talend, Infosphere Datastage, Informatica, and Matillion are good examples. These tools leverage a UI to either configure a data integration engine or compile code for data integration. GUI Integration tools promise fast, friendly user interfaces to rapidly create new data pipelines. Also, GUI-based data integration tools have a proven record of increasing developer productivity. They are good for organizations that have:

Many data integration pipelines to manage.
Complex MDM requirements and business rules that need to integrate into data pipelines.
An ubiquitous relational database ecosystem.
Requirements to move data to and from cloud platforms (e.g. AWS, Azure, GCP)

The second type of Data Integration is the Script/Code-based Data Integration Solutions.

Script/Code-based data integration leverages a serious of tools to develop a data pipeline. This capability usually requires:

A programming language like Python or Scala
A data processing framework such as Spark
An orchestration tool similar to Apache Airflow.

Code/Scripts are constructed in vertices or nodes using a programming language and framework. These vertices then are structured in Directed Acyclic Graphs (DAGs) by the orchestration tool. DAGs can scale to handle very large (think 10s of Terabytes per day) data pipelines. DAGs are also extremely useful for handling customized or complex processing that one would see in Artificial Intelligence or Machine Learning use cases.

The 0.5: Cloud Native

When I was initially socializing the two types of Cloud ETL blog idea, a counterpart asked, “What about cloud-native?” Good question! The cloud-native options are just flavors of the two types of Data Integration. For instance, AWS Glue and Google DataProc have UIs that generate code (e.g. Python and Scala). Unlike their legacy counterparts with a rich UI functionality, these cloud-native tools still require editing the generated code (usually Python or Scala). The cloud-native tools are quickly catching up, but they still need to add significant functionality to their UIs to be able to garner the same productivity gains as traditional GUI-based solutions.

EDW in the Cloud TCO

Bill Busch — Thu, 18 Jul 2019 13:24:05 +0000

In 2016, when I did my first in-depth comparison, the resulting TCOs were usually very close. Usually, the OpEx was slightly higher for the cloud TCO versus the on-prem TCO required substantial capital investment.

However, our most recent estimate was eye-opening to our client. We were assessing a green-field implementation for a Data Warehouse at a mid-sized company. Part of our assessment was to compare TCO between the different deployment options, on-prem and cloud. We fully loaded all expenses for both options, including data center expenses, networking, data transfer, storage, administrative, software subscription fees, hardware and software maintenance, depreciation, and support.

The results were staggering. The cloud deployment TCO was over 30% less than the comparable on-prem deployment. Further, the on-prem deployment required a significant capital investment which was not required for the cloud deployment. It should be noted that in the cloud TCO we greatly over-estimated data transfer, processing, and storage costs.

Inspecting the TCO, there were three cloud features that greatly swung the:

Disaster Recovery cost was minimal, primarily data storage and data transfer costs.
Separating storage and compute with pay as you go compute minimized overall costs
Ability to right-size and scale the compute environment minimized initial costs of the program.

In the past, the cloud vs on-prem decision came down to a conversation around the speed of deployment, flexibility, elasticity – that is the normal cloud advantages. Now with the movement toward PaaS and serverless options that charge based only on resources used, the cloud has become the lowest TCO option in most cases.

Data Lakes, Not Just For Analytics Anymore

Bill Busch — Thu, 13 Dec 2018 22:57:12 +0000

Data Lakes have been around since the early part of this decade as most Fortune 500 companies have a Data Lake or are building a Data Lake. The drive to lake data has predominately been driven by analytical use cases where Data Scientists can wrangle and prepare data for their study or model building.

However, Perficient is seeing a significant shift from just deploying Hadoop to support analytics use case to using Data Lakes for operational data processing use cases. Companies are now able to move processing from expensive legacy mainframes and MPP data warehouses to Hadoop and Cloud-based systems. Although operational data processing has always been possible on Hadoop systems, the momentum has significantly accelerated due to a number of advances in in past few years.

These advances include:

SQL-based transformation included in Spark has made Big Data ETL accessible to many firms not wishing to invest in expensive ETL tools,
Cloud-based Data Warehouses that can offer similar scale and ease of use as traditional EDW systems at a fraction of the cost, and
Security advancements of Hadoop and Cloud-based Big Data Offerings has reassured companies that their data assets are protected in the new data ecosystem.

The movement to perform more operational data processing on Data Lakes brings its own set of challenges that companies need to address. In my next blog post, I will investigate these challenges that companies are facing as Big Data becomes operational.

Dorothy in the Land of Big Data

Bill Busch — Thu, 18 Jun 2015 03:15:30 +0000

Big Data is one of the enabling technologies for companies to digitally transform either their operations and/or customer interactions. However the open source world can be complicated, especially in the red hot Big Data arena. There are a myriad of technologies; some compete with one another, others overlap, some are complementary, and worse of all, some technologies both compete and are complementary (e.g. Cassandra can stand alone or run on top of Hadoop, thus it competes, and it is complementary.)

Like a Venn diagram, both Spark and MapReduce can coexist, but a number of their use cases do overlap. (Read more about Spark vs. Hadoop here). On Monday IBM, announced that it was throwing its weight behind Spark (see my blog post here) with little mention of Hadoop. It is easy to see how companies can one can feel like Dorothy in the Land of Oz. However, following a few simple recommendations a company can make good decisions in this seeming fantastical world (view my blog post here).

There is more good news. In this new digital, Big Data ecosystem the technology is pretty pluggable and is becoming more interchangeable. If the focus is on business value that data and analytics can bring to a company undergoing a digital transformation, then a good pragmatic business decision today will result in a good return on your technology investment.

Hadoop, Spark, Cassandra, Oh My!

Bill Busch — Thu, 18 Jun 2015 02:40:37 +0000

Previously, I reviewed why Spark will not by itself replace Hadoop, but Spark combined with other data storage and resource management technologies creates other options for managing Big Data. Today we will investigate how an enterprise should proceed in this new, “Hadoop is not the only option” world.

Hadoop, Spark, Cassandra, Oh My! Open source Hadoop and NoSQL are moving pretty fast. No wonder some companies might feel like Dorothy in the Land of Oz. To make sense and things and find the yellow brick road, first we need to understand Hadoop’s market position.
HDFS and YARN can support the storage of just about any type of data. With Microsoft’s help, now it is possible to run Hadoop on Windows and leverage .NET. Of course most are running on Linux and Java. HBASE, Hive, and Cassandra all can run on top of Hadoop. Hadoop and HIVE support is quickly becoming ubiquitous across data discovery, BI, and analytics tools sets. Hadoop is maturing fast from the security perspective. Thanks to HortonWorks, Apache Knox and Ranger have delivered enterprise security capabilities. Cloudera and IBM both have their own stories on enterprise security as well. WANdisco provides robust multi-site data replication and state of the art data protection. The bottom line is that Hadoop has and is continuing to mature AND there is an extensive amount of support from most vendors and related Apache open-source projects. Hadoop is not going anywhere but up! Hadoop is the defacto standard (aka safe bet) for Big Data management and processing. It will meet the requirements of most enterprises, and its ability to support many different execution frameworks like Storm, Mapreduce, TEZ, and Spark will assure support for most any application processing scenario.

Although Hadoop may be a safe choice it isn’t always the correct choice for an enterprise. There are other options. I worked at a mid-size company recently, and based on their requirements, they did not require a Hadoop to meet their 2-3 year needs. A solution build on RDBMS was sufficient because of the characteristics of their data. Meanwhile a large retailer has recently deployed a very large (100s of TBs) Cassandra system to replace a relational operational data store. Requirements, willingness to accept risk, budget, and in-house skill sets all play a part of the overall decision.

When making a platform decision for Big Data, enterprises should:

Make a decision based on the current market – do not “bet” on a technology that does not exist yet. If you have a business case today, the last thing you need to do is wait. That is, a good decision today is better than great decision tomorrow. Time is money.
Do not lock yourself into any single technology by investing a significant amount of time and expense in writing code that would need to be written if something better comes along. Rely on product vendors to isolate you from changes in technology. Using a third party tool like Snaplogic or Informatica for data transformation will help isolate you from underlying platform changes.
Stick to more mature open-source offerings. New open-source projects offer a lot of promise and excitement, however, they need to mature first (Note that IBM’s announcement implies that they believe Spark needs to mature more before it is enterprise ready).
Perform a proof of concept, but do not fall into the analysis paralysis trap of testing every different technology combination. Pick one or two deployment scenarios to prove. Time is money.

Lastly, consider partnering with an experience consulting firm with hands on experience in operationalizing Big Data (Perficient is an excellent choice by the way!). A good consultant will, in the long run, save you money by providing objective advice and speeding you through a solution and implementation by focusing your organization on those decisions that truly important.

Spark Gathers More Momentum

Bill Busch — Tue, 16 Jun 2015 19:54:48 +0000

Yesterday, IBM threw its weight behind Spark. This announcement is significant because it is a leading indicator of a transition from IT-focused Big Data efforts to business-driven analytics and Big Data investments. If you are interested in learning more about this announcement and what it means in the bigger picture, I wrote a blog entry on our IBM blog, which can be found here.

IBM’s Spark Investment is Evidence Big Data is Dead

Bill Busch — Tue, 16 Jun 2015 19:22:48 +0000

Right after I posted my blog on Spark and Hadoop, I came across this article. IBM made a big announcement that they are putting their weight behind Spark. They are committing more than 3,500 developers and programmers to help move Spark forward. This combined with significant support from the Big 3 Hadoop distributors (HortonWorks, Cloudera, and MapR) Spark seems to have a lot of momentum.

What is going on? Is Big Data is dead!? One can see Google articles proclaiming such. As discussed yesterday, some have said that Spark will replace Hadoop. Gartner a couple of years ago has proclaimed that Big Data has entered the Tough of Disillusionment. Now IBM has very publicly proclaimed massive support for Spark.

So, Big Data is dead, right? Well let’s investigate this a bit further. Just like any new transformational technology, at its onset Big Data has been driven by hype and promise. IT has jumped on the bandwagon because they realize that Hadoop, NoSQL, and other open source projects bring new gee-whiz technical capabilities. But technical capabilities alone do not create business value. Even Spark alone will not create business value. To create value, business focused solutions must be developed.

Up until recently, the focus of Big Data has been more on the technology side of things, especially around managing and storing this large influx of data we are seeing from mobile, Internet of Things, and social. The Big Data hype has been focused on how to ingest into Hadoop and leveraging an analytical tool, and magically you will garner high-value insights.

However real the value has been elusive to most companies. Depending on which analyst firm to whom you subscribe, only 20% of the companies experimenting with Hadoop have deployed projects to production. Why is this number so low? Simply put, IT was driving most of the Hadoop investment. However, in 2014 there seems to be a huge shift. Datameer published on their blog, that the executive interest in Big Data shifted from predominately IT to predominately business executives. To put it another way, overhyped IT driven Big Data is what is dead. However, the business opportunity that the new data ecosystem presents is real.

With business executives finally getting the message that they need to leverage analytics in the new data ecosystem, organizations will need to bring business centric solutions to the market. THIS, is where the IBM Spark announcement is significant. IBM is committing 3500 people Spark! That is at a minimum a $350Million dollar annual investment. A pretty huge undertaking even for IBM. What is IBM doing? In my opinion two things. One is to get Spark ready for prime time. The challenge with Spark has been its enterprise maturity. A lot of promise, Spark needs to mature. Once Spark is mature, IBM will be free to build high-value business oriented applications that leverage analytics and machine learning using Spark. Big Data is not Dead – it is just becoming business focused.

Will Spark Replace Hadoop?

Bill Busch — Mon, 15 Jun 2015 13:29:09 +0000

I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop. This is the wrong question! It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd. Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components. Specifically, Hadoop has three main components:

File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.

Spark’s core process framework provides a cluster computing framework and has a basic resource management layer. What it is missing it is missing is the data storage layer (e.g. HDFS). So, Spark, by itself, will not replace Hadoop (for any workload that requires data storage services).

The conversation does not end here, though. Spark’s main benefits are that it stores data for intermediate processing steps in memory versus MapReduce which stores data on disk. Mapreduce works well for batch processing where data can be read and written to disk in bulk, but for more streaming and many analytical types of workload, Spark’s in-memory centric architecture provides significant benefits. Spark’s in-memory architecture creates a framework that is very versatile supporting machine learning, graph analytics, SQL data access/management, and stream processing. Spark’s versatility is not just limited to the types of workloads. It supports Mesos, YARN, Ring Master (Cassandra’s resource manager), and its own resource management. For data storage, Spark is compatible with Hadoop, S3, Cassandra, and others. So, Spark alone will not replace Hadoop. However, Spark combined with other resource managers and data storage solutions can provide a holistic system that can replace Hadoop.

In summary, just like a DVD player can be utilized in a system that either a Plasma, LCD or DLP Projector for the display, Spark will open the door to storing and processing data residing in other systems other than Hadoop.

If you like today’s blog entry, please follow me @bigdata73.

Leveraging Your Oracle Resources for Big Data Value

Bill Busch — Tue, 07 Apr 2015 16:36:38 +0000

As companies transform their businesses to be data-driven and leverage the benefits of Big Data, they quickly realizing that lack of Big Data centric data scientists and wranglers is blocking their value attainment. One of the limiting factors in Big Data resources, are the skills that have typically been required to leverage Big Data. Java and MapReduce both have been required skills and created a barrier to leveraging Big Data. This barrier exists despite the fact most companies have a number of DBAs, data analysts, and developers that are familiar with more traditional SQL-based data access and integration tools. Providing tools that allow these untapped resources to access Big Data in the environments that they are already trained, will allow companies to quickly address, in part, the data scientist gap.

Oracle has realized this need for Oracle-centric big data tools, and developed a number of connectors that enable this Oracle army to access and manipulate Big Data using tools and languages familiar to Oracle professionals. Oracle has 5 such connectors/technologies, each targeted at its own niche within the Oracle ecosystem.

Oracle Loader for Hadoop – Allows DBAs and developers to easily bulk load data from Hadoop to Oracle without having to understand MapReduce, Yarn, or Java. This Hadoop-aware connector pushes-down Oracle data-type conversions to be processed on the Hadoop data nodes, thus, reducing the cpu utilization of the Oracle database during load-time.
Oracle R Advanced Analytics for Hadoop – This connector is for the data scientists that are used to leveraging Oracle Advance Analytics and R in the Oracle Database. It provides a single interface to combine data from HDFS, Hive, Oracle RDBMS, and other supported data sources into one analytic tasks or set of tasks.
Oracle SQL Connector for HDFS – SQL is the language most Oracle Professionals are familiar, and having a connector that enables HDFS data to be combined with data within an Oracle database using a single query greatly opens up access to Big Data.
Oracle XQuery for Hadoop – Many logs and machine data are stored in XML or Json structures. The Oracle XQuery connector extends Oracle’s XML Query search technology to HDFS and Oracle NoSQL databases allowing access to these popular structures.
Oracle Data Integrator for Big Data – Although not a connector, this product enables ODI professionals to leverage HDFS data as they would any other data source. Like other Big Data technology from Oracle, the ODI connector is Hadoop-aware and pushes transformational processing down to the Hadoop data nodes.

If you are an enterprise with significant investment in Oracle technologies and are pursuing Big Data, do not ignore the array of Big Data connectors from Oracle. These connectors enable your Oracle professionals to quickly utilize their Oracle and existing data knowledge to provide value from your Big Data investment.

Change is in the Air

Bill Busch — Mon, 30 Mar 2015 20:39:31 +0000

The strategy is complete, implementation of the mobile application and analytical system is finished, data scientists are providing useful analytical research. But is your enterprise getting the value out of your digital transformation investments?

A company’s culture, people, and business processes usually provide the largest barrier to realizing the value from digital investments. Yes, we talk about change management, however, most times that change management is involved in a one-time event like the implementation of ERP system or rolling out a new Salesforce application.

Once we have analytics in place providing insights into our digital endeavors, we are still not providing value to the company until a business process is changed. A few examples:

Retail – Yes, price optimization has been in the market for a couple of decades. However, once you know a set of prices should be altered, there is an approval process, and changes to the POS have to be made to update the products pricing, store labor has to physically, and change the price tag on the shelves.
Healthcare – One of the big trends in the healthcare today is early detection of the sepsis. Even if the predictive model can be automated, the resulting prediction (if it is positive), needs to be communicated down to the doctors and nurses that are caring for the patient.
Oil and Gas – Drilling can generate over a 1TB a day in log and sensor data. Based on analytics, companies can maximize their oil production by keeping horizontal bores in the middle of oil rich rock, knowing when to create a new bore, or determining when fracking is required. Again the information needs to get to platform/well crew for it to be actionable.

In these cases, the change management is not a one-time thing. Change is constant. The business processes must be built to enable the analytical changes. Only then, can an enterprise realize the full benefit of the digital transformation.

Data Quality – Don’t Fix It If It Ain’t Broke

Bill Busch — Tue, 27 Jan 2015 15:42:11 +0000

What is broke? If I drive a pickup truck around that has a small, unobtrusive crack in the windshield and a few dings in the paint, it will still pull a boat and haul a bunch of lumber from Home Depot. Is the pickup broke if it still meets my needs?

So, when is data broke? In our legacy data integration practices, we would profile data and identify all that is wrong with the data. Orphan keys, in-appropriate values, and incomplete data (to name a few) would be identified before data was moved. In the more stringent organizations data would need to near perfect for it to be used in a data warehouse. This ideal world or perfect data was strived after, but rarely obtained. It was too expensive, required too much business buy in, and lengthen BI and DW projects.

In the world of Big Data, things have changed. We move data first and some of it we may never fix. Why? To understand this we need to look at the analytical process. When performing an analytical project, data scientists will usually select a subset of data, split it into halves. One half of the data is used to build a model; the other is used to test the model. If the model tests OK, that is the standard error is within acceptable range, do we need to fix the data? Fixing the data would at this point not change the outcome, so it served its purpose.

With moving the data first and moving it into a data lake for processing this gives us a unique opportunity to test drive the data first. Data scientists and business users will be able to benefit from using the data to make better decisions. At a time that the data quality does not meet the needs, address the issues within the data. So, don’t fix it if it ain’t broke.

Follow Bill on Twitter @bigdata73

Connect with Perficient on LinkedIn here.