Bill Busch – Perficient Blogs Expert Insights Fri, 14 Feb 2020 22:18:44 +0000 en-US hourly 1 Bill Busch – Perficient Blogs 32 32 30508587 Data Architecture: 2.5 Types of Modern Data Integration Tools Mon, 10 Feb 2020 13:55:15 +0000

As we move into the modern cloud data architecture era, enterprises are deploying 2 primary classes of data integration tools to handle the traditional ETL and ELT use cases.

The first type of Data integration tool is GUI-Based Data Integration solutions.

Talend, Infosphere Datastage, Informatica, and Matillion are good examples. These tools leverage a UI to either configure a data integration engine or compile code for data integration.  GUI Integration tools promise fast, friendly user interfaces to rapidly create new data pipelines. Also, GUI-based data integration tools have a proven record of increasing developer productivity. They are good for organizations that have:

  1. Many data integration pipelines to manage.
  2. Complex MDM requirements and business rules that need to integrate into data pipelines.
  3. An ubiquitous relational database ecosystem.
  4. Requirements to move data to and from cloud platforms (e.g. AWS, Azure, GCP)

The second type of Data Integration is the Script/Code-based Data Integration Solutions.

Script/Code-based data integration leverages a serious of tools to develop a data pipeline. This capability usually requires:

  1. A programming language like Python or Scala
  2. A data processing framework such as Spark
  3. An orchestration tool similar to Apache Airflow.

Code/Scripts are constructed in vertices or nodes using a programming language and framework. These vertices then are structured in Directed Acyclic Graphs (DAGs) by the orchestration tool.    DAGs can scale to handle very large (think 10s of Terabytes per day) data pipelines. DAGs are also extremely useful for handling customized or complex processing that one would see in Artificial Intelligence or Machine Learning use cases.

The 0.5: Cloud Native

When I was initially socializing the two types of Cloud ETL blog idea, a counterpart asked, “What about cloud-native?” Good question! The cloud-native options are just flavors of the two types of Data Integration. For instance, AWS Glue and Google DataProc have UIs that generate code (e.g. Python and Scala). Unlike their legacy counterparts with a rich UI functionality, these cloud-native tools still require editing the generated code (usually Python or Scala).  The cloud-native tools are quickly catching up, but they still need to add significant functionality to their UIs to be able to garner the same productivity gains as traditional GUI-based solutions.

]]> 0 250681
EDW in the Cloud TCO Thu, 18 Jul 2019 13:24:05 +0000

In 2016, when I did my first in-depth comparison, the resulting TCOs were usually very close. Usually, the OpEx was slightly higher for the cloud TCO versus the on-prem TCO required substantial capital investment.

However, our most recent estimate was eye-opening to our client.  We were assessing a green-field implementation for a Data Warehouse at a mid-sized company.  Part of our assessment was to compare TCO between the different deployment options, on-prem and cloud. We fully loaded all expenses for both options, including data center expenses, networking, data transfer, storage, administrative, software subscription fees, hardware and software maintenance, depreciation, and support.

The results were staggering. The cloud deployment TCO was over 30% less than the comparable on-prem deployment.  Further, the on-prem deployment required a significant capital investment which was not required for the cloud deployment.  It should be noted that in the cloud TCO we greatly over-estimated data transfer, processing, and storage costs.

Inspecting the TCO, there were three cloud features that greatly swung the:

  1. Disaster Recovery cost was minimal, primarily data storage and data transfer costs.
  2. Separating storage and compute with pay as you go compute minimized overall costs
  3. Ability to right-size and scale the compute environment minimized initial costs of the program.

In the past, the cloud vs on-prem decision came down to a conversation around the speed of deployment, flexibility, elasticity – that is the normal cloud advantages.  Now with the movement toward PaaS and serverless options that charge based only on resources used, the cloud has become the lowest TCO option in most cases.

]]> 0 242280
Data Lakes, Not Just For Analytics Anymore Thu, 13 Dec 2018 22:57:12 +0000

Data Lakes have been around since the early part of this decade as most Fortune 500 companies have a Data Lake or are building a Data Lake. The drive to lake data has predominately been driven by analytical use cases where Data Scientists can wrangle and prepare data for their study or model building.

However, Perficient is seeing a significant shift from just deploying Hadoop to support analytics use case to using Data Lakes for operational data processing use cases.   Companies are now able to move processing from expensive legacy mainframes and MPP data warehouses to Hadoop and Cloud-based systems. Although operational data processing has always been possible on Hadoop systems, the momentum has significantly accelerated due to a number of advances in in past few years.

These advances include:

  1. SQL-based transformation included in Spark has made Big Data ETL accessible to many firms not wishing to invest in expensive ETL tools,
  2. Cloud-based Data Warehouses that can offer similar scale and ease of use as traditional EDW systems at a fraction of the cost, and
  3. Security advancements of Hadoop and Cloud-based Big Data Offerings has reassured companies that their data assets are protected in the new data ecosystem.

The movement to perform more operational data processing on Data Lakes brings its own set of challenges that companies need to address. In my next blog post, I will investigate these challenges that companies are facing as Big Data becomes operational.

]]> 0 234037
Dorothy in the Land of Big Data Thu, 18 Jun 2015 03:15:30 +0000

Big Data is one of the enabling technologies for companies to digitally transform either their operations and/or customer  interactions.  However the open source world can be complicated, especially in the red hot Big Data arena. There are a myriad of technologies; some compete with one another, others overlap, some are complementary, and worse of all, some technologies both compete and are complementaryVenn Diagram (e.g. Cassandra can stand alone or run on top of Hadoop, thus it competes, and it is complementary.)

Like a Venn diagram, both Spark and MapReduce can coexist, but a number of their use cases do overlap. (Read more about Spark vs. Hadoop here). On Monday IBM, announced that it was throwing its weight behind Spark (see my blog post here) with little mention of Hadoop. It is easy to see how companies can one can feel like Dorothy in the Land of Oz. However, following a few simple recommendations a company can make good decisions in this seeming fantastical world (view my blog post here).

There is more good news. In this new digital, Big Data ecosystem the technology is pretty pluggable and is becoming more interchangeable. If the focus is on business value that data and analytics can bring to a company undergoing a digital transformation, then a good pragmatic business decision today will result in a good return on your technology investment.

]]> 2 186489
Hadoop, Spark, Cassandra, Oh My! Thu, 18 Jun 2015 02:40:37 +0000

Previously, I reviewed why Spark will not by itself replace Hadoop, but Spark combined with other data storage and resource management technologies creates other options for managing Big Data.  Today we will investigate how an enterprise should proceed in this new, “Hadoop is not the only option” world. 

Hadoop, Spark, Cassandra, Oh My!  Open source Hadoop and NoSQL are moving pretty fast. No wonder some companies might feel like Dorothy in the Land of Oz.  To make sense and things and find the yellow brick road, first we need to understand Hadoop’s market position.
HDFS and YARN can support the storage of just about any type of data.  With Microsoft’s help, now it is possible to run Hadoop on Windows and leverage .NET.  Of course most are running on Linux and Java.   HBASE, Hive, and Cassandra all can run on top of Hadoop.   Hadoop and HIVE support is quickly becoming ubiquitous across data discovery, BI, and analytics tools sets.  Hadoop is maturing fast from the security perspective.  Thanks to HortonWorks, Apache Knox and Ranger have delivered enterprise security capabilities.   Cloudera and IBM both have their own stories on enterprise security as well.   WANdisco provides robust multi-site data replication and state of the art data protection.  The bottom line is that Hadoop has and is continuing to mature AND there is an extensive amount of support from most vendors and related Apache open-source projects.   Hadoop is not going anywhere but up! Hadoop is the defacto standard (aka safe bet) for Big Data management and processing.  It will meet the requirements of most enterprises, and its ability to support many different execution frameworks like Storm, Mapreduce, TEZ, and Spark will assure support for most any application processing scenario.

Although Hadoop may be a safe choice it isn’t always the correct choice for an enterprise.  There are other options. I worked at a mid-size company recently, and based on their requirements, they did not require a Hadoop to meet their 2-3 year needs.  A solution build on RDBMS was sufficient because of the characteristics of their data.  Meanwhile a large retailer has recently deployed a very large (100s of TBs) Cassandra system to replace a relational operational data store.   Requirements, willingness to accept risk, budget, and in-house skill sets all play a part of the overall decision.

When making a platform decision for Big Data, enterprises should:

  1. Make a decision based on the current market – do not “bet” on a technology that does not exist yet. If you have a business case today, the last thing you need to do is wait.  That is, a good decision today is better than great decision tomorrow.  Time is money.
  2. Do not lock yourself into any single technology by investing a significant amount of time and expense in writing code that would need to be written if something better comes along.  Rely on product vendors to isolate you from changes in technology.  Using a third party tool like Snaplogic or Informatica for data transformation will help isolate you from underlying platform changes.
  3. Stick to more mature open-source offerings. New open-source projects offer a lot of promise and excitement, however, they need to mature first (Note that IBM’s announcement implies that they believe Spark needs to mature more before it is enterprise ready).
  4. Perform a proof of concept, but do not fall into the analysis paralysis trap of testing every different technology combination. Pick one or two deployment scenarios to prove.  Time is money.

Lastly, consider partnering with an experience consulting firm with hands on experience in operationalizing Big Data (Perficient is an excellent choice by the way!).  A good consultant will, in the long run, save you money by providing objective advice and speeding you through a solution and implementation by focusing your organization on those decisions that truly important.


]]> 0 200121
Spark Gathers More Momentum Tue, 16 Jun 2015 19:54:48 +0000

Yesterday, IBM threw its weight behind Spark. This announcement is significant because it is a leading indicator of a transition frspark-logoom IT-focused Big Data efforts to business-driven analytics and Big Data investments. If you are interested in learning more about this announcement and what it means in the bigger picture, I wrote a blog entry on our IBM blog, which can be found here.

]]> 0 200120
IBM’s Spark Investment is Evidence Big Data is Dead Tue, 16 Jun 2015 19:22:48 +0000


Right after I posted my blog on Spark and Hadoop, I came across this article. IBM Big Data RIP tombstone-thumb-300x385-538made a big announcement that they are putting their weight behind Spark.  They are committing more than 3,500 developers and programmers to help move Spark forward. This combined with significant support from the Big 3 Hadoop distributors (HortonWorks, Cloudera, and MapR) Spark seems to have a lot of momentum.

What is going on? Is Big Data is dead!? One can see Google articles proclaiming such. As discussed yesterday, some have said that Spark will replace Hadoop.  Gartner a couple of years ago has proclaimed that Big Data has entered the Tough of Disillusionment. Now IBM has very publicly proclaimed massive support for Spark. 

So, Big Data is dead, right?  Well let’s investigate this a bit further. Just like any new transformational technology, at its onset Big Data has been driven by hype and promise.  IT has jumped on the bandwagon because they realize that Hadoop, NoSQL, and other open source projects bring new gee-whiz technical capabilities. But technical capabilities alone do not create business value. Even Spark alone will not create business value.  To create value, business focused solutions must be developed.

Up until recently, the focus of Big Data has been more on the technology side of things, especially around managing and storing this large influx of data we are seeing from mobile, Internet of Things, and social.  The Big Data hype has been focused on how to ingest into Hadoop and leveraging an analytical tool, and magically you will garner high-value insights.

However real the value has been elusive to most companies.  Depending on which analyst firm to whom you subscribe, only 20% of the companiesIT Screen-Shot-2015-02-05-at-9.18.04-PM experimenting with Hadoop have deployed projects to production. Why is this number so low? Simply put, IT was driving most of the Hadoop investment.  However, in 2014 there seems to be a huge shift.   Datameer published on their blog, that the executive interest in Big Data shifted from predominately IT to predominately business executives.  To put it another way, overhyped IT driven Big Data is what is dead.   However, the business opportunity that the new data ecosystem presents is real.

With business executives finally getting the message that they need to leverage analytics in the new data ecosystem, organizations will need to bring business centric solutions to the market.  THIS, is where the IBM Spark announcement is significant.  IBM is committing 3500 people Spark!  That is at a minimum a $350Million dollar annual investment.  A pretty huge undertaking even for IBM.  What is IBM doing?   In my opinion two things. One is to get Spark ready for prime time.   The challenge with Spark has been its enterprise maturity.  A lot of promise, Spark needs to mature.  Once Spark is mature, IBM will be free to build high-value business oriented applications that leverage analytics and machine learning using Spark.  Big Data is not Dead – it is just becoming business focused.


]]> 0 214224
Will Spark Replace Hadoop? Mon, 15 Jun 2015 13:29:09 +0000

I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop.   This is the wrong questionSpark does not equal hadoop.jpg!  It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd.  Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components.   Specifically, Hadoop has three main components:

  1. File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
  2. Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
  3. Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.

Spark’s core process framework provides a cluster computing framework and has a basic resource management layer.  What it is missing it is missing is the data storage layer (e.g. HDFS).  So, Spark, by itself, will not replace Hadoop (for any workload that requires data storage services).

The conversation does not end here, though.   Spark’s main benefits are that it stores data for intermediate processing steps in memory versus MapReduce which stores data on disk.   Mapreduce works well for batch processing where data can be read and written to disk in bulk, but for more streaming and many analytical types of workload, Spark’s in-memory centric architecture provides significant benefits.   Spark’s in-memory architecture creates a framework that is very versatile supporting machine learning, graph analytics, SQL data access/management, and stream processing. Spark’s versatility is not just limited to the types of workloads.  It supports Mesos, YARN, Ring Master (Cassandra’s resource manager), and its own resource management.  For data storage, Spark is compatible with Hadoop, S3, Cassandra, and others. So, Spark alone will not replace Hadoop.  However, Spark combined with other resource managers and data storage solutions can provide a holistic system that can replace Hadoop.

In summary, just like a DVD player can be utilized in a system that either a Plasma, LCD or DLP Projector for the display, Spark will open the door to storing and processing data residing in other systems other than Hadoop.

If you like today’s blog entry, please follow me @bigdata73.

]]> 0 200119
Leveraging Your Oracle Resources for Big Data Value Tue, 07 Apr 2015 16:36:38 +0000

As companies transform their businesses to be data-driven and leverage the benefits of Big Data, they quickly realizing that lack of Big Data centric data scientists and wrangleOracle big-data-connectors-box-1481886rs is blocking their value attainment.  One of the limiting factors in Big Data resources, are the skills that have typically been required to leverage Big Data.  Java and MapReduce both have been required skills and created a barrier to leveraging Big Data.   This barrier exists despite the fact most companies have a number of DBAs, data analysts, and developers that are familiar with more traditional SQL-based data access and integration tools.   Providing tools that allow these untapped resources to access Big Data in the environments that they are already trained, will allow companies to quickly address, in part, the data scientist gap.

Oracle has realized this need for Oracle-centric big data tools, and developed a number of connectors that enable this Oracle army to access and manipulate Big Data using tools and languages familiar to Oracle professionals.   Oracle has 5 such connectors/technologies, each targeted at its own niche within the Oracle ecosystem.

  • Oracle Loader for Hadoop – Allows DBAs and developers to easily bulk load data from Hadoop to Oracle without having to understand MapReduce, Yarn, or Java. This Hadoop-aware connector pushes-down Oracle data-type conversions to be processed on the Hadoop data nodes, thus, reducing the cpu utilization of the Oracle database during load-time.
  • Oracle R Advanced Analytics for Hadoop – This connector is for the data scientists that are used to leveraging Oracle Advance Analytics and R in the Oracle Database. It provides a single interface to combine data from HDFS, Hive, Oracle RDBMS, and other supported data sources into one analytic tasks or set of tasks.
  • Oracle SQL Connector for HDFS – SQL is the language most Oracle Professionals are familiar, and having a connector that enables HDFS data to be combined with data within an Oracle database using a single query greatly opens up access to Big Data.
  • Oracle XQuery for Hadoop – Many logs and machine data are stored in XML or Json structures. The Oracle XQuery connector extends Oracle’s XML Query search technology to HDFS and Oracle NoSQL databases allowing access to these popular structures.
  • Oracle Data Integrator for Big Data – Although not a connector, this product enables ODI professionals to leverage HDFS data as they would any other data source. Like other Big Data technology from Oracle, the ODI connector is Hadoop-aware and pushes transformational processing down to the Hadoop data nodes.

If you are an enterprise with significant investment in Oracle technologies and are pursuing Big Data, do not ignore the array of Big Data connectors from Oracle.  These connectors enable your Oracle professionals to quickly utilize their Oracle and existing data knowledge to provide value from your Big Data investment.

]]> 0 205672
Change is in the Air Mon, 30 Mar 2015 20:39:31 +0000

The strategy is complete, implementation of the mobile application andJigsaw-Change-Management analytical system is finished, data scientists are providing useful analytical research.  But is your enterprise getting the value out of your digital transformation investments?

A company’s culture, people, and business processes usually provide the largest barrier to realizing the value from digital investments.    Yes, we talk about change management, however, most times that change management is involved in a one-time event like the implementation of ERP system or rolling out a new Salesforce application.  

Once we have analytics in place providing insights into our digital endeavors, we are still not providing value to the company until a business process is changed.  A few examples:

  1. Retail – Yes, price optimization has been in the market for a couple of decades.   However, once you know a set of prices should be altered, there is an approval process, and changes to the POS have to be made to update the products pricing, store labor has to physically, and change the price tag on the shelves.
  2. Healthcare – One of the big trends in the healthcare today is early detection of the sepsis. Even if the predictive model can be automated, the resulting prediction (if it is positive), needs to be communicated down to the doctors and nurses that are caring for the patient.
  3. Oil and Gas – Drilling can generate over a 1TB a day in log and sensor data. Based on analytics, companies can maximize their oil production by keeping horizontal bores in the middle of oil rich rock, knowing when to create a new bore, or determining when fracking is required.   Again the information needs to get to platform/well crew for it to be actionable.

In these cases, the change management is not a one-time thing.  Change is constant.  The business processes must be built to enable the analytical changes.      Only then, can an enterprise realize the full benefit of the digital transformation.

]]> 0 186458
Data Quality – Don’t Fix It If It Ain’t Broke Tue, 27 Jan 2015 15:42:11 +0000

What is broke?  If I drive a pickup truck around that has a small, unobtrusive crack in the windshield and a few dings in the paint, it will still pull a boat and haul a bunch of lumber from Home Depot. Is the pickup broke if it still meets my needs?

pickup truck haulingSo, when is data broke? In our legacy data integration practices, we would profile data and identify all that is wrong with the data. Orphan keys, in-appropriate values, and incomplete data (to name a few) would be identified before data was moved. In the more stringent organizations data would need to near perfect for it to be used in a data warehouse. This ideal world or perfect data was strived after, but rarely obtained. It was too expensive, required too much business buy in, and lengthen BI and DW projects.  

In the world of Big Data, things have changed. We move data first and some of it we may never fix. Why? To understand this we need to look at the analytical process. When performing an analytical project, data scientists will usually select a subset of data, split it into halves.  One half of the data is used to build a model; the other is used to test the model. If the model tests OK, that is the standard error is within acceptable range, do we need to fix the data?  Fixing the data would at this point not change the outcome, so it served its purpose.

With moving the data first and moving it into a data lake for processing this gives us a unique opportunity to test drive the data first. Data scientists and business users will be able to benefit from using the data to make better decisions. At a time that the data quality does not meet the needs, address the issues within the data. So, don’t fix it if it ain’t broke.

Follow Bill on Twitter @bigdata73

Connect with Perficient on LinkedIn here

]]> 1 200101
Big Data Changes Everything – Has Your Governance Changed? Thu, 22 Jan 2015 14:00:54 +0000

A few years ago, Big Data/Hadoop systems were generally a side project for either storing bulk data or for analytics. But now as companies  have pursued a data unification strategy, leveraging the Next Generation Data Architecture, Big Data and Hadoop systems are becoming a strategic necessity in the modern enterprise.

shutterstock_124189609 (1)

Tupungato /

Big Data and Hadoop are technologies with so much promise and a very broad and deep value proposition. But why are enterprises struggling to see real-world results from their Big Data investments? Simply put it is governance.   In the rush to get Big Data systems enterprises are choosing to govern big data systems with processes that were intended for legacy data warehouse systems (or even worse transactional systems). Actually, if most organizations are honest with themselves, they would realize their less than agile data delivery and limited business intelligence value created from these legacy systems was a result of the governance processes, not the technology.

Now that Big Data and Hadoop have become mainstream, things are getting worse. Governance processes, frameworks, and approaches that were developed 20 years ago were not meant manage the volume, variety, and velocity of data required to be successful in the new digital economy. Imagine trying to identify data stewards for 10,000 tables and a few million data elements.

Does your organization have governance for your Big Data capabilities? Can you answer “yes” to the following questions?

  • Do you have a data delivery process that has been tailored to your organization’s Big Data capabilities?
  • Have you implemented a pattern-based delivery process?
  • Have you changed your Data Governance process to govern data based the data’s value and usage?
  • Have you established governance of citizen integrators and business content authors?
  • Have you adopted capability management approaches for maturing the supported use-cases and functions of your Big Data environment?

If you have answered “no” to most of these questions, your organization should probably access the adequacy of its big data governance.

Follow Bill on Twitter @bigdata73 for other insights and industry news.

]]> 0 200098