Perficient Enterprise Information Solutions Blog

Blog Categories


Archive for the ‘Emerging BI Trends’ Category

Hadoop, Spark, Cassandra, Oh My!

On Monday, I reviewed why Spark will not by itself replace Hadoop, but Spark combined with other data storage and resource management technologies creates other options for managing Big Data.  Today, in this blog post we will investigate how an enterprise should proceed in this new, “Hadoop is not the only option” world. Big Decisions

Hadoop, Spark, Cassandra, Oh My!  Open source Hadoop and NoSQL are moving pretty fast. No wonder some companies might feel like Dorothy in the Land of Oz.  To make sense and things and find the yellow brick road, first we need to understand Hadoop’s market position.
HDFS and YARN can support the storage of just about any type of data.  With Microsoft’s help, now it is possible to run Hadoop on Windows and leverage .NET.  Of course most are running on Linux and Java.   HBASE, Hive, and Cassandra all can run on top of Hadoop.   Hadoop and HIVE support is quickly becoming ubiquitous across data discovery, BI, and analytics tools sets.  Hadoop is maturing fast from the security perspective.  Thanks to HortonWorks, Apache Knox and Ranger have delivered enterprise security capabilities.   Cloudera and IBM both have their own stories on enterprise security as well.   WANdisco provides robust multi-site data replication and state of the art data protection.  The bottom line is that Hadoop has and is continuing to mature AND there is an extensive amount of support from most vendors and related Apache open-source projects.   Read the rest of this post »

Will Spark Replace Hadoop?

I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop.   This is the wrong questionSpark does not equal hadoop.jpg!  It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd.  Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components.   Specifically, Hadoop has three main components:

  1. File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
  2. Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
  3. Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.

Read the rest of this post »

Analytical Talent Gap

As new companies embark on the Digital Transformation leveraging Big Data, key concerns and challenges get amplified especially for the near term before the technology and talent pool supply adjusts to the demand. Looking at the  earlier post Big Data Challenges, the top 3 concerns were:

  1. Identifying the Business value/Monetizing the Big Data
  2. Setting up the Governance to manage Big Data
  3. Availability of skills

    Source: Mckinsey

    Source: Mckinsey

Big Data Skills can be broadly classified into 4 categories:

  • Business / Industry Knowledge
  • Analytical Expertise
  • Big Data Architecture
  • Big Data Tools (Infrastructure management, Development)

The value creation or the monetizing of the Big Data (see Architecture needed to monetize API’s) depends on the Business and the Analytical talent. See talent gap on the right specifically in the analytical area. Educating and augmenting the talent shortage through partner companies is critical for the niche and must have technology. As tools evolve coping up with the Architecture becomes very important as past tool / platform short comings addressed with new complexities.

While business continues to search for the Big Data gold, System Integrators and Product vendors are perfecting the methods to shrink the time to market, best practices and through Modern Architecture. How much of the gap we can shrink depends on multiple factors of Companies and their partners.

See also our webinar on: Creating a Next-Generation Big Data Architecture

IT (Operational) Vs. BT (Business Technology) Investments

IT spending is primarily focused on technologies to run the business primarily operations. With new ways of doing business, technology platforms decide the winners and losers. Typical brick and mortar versus online stores. If you look at the CIO’s budget, more than 70% goes to operational systems, infrastructure and keeping-the-lights-on-type-of applications, and the rest is spent on customer-facing applications/systems.

With Digital Transformations happening at many enterprises, the shift in IT budget is also tracking the trend. Customer experience is one of the key strategies for successful companies. With smartphones and tools for accessing information, customers are one step ahead of the traditional organizations. Investing in new technologies like Big Data, Fast Analytics and pro-active customer experience strategies through converging technologies are not just futuristic but has to be fully functional now.


CIOs are looking for ways to invest in new technologies for enhancing customer experience and leveraging the data (internal and external) to  accurately deliver customer experience not just operational systems. As more and more CIOs get invited to the business leadership table, business technology investment becomes a strategic asset to manage, leverage and deliver greater customer experience. (see spending shift in CIOs face the “Age of the Customer” ).

Connect with us on LinkedIn here

Managing Data in the Digital Transformation Era

Managing data has been a challenge irrespective of the size of the company. Last couple of decades most of the companies invested in leveraging the Enterprise Data through variety of initiatives like Enterprise Data Warehouse and Business Intelligence.


If we divide the enterprise information usage in the last 2 decades, primarily they fall into the following top categories:

  • Operational Efficiency
  • Direct marketing (Junk mail, email, telemarketing)
  • Customer experience (Satisfaction surveys and incentive programs)

Usage of data is typically managed in silos and customer touch-points are not coordinated or even understood.

Today the data does not reside just within the Enterprise alone. Customers are armed with smart phones and tablets and other gadgets and are engaging through multi-channel.  The amount of information available for immediate consumption and to deliver appropriate responses are a big challenge, especially if the modernization (Digital Transformation) is not addressed.

The Enterprise data continue to have its own challenges (security, growing volume and need for fast analytics). But the customer interaction data is exploding through multiple channels. Managing all these challenges means building new architecture (Tools, Applications, Platforms). Investing in Digital Transformation initiatives is key to building modern architecture, which is vital to the survival of the Enterprise. Approaching and understanding the nuances of Digital transformation should be a top priority for 2015 for any modern Enterprise.

Image source: IBM

Data Quality – Don’t Fix It If It Ain’t Broke

What is broke?  If I drive a pickup truck around that has a small, unobtrusive crack in the windshield and a few dings in the paint, it will still pull a boat and haul a bunch of lumber from Home Depot. Is the pickup broke if it still meets my needs?

pickup truck haulingSo, when is data broke? In our legacy data integration practices, we would profile data and identify all that is wrong with the data. Orphan keys, in-appropriate values, and incomplete data (to name a few) would be identified before data was moved. In the more stringent organizations data would need to near perfect for it to be used in a data warehouse. This ideal world or perfect data was strived after, but rarely obtained. It was too expensive, required too much business buy in, and lengthen BI and DW projects.   Read the rest of this post »

Big Data, Big Confusion



Everyone wants a piece of Big Data action whether you are part of Product Company, Solution provider, IT, or Business user. Like every new technology, Big Data is confusing, complex and intimidating. Though the idea is intriguing, the confusion begins when the techies start taking sides and tout the underlying tools rather than solution. But the fact is picking the right architecture (tools, platforms) does matter. It involves consideration of several aspects starting from understanding the technologies appropriate for the organization to understanding the total cost of ownership.

When you look at the organizations embarking on Big Data initiative, most organizations fall into the following 3 types.


Have experimented with several tools, multiple deployments done in multiple platforms by multiple business units/subsidiaries. Own several tool licenses, built several Data applications or experimenting currently. Many data management applications in production.

Loosely Centralized /Mostly De-centralized

Has Enterprise focus but BU’s and departmental Data applications are in use. Also several tools purchased over the years across various BU’s and departments. Many data management applications in production.

No major Data Applications

Yet to invest in major data applications. Mostly rely on reports and spreadsheets.

In all of the above scenarios, IT leaders can make a big difference in shaping the vision for embarking on a Big Data journey. Mostly Big Data projects have been experimental for many and the pressure to deliver tangible results is very high. Typically optimal tools strategy and standards takes a back seat. However at some point it becomes a priority.  The opportunity to focus on the vision and strategy is easier to sell when leadership change occurs within the organization. If you are the new manager to tackle Big Data, it is your chance to use your first 90 days to formulate the strategy than get sucked into business as usual. Utilizing these moments to formulate a strategy for platform / tools standardization is not only prudent but also presents greater opportunity for approval.  These strategic focus is critical for continued success and to avoid investments with low returns.

The options within Big Data is vast. Vendors with legacy products to startup companies offer several solutions. Traversing the maze of products without the help of right partners can lead to false starts and big project delays.


The New Data Integration Paradigm

Data integration has changed.  The old way of extracting data, moving it to a new server, A little stuffed animal called Hadooptransforming it, and then loading into a new system for reporting and analytics is now looking quite arcane. It’s expensive, time consuming, and does not scale to handle the volumes we are now seeing in the digitally transformed enterprise.

We saw this coming, with push down optimization and the early incarnations of Extract Load and Transform (ELT). Both of these architectural solutions were used to address scalability.

Hadoop has taken this to the next step where the whole basis of Hadoop is to process the data where it is stored.  Actually, this is bigger than Hadoop. The movement to cloud data integration will require the processing to be completed where the data is stored as well.

To understand how a solution may scale in a Hadoop or cloud centric architecture, one will need to understand where processing happens with regards to where the data is stored. To do this, one needs to ask vendors three questions:

  1. When is data moved off of the cluster? — Clearly understand when is data required to be moved off of the cluster. In general, the only time we should be moving data is to move to a downstream operational system to be consumed. Another way to put this, data should not be moved to an ETL or Data Integration server for processing then moved back to the Hadoop cluster.
  2. When is data moved from the data node? — Evaluate which functions require data to be moved off of the data node to a name or resource manager.   Tools that utilize Hive are of particular concern since anything that is pushed to Hive for processing will inherit the limitations of Hive. Earlier versions of Hive required data to be moved through the name node for processing.   Although Hive has made great strides pushing processing to the data nodes, there still are limitations.
  3. On the data node, when is data moved into memory?  — Within a Hadoop data node disk I/O is still a limiting factor.  Technologies that require data to be written to disk after a task is completed can quickly become I/O bound on the data node. Other solutions load data into all data into memory before processing may not scale to higher volumes.

Of course there is much more to be evaluated, however, choosing technologies that keep processing close to the data, instead of moving data to the processing will smooth the transition to the next generation architecture.   Follow Bill on Twitter @bigdata73

Analytics in the Digital Transformation Era

Successful Enterprises compete on many capabilities ranging from product excellence, customer service and marketing to name a few. Increasingly the back office / Information Technology (IT) is becoming a strategic player in the Digital Business Model which supports these key capabilities. In other words back office/IT Capability itself is becoming a differentiator. All of the key strategies like Customer Excellence, Product Excellence, and Market Segmentation depend on the successful Digital Business Model.

Having more data especially noisy data is complex to deal with. New platforms and tools are a must to make it possible to deal with them. Working with internally captured Enterprise data to answer strategic questions like “Should there be a pricing difference of life, annuities, and long-term care?” or setting up the benchmark for “Servicing cost per policy for life, annuities, and long-term care” can only go that much far. Ingesting and integrating the external data including machine data will change the way pricing and segmentation is done today.

In the technology space a wide variety of capabilities in terms tools / platforms, architecture offering Time to Market opportunities to leading edge predictive / prescriptive models to enable Business to operate and execute efficiently. What this all means is that Business has to embrace the Digital transformation happening faster than ever.

Traditional Analytics


Key strategies from IT should include two kinds of applications / platforms for dealing with new analytical and old analytical methods. The first kind is slow-moving or traditional Enterprise data which ends up in the warehouse and made available for ‘What happened’ questions, traditional reporting, business intelligence / Analytics etc.

Fast Analytics

The second kind is the real-time analytical response to the interactive customer, keeping in constant touch through multiple channels while providing seamless interaction and user experience. Technologies, platforms, architecture and applications are different for these two types of processing.

In the new world of Information management, traditional Enterprise applications and Data Warehouse becomes another source rather than the complete source of Data. Even absence of data is relevant information if the context is captured. Analytics is becoming more real-time with adaptive algorithms influencing different outcome based on the contextual data. Building the modern Information platforms to address these two different needs of the enterprise is becoming the new standard.

The Industrialization of Advanced Analytics

Gartner recently released its predictions on this topic in a report entitled, “Predicts 2015: A Step Change in the Industrialization of Advanced Analytics”. This has very interesting and important implications for all companies aspiring to become more of a digital business. The report states that failure to do so impacts mission-critical activities such as acquiring new customers, doing more cross-selling and predicting failures or demand.

shutterstock_167204534Specifically, business, technology and BI leaders must consider:

  • Developing new uses cases using data as a hypothesis generator, data-driven innovation and new approaches to governance.
  • Emergence of analytics marketplaces, which Gartner predicts will be more commonly offered in a Platform as a Service model (PaaS) by 25% of solution vendors by 2016
  • Solutions based on the following parameters: optimum scalability, ease of deployment, micro-collaboration and macro-collaboration and mechanisms for data optimization
  • Convergence of data discovery and predictive analytics tools
  • Expanding technologies advancing analytics solutions: cloud computing, parallel processing and in-memory computing
  • “Ensemble-learning” and “deep learning”. The former defined as synergistically combining predictive models through machine-learning algorithms to derive a more valuable single output from the ensemble. In comparison, deep learning achieves higher levels of classification and prediction accuracy through the development of additional processing layers in neural networks.
  • Data lakes (raw, largely unfiltered data) vs data warehouses and solutions for enabling exploration of the former and improving business optimization for the latter
  • Tools that bring data science and analytics to “citizen data scientists”, who’ll soon outnumber skilled data scientists 5-to-1

Leaders in the emerging analytics marketplace, include:

  • Microsoft with its Azure Machine Learning offering
    • For further info, check out:
  • IBM with its Bluemix offering

Finally, strategy and process improvement, while being fundamental and foundational, aren’t enough. The volume and complexity of big data along with the convergence between data science and analytics requires technology-enabled business solutions to transform companies into effective digital businesses. Perficient’s broad portfolio of services, intellectual capital and strategic vendor partnerships with emerging and leading big data, analytics and BI solution providers can help.