by January 22nd, 2015
Everyone wants a piece of Big Data action whether you are part of Product Company, Solution provider, IT, or Business user. Like every new technology, Big Data is confusing, complex and intimidating. Though the idea is intriguing, the confusion begins when the techies start taking sides and tout the underlying tools rather than solution. But the fact is picking the right architecture (tools, platforms) does matter. It involves consideration of several aspects starting from understanding the technologies appropriate for the organization to understanding the total cost of ownership.
When you look at the organizations embarking on Big Data initiative, most organizations fall into the following 3 types.
Have experimented with several tools, multiple deployments done in multiple platforms by multiple business units/subsidiaries. Own several tool licenses, built several Data applications or experimenting currently. Many data management applications in production.
Loosely Centralized /Mostly De-centralized
Has Enterprise focus but BU’s and departmental Data applications are in use. Also several tools purchased over the years across various BU’s and departments. Many data management applications in production.
No major Data Applications
Yet to invest in major data applications. Mostly rely on reports and spreadsheets.
In all of the above scenarios, IT leaders can make a big difference in shaping the vision for embarking on a Big Data journey. Mostly Big Data projects have been experimental for many and the pressure to deliver tangible results is very high. Typically optimal tools strategy and standards takes a back seat. However at some point it becomes a priority. The opportunity to focus on the vision and strategy is easier to sell when leadership change occurs within the organization. If you are the new manager to tackle Big Data, it is your chance to use your first 90 days to formulate the strategy than get sucked into business as usual. Utilizing these moments to formulate a strategy for platform / tools standardization is not only prudent but also presents greater opportunity for approval. These strategic focus is critical for continued success and to avoid investments with low returns.
The options within Big Data is vast. Vendors with legacy products to startup companies offer several solutions. Traversing the maze of products without the help of right partners can lead to false starts and big project delays.
by January 20th, 2015
Data integration has changed. The old way of extracting data, moving it to a new server, transforming it, and then loading into a new system for reporting and analytics is now looking quite arcane. It’s expensive, time consuming, and does not scale to handle the volumes we are now seeing in the digitally transformed enterprise.
We saw this coming, with push down optimization and the early incarnations of Extract Load and Transform (ELT). Both of these architectural solutions were used to address scalability.
Hadoop has taken this to the next step where the whole basis of Hadoop is to process the data where it is stored. Actually, this is bigger than Hadoop. The movement to cloud data integration will require the processing to be completed where the data is stored as well.
To understand how a solution may scale in a Hadoop or cloud centric architecture, one will need to understand where processing happens with regards to where the data is stored. To do this, one needs to ask vendors three questions:
- When is data moved off of the cluster? – Clearly understand when is data required to be moved off of the cluster. In general, the only time we should be moving data is to move to a downstream operational system to be consumed. Another way to put this, data should not be moved to an ETL or Data Integration server for processing then moved back to the Hadoop cluster.
- When is data moved from the data node? – Evaluate which functions require data to be moved off of the data node to a name or resource manager. Tools that utilize Hive are of particular concern since anything that is pushed to Hive for processing will inherit the limitations of Hive. Earlier versions of Hive required data to be moved through the name node for processing. Although Hive has made great strides pushing processing to the data nodes, there still are limitations.
- On the data node, when is data moved into memory? — Within a Hadoop data node disk I/O is still a limiting factor. Technologies that require data to be written to disk after a task is completed can quickly become I/O bound on the data node. Other solutions load data into all data into memory before processing may not scale to higher volumes.
Of course there is much more to be evaluated, however, choosing technologies that keep processing close to the data, instead of moving data to the processing will smooth the transition to the next generation architecture. Follow Bill on Twitter @bigdata73
by December 23rd, 2014
Successful Enterprises compete on many capabilities ranging from product excellence, customer service and marketing to name a few. Increasingly the back office / Information Technology (IT) is becoming a strategic player in the Digital Business Model which supports these key capabilities. In other words back office/IT Capability itself is becoming a differentiator. All of the key strategies like Customer Excellence, Product Excellence, and Market Segmentation depend on the successful Digital Business Model.
Having more data especially noisy data is complex to deal with. New platforms and tools are a must to make it possible to deal with them. Working with internally captured Enterprise data to answer strategic questions like “Should there be a pricing difference of life, annuities, and long-term care?” or setting up the benchmark for “Servicing cost per policy for life, annuities, and long-term care” can only go that much far. Ingesting and integrating the external data including machine data will change the way pricing and segmentation is done today.
In the technology space a wide variety of capabilities in terms tools / platforms, architecture offering Time to Market opportunities to leading edge predictive / prescriptive models to enable Business to operate and execute efficiently. What this all means is that Business has to embrace the Digital transformation happening faster than ever.
Key strategies from IT should include two kinds of applications / platforms for dealing with new analytical and old analytical methods. The first kind is slow-moving or traditional Enterprise data which ends up in the warehouse and made available for ‘What happened’ questions, traditional reporting, business intelligence / Analytics etc.
The second kind is the real-time analytical response to the interactive customer, keeping in constant touch through multiple channels while providing seamless interaction and user experience. Technologies, platforms, architecture and applications are different for these two types of processing.
In the new world of Information management, traditional Enterprise applications and Data Warehouse becomes another source rather than the complete source of Data. Even absence of data is relevant information if the context is captured. Analytics is becoming more real-time with adaptive algorithms influencing different outcome based on the contextual data. Building the modern Information platforms to address these two different needs of the enterprise is becoming the new standard.
by December 16th, 2014
Gartner recently released its predictions on this topic in a report entitled, “Predicts 2015: A Step Change in the Industrialization of Advanced Analytics”. This has very interesting and important implications for all companies aspiring to become more of a digital business. The report states that failure to do so impacts mission-critical activities such as acquiring new customers, doing more cross-selling and predicting failures or demand.
Specifically, business, technology and BI leaders must consider:
- Developing new uses cases using data as a hypothesis generator, data-driven innovation and new approaches to governance.
- Emergence of analytics marketplaces, which Gartner predicts will be more commonly offered in a Platform as a Service model (PaaS) by 25% of solution vendors by 2016
- Solutions based on the following parameters: optimum scalability, ease of deployment, micro-collaboration and macro-collaboration and mechanisms for data optimization
- Convergence of data discovery and predictive analytics tools
- Expanding technologies advancing analytics solutions: cloud computing, parallel processing and in-memory computing
- “Ensemble-learning” and “deep learning”. The former defined as synergistically combining predictive models through machine-learning algorithms to derive a more valuable single output from the ensemble. In comparison, deep learning achieves higher levels of classification and prediction accuracy through the development of additional processing layers in neural networks.
- Data lakes (raw, largely unfiltered data) vs data warehouses and solutions for enabling exploration of the former and improving business optimization for the latter
- Tools that bring data science and analytics to “citizen data scientists”, who’ll soon outnumber skilled data scientists 5-to-1
Leaders in the emerging analytics marketplace, include:
- Microsoft with its Azure Machine Learning offering
- For further info, check out: https://blogs.perficient.com/microsoft/2014/12/azure-ml-on-the-forefront-of-advanced-analytics/
- IBM with its Bluemix offering
Finally, strategy and process improvement, while being fundamental and foundational, aren’t enough. The volume and complexity of big data along with the convergence between data science and analytics requires technology-enabled business solutions to transform companies into effective digital businesses. Perficient’s broad portfolio of services, intellectual capital and strategic vendor partnerships with emerging and leading big data, analytics and BI solution providers can help.
by December 2nd, 2014
We are almost at the end of 2014. Time to check out the 2015 trends and compare with what has been the focus in 2014. Looking at the top 10 trends in Information Management, some things have changed and some have moved up or down the list.
However, the same old challenges pretty much remain. We saw a significant emphasis on Data Visualization and Big Data push in 2014 and this trend will continue.
Big Data remains in the top 10 in some shape or form, virtualization and cloud management is getting complex and is something organizations have to deal with. Especially hybrid cloud is becoming a part of the Enterprise Architecture fabric.
The common theme in all these trends are the complexity and the security / governance aspects. Data sources, creation and management is lot different in the last 5 years than ever before. Enterprise data is not confined to the firewalls and corporate data centers. Data centers continue to evolve and the applications continue to reside outside the norm. Ownership, responsibility, quality and trust worthiness is becoming real complex. Knowing what to trust, filtering the noise from the real information is becoming partly art and science.
New era of data centers include cloud infrastructure (public and private), traditional enterprise data centers, Cloud applications and accessibility through variety of devices including personal devices. Forging a security framework and governing the data becomes lot more critical and urgent.
Having a disciplined Governance organization with agility to respond and manage business information becomes a critical component of successful Information management. As the complexity, vulnerability and risk increases, forming and managing the policies to secure the corporate data is vital. Governing the information goes beyond the responsibility of Information Technology. Gone are the days where Business can hand a wish list and IT builds an application. Business and IT has to work closely to create Governance policies and procedures to tackle this paradigm shift.
Connect with Perficient on LinkedIn here.
by December 2nd, 2014
With the advent of Splice Machine and the release of Hive 0.14 we are seeing Hadoop’s role in the data center continue to grow. Both of these technologies support limited transactions against data stored in HDFS.
Now, I would not suggest moving your mission-critical ERP systems to Hive or Splice Machine, but the support of transactions is opening up Hadoop to support more use cases, especially those use cases supported by RDBMS based data warehouses. With transaction support there is a more elegant way to handle slowly changing dimensions of all types in Hadoop now that records can be easily updated. Fact tables with late-arriving information can be updated in place. With transactional support, Master Data can be supported more efficiently. The writing is on the wall: more and more of the functionality that has been historically provided by the data warehouse is now moving to the Hadoop cluster.
To address this ever-changing environment, enterprises must have a clear strategy for evolving their Big Data capabilities within their enterprise architecture. This Thursday, I will be hosting a webinar, “Creating the Next-Generation Big Data Architecture,” where we will discuss Hadoop’s different roles within in a modern enterprise’s data architecture.
by November 13th, 2014
Big Data is big deal. Every vendor has a strategy and a suite of products. Navigating the maze and picking the right Big Data platform and tools takes some level of planning and looking beyond techie’s dream product suite. Compounding the issue is the open source option vs. going with a vendor version of the open source. Like every other new technology, product shakedowns will happen sooner or later. So picking a suite now is like betting on the stock market, exercising caution and being conservative with long-term outlook will pay off.
Organizations tend to follow the safe route of sticking with the big vendor strategy but the downside is getting the funding and putting up with the procurement phase of waiting forever for the approval. The hard part is knowing the product landscape, assessing the strengths of each type of solution and prioritizing the short-term and long-term strategy.
I have seen smaller companies building their entire solution in open stack and don’t pay a penny for the software. Obviously the risk and the rewards plays out. Training the resources and hiring trained resources from the market place is a huge factor as well. Open source still has the same issues of version, bugs and compatibility, so having the knowledgeable team makes a big difference in managing the environment and the overall quality of the delivery.
But despite the confusion, there is good news. If you are in the process of figuring out how you want to play the Big Data game, big and small vendors alike are providing you with the sandbox or Dev environment almost free or for limited duration. Leveraging this option as part of the Big Data strategy will not only save money but also the learning curve. IBM Bluemix is an example of that. So does Cloudera, Datastax and the list is growing.
To maximize the benefit, follow the basic portfolio management strategy.
- Take an inventory of tools already available within the organization
- Identify the products which will play better with the existing tools
- Figure out the business case and the types of tools needed to get a successful POC
- Match the product selection with resource knowledge base
- Get as much help from external sources (a lot of them can be free, if you have the time) from training to POC
- Start small and use it to get the buy in for the larger project
- Invest in developing the strategy with POC to uncover the benefits and to build strong business case
Combining this strategy with little bit external help to narrow down the selection and avoiding the pitfalls based on the industry experience will add tremendous value in navigating the complex selection process. Time to market can be drastically cut down especially when you make use of the DevOps platform on the cloud.
The direct benefits in leveraging the try-before-buy options are:
- No Hardware / wait time or IT involvement for setting up the environment
- All the tools are available and ready to test
- Pricing and the product stack can be validated rather than finding out later that you need to buy one more product which is not in the budget
- Time to market is drastically cut down
- Initial POC and Business Case can be built with solid proof
- Throwaway work can be minimized
Looking at the all the benefits, it is worth taking this approach especially if you are in the initial stages and you want proof before asking for the millions which is hard to justify.
by November 11th, 2014
In part 1 of this series, we discussed some of the most common assumptions associated with Big Data Proof of Concept (POC) projects. Today, we’re going to begin exploring the next stage in Big Data POC definition – “The What.”
The ‘What’ for Big Data has gotten much more complicated in recent years; and now involves several key considerations:
- What business goals are involved – this is perhaps the most important part of defining any POC yet strangely is often ignored in many POC efforts.
- What scope is involved – for our purposes this means how much of the potential solution architecture will be evaluated. This can be highly targeted (database layer only) or can be comprehensive (an entire multi-tiered stack).
- What technology is involved – this one is tricky because often times people view a POC only in the context of proving a specific technology (or technologies). However, our recommended approach involves aligning technologies and business expectations up front – thus the technology isn’t necessarily the main driver. Once the goals are better understood then selecting the right mix of technologies becomes supremely important. There are different types of Big Data databases and a growing list of BI platforms to choose from – these choices are not interchangeable – some are much better tailored for specific tasks than others.
- What platform is needed – this is one of the first big technical decisions associated with both Big Data and Data Warehouse projects these days. While Big Data evolved sitting atop commodity hardware, now there are a huge number of device options and even Cloud platform opportunities.
- What technical goals or metrics are required – this consideration is of course what allows us to determine whether we’ve achieved success or not. Often times, organizations think they’re evaluating technical goals but don’t develop sufficiently detailed metrics in advance. And of course this needs to be tied to specific business goals as well.
Big Data POC Architecture views
Once we get through those first five items, we’re very close to having a POC Solution Architecture. But how is this Architecture represented and maintained? Typically, for this type of Agile project, there will be three visualizations:
- A conceptual view that allows business stakeholders to understand the core business goals as well as technical choices (derived from the exploration above).
- A logical view which provides more detail on some of the data structure/design and well as specific interoperability considerations (such as login between DB and analytics platform if both are present). This could be done using UML or freeform. As most of these solutions will not include Third Normal Form (3NF) Relational approaches, the data structure will not be presented using ERD diagram notation. We will discuss how to model Big Data in a future post.
- There is also often a need to represent the core technical architecture – server information, network information and specific interface descriptions. This isn’t quite the same as a strict data model analogy (Conceptual Logical, Physical). Rather this latter representation is simply the last level of detail for the overall solution design (not merely the DBMS structure).
It is also not uncommon to represent one or more solution options in the conceptual or logical views – which helps stakeholders decide which approach to select. Usually, the last view or POC technical architecture is completed after the selection is made.
There is another dimension to “The What” that we need to consider as well – the project framework. This project framework will likely include the following considerations:
- Who will be involved – both from a technical and business perspective
- Access to the capability – the interface (in some cases there won’t be open access to this and then it becomes a demo and / or presentation)
- The processes involved – what this means essentially is that the POC is occurring in a larger context; one that likely mirrors existing processes that are either manual or handled in other systems
The POC project framework also includes identification of individual requirements, overall timeline as well as specific milestones. In other words, the POC ought to managed as a real project. The project framework also serves as part of the “How” of the POC, but at first it represents the overall parameters of what will occur and when.
So, let’s step back a moment and take a closer look at some of the top level questions from the beginning. For example, how do you determine a Big Data POC scope? That will be my next topic in this series.
copyright 2014, Perficient Inc.
by November 7th, 2014
It seems as though every large organization these days is either conducting a Big Data Proof of Concept (POC) or considering doing one. Now, there are serious questions as to whether this is even the correct path towards adoption of Big Data technologies, but of course for some potential adopters it may very well be the best way to determine the real value associated with a Big Data solution.
This week, Bill Busch provided an excellent webinar on how organizations might go through the process of making that decision or business case. For this exploration, we will assume for the sake of argument that we’ve gotten past the ‘should we do it’ stage and are now contemplating what to do and how to do it.
Capability Evolution tends to follow a familiar path…
Big Data POC Assumptions:
Everything starts with assumptions – and there are a number of good ones that could be considered universal for Big Data POCs (applicable in most places), these include the following:
- When we say ‘Big Data’ what we really mean is multiple potential technologies and maybe even an entire technology stack. The days of Big Data just being entirely focused on Hadoop are long gone. The same premise still underlies the growing set of technologies but the diversity and complexity of options have increased almost exponentially.
- Big Data is now much more focused on Analytics. This is a key and very practical consideration – re-hosting your data is one thing – re-envisioning it is a much more pragmatic or perhaps more tangible goal.
- A Big Data POC is not just about the data or programming some application or even just the Analytics – it’s about a “Solution.” As such it ought to be viewed and managed the way your typical IT portfolio is managed – and it should be architected.
- The point of any POC should not be to prove that the technology works – the fact is that a lot of other people have already done that. The point is determining precisely how that new technology will help your enterprise. This means that the POC ought to be more specific and more tailored to what the eventual solution may look like. The value of having the POC is to identify any initial misconceptions so that when the transition to the operational solution occurs it will have a higher likelihood of success. This is of course the definition of an Agile approach and avoids having to re-define from scratch after ‘proof’ that the technology works has been obtained. If done properly, the POC architecture will largely mirror what the eventual solution architecture will evolve into.
- Last but not least, keep in mind that the Big Data solution will not (in 95% of the case now anyway) replace your existing data solution ecosystem. The POC needs to take that into account up front – doing so will likely improve the value of the solution and radically reduce the possibility of running into unforeseen integration issues downstream.
Perhaps the most important consideration before launching into your Big Data POC is determining the success criteria up front. What does this mean? Essentially, it requires you to determine the key problems that the solution is targeted to solve and coming up with metrics that can be objectively obtained from the solution. Those metrics can be focused both on technical and business considerations:
- A Technical metric might be the ability update a very large data set based on rules within a specified timeframe (consistently).
- A Business metric might be the number of user-defined reports or dashboard visualizations supported.
- And of course both of these aspects (technical and business capability) would be governed as part of the solution.
Without the POC success criteria it would be very difficult to determine just what value adopting Big Data technology might add to your organization. This represents the ‘proof’ that either backs up or repudiates the initial business case ROI expectation.
In my next post, we will examine the process of choosing “What to select” for a Big Data POC…
copyright 2014, Perficient Inc.
by November 4th, 2014
Tomorrow I will be giving a webinar on creating business cases for Big Data. One of the reasons for the webinar was that there is very little information available on creating a Big Data business cases. Most of what is available boils down to a “trust me, Big Data will be of value.” Most information available on the internet basically states:
More information, loaded into a central Hadoop repository, will enable better analytics, thus making our company more profitable.
Although logically, this statement seems true and most analytical companies have accepted the above statement, it illustrates the 3 most common mistakes we see in creating a business case for Big Data.
The first mistake, is not directly linking the business case to the corporate strategy. The corporate strategy is the overall approach the company is taking to create shareholder value. By linking the business case to the objectives in the corporate strategy, one will be able to illustrate the strategic nature of Big Data and how the initiative will support the overall company goals. Read the rest of this post »