Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed


Follow our Enterprise Information Technology board on Pinterest

Archive for the ‘Emerging BI Trends’ Category

Myths & Realities of Self-Service BI

Myths & Realities of Self-Service BI

The popularity of Data Visualization tools and the Cloud BI offerings are new forces to reckon with. I find it interesting to see how the perception Vs usage of these tools in reality. Traditionally IT likes the control and centralized management for obvious reasons of accountability and quality of information. However the self-service BI tools and cloud offerings are accelerating the departmental BI development. Some of the misconceptions based on the early hype cycle is wearing off and the realities are becoming more clear.

Let’s look at some of the myths and realities…

Self_serviceMyth 1: Self-Service means everyone is a report writer!

Self-Service BI is pitched as the solution for faster access to data. BI product vendors think anyone can develop reports and use it, but the truth is, analysts want to analyze, not create reports or dashboards. What they need is easy ways to analyze the data, visual analysis still better. Self-service does not mean everyone is on their own to create the reports.

Myth 2: Self-Service BI means it is end of the traditional BI!

Almost all the major BI vendor and major Data Management software player offers a Visualization / In-memory tool along with traditional BI Tools. Every tool has its advantages and disadvantages based on their capability and usage. Forging a framework for data access, sharing and securing data appropriately is the key to leverage these new technologies. IT can also learn from some of the departmental success, primarily their ability to create solutions in their space and how they are using the tool to further their cause and apply those techniques in traditional BI space as well.

Myth 3: Self-Service is new!

Well, Excel is always the king of self-service BI. It was there before Self-Service BI, it is there now, and it will be there in the foreseeable future as well. So understanding the Self-Service BI usage and the limits will help IT and the entire organization to use these spectrum of tools efficiently.

Self-service has its place and limitations. It is great for data discovery. Who could do data discovery better than the business folks? Self-Service BI is all about getting the data sooner than later to the business  power user not necessarily end-user. Use the data discovery to validate the benefit and integrate into the EDW or corporate centralized data once the application is proven.

In a nutshell self-service BI is here to stay as they always have been, but the key is to create a balancing governance structure to manage the quality, reliability and security.


Data Staging and Hadoop

Traditionally, in our information architectures we have a number of staging or intermediate data storage areas / systems.   These have taken different forms over the years, publish directories on source systems, staging areas in data warehouses, data vaults, or most commonly, data file hubs.   In general, these data file staging solutions have suffered from two limitations:

  1. Data Staging and HadoopBecause of costs storage, data retention was usually limited to a few months.
  2. Since these systems were intended to publish data for system integration purposes, end-users generally did not have access to staging data for analytics or data discovery

Hadoop’s systems reduce the cost per terabyte of storage by two orders of magnitude. Data that consumed $100 worth of data storage now costs $1 to store on a Hadoop system.   This radical reduction cost, enables enterprises to replace sourcing hubs with data lakes based on Hadoop, where a data lake can now house years of data vs. only a few months.

Next, once in a Hadoop filesystem (HDFS) the data can be published, either directly to tools that consume HDFS data or to Hive (or other SQL like interface).   This enables end-users to leverage analytical, data discovery, and visualization tools to derive value from data within Hadoop.

The simple fact that these data lakes can now retain historical data and provide scalable access for analytics, also has a profound effect on the data warehouse. This effect on the data warehouse will be the subject of my next few blogs.

Seven Deadly Sins of Database Design

This is a summary of an article from Database Trends And Applications; The author addresses fundamental mistakes that we do or we live with in regards to our database systems.

1. Poor or missing documentation for databases in PRODUCTION

Seven Deadly Sins of Database DesignWe may have descriptive table names and columns to begin with, but as workforce turns over and a database grows, we can lose essential knowledge about the systems.
A suggested approach is to maintain the data model in a central repository. This must be followed by executing validation and quality metrics regularly to enhance the quality of the models over time.

2. Little or no normalization

All information in one table may be easier for data access but may not be the best option in terms of design. Understand normalization:
• 1st normal Form – eliminate duplicate columns and repeating values in columns
• 2nd Normal Form – remove redundant data that apply to multiple columns.
• 3rd Normal Form – Each record of a table is unique based on the primary identifier.

3. Not treating the data model like a living breathing organism

Many people start with a good model when designing a database, and then throw it away as soon as the application is in production. The model should be updated as often as any new changes are applied on the database to communicate these changes effectively.

4. Improper storage of reference data

Store reference data in a model or have a reference in the model that points to reference data. Reference data is typically stored in several places or worse, in application code – making it very difficult when there is need to change this information.

5. Not using foreign keys or check constraints

Data quality is increased highly by having referential integrity and validation checks defined right from the database level.

6. Not using domains and naming standards

Domains allow you to create reusable attributes so that users don’t have to create them each time they need to use them. Naming standards increase readability of the database and make it easier for new users to adopt to a database. It is recommended that one uses proper names as opposed to short forms that a user has to try and figure out what they mean.

7. Not choosing primary keys properly

Choose a primary key wisely because it is painful to try and correct these down the line. A simple principle is suggested when picking a primary key; SUM: Static, Unique, Minimal. So a social security number may not be the best primary key in some cases because it may not always be unique and not everyone has one :)

Happy database modeling!

Data Science = Synergistic Teamwork

Data science is a discipline conflating elements from various fields such as mathematics, machine learning, statistics, computer programming, data warehousing, pattern recognition, uncertainty modeling, computer science, high performance computing, visualization and others.

Data Science = Synergistic TeamworkAccording to Cathy O’Neil and Rachel Schutt, two luminaries in the field of Data Science, there are about seven disciplines that even data scientists in training can easily identify as part of their tools set:

  • Statistics
  • Mathematics
  • Machine Learning
  • Computer Science
  • Data Visualization
  • Domain Expertise
  • Communication and Presentation Skills

Most data scientists, however, are experts in only a couple of these disciplines and proficient in another two or three – that’s why Data Science is a team sport.

I’ve definitely learned the importance of teamwork in this field over the last few months, while working with Perficient Data Science team on a Big Data Lab.

Ultimately, the goal of Data Science is to extract meaning from data and create products from the data itself. Data is the raw material used for the study of “the generalizable extraction of knowledge”.

With data scaling up by the day, it should not come as a surprise that Big Data would play an important role in a data scientist’s work – herein lies the importance of our Big Data Lab and our teamwork.

Our Big Data Lab is the place where Data Science’s many underlying disciplines come together to create something greater than the summation of our individual knowledge and expertise – synergistic teamwork.

Disruptive Scalability

The personal computer, internet, digital music players (think ipods), smart phones, tablets are just a few of the disruptive technologies that have become common place in our lifetime.   What is consistent about these technology disruptions is that they all have changed the way we work, live, and play.  Whole industries have grown up around these technologies.   Can you imagine a major corporation being competitive in today’s Disruptive Scalabilityworld without personal computers?

Big Data is another disruptive technology.    Big Data is spawning its own industry with 100s of startups and every major technology vendor seems to have a “Big Data Offering.”  Soon, companies will need to leverage Big Data to stay competitive.   The Big Data technology disruption in an Enterprise’s data architecture is significant. How we source, integrate, process, analyze, manage, and deliver will evolve and change. Big Data truly is changing everything!   Over the next few weeks I will focusing my blogging on how Big Data is changing our enterprise information architecture.   Big Data’s effect on MDM, data integration, analytics, and overall data architecture will be covered.   Stay-tuned!

Thoughts on Oracle Database In-Memory Option

Last month Oracle announced Oracle In-Memory database option. The overall message is that once installed, you can turn this “option” on and Oracle will become an in-memory database.   I do not think it will be that simple. However, I believe Oracle is on the correct track with this capability.

Thoughts on Oracle Database In-Memory OptionThere are two main messages with Oracle In-Memory’s vision which I view are critical capabilities in a modern data architecture. First, is the ability to store and process data based on the temperature of the data.
That is, hot, highly accessible data should be kept DRAM or as close to DRAM as possible.   As the temperature decreases, data can be stored on flash, and for cold, rarely accessed data, on disk (either in the Oracle DB or in Hadoop).   Of course we can store data of different temperatures today, however, the second feature, which is making this storage transparent to the application, makes the feature it very valuable. An application programmer, data scientist, or report developer, should not have to know where the data is stored.   It should be transparent. The Oracle DB or a DBA can optimize the storage of data based on cost /performance of storage without having to consider the compatibility with the application, cluster (RAC), or recoverability is quite powerful and useful.   Yes, Oracle has been moving this way for years, but now it has most, if not all, the pieces.

Despite the fact that the In-Memory option leverages a lot of existing core code, most of Oracle’s IT shops will need to remember that this is a version 1 product.   Plan accordingly. Understand the costs and architectural impacts. Implement the Oracle In-Memory option on a few targeted applications and then develop standards for its use. A well planned, standards-based approach, will assure that your company will maximize its return on your Oracle In-Memory investment.

A little stuffed animal called Hadoop

Doug Cutting – Hadoop creator – is reported to have explained how the name for his Big Data technology came about:

“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.”

A little stuffed animal called HadoopThe term, of course, evolved over time and almost took on a life of its own… this little elephant kept on growing, and growing… to the point that, nowadays, the term Hadoop is often used to refer to a whole ecosystem of projects, such as:

  1. Common – components and interfaces for distributed filesystems and general I/O
  2. Avro – serialization system for RPC and persistent data storage
  3. MapReduce – distributed data processing model and execution environment running on large clusters of commodity machines
  4. HDFS – distributed filesystem running on large clusters of commodity machines
  5. Pig – data flow language / execution environment to explore huge datasets (running on HDFS and MapReduce clusters)
  6. Hive – distributed data warehouse, manages data stored in HDFS providing a query language based on SQL for querying the data
  7. HBase – distributed, column-oriented database that uses HDFS for its underlying storage, supporting both batch-style computations and random reads
  8. ZooKeeper – distributed, highly available coordination service, providing primitives to build distributed applications
  9. Sqoop – transfer bulk data between structured data stores and HDFS
  10. Oozie – service to run and schedule workflows for Hadoop jobs

This is a sizable portion of the Big Data ecosystem… an ecosystem that keeps on growing almost by the day. In fact, we could spend a considerable amount of time describing additional technologies out there that play an important part in the Big Data symphony – DataStax, Sqrrl, Hortonworks, Cloudera, Accumulo, Apache, Ambari, Cassandra, Chukwa, Mahout, Spark, Tez, Flume, Fuse, YARN, Whirr, Grunt, HiveQL, Nutch, Java, Ruby, Python, Perl, R, NoSQL, PigLatin, Scala, etc.

Interestingly enough, most of the aforementioned technologies are used in the realm of Data Science as well, mostly due to the fact that the main goal of Data Science is to make sense out of and generate value from all data, in all of its many forms, shapes, structures and sizes.

In my next blog post, we’ll see how Big Data and Data Science are actually two sides of the same coin, and how whoever does Big Data, is actually doing Data Science as well to some extent – wittingly, or unwittingly.

Web analytics and Enterprise data…

I was looking at the market share of  Google Analytics (GA) and it is definitely on the rise. So I was curious to see the capabilities and what this tool can do. Of course it is a great campaign management tool. It’s been a while since I worked on campaign management.

GA_graphicsI wanted to know all the more now about this tool, off to YouTube and got myself up to speed on the tools capabilities. Right off the bat I noticed campaign management has changed drastically compared to the days when we were sending email blasts or snail mail, junk mail etc. I remember the days when we generated email lists and run it through third-party campaign management tools, blast it out to the world and wait. Once we get enough data (mostly when the purchase the product) to run the results through SAS, we could see the effectiveness. It took more than a month to see any valuable insights.

Fast-track to the social media era, GA provides instant results and the intelligent click-stream data for  tracking campaign management in  real-time. Checkout the YouTube Webinars to see what GA can do in a 45 min.

GA1On a very basic level, GA can track the new visitor, micro conversion (download a newsletter, or add something in a shopping cart), Macro Conversion (buy a product), or is it a returning customer. GA can track the ad-word traffic (how did they get to the website, trigger). It also has a link tag feature– which is very useful to identify the channel (email, referral website etc), linking the traffic to a specific campaign, based on the origination. It has many other features besides cool reports and analytical abilities as well.

There is so much information collected whether the customer buys a product or not. How much of this web analytics data is part of enterprise data. Does historical analysis include this data? Is this data used  for predictive and prescriptive analytics? It is important to ask the following questions to assess what percentage of gathered information is actually used at the enterprise level:

  • How well the organizations integrate this campaign data into enterprise data?
  • Do they collect and manage new prospect information at enterprise level?
  • Does the organization use this tool to enhance their master data?

This may become a Big Data question, depending on the number of Campaigns/ hits and the amount of micro activates the site can offer. Chances are that the data resides  in silo or at a third-party location and the results are not stored in the enterprise data.

SAP HANA and Hadoop – complementary or competitive?

In my last blog post, we learned about SAP HANA… or as I called it, “a database on steroids”. Here is what SAP former CTO and Executive Board Member, Vishal Sikka, told InformationWeek:

SAP HANA and Hadoop – complementary or competitive?“Hana is a full, ACID-compliant database, and not just a cache or accelerator. All the operations happen in memory, but every transaction is committed, stored, and persisted.”

In the same InformationWeek article you can read of how SAP is committed to become the #2 database vendor by 2015.

So, even if HANA is a new technology, it looks like SAP has pretty much bet its future on it. Soon, SAP customers may have SAP ERP, SAP NetWeaver BW, and their entire SAP system landscape sitting on a HANA database.

But if HANA is such a great database, you may wonder, why would SAP HANA need a partnership with Hadoop, or be integrated with Hadoop at all? Can HANA really integrate with Hadoop seamlessly? And, most importantly, are HANA and Hadoop complementary or competitive?

Well, in October 2012, SAP announced the integration of Hadoop into its data warehousing family – why?

The composite answer, in brief, is:

  1. tighter integration – SAP, Hadoop, Cloudera, Hitachi Data Systems, HP, and IBM are all brought together in order to address the ever-growing demands in the Big Data space
  2. analytics scenarios – in order to build more complex and mature analytics scenarios, HANA can be integrated with Hadoop via SAP Sybase IQ, SAP Data Services, or R queries, and include structured AND unstructured Big Data with prior integration and consolidation by Hadoop
  3. in-memory capabilities – some organizations already have existing Hadoop strategies or solutions but cannot do in-memory Big Data without HANA
  4. tailored solutions – by bringing together speed, scale and flexibility, SAP enables customers to integrate Hadoop into their existing BI and Data Warehousing environments in multiple ways, so as to tailor the integration to their very specific needs
  5. transparency for end-users – SAP BusinessObjects Data Integrator allows organizations to read data from Hadoop Distributed File Systems (HDFS) or Hive, and load the desired data very rapidly into SAP HANA or SAP Sybase IQ, helping ensure that SAP BusinessObjects BI users can continue to use their existing reporting and analytics tools
  6. queries federation – customers can federate queries across an SAP Sybase IQ MPP environment using built-in functionality
  7. direct exploration – SAP BusinessObjects BI users can query Hive environments giving business analysts the ability to directly explore Hadoop environments

In short, SAP is looking at a co-exist strategy with Hadoop… NOT a competitive one.

In the next blog post, we’ll look at Hadoop and its position in the Big Data landscape… stay tuned.

SAP HANA – A ‘Big Data’ Enabler

Some interesting facts and figures for your consideration:

  • 90% - of stored data in the world today was created in the past 2 years
  • 50% - annual data growth rate
  • 34,000 – tweets sent each minute
  • 9,000,000 – daily Amazon orders
  • 7,000,000,000 – daily Google Page Views
  • 2.5 Exabyte – amount of data created every day (an Exabyte is 1,000,000,000,000,000,000 B = 1000 petabytes = 1 million terabytes = 1 billion gigabytes)

Looking at these numbers it is easy to see why more and more technology vendors want to provide solutions to ‘Big Data’ problems.

In my previous blog, I mentioned how we’ll soon get to a place where it will be more expensive for a company not to store data than to store data – some pundits claim that we’ve already reached this pivotal point.

Either way, it would be greatly beneficial to come to terms with at least some of those technologies that have made a substantial investment in the Big Data space.

One such technology is SAP HANA – a Big Data enabler. I am sure that some of you have heard this name before… but what is SAP HANA exactly?

The acronym H.AN.A. in ‘SAP HANA’, stands for High-performance ANalytical Appliance. If I went beyond the name/acronym and described SAP HANA in one sentence, I would say that SAP HANA is a database on steroids, perfectly capable of handling Big Data in-memory, and one of the few in-memory computing technologies that can be used as an enabler of Big Data Solutions.

Dr. Berg and Ms. Silvia – both SAP HANA gurus – provide a comprehensive and accurate definition of SAP HANA:

“SAP HANA is a flexible, data-source-agnostic toolset (meaning it does not care where the data comes from) that allows you to hold and analyze massive volumes of data in real time, without the need to aggregate or create highly complex physical data models. The SAP HANA in-memory database solution is a combination of hardware and software that optimizes row-based, column-based, and object-based database technologies to exploit parallel processing capabilities. We want to say the key part again: SAP HANA is a database. The overall solution requires special hardware and includes software and applications – but at its heart, SAP HANA is a database”.

Or as I put it, SAP HANA is a database on steroids… but with no side-effects, of course. Most importantly though, SAP HANA is a ‘Big Data Enabler’, capable of:

  • Conducting Massive Parallel Processing (MPP), handling up to 100TB of data in-memory
  • Providing a 360 degree view of any organization
  • Safeguarding the integrity of the data by reducing, or eliminating data migrations, transformations, and extracts across a variety of environments
  • Ensuring overall governance of key system points, measures and metrics

All with very large amounts of data, in-memory and in real time… could this be a good fit for your company? Or, if you are already using SAP HANA, I’d love to hear from you and see how you have implemented this great technology and what benefits you’ve seen working with it.

My next blog post will focus on SAP HANA’s harmonious, or almost harmonious, co-existence with Hadoop…