Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Archives

Follow our Enterprise Information Technology board on Pinterest

Archive for the ‘Emerging BI Trends’ Category

DevOps Considerations for Big Data

Big Data is on everyone’s mind these days. Creating an analytical environment involving Big Data technologies is exciting and complex. New technology, new ways of looking at the data which is otherwise remained dark or not available. The exciting part of implementing the Big Data solution is to make it a production ready solution.

Once the enterprise comes to rely on the solution, dealing with typical production issues is a must. Expanding the data lakes and creating multiple applications accessing, changing and deploying new statistical learning solutions can hit the overall platform performance. In the end-user experience and trust will become an issue if the environment is not managed properly. Models which used to run in minutes may turn into hours and days based on the data changes and algorithm changes deployed. bigdata_1Having the right DevOps process framework is important to the success of Big Data solutions.

In many organizations the Data Scientist reports to the business and not to IT. Knowing the business and technological requirements and setting up the DevOps process is key to make the solutions production ready.

Key DevOps Measures for Big Data environment:

  • Data acquisition performance (ingestion to creating a useful data set)
  • Model execution performance (Analytics creation)
  • Modeling platform / Tool performance
  • Software change impacts (upgrades and patches)
  • Development to Production –  Deployment Performance (Application changes)
  • Service SLA Performance (incidents, outages)
  • Security robustness / compliance

 

One of the top key issue is Big Data security. How secured is the data and who has the access and the oversight of the data? Putting together a governance framework to manage the data is vital for the overall health and compliance of the Big Data solutions. Big Data is just getting the traction and much of best practices for Big Data DevOps scenarios yet to mature.

Virtualization – THE WHY?

 

The speed in which we receive information from multiple devices and the ever-changing customer interactions providing new ways of customer experience, creates DATA! Any company that knows how to harness the data and produce actionable information is going to make a big difference to their bottom line. So Why Virtualization? The simple answer is Business Agility.

As we build the new information infrastructure and the tools for the modern Enterprise Information Management, one has to adapt and change. In the last 15 years, the Enterprise Data Warehouse has matured to a point with proper ETL framework and Dimension models.

With the new ‘Internet of Things’ (IoT) a lot more data is created and consumed from external sources. Cloud applications create data which may not be readily available for analysis. Not having the data for analysis will greatly change the critical insights outcome.

Major Benefits of Virtualization

 Virtualization_benefits

Additional considerations

  • Address performance impact of Virtualization on the underlying Application and the overall refresh delays appropriately
  • It is not a replacement for Data Integration (ETL) but it is a quicker way to get data access in a controlled way
  • May not include all the Business rules, which implies Data Quality issues, may still be an issue

In conclusion, having the Virtualization tool in the Enterprise Data Management portfolio of products will add more agility in Data Management. However, use Virtualization  appropriately to solve the right kind problem and not as a replacement to traditional ETL.

Cloud BI use cases

Cloud BI comes in different forms and shapes, ranging from just visualization to full-blown EDW combined with visualization and Predictive Analytics. The truth of the matter is every niche product vendor offers some unique feature which other product suite does not offer. In most case you almost always need more than one suite of BI to meet all the needs of the Enterprise.

De-centralization definitely helps the business in achieving agility and respond to the market challenges quickly. At the same token that is how companies may end up with silos of information across the enterprise.

Let us look at some scenarios where a cloud BI solution is very attractive to Departmental use.

time_2_mktTime to Market

Getting the business case built and approved for big CapEx projects is a time-consuming proposition. Wait times for HW/SW and IT involvement means lot longer delays in scheduling the project. Not to mention the push back to use the existing reports or wait for the next release which is allegedly around the corner forever.

 

deploymentDeployment Delays

Business users have immediate need for analysis and decision-making. Typical turnaround for IT to get new sources of data takes anywhere between 90 days to 180 days. This is absolutely the killer for the business which wants the data now for analysis. Spreadsheets are still the top BI tool just for this reason. With Cloud BI (not just the tool) Business users get not only  the visualization and other product features but also the data which is not otherwise available. Customer analytics with social media analysis are available as  a third-party BI solution. In the case of value-added analytics there is business reason to go for these solutions.

 

Tool CapabilitiesBI_cap

Power users need ways to slice and dice the data, need integration of other non traditional sources (Excel, departmental cloud applications) to produce a combined analysis. Many BI tools comes with light weight integration (mostly push integration) to make this a reality without too much of IT bottleneck.

So if we can add new capability, without much delay and within departmental budget where is the rub?

The issue is not looking at the Enterprise Information in a holistic way. Though speed is critical, it is equally important to engage Governance and IT to secure the information and share appropriately to integrate into the Enterprise Data Asset.

As we move into the future of Cloud based solutions, we will be able to solve many of the bottlenecks, but we will also have to deal with security, compliance and risk mitigation management of leaving the data in the cloud. Forging a strategy to meet various BI demands of the enterprise with proper Governance will yield the optimum use of resources and /solution mix.

Governing the Cloud Analytics…

The new trend in the Analytics world, Cloud Analytics is slowly becoming a norm. Except for the Cloud tag, companies have used Cloud or External Analytics for a long time. Historically Campaign Management has been part outsourced, part managed by Marketing, using external data besides ‘Enterprise Data’. Traditional Data Vendors / Credit Score providers like D&B are expanding into Cloud BI offerings (acquisition of Indicee) to diversify their offerings. IBM’s acquisition of Silverpop puts them on par with data vendors offering Campaign Management solutions, not to mention pre-packaged Analytics solutions they offer in this space.

cloud_analytics.jpgAll of these new tools and offerings makes it conducive for Business users to use these solutions bypassing IT. The challenge remains how to deal with the fast paced changing norm for the Enterprise Data? Restricting artificially and holding back the trend is not only impossible but also will put the company in a competitive disadvantage. Cloud Analytics is Agile and offers reduction in time for Information access. Cloud deployments are much faster than traditional EDW/BI which is mostly ‘what happened?’ type of data. According to a report from Aberdeen ‘Large Enterprises utilizing Cloud Analytics obtain pertinent information 13% more often than all other large Enterprises’, which is a significant capability compared to the competition. (See Aberdeen Report: Cloud Analytics for the Large Enterprise: Fast Value, Pervasive Impact)

Preparing to deal with the new changing Analytics trend is a must for IT and the Enterprise as whole. Data silos has been an issue in the past, is an issue at present and will be an issue in the future. Creating the environment to deal with overall Information explosion is a key to survival of the company. Partnership between IT and Business will strengthen the Information Governance to manage the Information based decisions and to leverage the offerings in the Marketplace wisely. Adding more sources of Data in silos will restrict the overall Information use but not allowing to experiment will also put the company in a disadvantage position.

Having the visibility to overall Enterprise Information including the Cloud is a critical factor for a successful Information Governance. Companies should forge strategies to put in the framework to manage the Information in a rapidly changing environment with the right people, process and technologies at the Regional and Enterprise level. Information Governance should create a balanced environment by providing appropriate oversight yet foster innovation in leveraging new offerings.

Myths & Realities of Self-Service BI

Myths & Realities of Self-Service BI

The popularity of Data Visualization tools and the Cloud BI offerings are new forces to reckon with. I find it interesting to see how the perception Vs usage of these tools in reality. Traditionally IT likes the control and centralized management for obvious reasons of accountability and quality of information. However the self-service BI tools and cloud offerings are accelerating the departmental BI development. Some of the misconceptions based on the early hype cycle is wearing off and the realities are becoming more clear.

Let’s look at some of the myths and realities…

Self_serviceMyth 1: Self-Service means everyone is a report writer!

Self-Service BI is pitched as the solution for faster access to data. BI product vendors think anyone can develop reports and use it, but the truth is, analysts want to analyze, not create reports or dashboards. What they need is easy ways to analyze the data, visual analysis still better. Self-service does not mean everyone is on their own to create the reports.

Myth 2: Self-Service BI means it is end of the traditional BI!

Almost all the major BI vendor and major Data Management software player offers a Visualization / In-memory tool along with traditional BI Tools. Every tool has its advantages and disadvantages based on their capability and usage. Forging a framework for data access, sharing and securing data appropriately is the key to leverage these new technologies. IT can also learn from some of the departmental success, primarily their ability to create solutions in their space and how they are using the tool to further their cause and apply those techniques in traditional BI space as well.

Myth 3: Self-Service is new!

Well, Excel is always the king of self-service BI. It was there before Self-Service BI, it is there now, and it will be there in the foreseeable future as well. So understanding the Self-Service BI usage and the limits will help IT and the entire organization to use these spectrum of tools efficiently.

Self-service has its place and limitations. It is great for data discovery. Who could do data discovery better than the business folks? Self-Service BI is all about getting the data sooner than later to the business  power user not necessarily end-user. Use the data discovery to validate the benefit and integrate into the EDW or corporate centralized data once the application is proven.

In a nutshell self-service BI is here to stay as they always have been, but the key is to create a balancing governance structure to manage the quality, reliability and security.

 

Data Staging and Hadoop

Traditionally, in our information architectures we have a number of staging or intermediate data storage areas / systems.   These have taken different forms over the years, publish directories on source systems, staging areas in data warehouses, data vaults, or most commonly, data file hubs.   In general, these data file staging solutions have suffered from two limitations:

  1. Data Staging and HadoopBecause of costs storage, data retention was usually limited to a few months.
  2. Since these systems were intended to publish data for system integration purposes, end-users generally did not have access to staging data for analytics or data discovery

Hadoop’s systems reduce the cost per terabyte of storage by two orders of magnitude. Data that consumed $100 worth of data storage now costs $1 to store on a Hadoop system.   This radical reduction cost, enables enterprises to replace sourcing hubs with data lakes based on Hadoop, where a data lake can now house years of data vs. only a few months.

Next, once in a Hadoop filesystem (HDFS) the data can be published, either directly to tools that consume HDFS data or to Hive (or other SQL like interface).   This enables end-users to leverage analytical, data discovery, and visualization tools to derive value from data within Hadoop.

The simple fact that these data lakes can now retain historical data and provide scalable access for analytics, also has a profound effect on the data warehouse. This effect on the data warehouse will be the subject of my next few blogs.

Seven Deadly Sins of Database Design

This is a summary of an article from Database Trends And Applications; dbta.com. The author addresses fundamental mistakes that we do or we live with in regards to our database systems.

1. Poor or missing documentation for databases in PRODUCTION

Seven Deadly Sins of Database DesignWe may have descriptive table names and columns to begin with, but as workforce turns over and a database grows, we can lose essential knowledge about the systems.
A suggested approach is to maintain the data model in a central repository. This must be followed by executing validation and quality metrics regularly to enhance the quality of the models over time.

2. Little or no normalization

All information in one table may be easier for data access but may not be the best option in terms of design. Understand normalization:
• 1st normal Form – eliminate duplicate columns and repeating values in columns
• 2nd Normal Form – remove redundant data that apply to multiple columns.
• 3rd Normal Form – Each record of a table is unique based on the primary identifier.

3. Not treating the data model like a living breathing organism

Many people start with a good model when designing a database, and then throw it away as soon as the application is in production. The model should be updated as often as any new changes are applied on the database to communicate these changes effectively.

4. Improper storage of reference data

Store reference data in a model or have a reference in the model that points to reference data. Reference data is typically stored in several places or worse, in application code – making it very difficult when there is need to change this information.

5. Not using foreign keys or check constraints

Data quality is increased highly by having referential integrity and validation checks defined right from the database level.

6. Not using domains and naming standards

Domains allow you to create reusable attributes so that users don’t have to create them each time they need to use them. Naming standards increase readability of the database and make it easier for new users to adopt to a database. It is recommended that one uses proper names as opposed to short forms that a user has to try and figure out what they mean.

7. Not choosing primary keys properly

Choose a primary key wisely because it is painful to try and correct these down the line. A simple principle is suggested when picking a primary key; SUM: Static, Unique, Minimal. So a social security number may not be the best primary key in some cases because it may not always be unique and not everyone has one :)

Happy database modeling!

Data Science = Synergistic Teamwork

Data science is a discipline conflating elements from various fields such as mathematics, machine learning, statistics, computer programming, data warehousing, pattern recognition, uncertainty modeling, computer science, high performance computing, visualization and others.

Data Science = Synergistic TeamworkAccording to Cathy O’Neil and Rachel Schutt, two luminaries in the field of Data Science, there are about seven disciplines that even data scientists in training can easily identify as part of their tools set:

  • Statistics
  • Mathematics
  • Machine Learning
  • Computer Science
  • Data Visualization
  • Domain Expertise
  • Communication and Presentation Skills

Most data scientists, however, are experts in only a couple of these disciplines and proficient in another two or three – that’s why Data Science is a team sport.

I’ve definitely learned the importance of teamwork in this field over the last few months, while working with Perficient Data Science team on a Big Data Lab.

Ultimately, the goal of Data Science is to extract meaning from data and create products from the data itself. Data is the raw material used for the study of “the generalizable extraction of knowledge”.

With data scaling up by the day, it should not come as a surprise that Big Data would play an important role in a data scientist’s work – herein lies the importance of our Big Data Lab and our teamwork.

Our Big Data Lab is the place where Data Science’s many underlying disciplines come together to create something greater than the summation of our individual knowledge and expertise – synergistic teamwork.

Disruptive Scalability

The personal computer, internet, digital music players (think ipods), smart phones, tablets are just a few of the disruptive technologies that have become common place in our lifetime.   What is consistent about these technology disruptions is that they all have changed the way we work, live, and play.  Whole industries have grown up around these technologies.   Can you imagine a major corporation being competitive in today’s Disruptive Scalabilityworld without personal computers?

Big Data is another disruptive technology.    Big Data is spawning its own industry with 100s of startups and every major technology vendor seems to have a “Big Data Offering.”  Soon, companies will need to leverage Big Data to stay competitive.   The Big Data technology disruption in an Enterprise’s data architecture is significant. How we source, integrate, process, analyze, manage, and deliver will evolve and change. Big Data truly is changing everything!   Over the next few weeks I will focusing my blogging on how Big Data is changing our enterprise information architecture.   Big Data’s effect on MDM, data integration, analytics, and overall data architecture will be covered.   Stay-tuned!

Thoughts on Oracle Database In-Memory Option

Last month Oracle announced Oracle In-Memory database option. The overall message is that once installed, you can turn this “option” on and Oracle will become an in-memory database.   I do not think it will be that simple. However, I believe Oracle is on the correct track with this capability.

Thoughts on Oracle Database In-Memory OptionThere are two main messages with Oracle In-Memory’s vision which I view are critical capabilities in a modern data architecture. First, is the ability to store and process data based on the temperature of the data.
That is, hot, highly accessible data should be kept DRAM or as close to DRAM as possible.   As the temperature decreases, data can be stored on flash, and for cold, rarely accessed data, on disk (either in the Oracle DB or in Hadoop).   Of course we can store data of different temperatures today, however, the second feature, which is making this storage transparent to the application, makes the feature it very valuable. An application programmer, data scientist, or report developer, should not have to know where the data is stored.   It should be transparent. The Oracle DB or a DBA can optimize the storage of data based on cost /performance of storage without having to consider the compatibility with the application, cluster (RAC), or recoverability is quite powerful and useful.   Yes, Oracle has been moving this way for years, but now it has most, if not all, the pieces.

Despite the fact that the In-Memory option leverages a lot of existing core code, most of Oracle’s IT shops will need to remember that this is a version 1 product.   Plan accordingly. Understand the costs and architectural impacts. Implement the Oracle In-Memory option on a few targeted applications and then develop standards for its use. A well planned, standards-based approach, will assure that your company will maximize its return on your Oracle In-Memory investment.