Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Archives

Follow Enterprise Information Technology on Pinterest

Posts Tagged ‘data quality’

Data Quality – Don’t Fix It If It Ain’t Broke

 

What is broke?   If I drive a pick-up truck around that has a small, unobtrusive crack in the windshield and a few dings in the paint, it wpickup truck haulingill still pull a boat and haul a bunch of lumber from Home Depot.  Is the pick-up broke if it still meets my needs?

So, when is data broke?   In our legacy data integration practices, we would profile data and identify all that is wrong with the data.  Orphan keys, in-appropriate values, and incomplete data (to name a few) would be identified before data was moved.   In the more stringent organizations data would need to near perfect for it to be used in a data warehouse.    This ideal world or perfect data was strived after, but rarely obtained.  It was too expensive, required too much business buy in, and lengthen BI and DW projects.   Read the rest of this post »

Think Better Business Intelligence

Think First by jDevaun.Photography, on FlickrCreative Commons Creative Commons Attribution-No Derivative Works 2.0 Generic License by jDevaun.Photography

Everyone is guilty of falling into a rut and building reports the same way over and over again. This year, don’t just churn out the same old reports, resolve to deliver better business intelligence. Think about what business intelligence means. Resolve, at least in your world, to make business intelligence about helping organizations improve business outcomes by making informed decisions. When the next report requests land on your desk leave the tool of choice alone, Cognos in my case, and think for a while. This even applies to those of you building your own reports in a self-service BI world.

Think about the business value. How will the user make better business decisions? Is the user trying to understand how to allocate capital? Is the user trying to improve patient care? Is the user trying to stem the loss of customers to a competitor? Is the user trying find the right price point for their product? No matter what the ultimate object, this gets you thinking like the business person and makes you realize the goal is not a report.

Think about the obstacles to getting the information. Is the existing report or system to slow? Is the data dirty or incorrect? Is the data to slow to arrive or to old to use? Is the existing system to arcane to use? You know the type – when the moon is full, stand on your left leg, squint, hit O-H-Ctrl-R-Alt-P then the report comes out perfectly – if it doesn’t time out. Think about it, if there were no obstacles there would be no report request in your hands

Think about the usage. Who is going to use the analysis? Where will they be using it? How will they get access to the reports? Can everyone see all the data or is some of it restricted? Are users allowed to share the data with others? How will the users interact with the data and information? When do the users need the information in their hands? How current does the data need to be? How often does the data need to be refreshed? How does the data have to interact with other systems? Thinking through the usage gives you a perspective beyond the parochial limits of your BI tool.

Think like Edward Tufte. What should the structure of the report look like? How would it look in black and white? What form should the presentation take? How should the objects be laid out? What visualizations should be used? And, those are never pie-charts. What components can be taken away without reducing the amount of information presented? What components can be added, in the same real-estate, without littering, to improve the information provided? How can you minimize the clutter and maximize the information. Think about the flaws of write once and deliver anywhere, and the garish palates many BI tools provide.

Think about performance. Is the user thinking instantaneous response? Is the user thinking get a cup of tea and come back response time? Is the user okay kicking off a job and getting the results the next morning? If you find one of these, cherish them! They are hard to find these days. Will the user immediately select the next action or do they require some think time. Is the data set a couple of structured transactional records or is the data set a chunk of a big-data lake? Does the data set live in one homogenous source or across many heterogeneous sources? Thinking about performance early means you won’t fall into a trap of missed expectations or an impossible implementation.

Think about data quality. It is a fact of life. How do you deal with and present missing data? How do you deal with incorrect values? How do you deal with out of bounds data? What is the cost of a decision made on bad data? What are the consequences of a decision made on incorrect data? What is the cost of perfect data? What is the value of better data. Thinking about quality before you start coding lets you find a balance between cost and value.

Think about maintenance. Who is going to be responsible for modifications and changes? You know they are going to be needed. As good as you are, you won’t get everything right. Is better to quickly replicate a report multiple times and change the filters, or is it better to spend some extra time and use parameters and conditional code to have a single report server many purposes? Is it better to use platform specific outputs or is it better to use a “hybrid” solution and support every output format from a single build? Are the reports expected to be viable in 10-years or will they be redone in 10-weeks? Thinking through the maintenance needs will let you invest your time in the right areas

Think you are ready to build? Think again. Think through your tool sets capabilities and match them to you needs. Think through your users skills and match them to the tools. Think about your support team and let them know what you need. Think through your design and make sure it is viable.

Here’s to thinking better Business Intelligence throughout the year.

 

Key strategies for Data Quality

We have witnessed in numerous client engagements Data Quality (DQ) is a never-ending battle and in many companies IT is in the midst of fixing and re-fixing the data rather than developing solutions and managing the applications. Data quality is not confined to IT but it is an effort which involves all the users of the Data especially Business.

Building a company wide initiatives can be a hard sell as the enormity and the scope is complex. However applying some of the proven key strategies will  help create the awareness and gain the support for DQ. The key idea is to not fix the problem over and over but understand and communicate the bigger picture to solve the problem as the IT and Business Information management matures.

MetricsDQ_cert

If it is not measured you will never know what you are dealing with. Setting up key quality measures and documenting impacts is the best way to get support. As part of any key initiative include DQ measures which can be gathered and reported periodically. Having the key information metrics is sure way of getting the needed attention. Some of the measures could be as simple as:

  • Down time caused by quality issues
  • Man hours invested in repeat problems
  • Ownership of the Quality issues

Process

Any data project should consider the trust aspects of the data. Introducing Data certification process for adding new data, especially large batch data will add tremendous improvements in business participation and overall quality improvements. Measuring the key quality information like missing/null values and wrong information (invalid values), rejections, warnings should be communicated and remedied within the expected timeframe. Many times we find the fix is in the up-stream systems which will eliminate current and future data issues.

Creating a data certification process is the best way to engage the business and gaining their trust. The key idea is to make the business responsible for the Data not the IT.

DQ as part of SDLC

Data Quality should be part of SDLC to guarantee acceptable quality. Development Projects should include time for reporting the key quality measures. This will greatly improve the trustworthiness of the data which in turn helps the adaptations of the new application faster.

Leverage Governance   

Governance is the best way to gain support for the quality initiatives, most of the time when it comes to cutting cost, quality related development time is cut out because it is perceived as nice to have. DG can mandate these requirements and get the necessary support. Make sure there is a process for getting on to the DG agenda and leverage it to bring about the DQ transformation.

In short, making incremental changes to existing process to quality as a key component will help build the case for the broader process changes and tools needed to manage the overall quality.

Link to earlier DQ blog:

What is the worth of Data Quality to organizations?

Bootstrapping Data Governance – Part I

A lot has been said and written about Data Governance (DG) and the importance of having one. However it is still a mystery for many companies to create an effective DG. Based on our experience majority of the companies in their early stages of DG fall into one of these areas:

DG_Target

  1. Had too many false starts
  2. Not much impact and the DG lost much of the support
  3. No clue, not even attempted

Why is it so difficult to set up a reasonably functioning Data Governance?

The typical scenario is that IT leads the Data Governance initiative, as part of the overhaul of the IT or as part of a new initiative like re-building Data Warehouse / launching Master Data Management program. Too often companies tend to establish DG with limited vision and narrow scope with minimal business involvement. The problem areas and possible pitfalls companies run into  can be broadly classified under three major areas for the DG establishment phase viz., Vision, Preparation and Sponsorship & Support.

Vision

Getting the Executive buy-in and setting the Data Governance vision is a process of evolution. Typically this takes 3 – 12 months of pre-work through casual meetings and by including DG topic in the strategy meeting agendas for discussion. Awareness through common education by attending industry seminars/ conferences is another dimension for setting the vision. If DG concept has been discussed and socialized for some time, then leveraging the common understanding to launch the program is the next step.

Preparation

Being prepared is the best way to avoid false starts. Opportunity to launch the DG happens when you are least prepared. It is not easy to devote time to DG incubation when you have burning issues around you. But those burning issues especially the catastrophic events may escalate the urgency for DG and may gain unprecedented Executive support or even mandate from the top. Now you are definitely stuck if you are least prepared.

Sponsorship & Support

Once you get the go ahead, approaching DG without a holistic vision and complete picture will water down the momentum and slowly the support will start to disappear. Keeping the executive team committed to DG means, producing meaningful results and engaging the business in the planning through execution of DG.

Execution

DG Establishment is followed by the organization’s ability to successfully execute the DG mandates. Again putting together a solid approach of people, process and technology will guarantee the success of DG.

In the next segment let’s look at nimble and effective strategies to keep the DG a successful organization from establishment through execution.

Is IT ready for Innovation in Information Management ?

Information Technology (IT) has come a long way from being a delivery organization to an organization part of business innovation strategy, though a lot has to change in the coming years. Depending on the industry and the company culture, IT organization will mostly fall in the operational spectrum and a lot of progressive ones are  gravitating towards innovation. Typically, IT maybe consulted on executing the strategic vision. It is not IT’s role to lead the business strategy but data and information is another story.  IT is uniquely positioned to innovation in Information Management because of their knowledge in data, if they don’t take up that challenge, business will look for outside innovation. Today’s market place offers tools and technologies to business users and they are bypassing IT organizations if they are not ready for the information challenge. A good example will be business users trying out third-party services (cloud), self-service BI tools for slicing and dicing data, cutting down the development cycle. The only way IT can play strategic game is to get into the game.

It is almost impossible for IT not to pay attention to data and just bury their heads in keeping the lights on projects. So I took a stab at the types of products and technologies which is maturing in the last 5 years in the Data Management space. By any means this is not the complete list but it captures the essence.DM_tools_x

Interesting phenomenon is many companies traditionally late to adopt data driven approach are using analytical tools as they become visually appealing and are at a price they can buy. Cloud adoption is another trend which is making the technology deployment and management without a huge IT bottleneck.

The question every IT organization, irrespective of company size, should ask is Are we ready to take on the strategic role in the enterprise? How well they can co-lead the business solution and not just implementing an application after the fact. Data Management is one area IT needs to take the lead in educating and leading innovation to solve business problems. Predictive analytics and Big Data is right on top with all the necessary supporting platforms including Data Quality, Master Data Management and Governance.

It will be interesting to know how many IT organizations leverage the Information Management opportunity.

DM_tools_list

 

Primary Practices for Examining Data

SPSS Data Audit Node

z1

 

 

 

Once data is imported into SPSS Modeler, the next step is to explore the data and to become “thoroughly acquainted” with its characteristics. Most (if not all) data will contain problems or errors such as missing information and/or invalid values. Before any real work can be done using this data you must assess its quality (higher quality = more accurate the predictions).

Addressing issues of data quality

Fortunately, SPSS Modeler makes it (almost too) easy! Modeler provides us several nodes that can be used for our integrity investigation. Here are a couple of things even a TM1 guy can do.

Auditing the data

After importing the data, do a preview to make sure the import worked and things “look okay”.

In my previous blog I talked about a college using predictive analytics to predict which students might or might not graduate on time, based upon their involvement in athletics or other activities.

From the Variable File Source node, it was easy to have a quick look at the imported file and verify that the import worked.

z2

 

 

 

 

 

 

 

 

Another useful option is run a table. This will show if field values make sense (for example, if a field like age contains numeric values and no string values). The Table node is cool – after dropping it into my stream and connecting my source node to it, I can open it up and click run (to see all of my data nicely fit into a “database like” table) or I can do some filtering using the real-time “expression builder”.

z3

 

 

 

 

 

 

 

 

 

 

 

 

 

The expression builder lets me see all of the fields in my file (along with their level of measurement (shown as Type) and their Storage (integer, real, string). It also gives me the ability to select from SPSS predefined functions and logical operators to create a query expression to run on my data. Here I wanted to highlight all students in the file that graduated “on time”:

z4

 

 

 

 

 

 

 

 

 

 

You can see the possibilities that the Table node provides – but of course it is not practical to visually inspect thousands of records. A better alternative is the Data Audit node.

The Data Audit node is used to study the characteristics of each field. For continuous fields, minimum and maximum values are displayed. This makes it easy to detect out of range values.

Our old pal measurement level

Remember, measurement level (a fields “use” or “purpose”)? Well the data audit node reports different statistics and graphs, depending on the measurement level of the fields in your data.

For categorical fields, the data audit node reports the number of unique values (the number of categories).

For continuous fields, minimum, maximum mean, standard deviation (indicating the spread in the distribution), and skewness (a measure of the asymmetry of a distribution; if a distribution is symmetric it has a skewness value of 0) are reported.

For typeless fields, no statistics are produced.

“Distribution” or “Histogram”?

The data audit node also produces different graphs for each field (except for typeless fields, no graphs are produced for them) in your file (again based upon the field’s level of measurement).

For a categorical field (like “gender”) the Data Audit Node will display a distribution graph and for a continuous field (for example “household income”) it will display a histogram graph.

So back to my college’s example, I added an audit node to my stream and took a look at the results.

z5

 

 

 

 

 

 

 

 

 

First, I excluded the “ID” field (it is just a unique student identification number and has no real meaning for the audit node). Most of the fields in my example (gender, income category, athlete, activities and graduate on time) are qualified as “Categorical” so the audit node generated distribution graphs, but the field “household income” is a “Continuous” field, so a histogram was created for it (along with the meaningful statistics like Min, Max, Mean, etc.).

z6

 

 

 

 

 

 

 

 

 

 

 

 

Another awesome feature – if you click on the generated graphs, SPSS will give you a close up of the graph along with totals, values and labels.

Conclusion

I’ve talked before about the importance of understanding field measure levels. The fact that the audit data node generates statistics and chart types are derived from the measurement level is another illustration of how modeler uses the approach that measurement level determines the output.

 

IBM Vision 2013 – 2 thumbs Up!

visionme

I just returned from the IBM Vision Conference in Orlando, Florida. I attended a session in every available timeslot from Monday morning to Wednesday afternoon and it was worth every single minute of my time!

Although there were too many sessions and presenters to mention, here are my “top picks”:

  • Designing Solutions with IBM Cognos TM1 Performance Modeler – Andy Neimens and Stephen Brook. This session took a case study approach to building a planning and analysis solution using the Performance Modeler tool. If you have been reading my blog posts, you know I am in love with this tool. If anyone is still out there thinking that it’s acceptable to develop TM1 solutions using only TM1 Architect, and are not steadily building an expertise with PM, you are going to be left behind!

 

  • Reducing Cost -through Predictive Analytics by integrating IBM SPSS and IBM Cognos BI = JBS International. This was a “deep-dive” discussion on sourcing data from disparate systems and files to use with SPSS Statistics and SPSS Modeler to uncover relationships between financial performance and business objectives.  Again as you know, SPSS is a passion of mine and the PHD’s at JBS demonstrated their expertise with the technology. I spent time on break talking to these guys and trying to absorb their every word.

 

  • Building GRC (governance, risk and compliance) Success with the Power of Customer Experience -= Chris McClean of Forrester Research. This session explored how GRC programs can employ best practices from customer experiences to create an environment where employees want to participate with the program and embed it into their standard operating procedures.  This was an interesting presentation with real world examples for a vision for improving your GCR programs. It refreshed and renewed my commitment to the GRC programs I’ve help develop and support throughout my career.

 

  • Data Quality and Analytics – Which is the chicken and which is the egg? – Tony Petkovski, Commonwealth Bank of Australia. In this session, Tony demonstrated how his bank is using IBM’s OpenPages GRC platform to drive quality analytics and support better reporting and decision making, driving down the banks risks. Tony is such a passionate and charismatic guy that I left wanting to transfer all my money to the commonwealth bank!

 

  • Delivering Stronger Business Insight through a CFO Dashboard – Tony Levy, IBM.  This was a demonstration of IBM’s Smarter Analytics “Signature Solution” that leverages TM1, Cognos BI and SPSS Modeler to deliver a CFO dashboard that visualizes in NRT (near real time) KPI’s and KRI’s. – Walking out of this session I thought, I now know what I want to be when I grow-up!!!. Tony presented this new “add in product” for customers using these technologies – a configurable and customizable tool that will blow you away. Again, as you may or may not know, as an technology implementer, I design and help build these kinds of solutions all of the time. It’s always nice to see something like this. TM1, Cognos BI and SPSS Modeler? That has to be the “perfect storm”.

 

  • Vernice “fly-girl” Armour –America’s first African American female combat pilot. Last (but not in the least) I absolutely enjoyed attending the Tuesday Keynote presentation by Vernice Armour. She is so compelling and inspiring. She’s written a book (which I plan to pick up this week) “Zero to Breakthrough: The 7-Step, Battle-Tested Method for Accomplishing Goals that Matter”.  My favorite quote – “helicopters don’t need a runway”…Vernice: “Engage hot”! 

 

Thank-you IBM for another great conference and I hope to see you again next year!

 

Special thanks to my friends at Perficient for my ticket!

 

Introduction to Data Quality Services (DQS) – Part I

I was recently introduced to SQL Server 2012 and discovered Data Quality Services (DQS); a new feature of SQL Server 2012.  I wanted to use this blog as an introduction to DQS, define key terms, and present a simple example of the tool.  According to MSDN,

The data-quality solution provided by Data Quality Services (DQS) enables a data steward or IT professional to maintain the quality of their data and ensure that the data is suited for its business usage. DQS is a knowledge-driven solution that provides both computer-assisted and interactive ways to manage the integrity and quality of your data sources. DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.

(Click on each image to enlarge it.)
The below illustration displays the DQS process:

Read the rest of this post »

Teradata Talks Enterprise Data Integration

Teradata has been long been known for its powerful data systems and drive to push benchmarks for large data volumes. In fact, in 1992 Teradata built a first of its kind system for Wal-Mart, capable of  handling 1 terabyte of data. One of the main advantages to routinely working with very large data sizes is the exposure to integration, data quality (DQ) and master data management (MDM) techniques. From the experience derived after years of this type of work, Teradata has found themselves in a position as experts on the topics. With that, here is a video that includes these buzzwords and more as Teradata describes how to achieve data integration at the enterprise level:

Back to the Basics: What is Big Data?

This video published by SAP provides a concise description of Big Data. Timo Elliott (SAP Evangelist) and Adrian Simpson (CTO, SAP UK& Ireland) describe the 4 major challenges that big data is comprised of:

  • Volume – Amount of data
  • Velocity – Frequency of change in data
  • Variety – Both structured and unstructured data
  • Validity – Quality of the data

Before we can push into the details we must first understand the most simplified form of the topic as a platform to build from. Enjoy:

Read the rest of this post »