Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed


Follow Enterprise Information Technology on Pinterest

Posts Tagged ‘data quality’

Key strategies for Data Quality

We have witnessed in numerous client engagements Data Quality (DQ) is a never-ending battle and in many companies IT is in the midst of fixing and re-fixing the data rather than developing solutions and managing the applications. Data quality is not confined to IT but it is an effort which involves all the users of the Data especially Business.

Building a company wide initiatives can be a hard sell as the enormity and the scope is complex. However applying some of the proven key strategies will  help create the awareness and gain the support for DQ. The key idea is to not fix the problem over and over but understand and communicate the bigger picture to solve the problem as the IT and Business Information management matures.


If it is not measured you will never know what you are dealing with. Setting up key quality measures and documenting impacts is the best way to get support. As part of any key initiative include DQ measures which can be gathered and reported periodically. Having the key information metrics is sure way of getting the needed attention. Some of the measures could be as simple as:

  • Down time caused by quality issues
  • Man hours invested in repeat problems
  • Ownership of the Quality issues


Any data project should consider the trust aspects of the data. Introducing Data certification process for adding new data, especially large batch data will add tremendous improvements in business participation and overall quality improvements. Measuring the key quality information like missing/null values and wrong information (invalid values), rejections, warnings should be communicated and remedied within the expected timeframe. Many times we find the fix is in the up-stream systems which will eliminate current and future data issues.

Creating a data certification process is the best way to engage the business and gaining their trust. The key idea is to make the business responsible for the Data not the IT.

DQ as part of SDLC

Data Quality should be part of SDLC to guarantee acceptable quality. Development Projects should include time for reporting the key quality measures. This will greatly improve the trustworthiness of the data which in turn helps the adaptations of the new application faster.

Leverage Governance   

Governance is the best way to gain support for the quality initiatives, most of the time when it comes to cutting cost, quality related development time is cut out because it is perceived as nice to have. DG can mandate these requirements and get the necessary support. Make sure there is a process for getting on to the DG agenda and leverage it to bring about the DQ transformation.

In short, making incremental changes to existing process to quality as a key component will help build the case for the broader process changes and tools needed to manage the overall quality.

Link to earlier DQ blog:

What is the worth of Data Quality to organizations?

Bootstrapping Data Governance – Part I

A lot has been said and written about Data Governance (DG) and the importance of having one. However it is still a mystery for many companies to create an effective DG. Based on our experience majority of the companies in their early stages of DG fall into one of these areas:


  1. Had too many false starts
  2. Not much impact and the DG lost much of the support
  3. No clue, not even attempted

Why is it so difficult to set up a reasonably functioning Data Governance?

The typical scenario is that IT leads the Data Governance initiative, as part of the overhaul of the IT or as part of a new initiative like re-building Data Warehouse / launching Master Data Management program. Too often companies tend to establish DG with limited vision and narrow scope with minimal business involvement. The problem areas and possible pitfalls companies run into  can be broadly classified under three major areas for the DG establishment phase viz., Vision, Preparation and Sponsorship & Support.


Getting the Executive buy-in and setting the Data Governance vision is a process of evolution. Typically this takes 3 – 12 months of pre-work through casual meetings and by including DG topic in the strategy meeting agendas for discussion. Awareness through common education by attending industry seminars/ conferences is another dimension for setting the vision. If DG concept has been discussed and socialized for some time, then leveraging the common understanding to launch the program is the next step.


Being prepared is the best way to avoid false starts. Opportunity to launch the DG happens when you are least prepared. It is not easy to devote time to DG incubation when you have burning issues around you. But those burning issues especially the catastrophic events may escalate the urgency for DG and may gain unprecedented Executive support or even mandate from the top. Now you are definitely stuck if you are least prepared.

Sponsorship & Support

Once you get the go ahead, approaching DG without a holistic vision and complete picture will water down the momentum and slowly the support will start to disappear. Keeping the executive team committed to DG means, producing meaningful results and engaging the business in the planning through execution of DG.


DG Establishment is followed by the organization’s ability to successfully execute the DG mandates. Again putting together a solid approach of people, process and technology will guarantee the success of DG.

In the next segment let’s look at nimble and effective strategies to keep the DG a successful organization from establishment through execution.

Is IT ready for Innovation in Information Management ?

Information Technology (IT) has come a long way from being a delivery organization to an organization part of business innovation strategy, though a lot has to change in the coming years. Depending on the industry and the company culture, IT organization will mostly fall in the operational spectrum and a lot of progressive ones are  gravitating towards innovation. Typically, IT maybe consulted on executing the strategic vision. It is not IT’s role to lead the business strategy but data and information is another story.  IT is uniquely positioned to innovation in Information Management because of their knowledge in data, if they don’t take up that challenge, business will look for outside innovation. Today’s market place offers tools and technologies to business users and they are bypassing IT organizations if they are not ready for the information challenge. A good example will be business users trying out third-party services (cloud), self-service BI tools for slicing and dicing data, cutting down the development cycle. The only way IT can play strategic game is to get into the game.

It is almost impossible for IT not to pay attention to data and just bury their heads in keeping the lights on projects. So I took a stab at the types of products and technologies which is maturing in the last 5 years in the Data Management space. By any means this is not the complete list but it captures the essence.DM_tools_x

Interesting phenomenon is many companies traditionally late to adopt data driven approach are using analytical tools as they become visually appealing and are at a price they can buy. Cloud adoption is another trend which is making the technology deployment and management without a huge IT bottleneck.

The question every IT organization, irrespective of company size, should ask is Are we ready to take on the strategic role in the enterprise? How well they can co-lead the business solution and not just implementing an application after the fact. Data Management is one area IT needs to take the lead in educating and leading innovation to solve business problems. Predictive analytics and Big Data is right on top with all the necessary supporting platforms including Data Quality, Master Data Management and Governance.

It will be interesting to know how many IT organizations leverage the Information Management opportunity.



Primary Practices for Examining Data

SPSS Data Audit Node





Once data is imported into SPSS Modeler, the next step is to explore the data and to become “thoroughly acquainted” with its characteristics. Most (if not all) data will contain problems or errors such as missing information and/or invalid values. Before any real work can be done using this data you must assess its quality (higher quality = more accurate the predictions).

Addressing issues of data quality

Fortunately, SPSS Modeler makes it (almost too) easy! Modeler provides us several nodes that can be used for our integrity investigation. Here are a couple of things even a TM1 guy can do.

Auditing the data

After importing the data, do a preview to make sure the import worked and things “look okay”.

In my previous blog I talked about a college using predictive analytics to predict which students might or might not graduate on time, based upon their involvement in athletics or other activities.

From the Variable File Source node, it was easy to have a quick look at the imported file and verify that the import worked.










Another useful option is run a table. This will show if field values make sense (for example, if a field like age contains numeric values and no string values). The Table node is cool – after dropping it into my stream and connecting my source node to it, I can open it up and click run (to see all of my data nicely fit into a “database like” table) or I can do some filtering using the real-time “expression builder”.















The expression builder lets me see all of the fields in my file (along with their level of measurement (shown as Type) and their Storage (integer, real, string). It also gives me the ability to select from SPSS predefined functions and logical operators to create a query expression to run on my data. Here I wanted to highlight all students in the file that graduated “on time”:












You can see the possibilities that the Table node provides – but of course it is not practical to visually inspect thousands of records. A better alternative is the Data Audit node.

The Data Audit node is used to study the characteristics of each field. For continuous fields, minimum and maximum values are displayed. This makes it easy to detect out of range values.

Our old pal measurement level

Remember, measurement level (a fields “use” or “purpose”)? Well the data audit node reports different statistics and graphs, depending on the measurement level of the fields in your data.

For categorical fields, the data audit node reports the number of unique values (the number of categories).

For continuous fields, minimum, maximum mean, standard deviation (indicating the spread in the distribution), and skewness (a measure of the asymmetry of a distribution; if a distribution is symmetric it has a skewness value of 0) are reported.

For typeless fields, no statistics are produced.

“Distribution” or “Histogram”?

The data audit node also produces different graphs for each field (except for typeless fields, no graphs are produced for them) in your file (again based upon the field’s level of measurement).

For a categorical field (like “gender”) the Data Audit Node will display a distribution graph and for a continuous field (for example “household income”) it will display a histogram graph.

So back to my college’s example, I added an audit node to my stream and took a look at the results.











First, I excluded the “ID” field (it is just a unique student identification number and has no real meaning for the audit node). Most of the fields in my example (gender, income category, athlete, activities and graduate on time) are qualified as “Categorical” so the audit node generated distribution graphs, but the field “household income” is a “Continuous” field, so a histogram was created for it (along with the meaningful statistics like Min, Max, Mean, etc.).














Another awesome feature – if you click on the generated graphs, SPSS will give you a close up of the graph along with totals, values and labels.


I’ve talked before about the importance of understanding field measure levels. The fact that the audit data node generates statistics and chart types are derived from the measurement level is another illustration of how modeler uses the approach that measurement level determines the output.


IBM Vision 2013 – 2 thumbs Up!


I just returned from the IBM Vision Conference in Orlando, Florida. I attended a session in every available timeslot from Monday morning to Wednesday afternoon and it was worth every single minute of my time!

Although there were too many sessions and presenters to mention, here are my “top picks”:

  • Designing Solutions with IBM Cognos TM1 Performance Modeler – Andy Neimens and Stephen Brook. This session took a case study approach to building a planning and analysis solution using the Performance Modeler tool. If you have been reading my blog posts, you know I am in love with this tool. If anyone is still out there thinking that it’s acceptable to develop TM1 solutions using only TM1 Architect, and are not steadily building an expertise with PM, you are going to be left behind!


  • Reducing Cost -through Predictive Analytics by integrating IBM SPSS and IBM Cognos BI = JBS International. This was a “deep-dive” discussion on sourcing data from disparate systems and files to use with SPSS Statistics and SPSS Modeler to uncover relationships between financial performance and business objectives.  Again as you know, SPSS is a passion of mine and the PHD’s at JBS demonstrated their expertise with the technology. I spent time on break talking to these guys and trying to absorb their every word.


  • Building GRC (governance, risk and compliance) Success with the Power of Customer Experience -= Chris McClean of Forrester Research. This session explored how GRC programs can employ best practices from customer experiences to create an environment where employees want to participate with the program and embed it into their standard operating procedures.  This was an interesting presentation with real world examples for a vision for improving your GCR programs. It refreshed and renewed my commitment to the GRC programs I’ve help develop and support throughout my career.


  • Data Quality and Analytics – Which is the chicken and which is the egg? – Tony Petkovski, Commonwealth Bank of Australia. In this session, Tony demonstrated how his bank is using IBM’s OpenPages GRC platform to drive quality analytics and support better reporting and decision making, driving down the banks risks. Tony is such a passionate and charismatic guy that I left wanting to transfer all my money to the commonwealth bank!


  • Delivering Stronger Business Insight through a CFO Dashboard – Tony Levy, IBM.  This was a demonstration of IBM’s Smarter Analytics “Signature Solution” that leverages TM1, Cognos BI and SPSS Modeler to deliver a CFO dashboard that visualizes in NRT (near real time) KPI’s and KRI’s. – Walking out of this session I thought, I now know what I want to be when I grow-up!!!. Tony presented this new “add in product” for customers using these technologies – a configurable and customizable tool that will blow you away. Again, as you may or may not know, as an technology implementer, I design and help build these kinds of solutions all of the time. It’s always nice to see something like this. TM1, Cognos BI and SPSS Modeler? That has to be the “perfect storm”.


  • Vernice “fly-girl” Armour –America’s first African American female combat pilot. Last (but not in the least) I absolutely enjoyed attending the Tuesday Keynote presentation by Vernice Armour. She is so compelling and inspiring. She’s written a book (which I plan to pick up this week) “Zero to Breakthrough: The 7-Step, Battle-Tested Method for Accomplishing Goals that Matter”.  My favorite quote – “helicopters don’t need a runway”…Vernice: “Engage hot”! 


Thank-you IBM for another great conference and I hope to see you again next year!


Special thanks to my friends at Perficient for my ticket!


Introduction to Data Quality Services (DQS) – Part I

I was recently introduced to SQL Server 2012 and discovered Data Quality Services (DQS); a new feature of SQL Server 2012.  I wanted to use this blog as an introduction to DQS, define key terms, and present a simple example of the tool.  According to MSDN,

The data-quality solution provided by Data Quality Services (DQS) enables a data steward or IT professional to maintain the quality of their data and ensure that the data is suited for its business usage. DQS is a knowledge-driven solution that provides both computer-assisted and interactive ways to manage the integrity and quality of your data sources. DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.

(Click on each image to enlarge it.)
The below illustration displays the DQS process:

Read the rest of this post »

Teradata Talks Enterprise Data Integration

Teradata has been long been known for its powerful data systems and drive to push benchmarks for large data volumes. In fact, in 1992 Teradata built a first of its kind system for Wal-Mart, capable of  handling 1 terabyte of data. One of the main advantages to routinely working with very large data sizes is the exposure to integration, data quality (DQ) and master data management (MDM) techniques. From the experience derived after years of this type of work, Teradata has found themselves in a position as experts on the topics. With that, here is a video that includes these buzzwords and more as Teradata describes how to achieve data integration at the enterprise level:

Back to the Basics: What is Big Data?

This video published by SAP provides a concise description of Big Data. Timo Elliott (SAP Evangelist) and Adrian Simpson (CTO, SAP UK& Ireland) describe the 4 major challenges that big data is comprised of:

  • Volume – Amount of data
  • Velocity – Frequency of change in data
  • Variety – Both structured and unstructured data
  • Validity – Quality of the data

Before we can push into the details we must first understand the most simplified form of the topic as a platform to build from. Enjoy:

Read the rest of this post »

Data Governance – a must-have to ensure data quality – Part 2

In Part 1, we saw an overview of Data Governance and the initiatives firms need to take to incorporate governance. Let’s now look a bit more in detail about Data Quality Management as this is a key step in Data Governance towards ensuring data quality.

Why is Data Quality Management necessary?

Data Quality Management is the process of establishing roles & responsibilities and the business rules that govern data by bringing the Business and IT to work together. Their task is two-fold:- to address the problems that already exist and to prevent the potential ones from occurring. Let’s focus on the roles & responsibilities as this forms the core of a Data Quality Management program.

Roles & Responsibilities

There are various roles involved in this process and all of them have to be accountable to ensure data quality. Its vital that the roles are clearly defined upfront. The following are some of the commonly recognized roles:-

  • Data Governance Council – comprises of an Information Management Head and Data Stewards from various units.
  • Information Management Head – is the one who is accountable to the Governance Council on all aspects of data quality. This role would typically be fulfilled by the CIO.
  • Data Stewards - are the unit heads who lay down the rules & policies to be adhered to by rest of the team. This role would usually be fulfilled by a Program Manager.
  • Data Custodians – are responsible for the safe storage & maintenance of data within the technical environment. DBA’s would normally be the data custodians in a firm.
  • Business Analysts – are the ones who convey the data quality requirements to the data analysts.
  • Data Analysts – are those who would reflect the requirements into the model before handing it over to the  development team.


Some best practices to successful data governance 

This article on talks about some of the best practices around successful data governance. They key steps include:-

  • Get a governor and the right people in place to govern
  • Survey your situation
  • Develop a data-governance strategy
  • Calculate the value of your data
  • Calculate the probability of risk
  • Monitor the efficacy of your controls

While it is quite difficult to implement a data governance program, there is little doubt about the value addition it gives. Often companies tend to look at it just from the number of personnel involved and immediate ROI’s without looking at it from a broader perspective. Ultimately it is your own data that makes you stand out from your competitors. Ensuring data quality will automatically result in getting better insights from your analysis. Technology will always be a valuable enabler when there is a strong data governance program tied with it!

Data Governance – a must-have to ensure data quality – Part 1

While one of my earlier posts on Quality Data being a pre-requisite for every BI technique is still generating both positive and negative responses, I felt it would be apt to delve into Data Governance and see why it is necessary to be incorporated to achieve & maintain better data quality.

First, lets have a quick overview of data governance.

What is Data Governance?

Wikipedia defines Data Governance as a set of processes that ensure key data assets are formally managed throughout the enterprise so that the data can be trusted and people can be made accountable for any adverse event that occurs due to bad data. Data Governance is essentially a quality control discipline mainly meant to improve and maintain the data quality.

Data Governance Objectives

  • Improve decision-making of the management
  • Ensure data consistency
  • Build trust of data among everyone involved in the process
  • Adhere to compliance requirements
  • Eliminate risks related to data

Data Governance Pillars

It’s important to realize that data quality is just one of the pillars of governance. Typically Data Governance comprises of the following pillars:-

  • Metadata Management – It involves storing information about your data by means of a metadata repository.
  • Master Data Management (MDM) – It is a process of collecting and aggregating all the data within the firm into a single master file (acts as a reference) to ensure consistency.
  • Data Quality Management – It involves setting up of roles, responsibilities & governing business rules by bringing the Business and IT together with the focus on data quality.
  • Data Security – As the name indicates, it provides data access to only authorized users and protect it from unauthorized users and other threats.

Data Governance Initiatives

While there are quite a few data governance frameworks (like DMBOK, COBIT etc) out in the industry which firms can adopt, the following points could provide some first steps:-

  • Data Governance Vision Statement
  • Analyze & define data quality levels to be able to monitor performance
  • Establish roles & responsibilities by collaborating Business and IT
  • Setting up a Stewardship model to ensure data ownership & eliminate risks

Though it can take a considerable amount of time and effort to set up Data Governance – there is no doubt that it is going to improve the overall process of running your business.

In Part 2, we’ll look specifically about Data Quality Management and some best practices towards achieving data quality.