Skip to main content

Customer Experience and Design

Key insights on source data for healthcare analytics & exchanges

Providers and payers need to exchange or source a lot of data, and the rate of this activity will only increase with implementation of Obamacare and other directives. Given the poor state of metadata management (which makes data sharing much more difficult), the decision to incorporate a new data set into an Enterprise Data Warehouse or data services can be fraught with much confusion making it very difficult to estimate the level of effort and deliver on time. It makes sense, therefore, to identify and document a list of “what we need to know about data” so that standard policies and data dictionary templates can be crafted to serve as the foundation for the data contract (even above and beyond an XSD if you’re using XML, unless the XSD is completely expressive with restrictions on lengths, acceptable values, etc., – but of course even then it can be hard for business and data analysts to review an XSD).

The list of “what we need to know about data” must go far beyond the bare bones metadata such as the field name, datatype, length, and description. Why? Because someone is going to have to gather the missing information, and of course this information collection takes a significant amount of time and effort on the part of business analysts, data analysts, data stewards, and modelers. If the effort to collect this information isn’t made up front then more time and money will be required during the development process with increased risks of lack of confidence in the data.

By identifying the standard list of “what we need to know about data,” the source metadata can be assessed more rapidly to identify what is missing and be better able to estimate level of effort. For example, assume you’ve identified 15 things you need to know about every atomic data element/field/column in your standard data source data dictionary template, and for the sake of argument assume you have 2 data sources you want to integrate each with 100 data elements/fields – one with a 75% of the needed metadata available and the other with 40% available. Obviously the first can be integrated more rapidly (all other things being equal). This should also affect pricing if you’re bringing in data from an external customer as the first dataset will be less expensive to integrate – you will have to spend much less time trying to gain a proper understanding of the data so that ETL and data quality checks can be developed with much less rework. Below are examples of metadata elements about source data you might want to capture information for.

Metadata Element Mandatory? Comment
Container Metadata
Container Technical Name (table, file name, container XML element…)

Y

Container Business Name

Y

Container Definition

Y

Container Relationships

Y

Ideally these will be identified in a data model you receive from the data source provider. Even if you’re getting a single flat file don’t assume that there aren’t relationships – the data may have been denormalized – in which case you want to know how data elements relate and the cardinality/optionality of these relationships

Attribute/Atomic Element/Column/Field Metadata

Element Technical Name

Y

Be sure you get descriptions for all abbreviations used. DON’T make assumptions!
Element Business Name

Y

Make sure this is something a business person would understand without having to be a source system expert
Synonyms/Acronyms

N

Goes a long way to help bridge the semantic divide.
Description

Y

As detailed as possible, avoiding tautologies whenever possible, e.g., if the definition for PERSON_NM is “The name of a person” that is an example of a tautology
Part of Primary Key?

Y

Part of Natural Key?

Y

Extremely important – the primary key might just be a sequential number. Need to know what makes a record unique from a business perspective. If you know the natural key you’ve saved yourself some trouble!
Datatype

Y

Minimum Length

Y

Usually the maximum length is only identified – but for character fields there may be requirements that the data must be at least N characters.
Max Length

Y

Mandatory

Y

Indicates if the element will allow nulls
Valid Values / Domain

Y

Can be a set of limited values (if so then document these below), a range, or verbiage describing valid values
Valid value

Y

Valid value description

Y (if a limited set of valid values)

If the valid values are X, Y, or Z – what do these mean?
Example values

Y (depending on Valid Values/Domain)

If the list of valid values can be expressed above, then not needed. But if the set of valid values is too large provide examples
Conditions / Exclusions

N

If there are special conditions, rules to consider e.g., if the Claim Date year is < 2010 then there should not be an “X” value
Format

Y (if not freeform text)

E.g., SSN format might be NNN-NN-NNNN, an amount might by $N,NNN.NN
Security / Privacy

N

You want to be sure to identify if the element will contain PHI, need special access restrictions, etc.

 

Of course, it should go without saying, you will still need to perform data profiling on the data source as discrepancies between the data and metadata do happen. For example, if the data producer specifies that a certain data element may only have 2 values (e.g., Y or N), but through profiling you see that there are some nulls, X’s, 9’s, etc., then there’s a data quality issue which should be addressed in the data source (if at all possible) or the metadata needs to be updated.

So before you agree to bring in a new dataset, be sure to assess the source metadata. The degree of metadata completeness is as important as knowing when the data will be received, how it will be received, latency requirements, etc., and will result in faster and more accurate integration.

I’m interested to hear your feedback. You can reach me at pete.stiglich@perficient.com, on Twitter at @pstiglich, or in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.