Perficient Healtchare Solutions Blog


Pete Stiglich

Posts by this author: RSS

Database inferencing to get to trusted healthcare data

A health insurance client of mine recently embarked on an initiative to truly have “trusted data” in its Enterprise Data Warehouse so that business leaders could make decisions based on accurate data.  However, how can one truly know if your data is trustable??   In addition to having solid controls in place (e.g., unique indexes on the primary AND natural key), it is also necessary to measure how the data compares to defined quality rulesWithout this measurement, trusted data is a hope – not an assured reality. 

shutterstock_71078161To enable this measurement, I designed a repository for storing

  • configurable data quality rules,
  • metadata about data structures to be measured,
  • and the results of data quality measurements.

I experienced the need to be able to perform a degree of “inferencing” in the relational database (DB2) being used for this repository.  Normally one thinks of inferencing as the domain of semantic modeling and semantic web technologies like RDF, OWL, SPARQL, Pellet, etc. – and these are indeed very powerful technologies that I have written about elsewhere.  However, using semantic web technologies wasn’t a possibility for this system.


A Wish List for Data Modeling Technology

I was recently a panelist for a Dataversity webinar/discussion focused on the future of data modeling tools, functionality, and practice.  Given the holiday season, the panelists discussed their wish list for modeling tools – from currently practical (but maybe not economically viable) to futuristic (e.g., using a 3D printer to print models for model reviews, using Google Glass to move objects around on the model).

Modeling-BaseOf course, many modeling tools already support a vast array of functionality and sometimes can be difficult to use some of the non-core functionality without experiencing some unintended consequences, and so more intelligent guides and better semantics in the modeling tool will make these easier to use – so modelers can focus more on modeling and less on the technology.

More important than the technology – as important and interesting as that is – is having solid processes and modeling standards in place to ensure better model quality, reuse, and understandability.  

Disruption caused by Data Governance?

Instituting Data Governance is a major initiative providing a significant opportunity to leverage enterprise data assets more effectively.  As such, there is sometimes concern about being seen as a roadblock or concerns about formulating new enterprise level organizations, such as a Data Governance board and Data Stewardship committees.  To be sure, a proper balance between enforcing standards and retaining agility will be needed, and of course Data Governance organizations should not be committee membersassembled without careful thought in a phased approach.  However, Data Governance is all about making strategic decisions about enterprise data assets – and of course people have to meet to discuss and make these decisions.  Not just anyone should be making these decisions and so Data Governance organizations need to be formed.

Forming Data Governance organizations should take place only after an assessment, strategy, and roadmap project. Part of this project should be identifying roles and executive stewards for these roles, after significant collaboration with the Executive Sponsor. The first organization to be formed should be the Data Governance Board, as other organizations need to be formed under the direction of the Data Governance Board.  Enterprise standards should be enforced after careful deliberation, approval, and promotion by Data Governance.

Some take a more surreptitious approach to data governance, by trying to apply heroic effort to applying best practices such as developing an enterprise data model, setting up a metadata repository, of undertaking data quality initiatives.  While these are worthwhile objectives, unless there are supporting Data Governance organizations to endorse, promote, and oversee these initiatives, these efforts tend to be difficult to sustain (due to lack of resources) and cause frustration and burnout.   More importantly, there may not be enterprise alignment.  For example, an Enterprise Conceptual Data Model not reviewed with and approved by the business, may just be the perspective of the data modeler.

Data Governance should be a disruptive agent for (good) change, and needs to be implemented after careful deliberation and executive sponsorship.  Forming Data Governance organizations needs to take place in a phased approach.

A Low Cost Big Data Integration Option?

With all of the interest in big data in healthcare, it’s easy to get drawn in by the excitement and not realize that it’s not a silver bullet that’s going to address all your data and infrastructure problems.   Unless you are able to understand and integrate your data, throwing all the data onto a platform like Hadoop or Cassandra probably won’t provide the benefit you’re looking for. Of course, there really is benefit to leveraging a big data platform for the right kinds of use cases, such as increased scalability, performance, potentially lower TCO, etc.

Of course, there are many integration tools out on the market that perform well.   However, I’d like to propose consideration of Semantic Web technologies as a low cost alternative to traditional data integration.  Many are open sourced and are based on approved standards from W3C such as RDF (Resource Description Framework) and OWL (Web Ontology Language).

How_information_is_connecting_all_types_of_healthcare_data_to_make_a_difference_LRGUsing Semantic Web technologies to enable integration, for example the Open Link Data initiative for integrating data across the internet, can (besides being less expensive) provide significant advantages for automated inferencing of new data which would previously require specialized programming to derive.  Indeed, your Semantic Web environment can serve as the knowledge base for artificial intelligence.


The Conceptual Data Model – Key to Integration

The importance of data integration, whether for analytics, mergers/acquisitions, outsourcing arrangements, third party interfaces, etc., is easy to understand but extremely difficult to realize.  The technical aspect of data integration is the (relatively) easy part.  The hardest part is bridging the semantic divide and agreeing on business rules – i.e., communication.  The Conceptual Data Model can be an invaluable tool to help bridge the gap.

A Conceptual Data Model is a technology, application, and (usually) business unit neutral view of key business objects (data entities) and their relationships, and should be the framework for information systems.  It is a business model, from a business object perspective, where we’re identifying what data is of interest to the business rather than how the data is used.  It is not a process model, and is stateless in nature (all states of a relationship must be expressed to provide a longitudinal perspective).   Relationships can express many business rules.  A Conceptual Data Model typically takes the form of an ERD, UML Class Diagram, or ORM diagram.

Doctor Working on a Laptop (more…)

Key insights on source data for healthcare analytics & exchanges

Providers and payers need to exchange or source a lot of data, and the rate of this activity will only increase with implementation of Obamacare and other directives.  Given the poor state of metadata management (which makes data sharing much more difficult), the decision to incorporate a new data set into an Enterprise Data Warehouse or data services can be fraught with much confusion making it very difficult to estimate the level of effort and deliver on time.  It makes sense, therefore, to identify and document a list of  “what we need to know about data” so that standard policies and data dictionary templates can be crafted to serve as the foundation for the data contract (even above and beyond an XSD if you’re using XML, unless the XSD is completely expressive with restrictions on lengths, acceptable values, etc., – but of course even then it can be hard for business and data analysts to review an XSD).

The list of “what we need to know about data” must go far beyond the bare bones metadata such as the field name, datatype, length, and description.  Why?  Because someone is going to have to gather the missing information, and of course this information collection takes a significant amount of  time and effort on the part of  business analysts, data analysts, data stewards, and modelers.  If the effort to collect this information isn’t made up front then more time and money will be required during the development process with increased risks of lack of confidence in the data.


Big Data as an Archival Platform

Operational systems frequently have to have data archived (i.e., backed up and purged from the database) in order to maintain performance SLA’s.  Historical data is frequently maintained for a much longer period in data warehouses, but as a data warehouse is intended for storing integrated, standardized, and historical (and perhaps current) data in order to support cross-functional analysis and reporting – a data warehouse can’t truly be considered an archive platform where the original data is stored as it looked like in the operational source system.

Record retention policies and laws often specify that business records be retained in exactly the format in which the business record was originally recorded.  We can, of course, archive our data off to tape and then consider our archiving complete.

However, an alternative which may be considered is to use a Big Data cluster/grid to store all of our archived business records, whether structured or unstructured in nature (database records, contract images, etc.).  I will point out a couple of reasons for considering this:

1)       Online retention of all data – if the data can be legally retained indefinitely (sometimes you can’t retain records past a certain point e.g., juvenile court records), you can have rapid access to all of your historical data in its original format.   As soon as you purge the data from your system, someone will think of a reason why they want or need to look at or analyze that data.  Keeping it on-line means that you can minimize the amount of time required to make that information re-accessible.  For example, legal discovery can be performed more rapidly if all the historical data is retained.  Your data scientists and analysts will also be very happy by having all of this historical data available for data mining and investigation.

2)      Big Data platforms such as Hadoop or Cassandra can store any kind of file in a distributed, redundant fashion, thereby having fault tolerance built into your cluster (homogeneous compute nodes) or grid (heterogeneous compute nodes).  By default in Hadoop, each data block is replicated onto three nodes, though this is configurable.  Cassandra can operate across multiple data centers to enable even greater fault tolerance.  In a Big Data platform, low-cost commodity servers and DAS (Direct Attached Storage) can drive down the cost of retaining all of this information.  As a result, archival off to tape might not be a mandatory step in the archive process.


Healthcare Data Modeling Governance

I participated in a webinar/panel discussion last week hosted by Dataversity on Data Modeling Governance, which was well attended and lively. The focus was on governance of Data Models and the Data Modeling environment (e.g., tools, model repositories, standards).

Data Modeling Governance is supported by Data Governance – and Data Governance benefits significantly from Data Modeling Governance. I will describe what Data Modeling Governance is and how it relates with Data Governance.

Data Modeling Governance entails making strategic decisions about enterprise data model assets, data modeling practices, and the data modeling environment in order to support improved governance, management, quality, and definition of data assets. Data assets usually originate from a data model – the more our models are aligned with the enterprise (e.g., standards, nomenclature, and modeling practice) – the more our data assets will be aligned, reusable, sharable, and of higher quality.

Some examples of Data Modeling Governance in support of Data Governance:

  • Development of and adherence to a Data Modeling Standards document. By documenting these standards (rather than having them in people’s heads or in an email somewhere) a common modeling practice can be instituted, leading to less contention and confusion, more effective model reviews, more rapid modeling (not reinventing the wheel on every project), and more reusable and standardized model objects (resulting in more reusable, integratable, and shareable data – all good things that Data Governance loves).
  • Enabling a common model repository leads to more secure and findable models. Data models often express significant intellectual capital of the enterprise and so needs to be properly secured. Much business metadata can/should be expressed in data models (e.g., business object names, conceptual relationships, definitions, etc. This is of much value to Data Governance and especially important to Data Stewardship.

Data Governance in turn needs to support Data Modeling Governance. Often, Data Architects and Modelers struggle to implement best practices only to face resistance from project teams. By raising awareness of issues encountered to a Data Governance Board, Data Governance working with Application Development leadership can make recommendations for updating the SDLC to ensure those best practices are included so that adequate time for solid modeling practices is accommodated in projects.

I am interested to hear your feedback and experience of Data Modeling Governance. You can reach me at, on twitter at @pstiglich, or in the comments section below.

Big Data Security

In the Big Data Stack below, security is a vertical activity touching upon all aspects of a Big Data architecture which must receive careful attention, especially in a healthcare environment when PHI may be involved.

NoSQL technologies, which many Big Data platforms are built upon, are still maturing technologies where security is not as robust as in a relational database.  Examples of these NoSQL technologies include MongoDB, HBase, Cassandra, and others. When designing Big Data architecture, it is easy to get excited about the power and flexibility that Big Data enables, but the more mundane non-functional requirements must be carefully considered. These requirements include security, data governance, data management, and metadata management.  The most advanced technology might not necessarily be the best fit if some of the critical non-functional requirements can’t be accommodated without too much trouble.

To protect PHI, you might consider specialized encryption software such as IBM InfoSphere Guardium Encryption Expert or Gazzanga, which can perform encryption and decryption at the OS level as IO operations are performed.  These encryption technologies operate below the data layer and as such can be used regardless of how you store the data.  Using these technologies means that you can have a robust and highly secure Big Data architecture.

Figure 1 – Perficient’s Big Data Stack

big data stack

I’m interested to hear your feedback and experience of security in Big Data platforms.  You can reach me at, on twitter at @pstiglich, or in the comments section below.

Business Glossary – Sooner Rather than Later

If you are undertaking any significant data initiatives, such as MDM, Enterprise Data Warehousing, CRM, integration/migrations due to mergers or acquisitions, etc., where you have to work across business units and interface with numerous systems, one of the best investments you can make is to implement an Enterprise Business Glossary early on in your initiative.

An Enterprise Business Glossary is used to facilitate and store definitions and additional metadata about business terms, business data elements, KPI’s, and enterprise standard calculations (not every calculation will be a KPI, e.g., calculation for determining tax).

Why is an Enterprise Business Glossary so important?  Data is meant to represent a real world business object and needs to be named and defined appropriately so it can be found, managed, analyzed, and reused.  Semantic and naming confusion leads to dis-integration.  Data entities in a Conceptual Data Model or a Semantic Model (e.g., using RDF/OWL) represent business objects such as Customer, Product, Vendor and so on.  If you are trying to integrate data you have to understand what these terms mean – to the business units, source system, and the enterprise.  Otherwise, there is a lot of flailing that will go on – data analysis, data modeling, and BI are all affected by lack of clarity in business terminology.  Executive requests for analyses which seem straightforward (e.g., How many members do we have?) are hindered by the semantic confusion and variability found in our enterprises.  Duplicate systems, data, and processes performing similar functionality require expensive resources.


Perficient’s Big Data Stack – Infrastructure Tier

In a previous blog posting, I introduced Perficient’s Big Data Stack in order to provide a framework for a Big Data architecture and environment.  This article is the second in a series of articles (click here for the first article on the Application tier) focusing on a component of the Big Data Stack.  The stack diagram below has two key vertical tiers – the application tier and the infrastructure tier.  I will discuss the infrastructure tier in this article.

Figure 1 – Perficient’s Big Data Stack


As mentioned previously, with any data project it is important to identify non-functional requirements (requirements for performance, scalability, availability, backups, etc.).  Requirements drive design, and to a large degree should drive infrastructure decisions.  You might need more than one Big Data environment to meet the needs of different applications.  For example, you might have a mission-critical environment requiring extensive controls or you might have a separate environment to support Data Scientists who are exploring hypotheses requiring more flexibility in the environment.

For the infrastructure tier, there is a dizzying array of technologies and architectures that can be chosen from.  Not only do you need to determine what processing platform to choose (Are we going to deploy on public / private cloud? Are we going to utilize proprietary appliances or commodity hardware? How many servers do we need in our Big Data MPP (grid or cluster) environment?), you have to choose how you’re going to persist the data (or you might not choose to persist all the data e.g., for stream processing where the data is only processed in memory and only the information needed long term is persisted to disk), and if you are going to use a relational database or NoSQL technology.  If you want to go with a relational database, can it scale into the petabyte range?  Will it be able to handle unstructured or semi-structured data efficiently?  If you are going to use a NoSQL platform, does it have the security and data management controls necessary?  Will you have PHI or other sensitive data in the Big Data environment?  Security controls for NoSQL databases aren’t as robust as the relational database so you might need to utilize an IO level encryption technology, such as IBM InfoSphere Guardium Encryption Expert or Gazzanga.

Are you going to use open source or COTS software?  Big Data technologies like Hadoop and Cassandra are open sourced, but you might want to buy a tool which will mask a lot of the underlying complexity of tying open source software together, as well as providing analytics and data management capabilities to streamline Big Data efforts.  The beauty of Hadoop and Cassandra being able to leverage commodity hardware means that you can significantly drive down the cost of your Big Data environment, as long as you have the skill sets in house to be able to implement open source.

There are over 100 NoSQL databases to choose from for different types of data representation (wide column, key value pair, graph, XML, object, document).  Figure 2 below shows the different types of data representation intersected by the types of processing which might be a good fit for the representation type (not limited just to Big Data representation).

Figure 2 – Data Representation by Processing Type


These are just a few of the questions you will need to ask as you consider the infrastructure needed for your Big Data environment.  You will want to perform a strategy, roadmap, and readiness assessment for your Big Data program before undertaking infrastructure decisions and acquiring technologies.

I will delve into more detail into these components of Perficient’s Big Data stack in future articles.   I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience.  You can reach me at, or on twitter at @pstiglich, or in the comments section below.

Perficient’s Big Data Stack – The Application Tier

In my last blog posting, I introduced Perficient’s Big Data Stack in order to provide a framework for a Big Data environment.  The stack diagram below has two key vertical tiers – the application tier and the infrastructure tier.  I will discuss the application tier in this article.

Figure 1 – Perficient’s Big Data Stack

As with any data project, it is of course important to understand how it will be used (the functional requirements), and equally important to identify non-functional requirements (requirements for performance, scalability, availability, etc.).  Requirements drive design, and to a large degree should drive infrastructure decisions. Your Data Scientists need to be actively engaged in identifying requirements for a Big Data application and platform, as they will usually be the primary users.

Most Big Data platforms are geared towards analytics – being able to process, store, and analyze massive amounts of data (hundreds of terabytes, petabytes, and beyond), but Big Data is being used for operations as well (such as patient monitoring, real-time medical device data integration, web intrusion detection, etc.).

Analytics in Big Data is typically not a replacement for your Business Intelligence capabilities.  A powerful analytics use case for Big Data is in supporting ad-hoc analyses which might not ever need to be repeated – Data Scientists formulate hypotheses and use a Big Data platform to investigate the hypothesis.  Some have the opinion that Big Data is unstructured data.  Unstructured data definitely works quite well in a Big Data platform, but if you have massive amounts of structured data (such as medical device, RFID data, or just regular tabular data) – these of course can take advantage of a Big Data platform where you can perform Data Mining inexpensively, even using open source data mining tools such as R and Mahout.

Most of the time you will be sourcing data for your Big Data application and so the Data Sourcing component is at the top of the application tier in the Big Data Stack.  There are many tools available for sourcing such as ETL tools, log scanners, streaming message queues, etc.  Due to the massive scalability and reliability of Big Data platforms such as Hadoop and Cassandra (NoSQL technologies), such platforms may be an ideal place to archive ALL of your data online.  With these technologies, each data block is automatically replicated on multiple machines (or even multiple data centers in the case of Cassandra).  Failure of the nodes in the cluster or grid is expected and so node failures can be handled gracefully and automatically.  Having all of your archived data available in a single environment online can provide a rich environment for your data scientists.

We’ve talked briefly about operational applications for a Big Data platform – but much more can be said.  Transaction processing can be supported on Big Data, but usually not ACID compliant transaction processing (HBase, built on top of Hadoop, touts that it can support ACID transactions).  Not all types of transactions require ACID compliance – e.g., if a user updates his/her Facebook status and the status is lost due to a failure, it’s not the end of the world.  Some operational applications might not persist all of the data – and might just distribute the data across the nodes to utilize computing capacity and distributed memory.

I will delve into more detail into these components of Perficient’s Big Data stack in future articles.   I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience.  You can reach me at or by leaving a comment in the section below.