Perficient Healtchare Solutions Blog

Subscribe via Email

Subscribe to RSS feed

Perficient Healthcare Business Solutions

Perficient Healthcare on Twitter

avatar

Pete Stiglich

Mr. Stiglich is an experienced, international IT architect, project manager, and trainer with significant expertise in Data Architecture, Data Warehousing/Business Intelligence, Customer Relationship Management, Customer Data Integration, Enterprise Information Management, Data Modeling, Data Governance and Stewardship, Data Quality, Meta Data Management, Master Data Management, and the Semantic Web and ontologies. He has nearly 25 years of IT experience, primarily working at the enterprise level. Mr. Stiglich is a recognized expert on the SearchDataManagement website in the areas of Data Modeling and Data Warehousing, and his articles have been published in several venues (Real World Decision Support, EIMInstitute.org, Intl. Association for Information and Data Quality (IADQ.org), SemanticUniverse, and others). Mr. Stiglich has developed and taught courses on a variety of topics for large firms and government agencies, and has presented at conferences and professional associations. Mr. Stiglich is a thought leader on the topic of Conceptual Data Modeling, and uses this type of model to achieve outstanding results for effectively understanding the business, before developing solutions. He holds the Certified Computing Professional (CCP), Certified Data Management Professional (CDMP) and Certified Business Intelligence Professional (CBIP) certifications, all at the mastery level. Mr. Stiglich also serves as the president of DAMA Phoenix.

LinkedIn LinkedIn Public Profile
Twitter pstiglich

Posts by this author:

avatar

Helping Data Scientists Navigate Big Data with the Semantic Web

by on April 30th, 2012

The term “Big Data” is being thrown around a lot lately, but what is it, and what makes “Big Data” different from the other data that wework with in the healthcare industry?  Big Data is differentiated by its volume, velocity, and variety.   The last point, variety, deals not only with the wide variety of formats that data may reside in, but also with the varying nomenclature and semantics of data assets. 

Enter the Data Scientist

Big Data holds massive potential to uncover new insights and trends and has led to a new role – that of the Data Scientist.  Data Scientists are (rare) people with statistical, data modeling, and programming backgrounds.  The Data Scientists also have to be able to not just do number-crunching, they need to be able to use the data and visualize it to tell a story to executives in order to change strategy, identify new market opportunities, determine how to better maintain population health for ACO, or what have you.  As you can imagine, these people will be highly paid.

So now you’ve embarked on a Big Data program – do you want these highly paid individuals to be spending their time trying to find and integrate data?  While large enterprises may have petabytes of data, these are typically spread out in numerous silos – very often each with its own nomenclature and varying semantics which makes it very difficult to integrate data. 

Wouldn’t it be useful to Data Scientists (and a lot of other folks!) to be able to find information, regardless of how it is named, and query the data without having to install a lot of database clients, and tie it together in a single query platform where you could automatically be assured of globally unique identifiers, and be able to have the computer draw as many inferences as possible (so that a lot of custom programming doesn’t have to be performed)?   If you think this would be extremely useful – Semantic Web technologies are for you!

Mining the Web of Data

The Semantic Web is a “web of data” – instead of a web of documents (which is what the WWW is currently).  Semantic Web technologies can of course be used on your intranet to traverse the web of your enterprise data to help bridge the semantic divide.  The foundations of the Semantic Web are RDF and OWL – with RDF triples you have a common logical model of subject, predicate, and object.  Data can be converted to RDF triples directly or converted to triples on the fly such as with D2R, which converts RDBMS data into RDF triples at query time.  Ontologies and business rules, expressed using RDFS, OWL, SWRL, etc., help to bridge the semantic divide and enable machine inferencing/learning.  For example, the Translational Medicine Ontology (TMO) asserts that (for its purposes), the class “diseases” is equivalent to “condition” and “side_effects” classes, i.e., so that analysis of diseases would automatically (with inferencing) include the members of the condition and side effects classes.

<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/diseases> owl:equivalentClass             <http://data.linkedct.org/resource/linkedct/condition>.
<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/diseases> owl:equivalentClass             <http://www4.wiwiss.fu-berlin.de/sider/resource/sider/side_effects>.

Using Semantic Web technologies, Data Scientists can be more productive. With SPARQL (query language for the Semantic Web), Data Scientists can formulate powerful federated queries informed by machine inferencing capabilities (using SPARQL 2 path properties or by accessing stored inferenced data generated by a reasoner such as pellet or Jena).   SPARQL can also enable pulling the varied and disparate data together into a denormalized format (since many Big Data platforms don’t have join capabilities) which can then distributed in Hadoop or other Big Data platform for high performance data mining, if your Semantic Web platform isn’t able to achieve the performance required on massive amounts of data.  The results of the data mining might be converted into RDF triples to enable a feedback loop.  Semantic Web technology is beginning to scale dramatically – one triple store vendor has asserted ability to store 1 trillion triples, so much more analysis will be able to be performed directly (and more easily) using your Semantic Web platform. 

Semantic Web technologies can be a very powerful tool in your Big Data arsenal, and helps to address the semantic variety of data and helps Data Scientists to traverse the information silos.

avatar

Big Data – Data Management Challenges

by on April 2nd, 2012

Big Data presents a lot of new opportunities – from analytics on petabytes of data to Complex Event Processing (CEP) of many large streams of data in real time.  However, Big Data also presents big challenges for Data Management.  These challenges include security/privacy, governance, data modeling, and backups.

Security – security for many NoSQL-based Big Data platforms are not as mature as relational counterparts, with much security seeming to take place at the firewall level.   Privacy, especially for PHI, is especially problematic.  Solutions such as IBM Guardium Encryption Expert or Gazzanga look promising as they perform encryption/decryption at the OS level below the filesystem level, and so can be transparent to the application. Tokenization is another alternative where referential integrity in the data can be maintained without exposing private information.

Data Governance – Another complex twist is that with the massive amount of data available is that privacy becomes more difficult, even beyond protecting PHI or credit card numbers.  Intellectual property or potentially damaging information which might have been stored in unstructured data sources can be exposed more easily.  With a massive amount of data points, inferences can be made (e.g., using Semantic Web technologies) to uncover knowledge previously much more difficult to piece together.  Data Governance and Data Stewardship must be involved when exploiting Big Data technologies to ensure that information is properly secured and used, according to business requirements.

Metadata Management – many metadata management solutions have not yet developed scanners or busses to import/exchange metadata in order to support data lineage or impact analysis to support Data Governance and change management.  However, more and more data management tools (e.g., Informatica for data integration, Composite for data virtualization) are starting to support Big Data technologies, such as Hadoop, and so metadata may be exposed using these tools, though probably not with the same ease as with relational databases given the “schema-less” nature of many NoSQL databases.

Data Modeling – given the “schema-less” nature of NoSQL databases, the data modeling task will morph, primarily Physical Data Modeling (as DDL doesn’t need to be generated).  However, Conceptual and Logical Data Modeling will remain important.  With the advent of the Semantic Web, ontology modeling will become much more prevalent.  I call ontology modeling Conceptual Data Modeling on steroids as it can inform machine learning and inferencing.

Performing backups on a Big Data platform can be problematic as backup solutions are in their infancy for these types of solutions.  One can easily imagine the problems of performing backups when many terabytes are loaded every hour or day.  However, many big data platforms have built in availability.  With Hadoop HDFS for example, each data block is replicated on three nodes (by default) and is rack aware so that a block is stored on at least two different racks.  With Cassandra, data can be replicated across Data Centers.  This does not, of course, preclude the possibility of data being deleted.

Big Data presents a paradigm shift and will require adjustment in data management practices.  Not every new system will require a Big Data solution or NoSQL platform, but where required, careful consideration of data management requirements is necessary.

avatar

Big Data – Where can it be used in Healthcare?

by on March 7th, 2012

There is a lot of interest in Big Data these days.  The common definition for Big Data is often considered to be data sets which are too large (e.g., hundreds of terabytes or into petabytes) to be handled by traditional means.  Leveraging Big Data requires an extensive degree of parallelism of both data and computing, requiring a Massively Parallel Processing (MPP) architecture, and may take advantage of low-cost commodity hardware and open source software to help drive down costs.  Healthcare is certainly an industry which generates massive volumes of data which very often isn’t leveraged anywhere near to full potential.

Healthcare providers and payers often want to uncover insight into how to improve clinical care and operations by mining the data, but might not want to face the expense and delay involved in building an Enterprise Data Warehouse in order to quickly investigate a hypothesis.  In such a situation, a Data Scientist might benefit from having all HL7 messages available in their raw form so as to have access to all the data without the restructuring that usually takes place to integrate HL7 data into systems.  Personalized medicine will require massive amounts of patient specific genomic, proteomic, metabolic and other data.   Evidence-based medicine may benefit from intensive text mining of unstructured data (medical literature, physicians notes, etc.) in order to better align practices to established norms.  Streaming HL7 messages and other real-time data to be able to proactively monitor and respond to conditions, such as potential illness outbreaks.

Big Data of course should not be thought of as a silver bullet.  Issues of data governance, security, and privacy have to be addressed and require careful thought given the new paradigms that might be employed.  Not all Big Data technology (e.g., NoSQL databases) will be applicable or suitable for every situation and careful tradeoff analysis has to be analyzed.  A learning curve is to be expected to be able to fully exploit Big Data capabilities.

avatar

Blue Button – Is it time to get into the 21st century?

by on February 8th, 2012

The VA and Medicare recently launched the Blue Button initiative where patients can “download their claims and medical information in a common format.”  This format is a plain ASCII text file with the purpose being to allow the information to be read or printed on any device.  A pretty slick Adobe Air application can be installed on the patient’s computer allow him/her to navigate through the information along with some other nice features e.g., showing a timeline of medication use.

According to the White House, Blue Button data “can be used to create portable medical histories that will facilitate dialog with Veterans’ and beneficiaries’ health care providers, caregivers, and other trusted individuals or entities.”[1].  This is certainly laudable, but is an ASCII text file the ideal format for sharing this data?  To the right is a small sample of a Blue Button text file.[2]

While the text file is relatively easy for a human to read it is difficult to a computer to ingest and integrate this information in a meaningful way due to the lack of an easily parsable format.  It would be nice if this information was also downloadable as an XML file (we are after all in the 21st century) – relying solely on a 1970’s style report for data sharing leaves much to be desired.

A reason given for not using XML or other more parsable format is due to the desire to avoid arguments over what standard to use.  Why not just develop an XSD specifically for this Blue Button data; a standard after all is implied in the format of the report.  This would have made the data more actionable for applications and systems that could use the data.

While improvement could be made to Blue Button for better data sharing it certainly is a worthwhile initiative that can save lives.  Having the data in an ASCII text format certainly has advantages as no specialized software is required to view it.  Providing the data in an XML format should be an addition to, not a replacement, for the text file.

What has been your experience with the Blue Button initiative? Have you experienced any limitations? 


[1] http://www.whitehouse.gov/blog/2010/10/07/blue-button-provides-access-downloadable-personal-health-data

[2] http://bluebuttondata.org/how_to.php

avatar

Data Governance and Data Stewardship – keys to successful enterprise data initiatives

by on January 9th, 2012

When embarking on enterprise data initiatives, such as Meaningful Use, BI or supporting the conversion to ICD-10, success is very often correlated to the degree of business involvement.  For enterprise data initiatives, there are three types of business (here meaning non-IT) involvement required:

  • Stakeholder involvement (these, of course, are the people for whom we are undertaking the initiative or who are approvers)
  • Data Governance
  • Data Stewardship

Our focus here is on the latter two – but one way we can think of these types of business involvements is that stakeholder involvement is vertical in nature (i.e., their involvement is often for a specific enterprise application e.g., an Enterprise Data Warehouse), while Data Governance and Data Stewardship are horizontal in nature (i.e., governing or stewarding cross functionally, business unit neutral, and independent of any specific system or application).

Let’s define what Data Governance and Data Stewardship are:

Data Governance – the practice of treating data and information as enterprise business assets and making strategic enterprise decisions about data and information, data management practices, and data/information environments.

Data Stewardship – is assigned responsibility, authority, and accountability over a data subject area for the enterprise to ensure that data is well defined, of high quality, properly secured, and effectively utilized.

As a lot of things tend to get lumped into the term “Data Governance,” let’s contrast it with the term “Data Management” (or Information Management).  Here is the DAMA DMBOK definition:

Data Management – “is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets.”

Data Governance and Data Stewardship are components of Enterprise Information Management (EIM) – other components including structured data management (e.g., database management), unstructured data management (e.g., email, document and content management), data architecture (e.g., data modeling, etc.), metadata management, data quality management, enterprise data warehousing, master data management, etc.

Data Governance and Data Stewardship are key ways that the business is involved in EIM.  Data Governance focuses on making strategic enterprise decisions (such as, do we need an Enterprise Data Warehouse, what policies will be enacted for HIPAA compliance from a data perspective, etc.).

The strategic decisions that are made as part of Data Governance are made typically through an enterprise level board or committee formed specifically for Data Governance.  Participants in a Data Governance Board are primarily from the business, with IT playing a supporting role.  This helps to ensure that the data/information environment is in greater alignment with business needs and objectives.

Data Stewardship is more tactical in nature as it is focused on a specific data subject area (Patient, Practitioner, Claim, Encounter, etc.).   There are different types of data stewards, with the most critical being the Business Data Steward.  This role should be filled by someone from a business unit (non-IT) who has the skills to work across the enterprise to achieve consensus on data and term definitions, data quality measurement and improvement, monitoring compliance with policies for the subject area, etc.  The Business Data Steward is primarily a facilitation role – but of course the Business Data Steward must have subject area expertise.  Business Data Stewards are assisted by Technical Data Stewards, Data Custodians and others and might chair a subcommittee for a complex data subject area.

There is much more that can be written about Data Governance and Data Stewardship – but the most important take away is that these are primarily business undertakings and are critical ways to enable business involvement in enterprise data initiatives.

Does your enterprise have a data governance board or committee?  If not, what are some obstacles to forming one?

Have you served as a Data Steward? What challenges did you face? 

I’d like to hear about your experiences or questions!

avatar
avatar

Semantic Web Technologies for ICD-9 to ICD-10 Conversion

by on October 13th, 2011

ICD-10 code mapping is not for the faint at heart.  With a ten fold increase in the pure volume of codes, healthcare organizations will need a few tricks up their sleeves to pull this off.

The mapping between ICD-9 and ICD-10 in many cases is not a simple 1:1 mapping.  For example, ICD-9-CM code 274.02 “Chronic gouty arthropathy without mention of tophus (tophi)” can map to 97 ICD-10-CM codes for a greater level of specificity, e.g., M1A.3120 “Chronic gout due to renal impairment, left shoulder, without tophus (tophi)”. 

Going from a 4010 to 5010 transaction requires going from the ICD-9 code to a (usually) more specific ICD-10 code.  Based on the example above, assume we’ll need to map 274.02 to M1A.3120.  As the 274.02 doesn’t indicate that the gout was due to renal impairment or that it was in the left shoulder, we might need to find this information from multiple sources.

ICD-10: The Complex Logic Game

You can write custom code to pull this information together, but of course there may be thousands of conversions with complex logic required to perform all the various ICD-9 to ICD-10 mappings.  An alternative to consider may be to expose data as RDF triples in order to leverage a common logical model (every triple has a subject, predicate, object) which can facilitate more rapid federation and analysis of data across numerous heterogeneous data sources.  For example, relational data can be exposed as RDF using tools such as D2R which generates SQL and converts the data to RDF on the fly.   Messages encoded in XML can be converted to RDF using GRDDL.  RDF data is queried using SPARQL, and SPARQL can easily federate data from multiple SPARQL endpoints, RDF triplestores, or RDF files. As a SPARQL endpoint makes the data available over HTTP – it is much easier to establish connectivity to the various data sources.

A major roadblock to any kind of data integration or federation project is the varying nomenclature and semantics found in our systems.  Using semantic web standards – an ontology can be developed and used in order to help resolve these roadblocks at query time.  For example, if in one system the term “kidney failure” is used while in another the term “renal failure” is used – in our ontology we can state that these terms are synonymous (e.g., systema:kidneyFailure owl:sameAs systemb:renalFailure).  We can now find the data using the terminology we’re familiar with, regardless of nomenclature used in the different systems. 

While these technologies are relatively new – they are gaining in acceptance.  Case in point – ICD11 will be published in OWL (Web Ontology Language).  The Query Health initiative is looking to develop the Linked Health Data Cloud using semantic technologies. 

RDF follows a significantly different paradigm from the relational model.  SPARQL (the query language for RDF data) differs from SQL, and developing ontologies in RDF/RDFS/OWL and encoding business rules in SWRL will all require a learning curve. However, these technologies are being used today in the healthcare setting.   The Cleveland Clinic uses semantic technologies “to improve future patient care through outcomes-based and longitudinal clinical research”.  The Mayo Clinic is using these technologies for helping consumers find information without having to know all the underlying medical terminology with its Mayo Consumer Vocabularies initiative.  Wellpoint is looking to leverage IBM’s Watson (which leverages ontologies for question answering) in a clinical setting.   Providers and payers with mature and technologically savvy IT organizations might investigate leveraging semantic technologies to ease the transition to ICD-10.

avatar