A health insurance client of mine recently embarked on an initiative to truly have “trusted data” in its Enterprise Data Warehouse so that business leaders could make decisions based on accurate data. However, how can one truly know if your data is trustable?? In addition to having solid controls in place (e.g., unique indexes on the primary AND natural key), it is also necessary to measure how the data compares to defined quality rules. Without this measurement, trusted data is a hope – not an assured reality.
To enable this measurement, I designed a repository for storing
- configurable data quality rules,
- metadata about data structures to be measured,
- and the results of data quality measurements.
I experienced the need to be able to perform a degree of “inferencing” in the relational database (DB2) being used for this repository. Normally one thinks of inferencing as the domain of semantic modeling and semantic web technologies like RDF, OWL, SPARQL, Pellet, etc. – and these are indeed very powerful technologies that I have written about elsewhere. However, using semantic web technologies wasn’t a possibility for this system.
The problem I was trying to address was to enable inheritance of data quality rules assigned to a dataset by lower level subsets (groups of records within the dataset). A canonical message type might have rules associated with it, and that message type may be used in many different services.
For purposes of tracking, each service was a separate subset. For example, a Party canonical message might be used for a service going from A to B, another service going from B to C, and so on. The records sent in a service were considered a subset.
The requirement was that rules associated with the high level canonical message type would be inherited by the child subsets, and reporting on the quality of the data in these services must be measured. Another example might be a database table that has some rules that apply to all records (a subset called “All” was created to enable this), while there may be more granular rules that apply only to a smaller subset of the data (e.g., patient date of birth >= 1/1/1970). These more granular subsets are children of the “All” subset, and in turn may have their own child subsets. Below is an example of the subset hierarchy.
Dataset : Member (a database table)
Subset 1 : All (representing all records in the table) Parent Subset: N/A
Subset 2: PATIENT_DOB >= 1/1/1970 Parent Subset: 1
Subset 3: STATE_CD = ‘NY’ Parent Subset: 2
As you can see subset 3 represents a set of records in the Member table where the PATIENT_DOB >= 1/1/1970 and the STATE_CD = ‘NY’.
In essence, there was a variable depth taxonomy of business rules with lower level subsets inheriting rules from higher level subsets, as well as the possibility of the lower level subsets having their own distinct rules. Additionally, we had to be able to calculate the trust score (percent of failed rules compared to assigned rules) for a dataset or a subset – which required traversing down the hierarchies (there were multiple hierarchies which made it even more complex) to include the trust score from lower level elements (e.g., in an XML message) or subsets. To traverse up and down these variable depth / ragged hierarchies (expressed as recursive relationships), we used the CONNECT BY syntax available in DB2 to significantly simplify this traversal. While use of this non-ANSI standard SQL syntax is complex in itself, the alternatives would be much more difficult e.g., having to use a stored procedure or program to loop through the data or write very complex queries with numerous sub-queries.
There is much written on the CONNECT BY syntax that we won’t be able to go into here, but the gist of it is to be able to start at (see START WITH clause) a certain point(s) in a hierarchy (in this context hierarchy refers to a group of hierarchically related rows[1] in the table which has a recursive relationship to itself), and then traverse up or down the hierarchy depending on how you structure your CONNECT BY clause.
While semantic technologies are able to handle taxonomies and ontologies as the core of their capability and enable very powerful machine reasoning and inferencing, the CONNECT BY clause can enable some basic inferencing capabilities in a relational database such as DB2 and Oracle. By utilizing this capability, we were able to simplify our architecture and drastically reduce the amount of space required.
I am interested to hear your feedback. You can reach me at pete.stiglich@perficient.com, on Twitter at @pstiglich, or in the comments section below.
[1] For example, in an employee table there might be a recursive relationship in the model (and possibly a foreign key from the table back to itself in the database) which identifies the manager of the employee – and of course the manager is also an employee.