Pete Stiglich, Author at Perficient Blogs https://blogs.perficient.com/author/pstiglich/ Expert Digital Insights Thu, 05 Apr 2018 19:43:00 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png Pete Stiglich, Author at Perficient Blogs https://blogs.perficient.com/author/pstiglich/ 32 32 30508587 Database inferencing to get to trusted healthcare data https://blogs.perficient.com/2014/01/16/database-inferencing-to-get-to-trusted-healthcare-data/ https://blogs.perficient.com/2014/01/16/database-inferencing-to-get-to-trusted-healthcare-data/#respond Thu, 16 Jan 2014 16:10:21 +0000 https://blogs.perficient.com/healthcare/?p=6163

A health insurance client of mine recently embarked on an initiative to truly have “trusted data” in its Enterprise Data Warehouse so that business leaders could make decisions based on accurate data. However, how can one truly know if your data is trustable?? In addition to having solid controls in place (e.g., unique indexes on the primary AND natural key), it is also necessary to measure how the data compares to defined quality rules. Without this measurement, trusted data is a hope – not an assured reality.

shutterstock_71078161To enable this measurement, I designed a repository for storing

  • configurable data quality rules,
  • metadata about data structures to be measured,
  • and the results of data quality measurements.

I experienced the need to be able to perform a degree of “inferencing” in the relational database (DB2) being used for this repository. Normally one thinks of inferencing as the domain of semantic modeling and semantic web technologies like RDF, OWL, SPARQL, Pellet, etc. – and these are indeed very powerful technologies that I have written about elsewhere. However, using semantic web technologies wasn’t a possibility for this system.

The problem I was trying to address was to enable inheritance of data quality rules assigned to a dataset by lower level subsets (groups of records within the dataset). A canonical message type might have rules associated with it, and that message type may be used in many different services.

For purposes of tracking, each service was a separate subset. For example, a Party canonical message might be used for a service going from A to B, another service going from B to C, and so on. The records sent in a service were considered a subset.

The requirement was that rules associated with the high level canonical message type would be inherited by the child subsets, and reporting on the quality of the data in these services must be measured. Another example might be a database table that has some rules that apply to all records (a subset called “All” was created to enable this), while there may be more granular rules that apply only to a smaller subset of the data (e.g., patient date of birth >= 1/1/1970). These more granular subsets are children of the “All” subset, and in turn may have their own child subsets. Below is an example of the subset hierarchy.

Dataset : Member (a database table)

Subset 1 : All (representing all records in the table) Parent Subset: N/A

Subset 2: PATIENT_DOB >= 1/1/1970 Parent Subset: 1

Subset 3: STATE_CD = ‘NY’ Parent Subset: 2

As you can see subset 3 represents a set of records in the Member table where the PATIENT_DOB >= 1/1/1970 and the STATE_CD = ‘NY’.

In essence, there was a variable depth taxonomy of business rules with lower level subsets inheriting rules from higher level subsets, as well as the possibility of the lower level subsets having their own distinct rules. Additionally, we had to be able to calculate the trust score (percent of failed rules compared to assigned rules) for a dataset or a subset – which required traversing down the hierarchies (there were multiple hierarchies which made it even more complex) to include the trust score from lower level elements (e.g., in an XML message) or subsets. To traverse up and down these variable depth / ragged hierarchies (expressed as recursive relationships), we used the CONNECT BY syntax available in DB2 to significantly simplify this traversal. While use of this non-ANSI standard SQL syntax is complex in itself, the alternatives would be much more difficult e.g., having to use a stored procedure or program to loop through the data or write very complex queries with numerous sub-queries.

There is much written on the CONNECT BY syntax that we won’t be able to go into here, but the gist of it is to be able to start at (see START WITH clause) a certain point(s) in a hierarchy (in this context hierarchy refers to a group of hierarchically related rows[1] in the table which has a recursive relationship to itself), and then traverse up or down the hierarchy depending on how you structure your CONNECT BY clause.

While semantic technologies are able to handle taxonomies and ontologies as the core of their capability and enable very powerful machine reasoning and inferencing, the CONNECT BY clause can enable some basic inferencing capabilities in a relational database such as DB2 and Oracle. By utilizing this capability, we were able to simplify our architecture and drastically reduce the amount of space required.

I am interested to hear your feedback. You can reach me at pete.stiglich@perficient.com, on Twitter at @pstiglich, or in the comments section below.


[1] For example, in an employee table there might be a recursive relationship in the model (and possibly a foreign key from the table back to itself in the database) which identifies the manager of the employee – and of course the manager is also an employee.

 

]]>
https://blogs.perficient.com/2014/01/16/database-inferencing-to-get-to-trusted-healthcare-data/feed/ 0 181483
A Wish List for Data Modeling Technology https://blogs.perficient.com/2013/12/10/a-wish-list-for-data-modeling-technology/ https://blogs.perficient.com/2013/12/10/a-wish-list-for-data-modeling-technology/#respond Tue, 10 Dec 2013 13:32:38 +0000 https://blogs.perficient.com/healthcare/?p=6033

I was recently a panelist for a Dataversity webinar/discussion focused on the future of data modeling tools, functionality, and practice. Given the holiday season, the panelists discussed their wish list for modeling tools – from currently practical (but maybe not economically viable) to futuristic (e.g., using a 3D printer to print models for model reviews, using Google Glass to move objects around on the model).

Modeling-BaseOf course, many modeling tools already support a vast array of functionality and sometimes can be difficult to use some of the non-core functionality without experiencing some unintended consequences, and so more intelligent guides and better semantics in the modeling tool will make these easier to use – so modelers can focus more on modeling and less on the technology.

More important than the technology – as important and interesting as that is – is having solid processes and modeling standards in place to ensure better model quality, reuse, and understandability.

]]>
https://blogs.perficient.com/2013/12/10/a-wish-list-for-data-modeling-technology/feed/ 0 181470
Disruption caused by Data Governance? https://blogs.perficient.com/2013/11/11/disruption-caused-by-data-governance/ https://blogs.perficient.com/2013/11/11/disruption-caused-by-data-governance/#respond Mon, 11 Nov 2013 13:14:13 +0000 https://blogs.perficient.com/healthcare/?p=5974

Instituting Data Governance is a major initiative providing a significant opportunity to leverage enterprise data assets more effectively. As such, there is sometimes concern about being seen as a roadblock or concerns about formulating new enterprise level organizations, such as a Data Governance board and Data Stewardship committees. To be sure, a proper balance between enforcing standards and retaining agility will be needed, and of course Data Governance organizations should not be committee membersassembled without careful thought in a phased approach. However, Data Governance is all about making strategic decisions about enterprise data assets – and of course people have to meet to discuss and make these decisions. Not just anyone should be making these decisions and so Data Governance organizations need to be formed.

Forming Data Governance organizations should take place only after an assessment, strategy, and roadmap project. Part of this project should be identifying roles and executive stewards for these roles, after significant collaboration with the Executive Sponsor. The first organization to be formed should be the Data Governance Board, as other organizations need to be formed under the direction of the Data Governance Board. Enterprise standards should be enforced after careful deliberation, approval, and promotion by Data Governance.

Some take a more surreptitious approach to data governance, by trying to apply heroic effort to applying best practices such as developing an enterprise data model, setting up a metadata repository, of undertaking data quality initiatives. While these are worthwhile objectives, unless there are supporting Data Governance organizations to endorse, promote, and oversee these initiatives, these efforts tend to be difficult to sustain (due to lack of resources) and cause frustration and burnout. More importantly, there may not be enterprise alignment. For example, an Enterprise Conceptual Data Model not reviewed with and approved by the business, may just be the perspective of the data modeler.

Data Governance should be a disruptive agent for (good) change, and needs to be implemented after careful deliberation and executive sponsorship. Forming Data Governance organizations needs to take place in a phased approach.

]]>
https://blogs.perficient.com/2013/11/11/disruption-caused-by-data-governance/feed/ 0 181461
A Low Cost Big Data Integration Option? https://blogs.perficient.com/2013/09/23/a-low-cost-big-data-integration-option/ https://blogs.perficient.com/2013/09/23/a-low-cost-big-data-integration-option/#respond Mon, 23 Sep 2013 14:54:31 +0000 https://blogs.perficient.com/healthcare/?p=5853

With all of the interest in big data in healthcare, it’s easy to get drawn in by the excitement and not realize that it’s not a silver bullet that’s going to address all your data and infrastructure problems. Unless you are able to understand and integrate your data, throwing all the data onto a platform like Hadoop or Cassandra probably won’t provide the benefit you’re looking for. Of course, there really is benefit to leveraging a big data platform for the right kinds of use cases, such as increased scalability, performance, potentially lower TCO, etc.

Of course, there are many integration tools out on the market that perform well. However, I’d like to propose consideration of Semantic Web technologies as a low cost alternative to traditional data integration. Many are open sourced and are based on approved standards from W3C such as RDF (Resource Description Framework) and OWL (Web Ontology Language).

How_information_is_connecting_all_types_of_healthcare_data_to_make_a_difference_LRGUsing Semantic Web technologies to enable integration, for example the Open Link Data initiative for integrating data across the internet, can (besides being less expensive) provide significant advantages for automated inferencing of new data which would previously require specialized programming to derive. Indeed, your Semantic Web environment can serve as the knowledge base for artificial intelligence.

To enable enterprise integration, you probably won’t start by converting all of your data into an RDF/OWL (in fact, I wouldn’t recommend it). Instead, you might leverage converters such as DB2RDF which will translate RDBMS data into RDF triples on the fly at query time.

Converting information to RDF/OWL (at query time or in a triple store) can bridge the semantic divide more easily. Different systems call the same thing by different names and so it can be confusing to integrate. For example, in System A, patients are identified using a PAT_ID; in system B it’s MPI_ID, and so on. Using an OWL equivalence property owl:sameAs, this mapping can be handled in a single triple (A:PAT_ID owl:sameAs B:MPI_ID) in our mapping ontology, then using SPARQL (Semantic Web query language) this data can be integrated at query time. Of course, this is a greatly simplified example, but given the hundreds and thousands of systems large healthcare organizations have, being able to have a comprehensive view of the patient can provide tremendous value.

Data integration is a good starting point for using Semantic Web technologies – your users don’t have to know what the data is called in all the various systems. They are able to do analysis by terminology they’re familiar with and don’t have to adjust their vocabulary; as long as the terminology & data is mapped (relatively simply as demonstrated above), they can find the information they need.

However, the real icing on the cake for Semantic Web technologies lies with machine reasoning and having new data inferred for you. Semantic Web technologies have been likened to “programming by example” and are very powerful for encapsulating data, meta-data, business rules, and transformation logic into a common standard format.

]]>
https://blogs.perficient.com/2013/09/23/a-low-cost-big-data-integration-option/feed/ 0 181438
The Conceptual Data Model – Key to Integration https://blogs.perficient.com/2013/08/20/the-conceptual-data-model-key-to-integration/ https://blogs.perficient.com/2013/08/20/the-conceptual-data-model-key-to-integration/#respond Tue, 20 Aug 2013 12:25:23 +0000 https://blogs.perficient.com/healthcare/?p=5765

The importance of data integration, whether for analytics, mergers/acquisitions, outsourcing arrangements, third party interfaces, etc., is easy to understand but extremely difficult to realize. The technical aspect of data integration is the (relatively) easy part. The hardest part is bridging the semantic divide and agreeing on business rules – i.e., communication. The Conceptual Data Model can be an invaluable tool to help bridge the gap.

A Conceptual Data Model is a technology, application, and (usually) business unit neutral view of key business objects (data entities) and their relationships, and should be the framework for information systems. It is a business model, from a business object perspective, where we’re identifying what data is of interest to the business rather than how the data is used. It is not a process model, and is stateless in nature (all states of a relationship must be expressed to provide a longitudinal perspective). Relationships can express many business rules. A Conceptual Data Model typically takes the form of an ERD, UML Class Diagram, or ORM diagram.

Doctor Working on a LaptopThe Conceptual Data Model provides a framework for information systems. However, in order to craft the framework the Enterprise Data Architect and/or Data Modeler must meet with and understand the business. You wouldn’t hire a housing architect to design a model for you without meeting with you and coming to alignment before proceeding to laying the foundation and framing the house, and similarly you shouldn’t build your information system before you have the conceptual data model in place.

After the housing architect meets with the customer and develops the model, he or she meets with the customer to review and fine tune the model. After the data architect or modeler crafts the model he or she must review it with the business. The amount of effort and time required making the models presentable and understandable by the business should not be underestimated, as you want the business to sign off:

  • Names – Do the names of entities resonate with the business and does the business agree to the proposed standard name? Be sure to identify synonyms and homonyms to provide context to the business.
  • Definitions – Do the entity definitions conform to business understanding for the term / entity name?
  • Relationships – Are the relationships at the right level of granularity (e.g., does an entity relate to the claim header or to the claim line)? Is the business rule expressed via cardinality and optionality reflecting business reality over time (rather than at a specific state)?

I like to present my models to the business using a PowerPoint slide deck. This is harder and more time consuming for the modeler, but easier to present and easier for the business to focus on. Focus on a small set of entities and relationships in a single slide (the rule of 7 +/- 2 works here). I export definition and relationship metadata onto separate slides. It is very important to write out the relationships in a formal sentence. Why?

  1. It removes the risk of business people getting confused or turned off by the notation
  2. It is much easier to get the business to review and approve the relationship
  3. It acts as a quality check for your modeling (often having to write out the relationships uncovers oversights when modeling)

Once you have your Conceptual Data Mode,l i.e., Information Framework, you are ready to build upon that framework. Whether you’re building an internal data warehouse, preparing for a merger or acquisition, or have other integration needs, you can be more confident that you understand much more of the business and so integration can go off more smoothly.

]]>
https://blogs.perficient.com/2013/08/20/the-conceptual-data-model-key-to-integration/feed/ 0 181423
Key insights on source data for healthcare analytics & exchanges https://blogs.perficient.com/2013/07/30/key-insights-on-source-data-for-healthcare-analytics-exchanges/ https://blogs.perficient.com/2013/07/30/key-insights-on-source-data-for-healthcare-analytics-exchanges/#respond Tue, 30 Jul 2013 12:26:17 +0000 https://blogs.perficient.com/healthcare/?p=5699

Providers and payers need to exchange or source a lot of data, and the rate of this activity will only increase with implementation of Obamacare and other directives. Given the poor state of metadata management (which makes data sharing much more difficult), the decision to incorporate a new data set into an Enterprise Data Warehouse or data services can be fraught with much confusion making it very difficult to estimate the level of effort and deliver on time. It makes sense, therefore, to identify and document a list of “what we need to know about data” so that standard policies and data dictionary templates can be crafted to serve as the foundation for the data contract (even above and beyond an XSD if you’re using XML, unless the XSD is completely expressive with restrictions on lengths, acceptable values, etc., – but of course even then it can be hard for business and data analysts to review an XSD).

The list of “what we need to know about data” must go far beyond the bare bones metadata such as the field name, datatype, length, and description. Why? Because someone is going to have to gather the missing information, and of course this information collection takes a significant amount of time and effort on the part of business analysts, data analysts, data stewards, and modelers. If the effort to collect this information isn’t made up front then more time and money will be required during the development process with increased risks of lack of confidence in the data.

By identifying the standard list of “what we need to know about data,” the source metadata can be assessed more rapidly to identify what is missing and be better able to estimate level of effort. For example, assume you’ve identified 15 things you need to know about every atomic data element/field/column in your standard data source data dictionary template, and for the sake of argument assume you have 2 data sources you want to integrate each with 100 data elements/fields – one with a 75% of the needed metadata available and the other with 40% available. Obviously the first can be integrated more rapidly (all other things being equal). This should also affect pricing if you’re bringing in data from an external customer as the first dataset will be less expensive to integrate – you will have to spend much less time trying to gain a proper understanding of the data so that ETL and data quality checks can be developed with much less rework. Below are examples of metadata elements about source data you might want to capture information for.

Metadata Element Mandatory? Comment
Container Metadata
Container Technical Name (table, file name, container XML element…)

Y

Container Business Name

Y

Container Definition

Y

Container Relationships

Y

Ideally these will be identified in a data model you receive from the data source provider. Even if you’re getting a single flat file don’t assume that there aren’t relationships – the data may have been denormalized – in which case you want to know how data elements relate and the cardinality/optionality of these relationships

Attribute/Atomic Element/Column/Field Metadata

Element Technical Name

Y

Be sure you get descriptions for all abbreviations used. DON’T make assumptions!
Element Business Name

Y

Make sure this is something a business person would understand without having to be a source system expert
Synonyms/Acronyms

N

Goes a long way to help bridge the semantic divide.
Description

Y

As detailed as possible, avoiding tautologies whenever possible, e.g., if the definition for PERSON_NM is “The name of a person” that is an example of a tautology
Part of Primary Key?

Y

Part of Natural Key?

Y

Extremely important – the primary key might just be a sequential number. Need to know what makes a record unique from a business perspective. If you know the natural key you’ve saved yourself some trouble!
Datatype

Y

Minimum Length

Y

Usually the maximum length is only identified – but for character fields there may be requirements that the data must be at least N characters.
Max Length

Y

Mandatory

Y

Indicates if the element will allow nulls
Valid Values / Domain

Y

Can be a set of limited values (if so then document these below), a range, or verbiage describing valid values
Valid value

Y

Valid value description

Y (if a limited set of valid values)

If the valid values are X, Y, or Z – what do these mean?
Example values

Y (depending on Valid Values/Domain)

If the list of valid values can be expressed above, then not needed. But if the set of valid values is too large provide examples
Conditions / Exclusions

N

If there are special conditions, rules to consider e.g., if the Claim Date year is < 2010 then there should not be an “X” value
Format

Y (if not freeform text)

E.g., SSN format might be NNN-NN-NNNN, an amount might by $N,NNN.NN
Security / Privacy

N

You want to be sure to identify if the element will contain PHI, need special access restrictions, etc.

 

Of course, it should go without saying, you will still need to perform data profiling on the data source as discrepancies between the data and metadata do happen. For example, if the data producer specifies that a certain data element may only have 2 values (e.g., Y or N), but through profiling you see that there are some nulls, X’s, 9’s, etc., then there’s a data quality issue which should be addressed in the data source (if at all possible) or the metadata needs to be updated.

So before you agree to bring in a new dataset, be sure to assess the source metadata. The degree of metadata completeness is as important as knowing when the data will be received, how it will be received, latency requirements, etc., and will result in faster and more accurate integration.

I’m interested to hear your feedback. You can reach me at pete.stiglich@perficient.com, on Twitter at @pstiglich, or in the comments section below.

]]>
https://blogs.perficient.com/2013/07/30/key-insights-on-source-data-for-healthcare-analytics-exchanges/feed/ 0 181415
Big Data as an Archival Platform https://blogs.perficient.com/2013/06/24/big-data-as-an-archival-platform/ https://blogs.perficient.com/2013/06/24/big-data-as-an-archival-platform/#respond Mon, 24 Jun 2013 13:13:45 +0000 https://blogs.perficient.com/healthcare/?p=5604

Operational systems frequently have to have data archived (i.e., backed up and purged from the database) in order to maintain performance SLA’s. Historical data is frequently maintained for a much longer period in data warehouses, but as a data warehouse is intended for storing integrated, standardized, and historical (and perhaps current) data in order to support cross-functional analysis and reporting – a data warehouse can’t truly be considered an archive platform where the original data is stored as it looked like in the operational source system.

Record retention policies and laws often specify that business records be retained in exactly the format in which the business record was originally recorded. We can, of course, archive our data off to tape and then consider our archiving complete.

However, an alternative which may be considered is to use a Big Data cluster/grid to store all of our archived business records, whether structured or unstructured in nature (database records, contract images, etc.). I will point out a couple of reasons for considering this:

1) Online retention of all data – if the data can be legally retained indefinitely (sometimes you can’t retain records past a certain point e.g., juvenile court records), you can have rapid access to all of your historical data in its original format. As soon as you purge the data from your system, someone will think of a reason why they want or need to look at or analyze that data. Keeping it on-line means that you can minimize the amount of time required to make that information re-accessible. For example, legal discovery can be performed more rapidly if all the historical data is retained. Your data scientists and analysts will also be very happy by having all of this historical data available for data mining and investigation.

2) Big Data platforms such as Hadoop or Cassandra can store any kind of file in a distributed, redundant fashion, thereby having fault tolerance built into your cluster (homogeneous compute nodes) or grid (heterogeneous compute nodes). By default in Hadoop, each data block is replicated onto three nodes, though this is configurable. Cassandra can operate across multiple data centers to enable even greater fault tolerance. In a Big Data platform, low-cost commodity servers and DAS (Direct Attached Storage) can drive down the cost of retaining all of this information. As a result, archival off to tape might not be a mandatory step in the archive process.

It should go without saying that retaining the metadata describing your business records is essential, being able to account for structural changes over time. One option for representing structured business records from a database in the Big Data archival platform is to convert the transactions to XML so that the structure and relationships can be encapsulated and leveraged for the long term, without having to reload the data back into the operational system to be able to leverage it. This will of course increase the amount of space required to store this data, but given the benefits this can be a reasonable tradeoff. Of course, be sure that your XSD’s describing the XML files are retained and secured!

Data Archival is represented in the Big Data stack in the Data Sourcing component (as existing archives might serve as a data source for the Big Data platform) though archival might also fall into the Operations component as well to support business record retention, legal discovery, and other types of applications.

Figure 1 – Perficient’s Big Data Stack

peteblog

I am interested to hear your feedback and experience of using Big Data clusters/grids as an archival platform. You can reach me at pete.stiglich@perficient.com, on twitter at @pstiglich, or in the comments section below.

]]>
https://blogs.perficient.com/2013/06/24/big-data-as-an-archival-platform/feed/ 0 181401
Healthcare Data Modeling Governance https://blogs.perficient.com/2013/05/29/healthcare-data-modeling-governance/ https://blogs.perficient.com/2013/05/29/healthcare-data-modeling-governance/#respond Wed, 29 May 2013 12:31:09 +0000 https://blogs.perficient.com/healthcare/?p=5528

I participated in a webinar/panel discussion last week hosted by Dataversity on Data Modeling Governance, which was well attended and lively. The focus was on governance of Data Models and the Data Modeling environment (e.g., tools, model repositories, standards).

Data Modeling Governance is supported by Data Governance – and Data Governance benefits significantly from Data Modeling Governance. I will describe what Data Modeling Governance is and how it relates with Data Governance.

Data Modeling Governance entails making strategic decisions about enterprise data model assets, data modeling practices, and the data modeling environment in order to support improved governance, management, quality, and definition of data assets. Data assets usually originate from a data model – the more our models are aligned with the enterprise (e.g., standards, nomenclature, and modeling practice) – the more our data assets will be aligned, reusable, sharable, and of higher quality.

Some examples of Data Modeling Governance in support of Data Governance:

  • Development of and adherence to a Data Modeling Standards document. By documenting these standards (rather than having them in people’s heads or in an email somewhere) a common modeling practice can be instituted, leading to less contention and confusion, more effective model reviews, more rapid modeling (not reinventing the wheel on every project), and more reusable and standardized model objects (resulting in more reusable, integratable, and shareable data – all good things that Data Governance loves).
  • Enabling a common model repository leads to more secure and findable models. Data models often express significant intellectual capital of the enterprise and so needs to be properly secured. Much business metadata can/should be expressed in data models (e.g., business object names, conceptual relationships, definitions, etc. This is of much value to Data Governance and especially important to Data Stewardship.

Data Governance in turn needs to support Data Modeling Governance. Often, Data Architects and Modelers struggle to implement best practices only to face resistance from project teams. By raising awareness of issues encountered to a Data Governance Board, Data Governance working with Application Development leadership can make recommendations for updating the SDLC to ensure those best practices are included so that adequate time for solid modeling practices is accommodated in projects.

I am interested to hear your feedback and experience of Data Modeling Governance. You can reach me at pete.stiglich@perficient.com, on twitter at @pstiglich, or in the comments section below.

]]>
https://blogs.perficient.com/2013/05/29/healthcare-data-modeling-governance/feed/ 0 181391
Big Data Security https://blogs.perficient.com/2013/05/01/big-data-security/ https://blogs.perficient.com/2013/05/01/big-data-security/#respond Wed, 01 May 2013 12:28:15 +0000 https://blogs.perficient.com/healthcare/?p=5429

In the Big Data Stack below, security is a vertical activity touching upon all aspects of a Big Data architecture which must receive careful attention, especially in a healthcare environment when PHI may be involved.

NoSQL technologies, which many Big Data platforms are built upon, are still maturing technologies where security is not as robust as in a relational database. Examples of these NoSQL technologies include MongoDB, HBase, Cassandra, and others. When designing Big Data architecture, it is easy to get excited about the power and flexibility that Big Data enables, but the more mundane non-functional requirements must be carefully considered. These requirements include security, data governance, data management, and metadata management. The most advanced technology might not necessarily be the best fit if some of the critical non-functional requirements can’t be accommodated without too much trouble.

To protect PHI, you might consider specialized encryption software such as IBM InfoSphere Guardium Encryption Expert or Gazzanga, which can perform encryption and decryption at the OS level as IO operations are performed. These encryption technologies operate below the data layer and as such can be used regardless of how you store the data. Using these technologies means that you can have a robust and highly secure Big Data architecture.

Figure 1 – Perficient’s Big Data Stack

big data stack

I’m interested to hear your feedback and experience of security in Big Data platforms. You can reach me at pete.stiglich@perficient.com, on twitter at @pstiglich, or in the comments section below.

]]>
https://blogs.perficient.com/2013/05/01/big-data-security/feed/ 0 181377
Business Glossary – Sooner Rather than Later https://blogs.perficient.com/2013/04/02/business-glossary-sooner-rather-than-later/ https://blogs.perficient.com/2013/04/02/business-glossary-sooner-rather-than-later/#respond Tue, 02 Apr 2013 12:27:49 +0000 https://blogs.perficient.com/healthcare/?p=5360

If you are undertaking any significant data initiatives, such as MDM, Enterprise Data Warehousing, CRM, integration/migrations due to mergers or acquisitions, etc., where you have to work across business units and interface with numerous systems, one of the best investments you can make is to implement an Enterprise Business Glossary early on in your initiative.

An Enterprise Business Glossary is used to facilitate and store definitions and additional metadata about business terms, business data elements, KPI’s, and enterprise standard calculations (not every calculation will be a KPI, e.g., calculation for determining tax).

Why is an Enterprise Business Glossary so important? Data is meant to represent a real world business object and needs to be named and defined appropriately so it can be found, managed, analyzed, and reused. Semantic and naming confusion leads to dis-integration. Data entities in a Conceptual Data Model or a Semantic Model (e.g., using RDF/OWL) represent business objects such as Customer, Product, Vendor and so on. If you are trying to integrate data you have to understand what these terms mean – to the business units, source system, and the enterprise. Otherwise, there is a lot of flailing that will go on – data analysis, data modeling, and BI are all affected by lack of clarity in business terminology. Executive requests for analyses which seem straightforward (e.g., How many members do we have?) are hindered by the semantic confusion and variability found in our enterprises. Duplicate systems, data, and processes performing similar functionality require expensive resources.

Is the Enterprise Business Glossary the answer to all of our integration problems? Of course not, but if you don’t have infinite resources to throw at a problem, purchasing and implementing Business Glossary technology is a good way to be more agile and reduce costs. It is one of the least expensive and easiest enterprise data architecture components to implement and can have a very high ROI. It can facilitate many aspects of data governance and data stewardship, improve enterprise communication and collaboration between business and IT and between business units, lead to faster integration, provide for better knowledge transfer and retention (the definitions are stored in an enterprise repository rather than a spreadsheet floating around somewhere), and simplified workflow.

Poor metadata management is a leading cause of enterprise data failures – implementing an Enterprise Business Glossary should allow you to do some basic impact analysis as you can link glossary items to model objects, database tables, ETL processes, reports, cubes, and other implementation objects. This can enable your data stewards to measure where the data is used, where it comes from, and where it’s out of compliance.

I’m interested to hear your feedback or questions about Business Glossaries. For additional information about Business Glossaries, please download my whitepaper. You can reach me at pete.stiglich@perficient.com, on Twitter at @pstiglich, or in the comments section below.

]]>
https://blogs.perficient.com/2013/04/02/business-glossary-sooner-rather-than-later/feed/ 0 181369
Perficient’s Big Data Stack – Infrastructure Tier https://blogs.perficient.com/2013/02/04/perficients-big-data-stack-application-tier/ https://blogs.perficient.com/2013/02/04/perficients-big-data-stack-application-tier/#respond Mon, 04 Feb 2013 13:16:25 +0000 https://blogs.perficient.com/healthcare/?p=5062

In a previous blog posting, I introduced Perficient’s Big Data Stack in order to provide a framework for a Big Data architecture and environment. This article is the second in a series of articles (click here for the first article on the Application tier) focusing on a component of the Big Data Stack. The stack diagram below has two key vertical tiers – the application tier and the infrastructure tier. I will discuss the infrastructure tier in this article.

Figure 1 – Perficient’s Big Data Stack

ps1

As mentioned previously, with any data project it is important to identify non-functional requirements (requirements for performance, scalability, availability, backups, etc.). Requirements drive design, and to a large degree should drive infrastructure decisions. You might need more than one Big Data environment to meet the needs of different applications. For example, you might have a mission-critical environment requiring extensive controls or you might have a separate environment to support Data Scientists who are exploring hypotheses requiring more flexibility in the environment.

For the infrastructure tier, there is a dizzying array of technologies and architectures that can be chosen from. Not only do you need to determine what processing platform to choose (Are we going to deploy on public / private cloud? Are we going to utilize proprietary appliances or commodity hardware? How many servers do we need in our Big Data MPP (grid or cluster) environment?), you have to choose how you’re going to persist the data (or you might not choose to persist all the data e.g., for stream processing where the data is only processed in memory and only the information needed long term is persisted to disk), and if you are going to use a relational database or NoSQL technology. If you want to go with a relational database, can it scale into the petabyte range? Will it be able to handle unstructured or semi-structured data efficiently? If you are going to use a NoSQL platform, does it have the security and data management controls necessary? Will you have PHI or other sensitive data in the Big Data environment? Security controls for NoSQL databases aren’t as robust as the relational database so you might need to utilize an IO level encryption technology, such as IBM InfoSphere Guardium Encryption Expert or Gazzanga.

Are you going to use open source or COTS software? Big Data technologies like Hadoop and Cassandra are open sourced, but you might want to buy a tool which will mask a lot of the underlying complexity of tying open source software together, as well as providing analytics and data management capabilities to streamline Big Data efforts. The beauty of Hadoop and Cassandra being able to leverage commodity hardware means that you can significantly drive down the cost of your Big Data environment, as long as you have the skill sets in house to be able to implement open source.

There are over 100 NoSQL databases to choose from for different types of data representation (wide column, key value pair, graph, XML, object, document). Figure 2 below shows the different types of data representation intersected by the types of processing which might be a good fit for the representation type (not limited just to Big Data representation).

Figure 2 – Data Representation by Processing Type

ps2

These are just a few of the questions you will need to ask as you consider the infrastructure needed for your Big Data environment. You will want to perform a strategy, roadmap, and readiness assessment for your Big Data program before undertaking infrastructure decisions and acquiring technologies.

I will delve into more detail into these components of Perficient’s Big Data stack in future articles. I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience. You can reach me at pete.stiglich@perficient.com, or on twitter at @pstiglich, or in the comments section below.

]]>
https://blogs.perficient.com/2013/02/04/perficients-big-data-stack-application-tier/feed/ 0 181328
Perficient’s Big Data Stack – The Application Tier https://blogs.perficient.com/2012/12/11/perficients-big-data-stack-the-application-tier/ https://blogs.perficient.com/2012/12/11/perficients-big-data-stack-the-application-tier/#respond Tue, 11 Dec 2012 13:38:42 +0000 https://blogs.perficient.com/healthcare/?p=4895

In my last blog posting, I introduced Perficient’s Big Data Stack in order to provide a framework for a Big Data environment. The stack diagram below has two key vertical tiers – the application tier and the infrastructure tier. I will discuss the application tier in this article.

Figure 1 – Perficient’s Big Data Stack

As with any data project, it is of course important to understand how it will be used (the functional requirements), and equally important to identify non-functional requirements (requirements for performance, scalability, availability, etc.). Requirements drive design, and to a large degree should drive infrastructure decisions. Your Data Scientists need to be actively engaged in identifying requirements for a Big Data application and platform, as they will usually be the primary users.

Most Big Data platforms are geared towards analytics – being able to process, store, and analyze massive amounts of data (hundreds of terabytes, petabytes, and beyond), but Big Data is being used for operations as well (such as patient monitoring, real-time medical device data integration, web intrusion detection, etc.).

Analytics in Big Data is typically not a replacement for your Business Intelligence capabilities. A powerful analytics use case for Big Data is in supporting ad-hoc analyses which might not ever need to be repeated – Data Scientists formulate hypotheses and use a Big Data platform to investigate the hypothesis. Some have the opinion that Big Data is unstructured data. Unstructured data definitely works quite well in a Big Data platform, but if you have massive amounts of structured data (such as medical device, RFID data, or just regular tabular data) – these of course can take advantage of a Big Data platform where you can perform Data Mining inexpensively, even using open source data mining tools such as R and Mahout.

Most of the time you will be sourcing data for your Big Data application and so the Data Sourcing component is at the top of the application tier in the Big Data Stack. There are many tools available for sourcing such as ETL tools, log scanners, streaming message queues, etc. Due to the massive scalability and reliability of Big Data platforms such as Hadoop and Cassandra (NoSQL technologies), such platforms may be an ideal place to archive ALL of your data online. With these technologies, each data block is automatically replicated on multiple machines (or even multiple data centers in the case of Cassandra). Failure of the nodes in the cluster or grid is expected and so node failures can be handled gracefully and automatically. Having all of your archived data available in a single environment online can provide a rich environment for your data scientists.

We’ve talked briefly about operational applications for a Big Data platform – but much more can be said. Transaction processing can be supported on Big Data, but usually not ACID compliant transaction processing (HBase, built on top of Hadoop, touts that it can support ACID transactions). Not all types of transactions require ACID compliance – e.g., if a user updates his/her Facebook status and the status is lost due to a failure, it’s not the end of the world. Some operational applications might not persist all of the data – and might just distribute the data across the nodes to utilize computing capacity and distributed memory.

I will delve into more detail into these components of Perficient’s Big Data stack in future articles. I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience. You can reach me at pete.stiglich@perficient.com or by leaving a comment in the section below.

]]>
https://blogs.perficient.com/2012/12/11/perficients-big-data-stack-the-application-tier/feed/ 0 181306