A Data Lake can be a highly valuable asset to any enterprise, and there is a myriad of technology solutions available for leveraging the processes to feed, maintain and retrieve information from the Lake.
But all this technology is, if not worthless, significantly less valuable, if the environment is not well governed and managed. This is the primary Takeaway to keep in mind when a Data Lake solution is being considered – or is already in place but needing improvement – by any organization.
Another takeaway is the idea of positioning the Data Lake as an Aggregator of information – and for it to operate analogically like a Warehouse store – positioned to serve Consumers, but ultimately is responsible for determining how best to collect, store, and make available, the information it houses. This takeaway significantly influences how the Governance of the environment is set up and run.
Accepting the above two statements – the criticality of Governance and the Operating Model of an Aggregator – some other observations can be made:
As mentioned in a previous entry, these observations may sound dictatorial, but for this to be successful, when it comes to the information assets housed in the Data Lake, a highly collaborative environment where all parties are willing to compromise and reach consensus must be an integral part of the culture of the enterprise.
So, this completes my journey into Data Lakes and the Information Governance needed. I hope you found this interesting and helpful. Feel free to reach out with any comments or observations you may have. Thanks so much for reading my blog.
]]>In my last blog, you may recall that we were discussing the value and the need for Standards and Templates for ensuring a consistent and efficient use of the Data Lake, both in its population (supplying) and in its retrieval (consuming) of information. To achieve this level of consistency and efficiency, as well as reliability, requires a robust Information Governance Program responsible for overseeing the environment. In this entry, I will provide an overview of what this means to me.
As I’ve referenced in previous blog entries, Information Governance can be defined as a strategic practice that defines Rules (inclusive of policies, guidelines, laws, etc.) for interacting with Information, Decision Rights and Responsibilities of all parties involved in these interactions and the Processes and Controls to be followed when performing these interactions. To accomplish this, the IG Practice itself fulfills a set of oversight roles that can be compared to our (the U.S.) form of government consisting of three branches – Executive, Legislative and Judicial.
Branch | Description | Fulfilled By |
Executive | Provides overall strategy and guidance to the Program and how it serves (and benefits) the organization. Identifies and approves the needed Artifacts (Rules, Decision Rights, Processes) | Governance Committee/Board, Steering/Strategic Committee, etc. |
Legislative | Creates, maintains and improves the artifacts at the behest of the Executive Branch; communicates and describes the artifacts to the enterprise | Governance SME-based Workgroups, Governance Analysts, etc. |
Judicial | Enforces artifacts and identifies needs (along with the entire user community) for the creation, modification or removal of artifacts | Information Stewards, Owners, etc. |
As far as Rules, Decision Rights and Processes, we need to consider the overall purpose and role of a Data Lake and craft these accordingly. If you accept that the Data Lake will house the Information Assets of the enterprise, the following are some examples of these artifacts consistent with that model.
As indicated, this is a broad category meant to capture the “enforceable” items with regard to the use of the Data Lake. Some “categories” of these rules include:
Decision Rights
Decision Rights bestow enforceable privileges (and the associated responsibility) upon parties involved in the program. These rights need to be defined for all governance and user roles. Using the Aggregator analogy we have been talking about, the following are examples of the Decision Rights bestowed upon the Supplier, Consumer and Aggregator.
Supplier Rights
Consumer Rights
Aggregator Rights
These decision rights may appear “dictatorial” and at cross-purposes, but that is not the case. The expectation is that the decisions be highly collaborative between the parties, but that, ultimately, each party has the right to make a decision best suited for them.
Processes essentially define how and when the Rules and Decision Rights are utilized along a path of activities put in place to achieve a usage goal of the Data Lake. These Processes again must be defined for both governing the information as well as how the user interactions are to take place. Some Processes that would be defined by the IG Program include:
As you can see, there is a lot of “infrastructure” that needs to be put in place for the effective and efficient use of a Data Lake. If the enterprise recognizes that it is worth this investment to ensure the enterprise a valuable and reliable Data Lake.
The establishment and maintenance of this infrastructure is the duty and responsibility of an Information Governance practice area – which is why I consider IG an essential aspect of any Data Lake initiative.
In my next post I will provide some key takeaways to keep in mind when creating the business case for the establishment of an Information Governance Program for getting the most out of a Data Lake.
]]>In my previous blog, I described the concept of an “Information Catalog” and how it plays a vital role in ensuring communication between the Data Lake Aggregator and Suppliers and Consumers is efficient and effective due to the common language that it provides.
I also included the following diagram as an example of how the Catalog is used to connect the artifacts built for describing the information assets:
I also mentioned that confusion can still reign if there are not standards in place to guide and control how to present the specifications, requirements and designs artifacts that are needed for these collaborations. This post will take a look at some artifacts typically generated by Suppliers and Consumers suggesting how these standards can be realized through the use of templates defined by the Aggregator – or more specifically the IG Program overseeing the Data Lake.
The supplier needs to communicate not only what is being supplied, but also how it is being supplied in sufficient detail so that the Aggregator can take the information, get it “landed” into the Lake, and then also be able to find the relevant information in what is provided to fulfill Consumers’ needs.
Using the example of a Supplier providing an “Extract File”, the following set of templates, or required artifacts, should be used to fully specify what is in the Extract File:
Semantic Model | This represents the concepts, their characteristics and their relationships to one another. This is not so much a template as a set of standards for representing these aspects in a “boxes and lines” kind of view. These models must represent a subset of the Catalog’s Model (which may require an expansion of the Catalog if the Supplier is providing information not yet represented) |
Glossary of Terms | This glossary contains not only the Semantic Model items, but also other terms that may describe information being provided that is derived from the semantic model (for example, Calculated or Summary Values that are present in the extract file). This template contains a set of standard “columns” for describing a term (definition, synonyms, term categorization, etc.) |
Rules | This presents all constraints that the supplier’s system enforced on the information being provided. For example, if the model identifies a Person can have many Addresses, but the supplier system only allows one Address per Person – that would be documented in this Rulebook. Similar to the Glossary template, the rule template should contain typical “columns” for describing a rule |
Translation Map | This is the heart of the specification in that it “connects” the information being provided (in this case the extract file’s records and fields) to the concepts as represented in the Semantic Model and Glossary of Terms. This template therefore consists of columns that describe the record/field being supplied and matching set of columns that describe the concepts to which these items align, or map, as represented in the Model/Glossary |
Field Definition Dictionary | Similar to the Glossary, this presents a description of every field in the extract file. This template consists of a set of columns typical for describing a field, but should also, like the glossary, offer guidance as to what constitutes a good definition |
Field Valid Values | For any field that is limited by what can be placed in it within the supplying system, the full set of values that are valid. This template consists of a set of columns for describing a value including, in the case of ‘codes’ or other cryptic values, columns that allow for a full description of the meaning of each of these values |
The Consumer needs to tell the Aggregator what they need, but needn’t, at least initially, worry about exactly how these needs are presented to them. This gives the Aggregator some flexibility in fulfilling the need which, in turn, will improve efficiency of delivery in that the Aggregator will be able to offer “standard” packages of information that may serve the needs of multiple Consumers.
Given that, the set of required artifacts for a Consumer focus upon simply describing what is needed:
Semantic Model | As the artifact used by the Supplier, this represents the concepts, their characteristics and their relationships to one another. This is not so much a template as a set of standards for representing these aspects in a “boxes and lines” kind of view. These models must represent a subset of the Catalog’s Model (which may require an expansion of the Catalog if the Consumer is requesting information not yet represented) |
Glossary of Terms | This glossary contains not only the Semantic Model items, but also other terms that may describe information being requested that is derived from the semantic model (for example, Calculated or Summary Values that are needed by the Consumer). This template contains a set of standard “columns” for describing a term (definition, synonyms, term categorization, etc.) |
Rules | This presents all constraints that the consumer’s system will enforce on the information being provided. For example, if the model identifies a Person can have many Addresses, but the consumer system only allows one Address per Person – that would be documented in this Rulebook. Similar to the Glossary template, the rule template should contain typical “columns” for describing a rule |
As you may have seen, the Consumer’s artifacts are identical to the Supplier’s as far as templates and content – the difference is strictly in the PERSPECTIVE from which these are populated. This furthers the ability of de-coupling sources from targets in that the Supplier need focus only on what they are providing and the Consumer can focus only on what they need.
This provides the Aggregator significant flexibility in both accepting information coming in as well as multiple ways for sending information out.
I realize I did not provide a lot of detail or specific examples of what a template would actually look like, but, to some degree, that is dependent upon a particular enterprise’s need and maturity. Hopefully this gives you sufficient information to get started on defining your own templates, but feel free to leave a comment or reach out directly to me if you’d like further information (or to add details of your own).
Finally, all this talk of Master Catalogs, Standards and Templates leads me to my ultimate area of interest for making all this work – Information Governance. For this all to come to fruition, and be sustainable, a robust Information Governance Program is required, and it is this I will discuss in my next post.
]]>My previous blog talked about a Data Lake using a Supplier-Aggregator-Consumer analogy and talking about the roles each of these parties play. One factor that is critical to the success of this approach is the use of a common vocabulary that ensures efficiency and effectiveness in the interactions and collaborations between the parties.
The implication of the Aggregator analogy is that suppliers and consumers independently approach the aggregator, so it is imperative that there is a common language utilized by all for describing what is provided (the “specifications” of the supplier’s content), what is needed/desired (the “requirements” of the consumers) and what is actually contained in the Data Lake (the “catalog” of information published by the aggregator).
So, what does this Catalog look like? Given this is information we are talking about, it is not anything you probably haven’t seen before – essentially it consists of a representation of the information housed in the Data Lake utilizing Information/Data Models and a Glossary of Terms. Together they fully describe the information that is relevant to the business being conducted by the enterprise.
Both the Models and the Glossary exclusively describe “what” information exists using the “language of the business” for which it exists. Both the terminology and the representation/notation used in the models must be accessible to all those involved – both business and technical – to ensure maximum understanding.
To be perfectly clear – what this is NOT is a physical representation of how and where all the information is stored, or its format, access mechanisms or any other physical aspect. Those are all critical and play a part in the actual receipt and delivery of information, but that “how” detail is addressed separately in order to keep the Catalog focused upon ensuring a common language that does not fluctuate with the use or advancement of technology.
The following diagram provides an example of how the Catalog serves as the “connecting thread” between what the supplier provides and the consumer needs:
This diagram illustrates the use of the Catalog not only for describing the information from both party’s perspectives, but how it is also used to ensure consistency and traceability of the physical instantiation of the information in the Lake with and to the common concepts represented in the Catalog.
All of this collaboration, even with a common language, can still be inefficient if every individual party is left to their own devices for presenting their specifications or requirements to the aggregator. The establishment of standards and templates can greatly reduce this inefficiency and I will discuss those in my next entry.
]]>As you may recall, in my last blog I introduced the analogy of the Aggregator to describe utilizing a Data Lake as a Consolidator of information, and I mentioned the three key roles in this model: the Supplier, the Aggregator and the Consumer.
In this post I will provide a little more detail on the responsibilities possessed by each of these roles that, when carried out diligently, provide an effective environment for obtaining significant value from the Lake.
For this model to work effectively – there are a few key points to keep in mind at all times:
Keeping these underlying principles in mind, the following set of responsibilities can be defined for each role (note that the embedded examples are for a Healthcare Insurance Provider):
This last statement is key to the connection to Information Governance. As a matter of fact, all these responsibility descriptions are an aspect of the “decision rights” defined and controlled by a Governance Body.
The implication being that the “keepers” of the Data Lake must establish the Governance of the information housed in the lake – although it is recommended that the IG Program be created organizationally as a separate and distinct entity from the Data Lake solution owner.
You will also notice that a lynchpin between all these roles is a Catalog that is utilized by all parties in their communications with the other roles. The creation and maintenance of this catalog is the responsibility of the IG Program – and I will talk more about this artifact, and its importance, in my next post.
]]>In my last blog, I introduced the concept of the Data Lake as a Consolidator and the critical success factor of applying robust Information Governance to this environment. In this post, I want to introduce an analogy to help visualize this environment and the parties involved.
So, a Data Lake as Consolidator. What does that really mean? Well, for me it means obtaining information from multiple sources and making it available to multiple targets – with a key differentiator of ensuring the targets do not need to know which source provided what information.
In other words, de-coupling sources from targets so that the focus is on the actual information is a key characteristic of a powerful, and useful, Data Lake.
This de-coupling provides a level of flexibility in that the addition, removal – and even the alteration of the access mechanism – of an involved system becomes much simpler and more efficient because you need focus only upon a single system, and not worry about how that system may, or may not, interact with others.
Stated another way, the Data Lake Consolidator can be described using the following purpose and value statements:
The focus of the Data Lake is to provide a singular and common mechanism for the sharing of information across a wide variety of systems and solutions
The benefit of the Data Lake is in the de-coupling of systems and removing point-to-point integration solutions to improve efficiencies and lower maintenance costs, while allowing both the removal and introduction of solutions without impacting any other solution or incurring the cost of integrating or de-integrating solutions
I like to use the analogy of an Aggregator – in that, the central repository (the Data Lake) pulls information from a variety of sources (suppliers), aggregates it (separates, combines, consolidates, repackages – or just leaves it as is) and presents this source independent view of the information to the targets (consumers). The following picture provides a diagrammatic representation of this analogy:
This real-world concept is applied in our day-to-day living all the time – and is the underlying model to all retail interactions. But, as indicated, the “warehouse” model is probably closest to the Data Lake concept because it also provides the “direct” access to the products of the supplier “as-delivered” (just sitting on a pallet) – which is one of the options in a Data Lake.
For the right consumer, sometimes it just makes sense to provide direct access, offering that option in concert with the “re-packaged” versions.
This model relies upon a couple key concepts: one being the reference to a “common vocabulary”, which I’ll discuss in a later post, and the other the roles of Supplier, Aggregator and Consumer.
It is critical to well-define and articulate these roles and their responsibilities so that all parties are “on the same page” as far as knowing how they play a part, and equally important, where the lines of demarcation lie between these roles. I will delve a little more deeply into these roles and responsibilities in my next post.
]]>Since my last post I’ve been working for a client that is actively engaged in establishing a Data Lake for the purpose of supporting their analytics efforts, but also looking to “re-architect” the way their systems collaborate by using this Data Lake environment to control and consolidate all information-sharing interactions within their environment.
I was most interested in whether and how Information Governance practices were being defined and applied to this new “centralized” view of information sharing. This will be the focus of my next few blog entries.
I’m sure by now most people are familiar with the Data Lake concept, wherein the idea is that all data entering the enterprise – regardless of content, format or source – is placed, or landed, into the “lake” for others to access. However, to access this “raw” data efficiently and effectively requires some level of transformation, consolidation and standardization so that there is a “common” view of the information in order to serve multiple targets without each of them having to devise their own custom mechanism for obtaining what they need from the lake.
It is this common view that requires Information Governance. By putting in place an appropriate set of decision rights, controls (policies, rules, guidelines, etc.) and processes, there is a much better chance that the Lake will not become polluted, AND, the actual content of the lake remains not only useful, but accessible – irrespective of the addition and subtraction of both sources and targets.
Over the next few months I will present my thoughts on how best to go about this. First I’ll describe the “architecture” and concept of utilizing a Data Lake for the above-mentioned purposes – using an analogy of an Aggregator (not unlike the warehouse store model that presents its offerings sometimes just as received and other times “repackaged” based upon consumer demand) – and from there I will dive into the roles and responsibilities of the players involved, the critical role of a “catalog” for managing the lakes content, the equally critical role of standards and templates, the absolute essential requirement of a robust Information Governance Program, and finally, a summary with some of the key takeaways.
Note that this is NOT a technical discussion – so will not be talking about Hadoop, NoSQL, RDBMS or any of the other myriad associated technology – but will focus upon the concepts and usage of a Governed Data Lake for ensuring business value is truly obtained from this environment.
I hope you will join me in this journey and that you find this both informative and useful.
]]>As I have been continuing to work in the information governance area as it relates to healthcare, I recently came across an interesting development.
Some of my previous blog posts have covered the difference between Information Governance and Data Governance and some of the players in the field, including the American Health Information Management Association (AHIMA) – specifically in the healthcare space and their efforts in the information governance arena.
Since those posts, I’ve recently had a few conversations with the IG Advisors arm of the association and learned they have introduced a new tool for measuring an organization’s maturity with regard to their Information Governance (IG) Program. This tool is called IGHealthRate(TM), and it’s a fairly robust tool for determining not only the current maturity level of an organization but also providing some insights on steps the organization could take to progress forward on the maturity curve.
I’ve always believed that before any change can occur one should clearly define a Vision of where they would like to be, regardless of where they may actually be currently. An assessment tool like IGHealthRate(TM) is a great mechanism for understanding where you are and for forming a solid picture of where you would like to be.
AHIMA’s assessment tool is reflective of most maturity models in that it uses five levels of maturity they have identified as At Risk, Aware, Aspirational, Aligned and Actualized. It then uses its own framework as described in their IG Toolkit to evaluate an organization across all the “pieces,” or what they call Competencies, that a fully robust IG Program possesses and, through surveys and interviews, “scores” the organization’s maturity against each of these dimensions. They call this the Information Governance Adoption Model (IGAM)(TM) and the Competencies they identify are: IG Structure, Strategic Alignment, Enterprise Info Mgmt, Data Governance, IT Governance, Analytics, Privacy and Security, Legal and Regulatory, Awareness and Adherence, and IG Performance. From there, a roadmap can be defined by the organization for how best to evolve within each of these dimensions to move closer to its Vision State.
If you are interested in establishing (or improving) an IG Program, AHIMA’s IGHealthRate(TM) is a good first step to consider. It requires a minimal investment and its results can help build a business case for pursuing and maturing the IG Program.
]]>The explosion of data is something that executives across industry are trying to wrap their heads around. Healthcare is no different. In fact, healthcare data is expected to grow 99% – patient data, wearables, medical literature, scientific articles, etc. are adding to the explosion of healthcare information. This data deluge is a big challenge for healthcare organizations because they are unable to leverage information to make timely and profitable business decisions.
To solve the data challenge many organizations try:
Unfortunately, these approaches are not very effective. In order to tackle data challenges healthcare organizations must turn to governance. Governance helps address the information challenges by:
Information and data governance are quickly becoming imperative for a healthcare industry that is both seeking to capitalize on the value of its information assets and that is committed to ensuring the reliability and integrity of information and data used to improve care quality, operations, and financial performance. After all, trust in health information and high-quality patient care depend on it.
To learn more about trends impacting healthcare governance, download our recent guide, Healthcare Governance, Trends to Watch.
]]>Of all the governance trends, none is more foundational and critical to the success of the governance program – indeed the organization itself – than the need for accurate, consistent, and relevant models that communicate the meaning, use, and residency of the information assets of the enterprise.
Modeling not only addresses the integration and ingestion of data across and between information systems, but also aids in communication both within a healthcare organization as well as in the organization’s interactions with patients, partners, vendors and consumers. The models provide a consistent basis for understanding and minimize miscommunications, thereby increasing organizational efficiencies.
Governance programs are adopting not just the classic business glossary, but information reference models that provide the necessary context for the information across business units, technologies, applications, and personnel changes. Governance is becoming the keeper of this common language in order to ensure the associated rules, policies, controls, decision rights, and processes defined to govern the information are both understandable and enforceable regardless of the area of the organization impacted.
To learn more about this trend and the other trends impacting healthcare governance, download our recent guide, Healthcare Governance, Trends to Watch.
]]>Addressing the ongoing explosion of data sources and storage options requires establishing consistent metadata attributes and semantic models across the enterprise to effectively govern information as an enterprise asset.
The primary objective of any enterprise governance program is to ensure consistent and timely data, so reaching consensus and agreement on a common understanding of concepts and metadata attributes must be addressed and enforced by the program. Cloud applications, for example, continue to be adopted and most have their own semantic and metadata models. Integration of these respective views is foundational to governance because without it meaning and reusability of the information suffers.
A recognition is forming that as information becomes a true enterprise asset, that the need to cross silos and reach consensus on a consistent meaning of the semantics and metadata used to describe the business domains and the information itself is becoming critical. This common understanding is best facilitated and controlled through robust governance that is enforced company-wide.
To learn more about this trend and the other trends impacting healthcare governance, download our recent guide, Healthcare Governance, Trends to Watch.
]]>The rise of Big Data, self-service, and more powerful and flexible end-user information visualization and preparation tools is impacting governance in a significant manner with regard to structure, decision rights, and accountabilities. End-users are gaining more control of data, including the ability to integrate and manipulate data for their own purposes, and being able to select data based on a relevance criteria not necessarily codified in classic metadata or semantic models.
What this means is that the responsibility of governance, such as adhering to access policies, is becoming the responsibility of practically any individual that needs or uses the data. Stewardship, therefore, is becoming democratized across the user community, directly impacting the centralized model where clear stewards and owners are typically named along domain boundaries. This paradigm shift means that anyone who uses the data has a say in how it is governed, but also the responsibility to behave accordingly.
As self-service as one of the key drivers, the need for broadening the responsibility for the stewarding of the information to a larger community of interested parties is becoming more common. This is consistent with the move towards a business-centric approach as it is the business users who have the need and are taking on this responsibility for the information critical to them. Better data preparation tools and governance stewardship applications are also contributing to and supporting this trend.
To learn more about this trend and the other trends impacting healthcare governance, download our recent guide, Healthcare Governance, Trends to Watch.
]]>