Working with the Data Lake Aggregator – Standards and Templates - Perficient Blogs
Blog
  • Topics
  • Industries
  • Partners

Explore

Topics

Industries

Partners

Working with the Data Lake Aggregator – Standards and Templates

In my previous blog, I described the concept of an “Information Catalog” and how it plays a vital role in ensuring communication between the Data Lake Aggregator and Suppliers and Consumers is efficient and effective due to the common language that it provides.

I also included the following diagram as an example of how the Catalog is used to connect the artifacts built for describing the information assets:

I also mentioned that confusion can still reign if there are not standards in place to guide and control how to present the specifications, requirements and designs artifacts that are needed for these collaborations. This post will take a look at some artifacts typically generated by Suppliers and Consumers suggesting how these standards can be realized through the use of templates defined by the Aggregator – or more specifically the IG Program overseeing the Data Lake.

Supplier Artifacts

The supplier needs to communicate not only what is being supplied, but also how it is being supplied in sufficient detail so that the Aggregator can take the information, get it “landed” into the Lake, and then also be able to find the relevant information in what is provided to fulfill Consumers’ needs.

Using the example of a Supplier providing an “Extract File”, the following set of templates, or required artifacts, should be used to fully specify what is in the Extract File:

Semantic Model This represents the concepts, their characteristics and their relationships to one another. This is not so much a template as a set of standards for representing these aspects in a “boxes and lines” kind of view. These models must represent a subset of the Catalog’s Model (which may require an expansion of the Catalog if the Supplier is providing information not yet represented)
Glossary of Terms This glossary contains not only the Semantic Model items, but also other terms that may describe information being provided that is derived from the semantic model (for example, Calculated or Summary Values that are present in the extract file). This template contains a set of standard “columns” for describing a term (definition, synonyms, term categorization, etc.)
Rules This presents all constraints that the supplier’s system enforced on the information being provided. For example, if the model identifies a Person can have many Addresses, but the supplier system only allows one Address per Person – that would be documented in this Rulebook. Similar to the Glossary template, the rule template should contain typical “columns” for describing a rule
Translation Map This is the heart of the specification in that it “connects” the information being provided (in this case the extract file’s records and fields) to the concepts as represented in the Semantic Model and Glossary of Terms. This template therefore consists of columns that describe the record/field being supplied and matching set of columns that describe the concepts to which these items align, or map, as represented in the Model/Glossary
Field Definition Dictionary Similar to the Glossary, this presents a description of every field in the extract file. This template consists of a set of columns typical for describing a field, but should also, like the glossary, offer guidance as to what constitutes a good definition
Field Valid Values For any field that is limited by what can be placed in it within the supplying system, the full set of values that are valid. This template consists of a set of columns for describing a value including, in the case of ‘codes’ or other cryptic values, columns that allow for a full description of the meaning of each of these values

Consumer Artifacts

The Consumer needs to tell the Aggregator what they need, but needn’t, at least initially, worry about exactly how these needs are presented to them. This gives the Aggregator some flexibility in fulfilling the need which, in turn, will improve efficiency of delivery in that the Aggregator will be able to offer “standard” packages of information that may serve the needs of multiple Consumers.

Given that, the set of required artifacts for a Consumer focus upon simply describing what is needed:

 

Semantic Model As the artifact used by the Supplier, this represents the concepts, their characteristics and their relationships to one another. This is not so much a template as a set of standards for representing these aspects in a “boxes and lines” kind of view. These models must represent a subset of the Catalog’s Model (which may require an expansion of the Catalog if the Consumer is requesting information not yet represented)
Glossary of Terms This glossary contains not only the Semantic Model items, but also other terms that may describe information being requested that is derived from the semantic model (for example, Calculated or Summary Values that are needed by the Consumer). This template contains a set of standard “columns” for describing a term (definition, synonyms, term categorization, etc.)
Rules This presents all constraints that the consumer’s system will enforce on the information being provided. For example, if the model identifies a Person can have many Addresses, but the consumer system only allows one Address per Person – that would be documented in this Rulebook. Similar to the Glossary template, the rule template should contain typical “columns” for describing a rule

As you may have seen, the Consumer’s artifacts are identical to the Supplier’s as far as templates and content – the difference is strictly in the PERSPECTIVE from which these are populated. This furthers the ability of de-coupling sources from targets in that the Supplier need focus only on what they are providing and the Consumer can focus only on what they need.

This provides the Aggregator significant flexibility in both accepting information coming in as well as multiple ways for sending information out.

I realize I did not provide a lot of detail or specific examples of what a template would actually look like, but, to some degree, that is dependent upon a particular enterprise’s need and maturity. Hopefully this gives you sufficient information to get started on defining your own templates, but feel free to leave a comment or reach out directly to me if you’d like further information (or to add details of your own).

Finally, all this talk of Master Catalogs, Standards and Templates leads me to my ultimate area of interest for making all this work – Information Governance. For this all to come to fruition, and be sustainable, a robust Information Governance Program is required, and it is this I will discuss in my next post.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up