Skip to main content


Data Lake as Aggregator – The Critical Role of the Catalog

My previous blog talked about a Data Lake using a Supplier-Aggregator-Consumer analogy and talking about the roles each of these parties play. One factor that is critical to the success of this approach is the use of a common vocabulary that ensures efficiency and effectiveness in the interactions and collaborations between the parties.

The implication of the Aggregator analogy is that suppliers and consumers independently approach the aggregator, so it is imperative that there is a common language utilized by all for describing what is provided (the “specifications” of the supplier’s content), what is needed/desired (the “requirements” of the consumers) and what is actually contained in the Data Lake (the “catalog” of information published by the aggregator).

So, what does this Catalog look like? Given this is information we are talking about, it is not anything you probably haven’t seen before – essentially it consists of a representation of the information housed in the Data Lake utilizing Information/Data Models and a Glossary of Terms. Together they fully describe the information that is relevant to the business being conducted by the enterprise.

Both the Models and the Glossary exclusively describe “what” information exists using the “language of the business” for which it exists. Both the terminology and the representation/notation used in the models must be accessible to all those involved – both business and technical – to ensure maximum understanding.

To be perfectly clear – what this is NOT is a physical representation of how and where all the information is stored, or its format, access mechanisms or any other physical aspect. Those are all critical and play a part in the actual receipt and delivery of information, but that “how” detail is addressed separately in order to keep the Catalog focused upon ensuring a common language that does not fluctuate with the use or advancement of technology.

The following diagram provides an example of how the Catalog serves as the “connecting thread” between what the supplier provides and the consumer needs:

This diagram illustrates the use of the Catalog not only for describing the information from both party’s perspectives, but how it is also used to ensure consistency and traceability of the physical instantiation of the information in the Lake with and to the common concepts represented in the Catalog.

All of this collaboration, even with a common language, can still be inefficient if every individual party is left to their own devices for presenting their specifications or requirements to the aggregator. The establishment of standards and templates can greatly reduce this inefficiency and I will discuss those in my next entry.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.