Since my last post I’ve been working for a client that is actively engaged in establishing a Data Lake for the purpose of supporting their analytics efforts, but also looking to “re-architect” the way their systems collaborate by using this Data Lake environment to control and consolidate all information-sharing interactions within their environment.
I was most interested in whether and how Information Governance practices were being defined and applied to this new “centralized” view of information sharing. This will be the focus of my next few blog entries.
I’m sure by now most people are familiar with the Data Lake concept, wherein the idea is that all data entering the enterprise – regardless of content, format or source – is placed, or landed, into the “lake” for others to access. However, to access this “raw” data efficiently and effectively requires some level of transformation, consolidation and standardization so that there is a “common” view of the information in order to serve multiple targets without each of them having to devise their own custom mechanism for obtaining what they need from the lake.
It is this common view that requires Information Governance. By putting in place an appropriate set of decision rights, controls (policies, rules, guidelines, etc.) and processes, there is a much better chance that the Lake will not become polluted, AND, the actual content of the lake remains not only useful, but accessible – irrespective of the addition and subtraction of both sources and targets.
Over the next few months I will present my thoughts on how best to go about this. First I’ll describe the “architecture” and concept of utilizing a Data Lake for the above-mentioned purposes – using an analogy of an Aggregator (not unlike the warehouse store model that presents its offerings sometimes just as received and other times “repackaged” based upon consumer demand) – and from there I will dive into the roles and responsibilities of the players involved, the critical role of a “catalog” for managing the lakes content, the equally critical role of standards and templates, the absolute essential requirement of a robust Information Governance Program, and finally, a summary with some of the key takeaways.
Note that this is NOT a technical discussion – so will not be talking about Hadoop, NoSQL, RDBMS or any of the other myriad associated technology – but will focus upon the concepts and usage of a Governed Data Lake for ensuring business value is truly obtained from this environment.
I hope you will join me in this journey and that you find this both informative and useful.