Data lineage is the process of understanding, recording, and visualizing metadata as it flows from the originating data source to consumption. Using a data lineage tool can expedite the processes, however, it is a common misconception that a data lineage tool can automatically generate data lineage between several applications with the push of a button.
Data lineage platforms offer significant benefits, but like any other artificial intelligence solution, they require human input to be successful. The technical capabilities of the data lineage tool are important, but the essentiality of human knowledge should not be dismissed.
There are four major steps in a typical data lineage effort: (1) data acquisition/sourcing, (2) data scanning, (3) data stitching, and (4) data lineage validation.
Here are seven questions to consider during the data acquisition/sourcing stage:
- Who are the right subject matter experts with the technical knowledge about the target application?
- What is the general purpose and background of the application?
- What data sources does the target application receive data from?
- What data sources does the target application send data to?
- Which data flow(s) should be prioritized? Prioritization can be decided based on which data flows contain the majority of the critical data elements or regulatory data, such as BASEL and CECL.
- What are the source code repositories that contain the programs and scripts involved in moving data from source to target applications?
- How is the data received or sent to applications from prioritized data flows? Methods include database, mainframe files, real-time messages, non-mainframe file transmission, and application programming interfaces (APIs).
- Database
- What is the hostname, port number, service name, and schema to request database access?
- What are the schema and table names in which the data resides?
- Are there any stored procedures involved?
- Mainframe
- What is the mainframe complex, and does the data lineage platform have the underlying technology to support this type of mainframe?
- How is data getting loaded into the application?
- What is the job name/procedure name, JCL, library names, COBOL programs, and copybooks for starting point, intermediary job, and final transmission?
- Real-time Messages
- What is the list of relevant data fields, corresponding tag numbers, and messaging queue (MQ) column names?
- Non-mainframe File Transmission
- What is the name of the file, and how is the file being created/loaded?
- Is there a copy of the file layout or sample files you can provide?
- Is there any source code or stored procedure that creates/loads the file (i.e., java program)?
- API
- What are the file name and source code?
Perficient’s financial services and data solutions teams have extensive experience building and supporting complex data governance and data lineage programs that ensure regulatory compliance (e.g., BCBS 239, CCAR, MiFID II, GDPR, CCPA, FRTB) and enable data democratization. In addition to understanding how to navigate financial institutions with many complex systems, we have experience with various platforms and tools in the ecosystem, including ASG, Collibra, and Informatica Enterprise Data Catalog (EDC).
Whether you need help with business and IT requirements, data acquisition/sourcing, data scanning, data linking and stitching, UAT and sign-off, or data analysis – we can help.
Reach out today to learn more about our experience and how we support your efforts.