Sagar Adabaddi, Author at Perficient Blogs

IBM Cloud Pak for Data- Multicloud Data Integration and Data Governance

Fri, 01 Jul 2022 15:44:57 +0000

IBM Cloud Pak for Data- Multicloud Data Integration and Data Governance:

As we all know, IBM Cloud Pak for Data is a cloud-native solution that enables you to put your data to work quickly and efficiently. Let’s understand below features of IBM Cloud Pak for Data. I’ll also be discussing what practical experience I have gained while working on this through some detailed steps:

Multicloud Data Integration with DataStage as a part of Data Fabric Architecture
DataStage AVI (Address Verification Interface)
Watson Knowledge Catalog – Data Governance Processes and Data Privacy

Multicloud Data Integration with DataStage:

IBM DataStage on IBM Cloud Pak for Data is a modernized data integration solution to collect and deliver trusted data anywhere, at any scale and complexity, on and across multi-cloud and hybrid cloud environments.

This cloud-native insight platform — built on the Red Hat OpenShift container orchestration platform — integrates the tools needed to collect, organize and analyze data within a data fabric architecture. Data fabric is an architecture that facilitates the end-to-end integration of various data pipelines and cloud environments through intelligent and automated systems.

It dynamically and intelligently orchestrates data across a distributed landscape to create a network of instantly available information for data consumers. IBM Cloud Pak for Data can be deployed on-premises, as a service on the IBM Cloud, or on any vendor’s cloud.

Source of above Data Stage diagram: IBM Documentation

Prerequisites: Need to have Data Stage Instance provisioned to perform the required tasks.

Below are the Tasks performed on Data Stage:

Created a project and added DB2 as a connection
Added data to the project. Data added from a local project sample file
Create a DataStage flow that extracts information from DB2 source systems
Performed steps using operations to transform the data using filters on Customer columns.
Compiled and ran the DataStage job to transform the data.
Deliver the data to Target – Project – Asset Tab, and Data asset Customers were present there.

Prerequisites:

Signed up for Cloud Pak for Data as a Service
Added Data Stage Service Instance
Also added Watson Knowledge Catalog and Cloud Object Storage services

Below are the Tasks performed on Data Stage for Multiload Data Integration:

Created a Sample Project and associated with a Cloud Object Storage instance
Ran an existing DataStage flow that created a CSV file in the project that joins the two different Customer application data sets.
Edited the DataStage Flow and changed the Joint node settings, and selected the Email Address column name as Key
Added PostgresSQL Database to the DataStage Flow to get more Customer related information.
Added another Join Stage to join filtered application data
Added a Transformation Stage that created a new column by summing up two different Customer $amount columns.
Added MongoDB database to get more information related to Customer
Added a Lookup Stage and specified the range to get Customer information
Ran the DataStage flow to create the final Customer output file.
Created a Catalog so data engineers and analysts can access the relevant Customer Data.
Viewed the output file in the Project and Published it to a Catalog
In the Project->Asset Tab -> Now, you can view the data.

DataStage AVI (Address Verification Interface):

IBM’s Quality Stage Address Verification Interface (AVI) provides comprehensive address parsing, standardization, validation, geocoding, and reverse geocoding, available in selected packages against reference files for over 245 countries and territories.

AVI’s focus is to help solve challenges with location data across the enterprise, specifically addresses, geocodes, and reverse geocode data attributes. Data Quality and MDM have never been more critical as a foundation to any digital-minded business intent on cost and operational efficiency.

IBM cares about Quality addresses to avoid negative customer experience, Fraud Prevention, Cost of Undelivered and returned Mail, and maintaining key Customer Demographic data attributes.

Source of the above diagram: IBM Documentation

Prerequisites:

Signed up for Cloud Pak for Data as a Service
Added Data Stage Service Instance

Below are the Tasks performed on the Data Stage AVI Feature:

Created an Analytics Project in IBM Cloud Pak for Data
Added a Connection to the Project -> Selected DB2 and provided all DB and Host details
Added DataStage Flow to the Project. Below three Primary Categories appear
1. Connectors (Source and Target Access Points)
2. Stages (Data Aggregation, Transformation and Table lookup, etc.)
3. Quality (Data Standardization and Address verification)
Added and Configured Connectors and Stages to the DataStage Flow
1. Added a Source Connector from Asset Browser and selected address as an Input
2. Added Address Verification from the Quality menu
3. Added Sequential file to generate the .csv output
4. Connected all the above 3 files from left to right
5. Provided the required details and inputs for Address Line 1 and Address Line 2
Compiled and Executed the AVI DataStage Flow
Go to Project ->Data Asset->You would see a .csv file would be created
Open the .csv file and review the columns. Here you will see more columns added from the Address Verification Process
Please review the Accuracy Code String to see Verified versus unverified addresses.

Watson Knowledge Catalog:

IBM Watson Knowledge Catalog on Cloud Pak for Data powers intelligence, the self-service discovery of data, models, and more, activating them for artificial intelligence, machine learning, and deep learning. With WKC, users can access, curate, and share data, knowledge assets, and their relationships wherever they reside.

WKC’s below features were performed and tested.

Data Governance processes include role assignment, access control, business terms, and classifications.
Created a Centralized Data Catalog for Self-Service Access
Created workflow to manage the business processes
Mapped Business value to Technical asset

Source of above Data Governance diagram: IBM Documentation

Prerequisites:

Signed-up for Cloud Pak for Data as an Admin

Below are the Tasks performed on Watson Knowledge Catalog:

Click Administrator->Access Control->Created a New User Group
Added Users under New User Group:
1. Quality Analyst
2. Data Steward
Provided Pre-defined Roles – Administrator, Data Quality Analyst, Data Steward, and Report Administrator.
Go to Governance -> Categories –> Customer Information ->Customer Demographics subcategory to view the Governance Artifacts
Here you can explore the Governance Artifacts such as Address, Age, Date of Birth, Gender, etc.
Go to Governance -> Business Terms ->Account Number. Here you can view the business terms such as – Description, Primary Category, Secondary Category, Relationship, Synonyms, Classification, Tags, etc.
Go to Governance -> Classifications-Here you can view the business terms such as – Description, Primary Category, Secondary Category, Parent/Dependent Classification, Tags, etc

Go to Administration -> Workflows ->Governance artifact management->Template file-> You will find different approval templated here, including publish and review steps.

Selected Automatic Publishing and provided Conditions (Create, Update, Delete, Import)

Saved and Activated it.

There were more things you could do in WKC, such as:

Creating Governance Artifacts for Reference data to follow certain standards and procedures.

Creating Policies and Governance Rules

Creating Business Terms

Creating Reference Data sets and Hierarchies

Creating Data Classes – such as data fields or columns

Watson Knowledge Catalog – Data Privacy:

Here I have learned:

How to prepare trusted data with Data Governance and Privacy use case of the Data Fabric.

Created Trusted data assets by enriching them and with data quality analysis.

The goal was how data consumers can easily find high-quality and protected data assets via a self-service catalog.

Prerequisites:

Signed up for Cloud Pak for Data for data as a service with Watson Knowledge Catalog Services

Below are the Tasks performed on Watson Knowledge Catalog:

As a data steward – Created a Catalog by going to the Catalog menu with Enforce Data Policies

Created Categories by going to Governance ->Categories. This contains the Business Terms that we have to import later.

Added Governance ->Business Terms and imported the .csv file

Published the Business Terms.

Imported data to a Project by going to Projects ->Data Governance and Privacy Project->Assets->New Asset->Metadata Import ->Click Next->Select the Project->Select Scope and Connection

Selected Data Fabric Trial for DB2 Warehouse connection so the data can be imported and viewed as a table.

Enriched the Imported data by selecting Metadata Enrichment from the Assets tab. You can Profile the data, Analyze the Quality and Assign the terms. This will help the end-user to find the data faster.

Viewed the Enrich metadata

Published the enriched data to a Data Catalog.

Conclusion: IBM Cloud Pak for Data is a robust Cloud Data, Analytics, and AI platform that provides a cost-effective, powerful MultiCloud Data Integration and Data Governance solution.

]]>

IBM Cloud Pak for Data- Data Science MLOPS

Thu, 02 Jun 2022 18:17:23 +0000

IBM Cloud Pak for Data- Data Science MLOPS:

I have been Learning, exploring and working on IBM MLOPS for Data Science and wanted to share my learning and experience here about IBM’s Cloud services and how they are integrated under one umbrella named IBM Cloud Pak for Data.

First, let’s understand what IBM Cloud Pak for Data is.

IBM Cloud Pak for Data is a cloud-native solution that enables you to put your data to work quickly and efficiently.

Your enterprise has lots of data. You need to use your data to generate meaningful insights that can help you avoid problems and reach your goals.

But your data is useless if you can’t trust it or access it. Cloud Pak for Data lets you do both by enabling you to connect to your data, govern it, find it, and use it for analysis. Cloud Pak for Data also enables all of your data users to collaborate from a single, unified interface that supports many services that are designed to work together.

Cloud Pak for Data fosters productivity by enabling users to find existing data or to request access to data. With modern tools that facilitate analytics and remove barriers to collaboration, users can spend less time finding data and more time using it effectively.

And with Cloud Pak for Data, your IT department doesn’t need to deploy multiple applications on disparate systems and then try to figure out how to get them to connect.

Data Science MLOPS POC:

Before you begin an IBM Data Science MLOPS POC you need to have some pre-requisites done:

You need to have Cloud Pak for Data as a Service (CPDaaS) account as well IBM Cloud Account- Cloud Pak for Data as a Service (CPDaaS) account – https://dataplatform.cloud.ibm.com.

If you don’t have a CPDaaS account, then you can sign-up for a Free trial account, and the same is true for IBM Cloud Account.

Please provide all required services using IBM Cloud Account (https://cloud.ibm.com/login) and make such as Watson Studio, Watson Knowledge Catalog, Watson Machine Learning, Watson OpenScale, and DB2 Service.

Data science MLOPS POC was focused on the main capabilities and strengths of Watson Studio and related products. The three main themes of POCs were:

MLOps: End-to-End data science asset lifecycle

Low code data science: developing data science assets in visual tools

Trusted AI: an extension of MLOps with a focus on data/model governance and model monitoring

IBM MLOPS Flow

Source of Diagram: IBM Documentation

Learned about MLOPS Phases and how we can approach a Data Science POC:

Discovery – Identify the data, Set-up data connection, and load the data. Build the Data transformation and Virtualization process

Ingestion and Preparation – Data Ingestion, Validate the data post-ingestion and data pre-processing.

Development – Develop the model and automated it. Version control using GIT for any Code changes. Store the Model Repository and maintain it.

Deployment – Deploy the Model either in a Manual or Automated way. Score the Model and manage the artifacts. Have Change management controls.

Monitoring – Set up Model Monitoring and Alerts

Governance – Set up end to end Approval management process.

To build a Data Science POC we perform the followings activities/tasks:

Data Access: This covers Discovery, Ingestion, and Preparation

In CPD (IBM Cloud Pak for Data) Cluster, I’ve created an Analytics Project.

Added Data as an Asset to a project and in the Assets Tab uploaded a sample Customer data

Added DB2 on Cloud Connection to the Project -> Data Asset with all required DB, Hostname, and Port details.

Form the Add to Project ->Connected Data ->Selected the Source, Schema, and Table names and now the Customer Table has been created and displayed under the Data Assets tab

Form the Add to Project ->Notebook – Created a Notebook with Python 3.9 and Spark 3.0 environment. This environment will help us with Data Import. This is mainly for Code Generation and importing data via Pandas and Spark data frame.

By Navigating to Notebook -> You can select the Insert to Code for your Data Asset (Customer Table)->Execute it and load the data into the data frame

Similar way you can write data to a database using notebook code.

You can also create a connection to IBM Cloud Object Storage and load the data

Added Created Data Assets as above to Catalog by creating a new Catalog.

Promoted data assets to a deployment space.

Worked on Storage Volume where you can access files from a shared file system like NFS.

Watson Studio- Open Source and GIT with cpdctl for automated CI/CD deployment process: This Covers Development and Deployment

Performed Watson Studio-Git integration by creating a new Project in IBM Watson Studio. This is required for building and updating scripts using Python and Jupyter Lab

GIT can be used here from MLOPS CI/CD and can be integrated with Jenkins/Travis.

In Jupyter Lab-created 2 Notebooks with sample code and set up 2 User IDs in CPD one with Editor and the other as Collaborator.

User 1 Committed the Code to Git Repo and Push the changes.

With other User IDs as Collaborators Pull the Changes and did some code modifications. Commit the changes to Git Repo.

User ID 1 Pull the changes and see the updated/modified code.

Worked on cpdctl (Cloud Pak for Data Command Line Interface) and moved Jupyter Lab Notebook Scripts to Project and to Deployment space. With cpdctl you can automate an end-to-end flow that includes training a model, saving it, creating a deployment space, and deploying the model.

Performed Package Management – Install Libraries such as Anaconda in Notebook to do quick testing

Jobs in Watson Studio – Created a Job for Notebook Developed and invoked the Job.

Data Science Deployment (Model, Scripts, Functions) –

There are mainly two types of Deployment Batch and Online.

Online: a real-time request/response deployment option. When this deployment option is used, models or functions are invoked with a REST API. A single row or multiple rows of data can be passed in with the REST request.

Batch: a deployment option that reads and writes from/to a static data source. A batch deployment can be invoked with a REST API.

a. In CPD, Created a new Deployment Space ->Online– Selected the Customer Data Predict notebook->Execute and Save the Model using WML Object.

b. From the Projects -Assets view -> Locate the Model -> Promoted the Model by Selecting Deployment Space.

c. From Deployment Space -> Select the Model and Deploy it by clicking the Deploy button.

d. Similar way created Deployment Space for Batch -> Created Job -> Selected Customer data CSV as the source and executed it.

e. This is how we do automatic deployment of the model to a deployment space.

Monitoring and Governance – IBM Watson OpenScale is used for Monitoring the model in terms of Fairness, Quality, Drift, and other details.

Conclusion: IBM Cloud Pak for Data is a powerful Cloud Data, Analytics, and AI platform solution that provides end-users quick governed data access, increased productivity, and cost savings.

Note: Please note that some of the diagrams and details are taken from IBM (ibm.com/docs and other reference materials).

If you are interested in exploring and learning IBM Cloud Pak for Data and its services then please go through below tutorial:

IBM Cloud Pak for Data

Announcing hands-on tutorials for the IBM data fabric use cases

]]>