ETL Articles / Blogs / Perficient https://blogs.perficient.com/tag/etl/ Expert Digital Insights Tue, 16 Apr 2024 14:59:23 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png ETL Articles / Blogs / Perficient https://blogs.perficient.com/tag/etl/ 32 32 30508587 Azure SQL Server Performance Check Automation https://blogs.perficient.com/2024/04/11/azure-sql-server-performance-check-automation/ https://blogs.perficient.com/2024/04/11/azure-sql-server-performance-check-automation/#respond Thu, 11 Apr 2024 13:37:29 +0000 https://blogs.perficient.com/?p=361522

On Operational projects that involves heavy data processing on a daily basis, there’s a need to monitor the DB performance. Over a period of time, the workload grows causing potential issues. While there are best practices to handle the processing by adopting DBA strategies (indexing, partitioning, collecting STATS, reorganizing tables/indexes, purging data, allocating bandwidth separately for ETL/DWH users, Peak time optimization, effective DEV query Re-writes etc.,), it is necessary to be aware of the DB performance and consistently monitor for further actions. 

If Admin access is not available to validate the performance on Azure, building Automations can help monitor the space and necessary steps before the DB causes Performance issues/failures. 

Regarding the DB performance monitoring, IICS Informatica Job can be created with a Data Task to execute DB (SQL Server) Metadata tables query to check for the performance and Emails can be triggered once Free space goes below the threshold percentage (ex., 20 %). 

IICS Mapping Design below (scheduled Hourly once). Email alerts would contain the Metric percent values. 

                        Iics Mapping Design Sql Server Performance Check Automation 1

Note : Email alerts will be triggered only if the Threshold limit exceeds. 

                                             

IICS ETL Design : 

                                                     

                     Iics Etl Design Sql Server Performance Check Automation 1

IICS ETL Code Details : 

 

  1. Data Task is used to get the Used space of the SQL Server performance (CPU, IO percent).

                                          Sql Server Performance Check Query1a

Query to check if Used space exceeds 80% . I Used space exceeds the Threshold limit (User can set this to a specific value like 80%), and send an Email alert. 

                                                            

                                         Sql Server Performance Check Query2

If Azure_SQL_Server_Performance_Info.dat has data (data populated when CPU/IO processing exceeds 80%) the Decision task is activated and Email alert is triggered. 

                                          Sql Server Performance Result Output 1                                          

Email Alert :  

                                            Sql Server Performance Email Alert

]]>
https://blogs.perficient.com/2024/04/11/azure-sql-server-performance-check-automation/feed/ 0 361522
Step by step guide to secure JDBC SSL connection with Postgres in AWS Glue https://blogs.perficient.com/2024/04/05/step-by-step-guide-to-secure-jdbc-ssl-connection-with-postgre-in-aws-glue/ https://blogs.perficient.com/2024/04/05/step-by-step-guide-to-secure-jdbc-ssl-connection-with-postgre-in-aws-glue/#respond Sat, 06 Apr 2024 04:48:06 +0000 https://blogs.perficient.com/?p=360279

Have you ever tried connecting a database to AWS Glue using a JDBC SSL encryption connection? It can be quite a puzzle. A few months ago, I faced this exact challenge. I thought it would be easy, but  I was wrong! When I searched for help online, I couldn’t find much useful guidance. So, I rolled up my sleeves and experimented until I finally figured it out.

Now, I am sharing my learnings with you. In this blog, I’ll break down the steps in a clear, easy-to-follow way. By the end, you’ll know exactly how to connect your database to AWS Glue with SSL encryption. Let’s make this complex task a little simpler together.

Before moving ahead let’s discuss briefly how SSL encryption works

  1. The client sends a connection request (Client Hello).
  2. The server responds, choosing encryption (Server Hello).
  3. The client verifies the server’s identity using its certificate and root certificate.
  4. Key exchange establishes a shared encryption key.
  5. Encrypted data exchanged securely.
  6. Client may authenticate with its certificate before encrypted data exchange.
  7. Connection terminates upon session end or timeout.

SSL encription

Now you got basic understanding let’s continue to configure the Glue for SSL encryption.

The steps above are the basic steps of SSL encryption process. Let’s us now discuss how to configure the AWS Glue for SSL encryption.Before we start the configuration process we need the following Formatting below

1)Client Certificate

2 Root Certificate

3) Certificate Key

into DER format. This is the format suitable for AWS glue.

DER (Distinguished Encoding Rules) is a binary encoding format used in cryptographic protocols like SSL/TLS to represent and exchange data structures defined by ASN.1. It ensures unambiguous and minimal-size encoding of cryptographic data such as certificates.

Here’s how you can do it for each component:

1 .Client Certificate (PEM):

This certificate is used by the client (in this case, AWS Glue) to authenticate itself to the server (e.g., another Database) during the SSL handshake. It includes the public key of the client and is usually signed by a trusted Certificate Authority (CA) or an intermediate CA.

If your client certificate is not already in DER format, you can convert it using the OpenSSL command-line tool:

openssl x509 -in client_certificate.pem -outform der -out client_certificate.der

Replace client_certificate.pem with the filename of your client certificate in DER format, and client_certificate.der with the desired filename for the converted DER-encoded client certificate.

 

2.Root Certificate (PEM):

The root certificate belongs to the Certificate Authority (CA) that signed the server’s certificate (in this case, Postgre Database). It’s used by the client to verify the authenticity of the server’s certificate during the SSL.

Convert the root certificate to DER format using the following command:

openssl x509 -in root_certificate.pem -outform der -out root_certificate.der

Replace root_certificate.pem with the filename of your root certificate in DER format, and root_certificate.der with the desired filename for the converted DER-encoded root certificate.

 

3.Certificate Key (PKCS#8 PEM):

This is the private key corresponding to the client certificate. It’s used to prove the ownership of the client certificate during the SSL handshake.

Convert the certificate key to PKCS#8 PEM format using the OpenSSL command-line tool:

openssl pkcs8 -topk8 -inform PEM -outform DER -in certificate_key.pem -out certificate_key.pk8 -nocrypt

Replace certificate_key.pem with the filename of your certificate key in PEM format, and certificate_key.pk8 with the desired filename for the converted PKCS#8 PEM-encoded certificate key.

 

Stored the above certificates and key to S3 bucket. We will need these certificates while configuring the AWS glue.

 

AWS S3 Files

 

To connect AWS Glue to a PostgreSQL database over SSL using PySpark, you’ll need to provide the necessary SSL certificates and configure the connection properly. Here’s an example PySpark script demonstrating how to achieve this:

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession

# Initialize Spark and Glue contexts
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Define PostgreSQL connection properties
jdbc_url = "jdbc:postgresql://your_postgresql_host:5432/your_database"
connection_properties = {
    "user": "your_username",
    "password": "your_password",
    "ssl": "true",
    "sslmode": "verify-ca",  # SSL mode: verify-ca or verify-full
    "sslrootcert": "s3://etl-test-bucket1/root_certificate.der",  # S3 Path to root certificate
    "sslcert": "s3://etl-test-bucket1/client_certificate.der",     # S3 Path to client certificate
    "sslkey": "s3://etl-test-bucket1/certificate_key.pk8"         # S3 Path to client certificate key
}

# Load data from PostgreSQL table
dataframe = spark.read.jdbc(url=jdbc_url, table="your_table_name", properties=connection_properties)

# Perform data processing or analysis
# For example:
dataframe.show()

# Stop Spark session
spark.stop()

 

Now inside your glue job click on Job details page and scroll down until you see Dependent JARs path and Referenced files path options. Under Dependent JARs path put location of S3  path where you stored the jar file and in Referenced files path add the S3 path of converted Client,Root and Key certificate separated by comma “,”

AWS Glue Job Details

 

Now Click on Save option and you are ready to Go

 

This concludes the steps to configure secure JDBC Connection with DB in AWS Glue. To summarize, in this blog we:

1)Explained how to SSL encryption can be used for secure data exchange between AWS Glue and your database(here Postgresql)

2) The steps to configure SSL Encryption in AWS Glue to secure JDBC connection with a database

 

You can read my other blogs here

read more about AWS Glue

 

 

 

 

 

 

]]>
https://blogs.perficient.com/2024/04/05/step-by-step-guide-to-secure-jdbc-ssl-connection-with-postgre-in-aws-glue/feed/ 0 360279
Navigating Snaplogic Integration: A Beginner’s Guide https://blogs.perficient.com/2024/03/05/navigating-snaplogic-integration-a-beginners-guide/ https://blogs.perficient.com/2024/03/05/navigating-snaplogic-integration-a-beginners-guide/#comments Tue, 05 Mar 2024 07:52:56 +0000 https://blogs.perficient.com/?p=353553

As there is rapid growth in businesses going digital, the need to develop scalable and reliable functionalities to connect applications, Cloud environments, on-premises assets have grown. To resolve these complex scenarios, iPaaS seems to be a perfect solution.

For example, if a developer needs to connect and transfer huge data from an e-commerce platform to a CRM system, writing custom code to handle data transfer would be tedious. Instead, the developer can simply consume APIs deployed to iPaaS, significantly reducing development time and effort.

But What Exactly is iPaaS?

Integration Platform as a Service (iPaaS) is a cloud-based solution that makes integrating different applications, data sources and systems easier. It typically provides built-in connectors, reusable components, and tools for designing, executing, and monitoring integrations. This helps businesses enhance operational efficiency, reduce manual efforts, and quickly adapt to changing technology landscapes.

Today, we will talk about one of the iPaaS solutions which stands as a visionary in the Gartner’s magic quadrant of 2023 i.e. SnapLogic.

Picture1

What is SnapLogic?

SnapLogic is an iPaaS (Integration Platform as a Service) tool, that allows organization to connect various applications, data sources, and APIs to facilitate data integration, automation, and workflows.

It provides a visual interface for designing integration pipelines, making it easier for both technical and non-technical users to create and manage data integrations. SnapLogic supports hybrid cloud and on-premises deployment and is used for tasks such as data migration, ETL (Extract, Transform and Load) processes and application integration.

Getting Started with the Basics of SnapLogic

To kick-start your journey, spend 5-10 minutes for setup. Here are the steps to quickly setup your training environment.

  1. Sign Up for SnapLogic: You must sign up for an account. For training and better hands-on experience, SnapLogic provides a training account for 4 weeks. You can start with the training account to explore its features. Here is the link to get the training account: SnapLogic User Login.
  2. Access SnapLogic designer: SnapLogic designer is the heart of its integration capabilities. Once you have signed up, you can access it from your account.
  3. Course suitable for beginners: Click this link to enroll in theSnapLogic Certified Enterprise Automation Professional” entry-level course to quickly get up to speed on SnapLogic.

Features of SnapLogic

SnapLogic is an integration platform that makes connecting different data sources and applications easier. Some key features include:

  1. Multi-cloud Integration: Supports integration across various cloud platforms.
  2. Low-Code Approach: Reduces the requirement for advanced coding knowledge.
  3. API Management: Helps manage APIs and create custom APIs between different applications.
  4. Real-time Integration: Supports real-time data integration.

Overview of Use Case

Done with sign-up and setup! Lessons that are theoretical are never easy to learn until you continue to do hands-on in parallel. Let’s look at a practical use case to simplify learning.

The customer must automatically insert the employee records from the Excel file in a shared directory to the salesforce CRM end system.

How Can We Achieve This Using SnapLogic?

SnapLogic provides pre-built snaps, such as file reader, CSV parser, mapper, salesforce create, and many more.

For achieving the below use case, we need to add the File Reader Snap to fetch the csv file, to parse the data use the CSV Parser, Mapper Snap to transform the data, and lastly, Salesforce Create to insert the data into it.

Creating the pipeline

  1. Upload your CSV file to the SnapLogic file system as we need to read the csv file.Picture1
  2. Creating a pipeline is the first step in building an integration. You can click the “+” sign on the top of the middle canvas as follows: Picture2Then fill the pipeline name and parent project then click “Save”.Picture3
  3. Add and configure the file reader snap: For the file field you uploaded in step 1. Because you are accessing the file system, no authentication information is needed.Picture4Picture5
  4. Add a CSV parser snap; you will use the default configuration.Picture6Picture7
  5. Add the Mapper: It transforms the incoming data using specific mapping and produces new output data.Picture8Picture9
  6. Salesforce Create: It creates the records into a Salesforce account object using the Rest API.Picture10 Picture11
  7. After saving, SnapLogic will automatically validate changes; you can click on the green document icon to view what your data looks like.Picture12
  8. Test the pipeline: After the build is done, we can test the pipeline now. To do that, click on the “play” icon in the pipeline menu and wait for the pipeline to finish executing. Notice how the color of the snaps turns yellow while executing, indicating they are currently running.Picture13
  9. Validate the Results: Once the execution finishes, the pipeline turns dark green. If there’s any exception, the failing snap turns red.Picture14
  10. Results: Login to the salesforce account > accounts > Click on the recently viewed accounts. You will be able to see the records that were fetched from the Employee_Data.csv file.Picture15

Conclusion

Congratulations on completing your first SnapLogic integration! In this blog, we went through the basics of iPaaS and SnapLogic. We also went through a practical use case to gain confidence and better understand. Our journey in SnapLogic has just started, and we’ll be exploring more in the future to expand on the knowledge we accumulated in this article.

Perficient and SnapLogic

At Perficient, we develop scalable and robust integrations within the SnapLogic Platform. With our expertise in SnapLogic, we resolve customers’ complex business problems, which helps them grow their business efficiently.

Contact us today to explore more options for elevating your business.

]]>
https://blogs.perficient.com/2024/03/05/navigating-snaplogic-integration-a-beginners-guide/feed/ 1 353553
Data Virtualization with Oracle Enterprise Semantic Models https://blogs.perficient.com/2024/02/22/data-virtualization-with-oracle-enterprise-semantic-models/ https://blogs.perficient.com/2024/02/22/data-virtualization-with-oracle-enterprise-semantic-models/#respond Thu, 22 Feb 2024 22:51:57 +0000 https://blogs.perficient.com/?p=357386

A common symptom of organizations operating at suboptimal performance is when there is a prevalent challenge of dealing with data fragmentation. The fact that enterprise data is siloed within disparate business and operational systems is not the crux to resolve, since there will always be multiple systems. In fact, businesses must adapt to an ever-growing need for additional data sources. However, with this comes the challenge of mashing up data across systems to provide a holistic view of the business. This is the case for example for a customer 360 view that provides insight into all aspects of customer interactions, no matter where that information comes from, or whether it’s financial, operational or customer experience related. In addition, data movements are complex and costly. Organizations need the agility to adapt quickly to the additional sources, while maintaining a unified business view.

Data Virtualization As a Key Component Of a Data Fabric

That’s where the concept of data virtualization provides an adequate solution. Data stays where it is, but we report on it as if it’s stored together. This concept plays a key role in a data fabric architecture which aims at isolating the complexity of data management and minimizing disruption for data consumers. Besides data-intensive activities such as data storage management and data transformation, a robust data fabric requires a data virtualization layer as a sole interfacing logical layer that integrates all enterprise data across various source applications. While complex data management activities may be decentralized across various cloud and on-premises systems maintained by various teams, the virtual layer provides a centralized metadata layer with well-defined governance and security.

How Does This Relate To a Data Mesh?

What I’m describing here is also compatible with a data mesh approach whereby a central IT team is supplemented with products owners of diverse data assets that relate to various business domains.  It’s referred to as the hub-and-spoke model where business domain owners are the spokes, but the data platforms and standards are maintained by a central IT hub team. Again, the data mesh decentralizes data assets across different subject matter experts but centralizes enterprise analytics standards. Typically, a data mesh is applicable for large scale enterprises with several teams working on different data assets. In this case, an advanced common enterprise semantic layer is needed to support collaboration among the different teams while maintaining segregated ownerships. For example, common dimensions are shared across all product owners allowing them to report on the company’s master data such as product hierarchies and organization rollups. But the various product owners are responsible for consuming these common dimensions and providing appropriate linkages within their domain-specific data assets, such as financial transactions or customer support requests.

Oracle Analytics for Data Virtualization

Data Virtualization is achieved with the Oracle Analytics Enterprise Semantic Model. Both the Cloud version, Oracle Analytics Cloud (OAC) and the on-premises version, Oracle Analytics Server (OAS), enable the deployment of the semantic model. The semantic model virtualizes underlying data stores to simplify data access by consumers. In addition, it defines metadata for linkages across the data sources and enterprise standards such as common dimensions, KPIs and attribute/metric definitions. Below is a schematic of how the Oracle semantic model works with its three layers.

Oracle Enterprise Semantic Model

Outcomes of Implementing the Oracle Semantic Model

Whether you have a focused data intelligence initiative or a wide-scale program covering multi-cloud and on-premises data sources, the common semantic model has benefits in all cases, for both business and IT.

  • Enhanced Business Experience

With Oracle data virtualization, business users tap into a single source of truth for their enterprise data. The information available out of the Presentation Layer is trusted and is reported on reliably, no matter what front end reporting tool is used: such as self-service data visualization, dashboards, MS Excel, Machine Learning prediction models, Generative AI, or MS Power BI.

Another value-add for the business is that they can access new data sources quicker and in real-time now that the semantic layer requires no data movement or replication. IT can leverage the semantic model to provide this access to the business quickly and cost-effectively.

  • Future Proof Investment

The three layers that constitute the Oracle semantic model provide an abstraction of source systems from the presentation layer accessible by data consumers. Consequently, as source systems undergo modernization initiatives, such as cloud migrations, upgrades and even replacement with totally new systems, data consuming artifacts, such as dashboards, alerts, and AI models remain unaffected. This is a great way for IT to ensure any analytics investment’s lifespan is prolonged beyond any source system.

  • Enterprise Level Standardization

The semantic model enables IT to enforce governance when it comes to enterprise data shared across several departments and entities within an organization. In addition, very fine-grained object and data levels security configurations are applied to cater for varying levels of access and different types of analytics personas.

Connect with us for consultation on your data intelligence and business analytics initiatives.

]]>
https://blogs.perficient.com/2024/02/22/data-virtualization-with-oracle-enterprise-semantic-models/feed/ 0 357386
3 Key Takeaways from AWS re:Invent 2023 https://blogs.perficient.com/2023/12/11/three-key-takeaways-from-aws-reinvent-2023/ https://blogs.perficient.com/2023/12/11/three-key-takeaways-from-aws-reinvent-2023/#respond Mon, 11 Dec 2023 20:58:39 +0000 https://blogs.perficient.com/?p=351249

Now that the dust has settled, the team has had the chance to Re:flect on the events and announcements of AWS re:Invent 2023. Dominating the conversation was the advancement and capabilities of Generative AI across several AWS Services, while not losing sight on the importance of application modernization and cloud migration. Perficient walked away with 3 key takeaways: 1) Amazon Q 2) Serverless Innovation 3) The Zero ETL Future

1. Amazon Q

Generative AI was the talk of the conference, and no topic was discussed more than Amazon Q. The powerful, new generative AI assistant can be tailored to your business and can be used to generate content and solve problems, or if leveraged with Amazon Connect, now with generative AI capabilities that are powered through Amazon Bedrock, it can allow your agents to respond faster by assisting with suggesting actions or links to relevant articles. AI is here and it isn’t going anywhere, but what might be most important is to ensure it is being used responsibly. “What’s exciting here is that the path to responsibly enabling AI for enterprise is starting to light up…” Steve Holstad, Principal of Cloud said, “We know it’s going to be an ongoing journey for years to come, but the time for a private pilot leveraging your data, based on your unique use cases, is here.” At Perficient, we are at the forefront of the next generation of AI and ML. We’re excited about the progress we’ve made and are looking forward to creating innovative solutions with AWS Q.

Read Zachary Fischer’s, Senior Solutions Architect, blog about exploring the potential of Amazon Q and Perficient Handshake.

2. Serverless Innovation

Serverless computing isn’t new to AWS, as their wide variety of serverless data offerings have been helping customers take advantage of automated methods of setting up infrastructure, real time scaling, and dynamic pricing. Three new AWS serverless innovations for Amazon Aurora, Amazon Redshift, and Amazon ElastiCache build on the work AWS has already been doing for some time.

  1. Amazon Aurora Limitless Database: A new feature supporting automated horizontal scaling to process millions of transactions at a speed unlike any before and manage an excessive amount of data in a single Aurora database.
  2. Amazon Redshift Serverless: Gather insights in seconds without having to manage data warehouse infrastructure. Leverage its self-service analytics and autoscaling capabilities to better make sense of your data.
  3. Amazon ElastiCache Serverless: An innovative serverless solution enabling users to create a cache within a minute and dynamically adjust capacity in real-time according to application traffic trends.

Learn more by reading Shishir Meshram’s, Senior Technical Consultant, blog about Perficient’s ability to help achieve a serverless infrastructure.

3. The Zero ETL Future

Historically, to connect all your data sources to find new insights, you’d need to “extract, transform, and load” (ETL) information in a tedious manual effort. AWS announced several new integrations as part of their continued commitment to a, “zero ETL future,” so users can access data when and where they need it. In his keynote presentation, Dr. Swami Sivasubramanian, Vice President of Data and AI at AWS, said, ““In addition to having the right tool for the job, customers need to be able to integrate the data that is spread across their organizations to unlock more value for their business and innovate faster. That is why we are investing in a zero-ETL future, where data integration is no longer a tedious, manual effort, and customers can easily get their data where they need it.”

Learn more about these integrations, and find out how like AWS, you can work your way toward a “zero ETL future.”

This was just the tip of the iceberg of what was discussed at AWS re:Invent and Perficient is excited to be in the thick of it! Join us on this journey of discovery. Let’s see what we can build together.

]]>
https://blogs.perficient.com/2023/12/11/three-key-takeaways-from-aws-reinvent-2023/feed/ 0 351249
SQL Server Space Monitoring https://blogs.perficient.com/2023/11/28/sql-server-space-monitoring/ https://blogs.perficient.com/2023/11/28/sql-server-space-monitoring/#respond Wed, 29 Nov 2023 05:47:24 +0000 https://blogs.perficient.com/?p=350339

On Operational projects that involves heavy data volume load on a daily basis, there’s a need to monitor the DB Disk Space availability. Over a period of time, the size grows occupying the disk space. While there are best practices to handle the size by adopting strategies of Purge for outdated data and add buffer/temp/data/log space to address the growing needs, it is necessary to be aware of the Disk space and consistently monitor for further actions.

If Admin access is not available to validate the Available, building Automations can help monitor the space and necessary steps before the DB causes Performance issues/failures.

Regarding the DB Space monitoring, IICS Informatica Job can be created with a Data Task to execute DB (SQL Server) Metadata tables query to check for the Available Space and Emails can be triggered once Free space goes below the threshold percentage (ex., 20 %).

IICS Mapping Design below (scheduled Daily once). Email alerts would contain the Metric percent values.

 

Capture

 

Note : Email alerts will be triggered only if the Threshold limit exceeds.

 

IICS ETL Code Details :

 

  1. Data Task is used to get the Used space of the SQL Server Log and Data files.

Capture

Capture

 

Query to check if Used space exceeds 80% . I Used space exceeds the Threshold limit (User can set this to a specific value like 80%), and send an Email alert.

 

Capture

 

If D:\Out_file.dat has data (data populated when Used space exceeds 80%) the Decision task is activated and Email alert is triggered.

 

 

]]>
https://blogs.perficient.com/2023/11/28/sql-server-space-monitoring/feed/ 0 350339
Windows Folder/Drive Space Monitoring https://blogs.perficient.com/2023/11/28/windows-folder-drive-space-monitoring/ https://blogs.perficient.com/2023/11/28/windows-folder-drive-space-monitoring/#respond Wed, 29 Nov 2023 05:38:23 +0000 https://blogs.perficient.com/?p=350216

Often there’s a need to monitor the OS Disk Drive Space availability with the Drive holding ETL operational files (log, cache, temp, bad files etc.). Over a period of time, the # of files grows occupying the disk space. While there are best practices to limit the # of operational files and clear them from the Disk on regular basis (via Automations), it is recommended to be aware of the available space.

If Admin access is not available to validate the Available space and if the ETL Server is on a Remote machine, building Automations can help monitor the space and necessary steps before ETL causes Performance issues/failures.

Regarding the OS Folder/Drive Space monitoring, IICS Informatica Job can be created with a Command Task to execute Windows commands via Batch scripts to check for the Available Space and Emails can be triggered once Free space goes below the threshold percentage (ex., 20 %).

IICS Taskflow Design below (can be scheduled Bi-Weekly or Monthly according to the requirements). Email alerts would have the Free space percent value.

 

Capture

 

 

 

 

 

Note : Email alerts will be triggered only if the Threshold limit exceeds.

 

IICS ETL Code Details :

 

  1. Windows Command Task is used to get the Free space of the OS Drive/Network Drive/Folder on which ETL is installed and the log files are held.

 

Capture

 

 

D:\space_file_TGT.dat Content: (Drive Name, Free space, Overall Space)

D:,11940427776,549736935424

 

D:\Out_file.dat Content : (Drive Name, Free space[GB], Overall Space[GB], Flag[if Free space <25 set as Alert],Used space percent

D:,11940427776,549736935424,ALERT,98%

  1. IICS Data Task to populate D:\Out_file.dat

If D:\Out_file.dat has data (data populated when Free space <25 ) the Decision task is activated and Email alert is triggered.

Email Alert :

 

Capture

 

]]>
https://blogs.perficient.com/2023/11/28/windows-folder-drive-space-monitoring/feed/ 0 350216
An Introduction to ETL Testing https://blogs.perficient.com/2023/08/23/an-introduction-to-etl-testing/ https://blogs.perficient.com/2023/08/23/an-introduction-to-etl-testing/#comments Wed, 23 Aug 2023 05:13:30 +0000 https://blogs.perficient.com/?p=341215

ETL testing is a type of testing technique that requires human participation in order to test the extraction, transformation, and loading of data as it is transferred from source to target according to the given business requirements.

Take a look at the block below, where an ETL tool is being used to transfer data from Source to Target. Data accuracy and data completeness can be tested via ETL testing.

Etl 1

What Is ETL? (Extract, Transform, Load)

Data is loaded from the source system to the data warehouse using the Extract-Transform-Load (ETL) process, referred to as ETL.

Extraction defines the extraction of data from the sources (The sources can be either from a legacy system, a Database, or through Flat files).

Transformation defines Data that is transformed as part of cleaning, aggregation, or any other data alterations completed in this step of the transformation process.

Loading defines the load of data from the Transformed data into the Target Systems called Destinations (The Destinations can again be either a Legacy system, Database, or flat file).

Etl4

 

What is ETL testing?

Data is tested via ETL before being transferred to live data warehouse systems. Reconciliation of products is another name for it. ETL testing differs from database testing in terms of its scope and the procedures used to conduct the test. When data is loaded from a source to a destination after transformation, ETL testing is done to ensure the data is accurate. Data that is used between the source and the destination is verified at several points throughout the process.

What Is Etl Testing

In order to avoid duplicate records and data loss, ETL testing verifies, validates, and qualifies data. Throughout the ETL process, there are several points where data must be verified.

While testing tester confirms that the data we have extracted, transformed, and loaded has been extracted completely, transferred properly, and loaded into the new system in the correct format.

ETL testing helps to identify and prevent issues with data quality during the ETL process, such as duplicate data or data loss.

Test Scenarios of ETL Testing:

 1. Mapping Document Validation

Examining the mapping document for accuracy to make sure all the necessary data has been provided. The most crucial document for the ETL tester to design and construct the ETL jobs is the ETL mapping document, which comprises the source, target, and business rules information.

Example:  Consider the following real-world scenario: We receive a source file called “Employee_info” that contains employee information that needs to be put into the target’s EMP_DIM table.

The following table shows the information included in any mapping documents and how mapping documents will look.

Depending on your needs, you can add additional fields.

Etl6

 2. DDL/Metadata Check

Validate the source and target table structure against the corresponding mapping doc. The source data type and target data type should be identical. Length of data type in both the source and target should be equal. Will verify that the data field type and format are specified. Also, validate the name of the column in the table against the mapping doc.

Ex. Check the below table to verify the mentioned point of metadata check.

Source – company_dtls_1

Target – company_dtls_2

Etl5

 3. Data Completeness Validation

Data Completeness will Ensure that all expected data is loaded into the target table. And check for any rejected records and boundary value analysis. Will Compare record counts between the source and target. And will see data should not be truncated in the column of target tables. Also, compare the unique value of key fields between data loaded to WH and source data.

Example:

You have a Source table with five columns and five rows that contain company-related details. You have a Target table with the same five columns. After the successful completion of an ETL, all 5 records of the source table (SQ_company_dtls_1) are loaded into the target table (TGT_company_dtls_2) as shown in the below image. If any Error is encountered while ETL execution, its error code will be displayed in statistics.

Etl7

 4. Constraint Validation

To make sure the key constraints are defined for specific tables as expected.

    • Not Null & Null
    • Unique
    • Primary Key & Foreign Key
    • Default value check

5. Data Consistency Check

    • The data type and data length for particular attributes may vary in files or tables though the semantic definition is the same.
    • Validating the misuse of integrity constraints like Foreign Key

6. Data Correctness

    • Data that is misspelled or inaccurately recorded.
    • Null, non-unique, or out-of-range data

 

Why Perform ETL Testing?

Etl3

Inaccurate data resulting from flaws in the ETL process can lead to data issues in reporting and poor strategic decision-making. According to analyst firm Gartner, bad data costs companies, on average, $14 million annually with some companies costing as much as $100 million.

A consequence of inaccurate data is:

A large fast-food company depends on business intelligence reports to determine how much raw chicken to order every month, by sales region and time of year. If these data are inaccurate, the business may order too much, which could result in millions of dollars in lost sales or useless items.

When do we need ETL Testing?

Here are a few situations where it is essential to use ETL testing:

  • Following a data integration project.
  • Following a data migration project.
  • When the data has been loaded, during the initial setup of a data warehouse.
  • Following the addition of a new data source to your existing data warehouse.
  • When migrating data for any reason.
  • In case there are any alleged problems with how well ETL operations work.
  • whether any of the source systems or the target system has any alleged problems with the quality of the data

Required Skillset for ETL Tester:

  • Knowledge of BI, DW, DL, ETL, and data visualization process
  • Very good experience in analyzing the data and their SQL queries
  • Knowledge of Python, UNIX scripting
  • Knowledge of cloud technologies like AWS, Azure, Hadoop, Hive, Spark

Roles and responsibilities of ETL Tester:

To protect the data quality of the company, an ETL tester plays a crucial role.

ETL testing makes sure that all validity checks are met and that all transformation rules are strictly followed while transferring data from diverse sources to the central data warehouse. The main role of an ETL tester includes evaluating the data sources, data extraction, transformation logic application, and data loading in the destination tables. Data reconciliation is used in database testing to acquire pertinent data for analytics and business intelligence. ETL testing is different from data reconciliation. It is used by data warehouse systems.

Responsibilities of an ETL tester:

  • Understand the SRS document.
  • Create, design, and execute test cases, test plans, and test harnesses.
  • Test components of ETL data warehouse.
  • Execute backend data-driven test.
  • Identify the problem and provide solutions for potential issues.
  • Approve requirements and design specifications.
  • Data transfers and Test flat files.
  • Constructing SQL queries for various scenarios, such as count tests.
  • Inform development teams, stakeholders, and other decision-makers of the testing results.
  • To enhance the ETL testing procedure over time, incorporate new knowledge and best practices.

In general, an ETL tester is the organization’s data quality guardian and ought to participate in all significant debates concerning the data used for business intelligence and other use cases.

Conclusion:

Here we learned what ETL is, what is ETL testing, why we perform ETL testing when we need ETL testing, what skills are required for an ETL tester, and the Role and responsibilities of an ETL tester.

Happy Reading!

]]>
https://blogs.perficient.com/2023/08/23/an-introduction-to-etl-testing/feed/ 3 341215
Basic Understanding of Full Load And Incremental Load In ETL (PART 2) https://blogs.perficient.com/2023/05/15/basic-understanding-of-full-load-and-incremental-load-in-etl-part-2/ https://blogs.perficient.com/2023/05/15/basic-understanding-of-full-load-and-incremental-load-in-etl-part-2/#comments Mon, 15 May 2023 12:19:31 +0000 https://blogs.perficient.com/?p=334752

In the last blog PART1, we discussed Full load with the help of an example in the SSIS (SQL Server Integration Service).

In this blog, we will discuss the concept of Incremental load with the help of the Talend Open Studio ETL Tool.

Incremental Load:

The ETL Incremental Loading technique is a fractional loading method. It reduces the amount of data that you add or change and that may need to be rectified in the event of any irregularity. Because less data is loaded and reviewed, it also takes less time to validate the data and review changes.

Let’s Elaborate this with an Example:

Suppose the file is very large, for example, there are 200m to 500m records to load,  it is not possible to load this amount of data in a feasibletime because sometimes we do not havethe required amount of time to load the data during the day.So we  have to update the data during night-time and  which is limited in terms of hours.. Hence there is a great possibility that the entire amount of data is not loaded.

In scenarios where the actual updated records are very less in number but the overall  data size is very huge, we go with the incremental load, or in other words the differential load.

In the incremental load, we figure out how many many records are to be updated to the destination table and how many records in the source file or source table which are new that can be inserted into the destination table. Once this is decided, we just update or insert to the destination table.

How to Perform Incremental Load in Talend ETL?

Incremental loading with Talend can be done like in any other ETL tool. You must measure in your job the necessary time stamps of sequence values and keep the highest value for the next run and use this value in a query where the condition is to start reading all rows with this higher value.

Incremental loading is a way to update a data set with new data. It can be done by replacing or adding records in a table or partition of a database.

There are different ways to perform an incremental load in Talend ETL:

1) Incremental Load on New File: This method updates the existing data set with new data from an external file. This is done by importing the new data from the external file and overwriting the existing records.

2) Incremental Load on Existing File: This method updates the existing data set with new data from another source, such as a database table. In this case, records from both sources are merged and updated in one go.

3) The source database may have date time fields that may help us identify which source records got updated. Using the context variable and audit control table features, we can retrieve only the newly inserted or updated records from the source database.

 

Now you all know what Incremental Load in ETL is, Let’s Explore this using the Talend Open Studio.

Source Table:

We have a source table Product_Details with created_on and modified_on columns. Also, we have  some existing data in the table.

T2

ETL Control Table:

By using the etl_control table we will capture the last time when the job was successful. When we have 100 jobs and tables. We don’t want to keep it in different places it is always good practice to keep one etl_control table. In which we will capture the particular job name, table name, and last success as when it was last loaded.

T1

Target Table:

Product_inc is our target table. In the ETL Control table, we will give a last success date older than the source table and we will give conditions on the basis of the created_on column to insert and update data in the target table Product_inc.

T3

Now we will Explore our Talend job.

Job

First, we will drag and drop tDBConnection for our Postgres SQL connection. So, we can use this connection multiple times in the job. hen we will import all the tables.

Now we drag the etl_control table as input where we are saving the last success timestamp for a particular job.

Then we will drag and drop the tJavaRow component. With the help of this component, we will set the value for the last success timestamp. We write a Java code as below.

1

To store those values, we will create two context variables last_success timestamp and current_run. Timestamp.

  1. Last_success will be used to retrieve the data from the source.
  2. Current_run will be used to update the etl_control back when the job was successful.

Variable

Now we drag and drop the tPreJob component ensures that the below steps are always executed before the sub-job execution.

Subjob1

Next we add the actual source and target table to create the Sub-Job. Also, we drag the etl_control table as a tDBRow component to update back etl_control table.

It is connected with OnSubJobOk with the source table. So, if the job fails for any reason so it will not update back etl_control table because in the next run or the next day run the same records will be processed from the point it was processed last time.

Subjob2

Input Component:

We  change the existing query which is selecting all the column’s data with no condition.

For incremental load, we provide filter conditions so it will select newly inserted rows and updated values from the last run of the job.

 

“select * from product_details

where created_on >= ‘” + context.last_success +

“‘ or modified_on >= ‘” + context.last_success + “‘”

2

 

 

Output Component:

In the target table, we will modify in Action on Data to Insert and Update for the table

Based on the key value so in the edit schema we will provide the key value in the target table to product_id.

4

 

Control Component:

We will add an update command to update the etl_control table.

“Update etl_control set last_success = ‘”

+ context.current_run+

“‘ where job_name = ‘” + jobName + “‘”

3

This update command will dynamically update the last_success timestamp with a timestamp of the job run time. If we have multiple jobs so for a particular job, we also provided a condition where we used the global variable jobName to update the particular job’s last_success time stamp.

RUN1:

Now save the job and run it. We can see we read one record from the etl_control table and inserted 5 rows in the target table.

Run1

In the etl_control table based on the job name; it will update the last_success timestamp with the job run timestamp.

Run1t1Run1t3

If we rerun the job without any changes, it will not process any record in the sub-job present in the source table.

RUN2:

Now we will update one of the values in the source table and then run the job again.

It will capture only one record that is updated based on the last successful run time.

Update1

 

Run2 Run2t1 Run2t3

 

RUN3:

Now, we will insert one new record and update one of the existing values and then run the job again.

Insertupdate

We can see two records from which one is a newly inserted record, and one is an updated record.

Run3 Run3t1 Run3t3

So, this is how incremental load works based on the last successful run time to the start of the job to pick up inserted or updated records.

Please share your thoughts and suggestions in the space below, and I’ll do my best to respond to all of them as time allows.

For more such blogs click here

Happy Reading!

]]>
https://blogs.perficient.com/2023/05/15/basic-understanding-of-full-load-and-incremental-load-in-etl-part-2/feed/ 1 334752
Informatica PowerCenter Overview: Part 1 https://blogs.perficient.com/2023/05/15/informatica-powercenter-overview-part-1/ https://blogs.perficient.com/2023/05/15/informatica-powercenter-overview-part-1/#comments Mon, 15 May 2023 10:57:09 +0000 https://blogs.perficient.com/?p=332865

what is ETL?

ETL is a process that extracts the data from different source systems, then transforms the data (like applying calculations, concatenations, etc.), and finally loads the data into the Data Warehouse system. The full form of ETL is Extract, Transform, and Load.

Etl

What is a data warehouse (DW)?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. It includes historical data derived from transaction data from single and multiple sources. The purpose of a data warehouse is to connect and analyze business data from heterogeneous sources. Data warehouses are at the core of business intelligence solutions, which analyze and report on data.

Different types of ETL Tools available?

There are multiple ETL tools available in the market such as Adeptia Connect, Alooma Platform, CData Driver Technologies, Fivetran, IBM InfoSphere Information Server, Informatica Intelligent Data Platform, Matillion ETL, SQL Server Integration Services (SSIS), Oracle Data Integration Cloud Service, Talend Open Studio, SAS Data Management, and others.

In this blog, we will be discussing Informatica PowerCenter:

What is Informatica?

Informatica is a data integration tool based on ETL architecture. Data integration across business applications, data warehousing, and business intelligence are its main applications. Informatica has bust-in functionalities to connect with various source systems like databases, file systems, or SaaS-based applications using configurations, adapters and in-built connectors. Data is extracted from all systems through Informatica, transformed on the server, and fed into data warehouses.

Example: It is possible to connect to many database servers; Oracle and Microsoft SQL Server databases are both connected here. The data can be combined with that of another system.

Why We Need Informatica?

  1. If we need to perform some operations on the data at the backend in a data system, then we need the Informatica.
  2. To modify, update, or clean up the data based on some set of rules, we need Informatica.
  3. By using Informatica, is accessible for the loading of bulk data from one system to another.

Components in Informatica.

Informatica consists of two types of components:

  • Server component: Repository service, integration service, Domain, Node
  • Client component: Designer, workflow, monitor, repository client

Informatica ETL tool has the below services/components, such as:

  • Repository Service: It is responsible for maintaining Informatica metadata and provides access to the same to other services. The PowerCenter Repository Service manages connections to the PowerCenter repository from repository clients. The Repository Service is a separate, multi-threaded process that retrieves, inserts, and updates metadata in the repository database tables. The Repository Service ensures the consistency of metadata in the repository.

 

  • Integration Service: This service helps in the movement of data from sources to targets. The PowerCenter Integration Service reads workflow information from the repository. The Integration Service connects to the repository through the Repository Service to fetch metadata from the repository. The Integration Service can combine data from different platforms and source types. For example, you can join data from a flat file and an Oracle source. The Integration Service can also load data to different platforms and target types.

 

  • Reporting Service: This service generates the reports. After you create a Reporting Service, you can configure it. Use the Administrator tool to view or edit the Reporting Service properties. To view and update properties, select the Reporting Service in the Navigator. In the Properties view, click Edit in the properties section that you want to edit. If you update any of the properties, restart the Reporting Service for the modifications to take effect.

 

  • Repository Manager: Repository Manager is a GUI-based administrative client component, which allows users to create new domains and used to organize the metadata stored in the Repository. The metadata in the repository is organized in folders, and the user can navigate through multiple folders and repositories as shown in the image below. 

Repository Manager

 

  • Informatica Designer: Informatica PowerCenter Designer is a graphical user interface (GUI) for creating and managing PowerCenter objects such as source, target, Mapplets, Mapping, and transformations. To develop ETL applications, it provides a set of tools known as “Mapping”. PowerCenter Designer creates mappings by importing source tables from the database with the Source analyzer, target tables from the database with the Target designer, and transforming these tables.

Designer

  • Workflow Manager: The Workflow Manager allows for the creation and completion of workflows and other tasks. You must first create tasks, such as a session containing the mapping you make in the Designer before you can establish a process. You then connect tasks with conditional links to specify the order of execution for the tasks you created.

The Workflow Manager consists of three tools to help you develop a workflow:

    1. Task Developer – Use the Task Developer to create tasks you want to run in the workflow.
    2. Workflow Designer – Use the Workflow Designer to create a workflow by connecting tasks with links. You can also create tasks in the Workflow Designer as you develop the workflow.
    3. Worklet Designer – Use the Worklet Designer to create a worklet. 

Workflow Manager

 

  • Workflow Monitor: Informatica Workflow Monitor makes it easy to track how tasks are being completed. Generally, Informatica Power Centre helps you to track or monitor the Event Log information, list of executed Workflow, and their execution time in detail.

The Workflow Monitor consists of following windows:

  • Navigator window – Displays monitored repositories, servers, and repositories objects.
  • Output window – Displays messages from the Integration Service and Repository Service.
  • Time window – Displays the progress of workflow runs.
  • Gantt Chart view – Displays details about workflow runs in chronological format.
  • Task view – Displays details about workflow runs in a report format.

 Workflow Monitor

So, in Part 1 we have seen an overview of Informatica PowerCentre and basic understanding of all the available tools, in the next blog we will be discussing about the various transformations available in Informatica PowerCentre.

Please share your thoughts and suggestions in the space below, and I’ll do my best to respond to all of them as time allows.

for more such blogs click here

Happy Reading!

]]>
https://blogs.perficient.com/2023/05/15/informatica-powercenter-overview-part-1/feed/ 9 332865
Slowly Changing Dimension(SCD) TYPE 3 in Informatica PowerCenter https://blogs.perficient.com/2023/05/02/slowly-changing-dimensionscd-type-3-in-informatica-powercenter/ https://blogs.perficient.com/2023/05/02/slowly-changing-dimensionscd-type-3-in-informatica-powercenter/#comments Tue, 02 May 2023 09:11:58 +0000 https://blogs.perficient.com/?p=333211

Journey Builder V1 768x309

What is a Slowly Changing Dimension?

Slowly Changing Dimension (SCD) is a dimension that allows us to store and manage both current and previous data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records. There are three types of SCD Type 1, Type 2, and Type 3. In this blog, we will look at SCD type 3.

What is SCD type 3?

A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the selected attribute’s previous and current values. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

Now let us start with how we can implement SCD Type 3 in Informatica PowerCenter.

In the below sample mapping, we will check for Employee data in which Emp_branch column will get updated, If the value of Emp_Branch changes, then current value will be stored as the old value in EMP_Branch1 and the new value will become the current value in Emp_Branch2 column.

Source and Target

1. Drag and drop the required source and target instance to the mapping workspace.

Ss1

(Note here we required 2 target instances one for inserting data and another for updating data)

Ss2

2. Add the lookup to the mapping to check whether the income row/data exists in the target. Select the Lookup Icon and click on mapping workspace, and we will get a screen. Select the Target table inside the window. And select your required target table on which you want to do a lookup or check the income row/data exist or not

Ss3

LOOKUP TRANSFORMATION

3. Lookup transformation will be created which is the same as the target instance. Drag & Drop Required or All the ports/columns to the lookup transformation from Source qualifier.

Ss4

4. Double click on Lookup transformation then –> condition tab. Select the condition columns.

Ss5

ROUTER TRANSFORMATION

5. Add a router and create two groups (Insert Group and Update Group). Now drag and drop all columns coming from the source & Unique columns from the lookup.

Ss6

6. From Insert Flow of the Router group map columns to the target Definition (INSERT GROUP)  as shown below. As they are new records they will go as insert.

Ss7

7. Drag and drop columns from the Update Flow of the router group mapping incoming ports to the Update Strategy.

Ss8

8. Select Update Strategy, double click and go to the properties tab –> Under the formula mention “DD_UPDATE” as they are going  to update the history.

Ss9

Mapping

9. Save and validate mapping.

Ss10

WORKFLOW

Now Create Workflow and Session for the above-created mapping.

10. Connect to workflow manager. From Menu,

Click Tools –> select ‘Workflows’ –> Create

Ss11

11. To create a session, click on the session icon selected in Red Box in the below screenshot. Then a screen will pop-up with a list of mapping available in that folder. Please select the mapping for which you want to create this session. Click Ok.

Ss12

 

12. Now connect your session with Start Icon in the workspace.

Ss21

13. Select the session and double-click on it. Click on mapping.

Ss13

 

14. Go to the Source Folder and select the SQ instance to defined connection. Click on down arrow button highlighted below to select the required connect for the instance.

Ss14

15. Go the Target Folder and select the Target instance for Update flow to defined connection. Click on down arrow button highlighted below to select the required connect for the instance. And under properties select “update as update” only as here we are update the existing records.

Ss15

16. Go the Target Folder and select the Target instance for Insert flow to defined connection. Click on down arrow button   highlighted below to select the required connect for the instance. And under properties select “Insert” only as here we are insert the new records.

Ss16

17. Go the Transformation Folder and select the Lookup instance to defined connection. Click on down arrow button highlighted  below to select the required connect for the instance.

Ss17

18. Click Apply and Ok.

19. Save workflow and validation. Now you can run your job.

OUTPUT

Below is the screenshot of source table. Here it is having 3 records.

Ss18

       Below is the screenshot of target table after running the job successfully.

       EMP_BRANCH1 is the update records column and EMP_BRANCH2 is the historical record.

Ss20

Here we can check that

New record is inserted where EMP_ID – ‘104’ and

EMP_BRANCH of EMP_ID – ‘101’ is updated to EMP_BRANCH – ” BHARAT ” .

This is all about the implementation of  Slowly Changing Dimension(SCD) TYPE 3 Mapping.
Hope you enjoyed reading this blog and it was helpful.

Conclusion

I hope this 4-minute read has helped the enthusiast inclined to know about SCD Type3 in Informatica and get a broader view of how to create mapping of scd 3 in informatica. Referring to this blog, users can learn how to map and execute an end-to-end flow using these steps.

Happy Reading!

]]>
https://blogs.perficient.com/2023/05/02/slowly-changing-dimensionscd-type-3-in-informatica-powercenter/feed/ 5 333211
Implementation of SCD type 1 in Informatica PowerCenter https://blogs.perficient.com/2023/04/19/implementation-of-scd-type-1-in-informatica-powercenter/ https://blogs.perficient.com/2023/04/19/implementation-of-scd-type-1-in-informatica-powercenter/#comments Wed, 19 Apr 2023 14:57:41 +0000 https://blogs.perficient.com/?p=332952

What is a Slowly Changing Dimension?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

Type 1 SCDs – Overwriting

In a Type 1 SCD the new data overwrites the existing data. Thus, the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.

 Slowing Chaining Dimension Type 1:

Slowing Chaining Dimension Type 1 is used to maintain the latest data by comparing the existing data from the target. It will insert the new records. And update the new data by overwriting the existing data for those records that are updated. All the records contain current data only.

It is used to update the table when you do not need to keep any previous versions of data for those records (No historical records stored in the table).

Example:

This sample mapping is to showcase how SCD Type 1 works and in this exercise, we do not compare column to column for updates to check if is there any change in the existing record. In this we are only checking for Primary Key if exists then Update else Insert as new.

Please connect & open the Repository Folder where you want to create mapping and workflow.

1. Connect and Open the folder if not already opened.

2. Select Tools –> Mapping Designer

0

3. Select Mappings –> Create –> Enter the mapping name you want to create. Then click on “OK”.

4. Drag & drop the required source instance to mapping.

1

5. Drag & Drop target table to mapping (take 2 instances one for Insert and the other for update process)

4

6. Add Lookup to the Mapping. The lookup instance is on the target table to check whether the incoming records exist or not, it is not then inserts else update them. (Here the lookup is connected one).

Target Table

Here we need to look up the target table so select the location of the lookup table as “Target” and select the table from the list under the Targets folder as shown below.

5

Then Click on “OK”

7. The lookup instance will be added to the mapping as shown below.

8. Now Drag the required columns from the Source qualifier to look up the transformation as below.

6

9. To define the lookup condition, double click on lookup transformation –> go to the condition tab

7

 

10. Drag Lookup Primary Key (from lookup) and all other columns dragged from source qualifier to lookup are dragged to Router Transformation to route/separate records for insert and update.

Output Variable Ports

And we will create two output variable ports for new records and updated records.

o_new_records = IIF(ISNULL(lkp_DEPT_ID),TRUE,FALSE)

o_updated_records  = IIF(((DEPT_ID= lkp_DEPT_ID) AND

(DEPT_NAME!= lkp_DEPT_NAME) OR

(DEPT_LOC != lkp_DEPT_LOC) OR

(DEPT_HEAD != lkp_DEPT_HEAD)),TRUE,FALSE)

9

11. Condition to separate records for Insert and Update.

Double-click on Router Transformation — > go to the Group tab. To create 2 groups one for insert condition and the other for update condition.

For NEW_RECORD Group: o_new_records
Note: If case lookup DEPT_ID is null means there is no matching record in THE target so they go for Insert

For UPDATED_RECORDS Group: o_updated_records
Note: If case lookup DEPT_ID is not null means there is a matching record in the target so they go for Update

10

12. From NEW_RECORD Group of Router Transformation mapping columns to Target Table instance taken for Insert

(Note: Default incoming rows type is the insert that is why here we are not using update strategy for insert flow)

8

13. Add an update strategy to flag incoming records for update purposes. Drag the required column from Router Transformation –                       UPDATE_RECORD type Group as shown above

14. Double click on Update Strategy –> go to the Property tab

under “Update Strategy Expression” write: DD_UPDATE as shown below

Update

15. Map the required columns from Update Strategy to Target Instance taken for Update flow.

12

3

16. Create the workflow for the above mapping.

17. Connect and open the folder under which you have created the mapping.

18. Select “Workflows” from the menu –> Click on “Create…” as shown below

13

19. It will Pop Up on the below screen. Entry the name for the workflow.

14

Then Click on “OK”.

20. Create the session for the mapping by clicking the icon in the below screen. It will Pop up the Mappings list and from the list the mapping for which you want to create this session. As shown below.

15

Now, the session got created, then link the session with the start icon as below.

21. Double-click on Session then goes to –> Properties tab:

23

By default, Treat source rows as will be inserted but whenever you will add an update strategy in the mapping. Automatically Treat source rows as will be changed to Data Driven.

22. Then go to –> Mapping tab to assign/map source, target, and lookup database connection information

17

23. Go to the Sources folder in the left-side navigator, then select the source (SQ_DEPARTMENT_DETAILS) to assign a database connection. Click on the down arrow button to get the list of connections available for this repository and select the required one from the list. ex: oracle is the connection name pointing to the Oracle database in this example,

18

24. similarly go to the Target folder in the left-side navigator, then select the target (DEPARTMENT_CURRENT_insert) to assign a database connection. Click on the down arrow button to get the list of connections available for this repository and select the required one from the list. ex: Oracle is the connection name pointing to the Practice database in this example,

In properties, session select Insert values to insert data into the target

19

25. Then select the target folder in the left side navigator, then select (DEPARTMENT_CURRENT_insert1) to assign a database connection. Click on the down arrow button to get the list of connections available for this repository and select the required one from the list. ex: Oracle is the connection name pointing to the HR database in this example,

In properties, the session selects update values for updating data into the target

20

26. And check all the transformations Then click on “Apply” and “Ok”.

27. Save the session and workflow. then run the session/workflow.

28. When you run the session first time all the records will be inserted.

29. In the below screen, I have INSERTED one record and MODIFIED record no 2, 4 and then you run the job a second time they will be updated.

21

30. In the below screen records highlighted with the red box are modified/updated records and highlighted in the green box are newly inserted records.

Last

          31. Other records which are not highlighted are overwritten records as they have no changes.

This is all about the Implementation of SCD type 1 in Informatica PowerCenter. I hope you enjoyed reading this post and found it to be helpful.

For more blogs click here.

Happy Reading!

]]>
https://blogs.perficient.com/2023/04/19/implementation-of-scd-type-1-in-informatica-powercenter/feed/ 3 332952