Skip to main content

Data & Intelligence

Informatica: Removing Duplicate Source Records

Duplicate records are occasionally found in source data. Due to primary key constraints on a target database, only one version of a duplicate source record should be loaded into the target. The following methods demonstrate some of the most efficient ways to deal with duplicate source data.

1. Aggregator Transformation

When to use Any time an Aggregator transformation has no major impact on performance.Source is relational. Or source is file and output row order does not matter.
Pros Ease of implementation.Works with all types of sources.
Cons Limits in choice of output row.                                                                      Aggregator cache can slow performance of large loads.                                 Duplicates are silently removed and therefore no auditing.

Use AGG Transformation and group by the keys that u want to remove dup. To improve speed, sorted ports can be used for presorted data. Make sure the first row listed in the order by clause in source qualifier is the same as the Group By port in the Aggregator. If your source is not relational, add the Sorter transformation.

2. Rank Transformation

When to use When you want a specific row from each group and your source is a flat file.
Pros Ease of implementation.Sorts non-relational data for custom row output.
Cons Cache can slow performance of large loads.                                                           Limits in choice of output row.                                                                               Duplicates are silently removed and therefore no auditing.
   
Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

No modifications needed for Source Qualifier. This is most useful if the source is not relational. Set Number of Ranks to 1 and Top/Bottom property.

3. Sorter Transformation

Send all the data to a sorter and sort by all fields that you want to remove dup. In the properties’ tab, select Unique. This will select and send forward only Unique Data.

4. qualifier

At Source qualifier, you can enable ‘Select Distinct’ Option. Or, you can also write your own query in SQL so only distinct rows would be selected. However it works for only relational sources. For flat file sources, you can do it in as pre session command:

sort abc.txt | uniq

Thoughts on “Informatica: Removing Duplicate Source Records”

  1. You have listed out such a nice ways of removing duplicate records. But, duplicate data remove is become so easy by using deduplication software nowadays.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.