Duplicate records are occasionally found in source data. Due to primary key constraints on a target database, only one version of a duplicate source record should be loaded into the target. The following methods demonstrate some of the most efficient ways to deal with duplicate source data.
1. Aggregator Transformation
|When to use||Any time an Aggregator transformation has no major impact on performance.Source is relational. Or source is file and output row order does not matter.|
|Pros||Ease of implementation.Works with all types of sources.|
|Cons||Limits in choice of output row. Aggregator cache can slow performance of large loads. Duplicates are silently removed and therefore no auditing.|
Use AGG Transformation and group by the keys that u want to remove dup. To improve speed, sorted ports can be used for presorted data. Make sure the first row listed in the order by clause in source qualifier is the same as the Group By port in the Aggregator. If your source is not relational, add the Sorter transformation.
2. Rank Transformation
|When to use||When you want a specific row from each group and your source is a flat file.|
|Pros||Ease of implementation.Sorts non-relational data for custom row output.|
|Cons||Cache can slow performance of large loads. Limits in choice of output row. Duplicates are silently removed and therefore no auditing.|
No modifications needed for Source Qualifier. This is most useful if the source is not relational. Set Number of Ranks to 1 and Top/Bottom property.
3. Sorter Transformation
Send all the data to a sorter and sort by all fields that you want to remove dup. In the properties’ tab, select Unique. This will select and send forward only Unique Data.
At Source qualifier, you can enable ‘Select Distinct’ Option. Or, you can also write your own query in SQL so only distinct rows would be selected. However it works for only relational sources. For flat file sources, you can do it in as pre session command:
sort abc.txt | uniq