Skip to main content

Data & Intelligence

IBM SPSS Modeler and Duplicate Data

Transactional datasets (especially those originating from databases) may contain duplicate records that must be removed before any modeling can begin.  There are simply two situations where duplicate records occur are:

  • Datasets ARE erroneous (causing the same record to multiple times)
  • Datasets ARE NOT erroneous (but records appear multiple times because information is collected different moments in time or other valid explanations)

The truth is, identical records (as errors in the data) should be removed. The question really is how to deal with the second type of duplicate records.

You’re Objective

How you handle the duplicates depends on the objective. If your interest lies in the most recent account that is opened (no matter the type of the account) then only the record with the most recent date of opening has to be retained and the accounts opened on previous dates can be discarded.

An example may be that the records are not duplicated in the sense that all values on all fields are identical, but records are duplicated only on a subset of fields (the customer ID).

 

f1

 

 

 

 

 

 

 

Distincting: Key Fields

To remove duplicate records, key fields need to be specified to define which records are identical. Records with the same values on the key fields are treated as duplicate records. In the example above this means that if all fields are specified as key, then identical records will be identified and the data can be cleansed. To accommodate for the second situation, retaining only the most recent record for the customer, only customer_id should be specified as key field (assuming the data are sorted on “open date” within customer_id).

f2

 

 

 

 

Distincting with the Distinct Node

Using the SPSS Modeler Distinct node (Record Ops palette) checks for duplicate records and either passes the first distinct record or all but the first record) along the stream:

  1. Add a Distinct node to your stream downstream from the dataset that must have duplicates removed.
  2. Edit the Distinct node to set the options for duplicate removal.

The Distinct Dialog

You will see that the Distinct node has two tabs, Settings and Optimization. To set the options for duplicate removal all you do is:

  1. Click the Settings tab.
  2. The Mode controls whether only the first record (within the group of duplicates) is retained, or if all records but the first record (within the group id duplicates) are retained.
  3. Key fields for grouping should be set to the one (or more) fields that will identify duplicates.
  4. Within groups, sort records by. You can optionally sort the records within each group to enforce that a particular record (for example the most recent record) is the first record in the group (which in combination with mode, include only the first record, will retain the most recent record).

f3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

It really is that easy!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Jim Miller

Mr. Miller is an IBM certified and accomplished Senior Project Leader and Application/System Architect-Developer with over 30 years of extensive applications and system design and development experience. His current role is National FPM Practice Leader. His experience includes BI, Web architecture & design, systems analysis, GUI design and testing, Database modeling and systems analysis, design, and development of Client/Server, Web and Mainframe applications and systems utilizing: Applix TM1 (including TM1 rules, TI, TM1Web and Planning Manager), dynaSight - ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, PERL, Websuite, MS SQL Server, ORACLE, SYBASE SQL Server, etc. His Responsibilities have included all aspects of Windows and SQL solution development and design including: analysis; GUI (and Web site) design; data modeling; table, screen/form and script development; SQL (and remote stored procedures and triggers) development and testing; test preparation and management and training of programming staff. Other experience includes development of ETL infrastructure such as data transfer automation between mainframe (DB2, Lawson, Great Plains, etc.) systems and client/server SQL server and Web based applications and integration of enterprise applications and data sources. In addition, Mr. Miller has acted as Internet Applications Development Manager responsible for the design, development, QA and delivery of multiple Web Sites including online trading applications, warehouse process control and scheduling systems and administrative and control applications. Mr. Miller also was responsible for the design, development and administration of a Web based financial reporting system for a 450 million dollar organization, reporting directly to the CFO and his executive team. Mr. Miller has also been responsible for managing and directing multiple resources in various management roles including project and team leader, lead developer and applications development director. Specialties Include: Cognos/TM1 Design and Development, Cognos Planning, IBM SPSS and Modeler, OLAP, Visual Basic, SQL Server, Forecasting and Planning; International Application Development, Business Intelligence, Project Development. IBM Certified Developer - Cognos TM1 (perfect score 100% on exam) IBM Certified Business Analyst - Cognos TM1

More from this Author

Follow Us