Data & Intelligence

Data Mining with IBM SPSS Modeler v15

Having recently completed the course “IBM SPSS Modeler & Data Mining” offered by Global Knowledge, I was looking to find more opportunities to do some modeling with SPSS Modeler. So, when I read in the news recently, about college recruiters using predictive techniques to determine the probability of a particular recruit graduating on time, I thought it would be interesting to explore that idea.

For Example

My college wants to determine if a recruit will graduate on time or not. The institution can draw a sample from its historic data and using this sample, possibly predict if a particular recruit would graduate on time. The sample below gives us an idea such a historical dataset. Typically, a dataset will include a field that indicates the behavior, here: has the student graduated on time? Yes or no.

dm1

 

 

 

 

 

An Idea – and some cross tabulation!

The college recruiters have a hunch that there is a difference between students who are athletes and students who are not and if a student participates in collegiate activities or not. Based on this hunch, they investigate to see if there might be any differences in graduating on time statistics – by cross tabulating on “athlete”:

dm2

 

 

In IBM SPSS Modeler, it is very simple to cross tabulate data using the Matrix node. You can simply drop it into your stream, connect it to your source data and set some parameters. For example, I set “Rows” to the field in my file “graduate on time” and “Columns” to “athlete”. (I also went to the “Appearance” tab and clicked-on “Counts” and “Percentage of column” for my “Cross-tabulation cell contents”.

dm3

 

 

 

 

 

 

 

 

After clicking “Run”, the output is ready for review:

dm4

 

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

 

 

 

 

 

 

 

dm5

 

Another Analysis Tool

SPSS Modeler also provides the Distribution Node which lets you show the occurrence of symbolic (non-numeric) values, in this case, “graduated on time” or “athlete”, in our dataset. A typical use of the Distribution node can be to show imbalances in the data (that can be rectified by using a Balance node before creating a model). What I did was use the node to plot “athlete” overlaid with “graduate on time” for an interesting perspective:

dm6

 

 

 

Back to the Analysis

Looking at my cross-tabulation output, it appears that 93 % of the non-athlete students did not graduate on time, while for the students who were athletes; only 13 % did not graduate on time. The question now is -can this difference be attributed to chance (because just a sample was drawn) or, does the difference in the sample reflect a true difference in the population of all students?

The Chi-Square test is a statistical test is used to answer this question. This test gives the probability that the difference between athletes and non-athletes can be attributed to chance.

dm7

 

The CHAID Node

CHAID, or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits. Again, SPSS Modeler offers the ‘CHAID Node” that can be dropped into a stream and configured. In my exercise, I set my (CHAID) target to “graduate on time” and my predictors to “activities” and “athlete”. My results are presented in the viewer which shows a “tree” to present the data.  The initial node shows the breakdown of graduate on-time vs. not on-time and then modeler broke out the next level as students who did not participate in activities and those who did.

dm8

 

 

 

 

 

 

 

The exercise found the probability (P-Value) to be 0, so the probability is 0 that the difference between students involved in activities vs. those who are not can be attributed to chance.  In other words: there are differences between participating in activities and graduating on-time!

Looking at these results, I concluded that students who do not participate in activities during their college career have a much higher chance of NOT graduating on time (96 %) – vs. those that do participate in activities (3 %).

The next step might be to zoom in on these students that do not participate in groups. Modeler broke down the “tree” into a third level:

dm9

 

 

 

 

 

Here, modeler tells me that those students that do not participate in activities are both athletes and non-athletes. The non-athletes who do not participate in activities have a slightly better “on time” rate then do the athletes who do not participate in activities.

Conclusion

Of course there is more to a legitimate data mining project, but it clear that IBM SPSS is a handy tool that “fits” for novice to expert level data scientists. More exploration to come!

About the Author

Mr. Miller is an IBM certified and accomplished Senior Project Leader and Application/System Architect-Developer with over 30 years of extensive applications and system design and development experience. His current role is National FPM Practice Leader. His experience includes BI, Web architecture & design, systems analysis, GUI design and testing, Database modeling and systems analysis, design, and development of Client/Server, Web and Mainframe applications and systems utilizing: Applix TM1 (including TM1 rules, TI, TM1Web and Planning Manager), dynaSight - ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, PERL, Websuite, MS SQL Server, ORACLE, SYBASE SQL Server, etc. His Responsibilities have included all aspects of Windows and SQL solution development and design including: analysis; GUI (and Web site) design; data modeling; table, screen/form and script development; SQL (and remote stored procedures and triggers) development and testing; test preparation and management and training of programming staff. Other experience includes development of ETL infrastructure such as data transfer automation between mainframe (DB2, Lawson, Great Plains, etc.) systems and client/server SQL server and Web based applications and integration of enterprise applications and data sources. In addition, Mr. Miller has acted as Internet Applications Development Manager responsible for the design, development, QA and delivery of multiple Web Sites including online trading applications, warehouse process control and scheduling systems and administrative and control applications. Mr. Miller also was responsible for the design, development and administration of a Web based financial reporting system for a 450 million dollar organization, reporting directly to the CFO and his executive team. Mr. Miller has also been responsible for managing and directing multiple resources in various management roles including project and team leader, lead developer and applications development director. Specialties Include: Cognos/TM1 Design and Development, Cognos Planning, IBM SPSS and Modeler, OLAP, Visual Basic, SQL Server, Forecasting and Planning; International Application Development, Business Intelligence, Project Development. IBM Certified Developer - Cognos TM1 (perfect score 100% on exam) IBM Certified Business Analyst - Cognos TM1

More from this Author

Subscribe to the Weekly Blog Digest:

Sign Up