Automated Data Preparation (ADP)
The seasoned data scientist knows that probably the single most import step in creating a predictive model is pinpointing the appropriate “data pond” and ensuring that it is properly “prepared”. I’ve written about the many “out of the box” tools that SPSS users can use to manage data, such as the ability to:
- List Cases ,
- Identify and Replace Missing Values,
- Transform and Compute new variables,
- Recode,
- Select Cases,
- Sort Cases and even
- Merge Files.
These features are accessible in SPSS Statistics Base from pull-down menus. In addition, SPSS goes one step further and offers “Automated Data Preparation” or “ADP”.
Automated Data Preparation (ADP) automatically analyzes your data and identifies fixes, screening out fields that are a problem or not useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the ADP in “Automatic” mode (allowing it to choose and apply fixes), or you can use it in “Interactive” mode (previewing the changes before they are made and accept or reject them as desired).
Using ADP enables you to make your data ready for model building quickly and easily, without needing prior knowledge of the statistical concepts involved. Models will tend to build and score more quickly; in addition, using ADP improves the robustness of automated modeling processes.
To run ADP interactively, you simply choose Transform and then Prepare for Modeling and then Interactive…
The “Interactive Data Preparation” dialog is displayed:
The first tab asks for an objective that controls the default settings. The options are:
- Balance speed & accuracy
- Optimize for Speed
- Optimize for accuracy
- Customize
Each of the objectives will yield different results. It is recommended that each of the options be explored and understood before attempting to select the option that might be best for your data. The online help tells us:
• Balance speed & accuracy creates fields usable in modeling from dates, and may transform continuous fields like reside to make them more normally distributed.
• Optimize for accuracy creates some extra fields from dates (it also checks for outliers, and if the target is continuous, may transform it to make it more normally distributed).
• Optimize for speed does not prepare dates and does not rescale continuous fields, but does merge categories of categorical predictors and bin continuous predictors when the target is categorical (and perform feature selection and construction when the target is continuous).
Off course, ADP runs its analysis using its “best guess” fields and settings based upon what it “sees” in your file, but as you become more experienced, you may want to “override” the default choices and select and set your own.
Finally, from the “Analysis” tab, you can review both tabular and graphical output that summarizes the processing of your data and displays recommendations as to how the data may be modified or improved for scoring. You can then review and either accept or reject those recommendations.
Finally, (of course) all of the results of the analysis and automated data preparation can be saved to a PMML file!
Butch Cassidy: [to Sundance] Boy, I got vision, and the rest of the world wears bifocals.
What’s up, I read your new stuff like every week. Your humoristic style is awesome, keep up the good work!