The power of SPSS allows the data scientist or predictive modeler to consume large data volumes. This data may come in smaller manageable subsets or possible huge “data ponds”. Depending upon the procedures you will be performing in your analysis, SPSS may reread the entire data set for each procedure. Of course, procedures that change data require a certain amount of temporary disk space to keep track of the changes, and some actions even require enough disk space for at least one entire copy of that data.
To Virtualize or Not
The “virtual active file” feature of IBM SPSS enables you to work with very large data files without actually requiring equally large (or larger) amounts of “working” disk space. This allows us to define a “data cache” which, in some instances can improve performance. The key is to understand how SPSS will deal with the data for the procedure you wish to perform.
Actions that don’t require any temporary disk space include:
• Reading SPSS Statistics data files
• Merging two or more SPSS Statistics data files
• Reading database tables with the Database Wizard
• Merging SPSS Statistics data files with database tables
• Running procedures that read data (for example, Frequencies, Crosstabs, Explore)
Actions that create one or more columns of data in temporary disk space include:
• Computing new variables
• Recoding existing variables
• Running procedures that create or modify variables (for example, saving predicted values in Linear Regression)
Actions that create an entire copy of the data file in temporary disk space include:
• Reading Excel files
• Running procedures that sort data (for example, Sort Cases, Split File)
• Reading data with GET TRANSLATE or DATA LIST commands
• Using the Cache Data facility or the CACHE command
• Launching other applications from SPSS Statistics that read the data file (for example, AnswerTree, DecisionTime)
Actions that create an entire copy of the data file by default:
• Reading databases with the Database Wizard
• Reading text files with the Text Wizard
Creating your Cache
Although virtualizing a file can vastly reduce the amount of temporary space required for processing, the absence of a temporary copy of the “active” file does means that the original data source must be reread for each procedure you perform in SPSS. For larger data files read from an external source, creating a temporary copy of the data may improve your performance. If you have the luxury of having sufficient disk space (locally or remote) you can eliminate multiple read and improve processing time by creating a data cache of the file.
To create your cache:
File the menus choose: File > Cache Data…
Click OK or Cache Now.
OK – creates a data cache the next time the program reads the data; Cache Now creates the data cache immediately. This option is used if you want to “lock” your data so that it cannot be updated or changed until your processing is complete.
That’s all there is to it!
“From now on, we live in a world where man has walked on the moon. And it’s not a miracle, we just decided to go.” – Jim Lovell, Apollo 13