IBM SPSS Statistics – Data Management Toolset (DMS)
In a recent blog post I listed some of the more helpful “data management tools” offered within IBM SPSS Statistics version 20 (Case Summaries, Replace Missing Values, Transform and Compute, Recode, Select Cases, Sort Cases and Merge Files) and would like to review them today.
These tools can be used to support a best practice approach to data analysis:
- Identification (of the data to be used for analysis)
- Labeling (of the data variables)
- Verification (of the data – based upon label variable assumptions)
The best statisticians (today’s data scientists) will find it valuable to do a visual examination of their data and its defined variable assumptions. One of the most effective methods to do this is to make use of the SPSS Case Summaries option. The Case Summaries command [from the Statistics Viewer select Analyze and then Reports and then Case Summaries] allows you to list an entire data file or a subset of that file, either grouped or in the order of the original data. From the “Summarize Cases” dialog the variables defined in the data are listed and you select which variables you wish to summarize on. In selecting the variables you have the option of choosing the order in which they appear in the output generated. Several other options also allow you to select and format both the content and structure of the output -you can specify groupings, provide headings and captions for your summarizes as well as have SPSS provide additional information on those case summaries – such as minimum, maximum, first, last, mean, etc.
Replace Missing Values
In any data there is a very good chance that you will encounter missing values. Missing values can be a pain to deal with especially in a larger data file. More so, missing values may also influence the analyses of the data. To resolve this issue, SPSS offers Replace Missing Values [from the Statistics Viewer select Transform and then Replace Missing Values].
You should take note that SPSS specifies a difference between system-missing values and user-missing values where system-missing values are simply omissions in your dataset; a user-missing value is a value that is specified by the researcher as a missing value.
The Replace Missing Values dialog box allows you to create new variables from existing ones, replacing missing values with estimates computed with one of several methods.
Transform and Compute Variable
In most data files you want to have SPSS calculate totals based upon existing values. You can use Transform and Compute Variable Values [from the Statistics Viewer select Transform and then Compute Variable] to simplify this task. You can:
- Compute values for a variable based on numeric transformations of other variables.
- Compute values for numeric or string (alphanumeric) variables.
- Create new variables or replace the values of existing variables. For new variables, you can also specify the variable type and label.
- Compute values selectively for subsets of data based on logical conditions.
- Use a large variety of built-in functions, including arithmetic functions, statistical functions, distribution functions, and string functions.
SPSS Recode can also generate new variables – not by calculating totals from existing values like Transform and Compute – but by dividing existing variables into new categories. Using Recode Values [from the Statistics Viewer select Transform and then Recode into Same or Recode into Different] you can reassign the values of existing variables or collapse ranges of existing values into new values (Recode into Same) or reassign the values of existing variables or collapse ranges of existing values into new values for a new variable (Recode into Different). There is also an “Automatic Recode” feature that can be used to convert string and numeric values into consecutive integers as required by some procedures.
The idea of Select Cases [from the Statistics Viewer select Data and then Select Cases] is to provide the ability to conduct your analysis on selected subsets of the data file. Select Cases provides several methods for selecting a subgroup of cases based on variables and expressions (you can also select a random sample of cases). The criteria used to define a subgroup can include:
• Variable values and ranges
• Date and time ranges
• Case (row) numbers
• Arithmetic expressions
• Logical expressions
A typical approach to data verification is to reorganize (or resort) the data. SPSS handles this for you with Sort Cases [from the Statistics Viewer select Data and then Sort Cases]. Using this feature you can sort cases (rows) of the data based on the values of one or more sorting variables you select.
Sooner or later, you’ll need to combined files into 1 dataset for analysis. This can be a tedious chore. Fortunately, SPSS provides some simplification to the process with the Merge Files [from the Statistics Viewer select Data and then Merge Files] feature if:
- Your files contain the same values – just different cases or
- Your files contain the same variables – just different cases.
Okay, next time, I promise – into the analysis!