In my previous articles, we looked at all the basic concepts of creating a report in SSRS. We built a simple tabular report and made it presentable. However, simple tabular or matrix graphs are textual reports. Today, we will build a visual report – a pie chart. There are several visual reports one can build using report builder.
1.) Column graphs
2.) Line graphs
3.) Shape graphs
4.) Bar graphs
5.) Area graphs
6.) Range graphs
7.) Scatter graphs
8.) Polar graphs
We will focus on building a Pie Chart report today. Pie chart looks like a Pizza i.e a circle divided into smaller areas. Every area represents a set of data belonging to a group. This allows business managers to view the data as a proportion of the whole set of data. Pie charts also allow business users and managers to compare one set of data with another set of data in one chart.
Let us dive into creating our first visual report – Pie Chart.
Before creating the Pie Chart, let us see how we can divide our data set into different categories. We have primarily three types of products:
2.) Hand sanitizer
3.) Shower Gels.
Great – The next step is to create a datasource and a dataset.
SPSS CLEM is the control Language for Expression Manipulation, which is used to build expressions within SPSS Modeler streams. CLEM is actually used in a number of SPSS “nodes” (among these are the Select and Derive nodes) and you can check the product documentation to see the extended list.
CLEM expressions are constructed from:
CLEM should not be confused with the scripting capability that both SPSS Statistics and SPSS Modeler offer (known as Syntax). Syntax scripting is more of an automation tool, while CLEM’s purpose is focused on specific data manipulations, however a subset of the CLEM language can be used when scripting in the user interface – supporting many of the same data manipulations in an automated fashion.
With “CLEM”, you have the power to (using expressions) do many things, such as:
In nodes like “Select” and “Derive”, expressions and conditions can be entered directly (by typing them in) or by using the SPSS Expression Builder. The Expression Builder is invoked by clicking the Launch expression builder button in the respective dialog (for example, in a Select node you click the Launch expression builder button to launch the Expression builder).
The Expression Builder provides a powerful tool to construct expressions and in addition, the Expression Builder can check the validity of your expression in real-time.
More on the Expression Builder
Scripting and Expressions in Cognos TM1
In Cognos TM1, the closest thing to the SPSS expression builder is its rules editor. Which started out “pretty rudimentary” in form:
In more recent versions, TM1 introduced the “advanced rules editor” (remember, you have to set the TM1p.ini AdvancedRulesEditor = T setting to access this editor) with its specific tools to help you create, manage, and verify TM1 rules. This Rules Editor has a full set of menus for creating, editing, and managing rules. Keyboard shortcuts are provided for the more commonly used menu options.
The advanced version of TM1’s rule editor is similar to the SPSS Expression builder in that it allows the developer to either enter the express (the rule) directly (type it in) or “build it” from dialog selections such as the Insert Function button (which then displays the Insert Function dialog allowing you to select a function and even the function parameters from succeeding dialogs):
If you are a TM1 developer and have written your share of TM1 rules, you will feel comfortable with the concept of using CLEM and the expression builder with in SPSS Modeler.
Another interesting feature of SPSS Modeler is its built-in ability to sample data. It is pretty typical to have (in one or more files) hundreds of thousands of records to process, and using complete sets of data during testing can take a huge amount of your time and is inefficient in terms of computer processing time and memory.
From a TM1 perspective, think of sampling as “the selection of a subset of records from within a statistical population (a file)”. Keep in mind that the objective isn’t simply to reduce the size of the file but more to create a smaller version of the file that still is representative of the characteristics of the whole population (the whole file).
SPSS MODELER offers two sampling options:
Sampling in TM1
For the TM1 developer, nothing exists “out of the box” to create samples of data and you are (as usual) left to your (only) option of “building it yourself” using TurboIntegrator (TI) scripts- perhaps using functions such as “ItemSkip” and “Rand”. (Of course, to be fair, sampling is not something that TM1 is “built for”).
The Sample Node
Of course, SPSS Modeler features the sample node (found in the Record ops palette) which offers various methods to sample records without any programming or scripting.
The procedure to sample records is:
Sample node and the Settings Tab
Using the Settings tab in the Sample node, you can easily set the options for your data sampling:
The Generate button, when clicked, will generate a new random seed value.
Cognos TM1 is a tool that encourages rapid prototyping. Dimensions and Cubes can be snapped together and presented to subject matter experts for review. It is highly recommended that realistic data samples be loaded into prototyped models for best results. Using “made up” data or entire sets of actual data can be ineffectual. A realistic sampling set of data – based on actual data files – would increase the probability that “what you show” is a realistic representation of “what you will ultimately deliver”.
Clearly, SPSS Modeler handles sampling very well.
Consider the scenario where you have to convert information held in a “categorical field” into a “collection of flag fields” – found in a transactional source. For example, suppose you have a file of transactions that includes (among other fields) a customer identifier (“who”) and a product identifier (“what”). This file of transactional data indicates purchases of various energy drinks by unique customer IDs:
Aggregation and Flag Fields
What I need is have a single record per unique customer showing whether or not that customer purchased each of the energy drinks (True or False – not sales amounts) during a period of time. More like:
Doing it in TM1
In Cognos TM1, you’d utilize TurboIntegrator to read the file – record by record – aggregating the data (by customer id) and updating each measure (energy drink) – a reasonable trivial process, but still requiring lines of script to be written. Something (assuming you perhaps initialized the period being loaded to all False values) like:
In SPSS MODELER, the SetToFlag node (found in the Field Ops palette) enables you to create the flags you need and to aggregate the data (based on a grouping, or aggregate key fields) at the same time – without writing any script.
In order to have the SetToFlag node populated with the values of the categorical field, the field has to be instantiated so that MODELER knows the values for which to create flag fields. (In practice, this means that your data has to be read in a Type node prior to the SetToFlag node.) The procedure to create flag fields from the file is:
Editing the SetToFlag
Once you have added your SetToFlag node, you use the Settings tab to:
If you go back to the TM1 example, you now have a cube loaded that you can view by customer and period to which products each customer purchased within the period:
In SPSS Modeler, the output from the SetToFlag node is similar:
Next up – Data Sampling Patterns!
Transactional datasets (especially those originating from databases) may contain duplicate records that must be removed before any modeling can begin. There are simply two situations where duplicate records occur are:
The truth is, identical records (as errors in the data) should be removed. The question really is how to deal with the second type of duplicate records.
How you handle the duplicates depends on the objective. If your interest lies in the most recent account that is opened (no matter the type of the account) then only the record with the most recent date of opening has to be retained and the accounts opened on previous dates can be discarded.
An example may be that the records are not duplicated in the sense that all values on all fields are identical, but records are duplicated only on a subset of fields (the customer ID).
Distincting: Key Fields
To remove duplicate records, key fields need to be specified to define which records are identical. Records with the same values on the key fields are treated as duplicate records. In the example above this means that if all fields are specified as key, then identical records will be identified and the data can be cleansed. To accommodate for the second situation, retaining only the most recent record for the customer, only customer_id should be specified as key field (assuming the data are sorted on “open date” within customer_id).
Distincting with the Distinct Node
Using the SPSS Modeler Distinct node (Record Ops palette) checks for duplicate records and either passes the first distinct record or all but the first record) along the stream:
The Distinct Dialog
You will see that the Distinct node has two tabs, Settings and Optimization. To set the options for duplicate removal all you do is:
It really is that easy!
In TM1, missing data usually means that there is a defect in the logic of your ETL script or you need to check your SQL. In SPSS Modeler, missing values arise for a variety of reasons and they must be considered carefully.
You might expect that missing values imply errors or should those records be thrown away, but that is not always the case. In fact, (in the SPSS world) analysts have found that missing values can be useful information.
What to do
Assuming you TM1 developers have your ETL and SQL under control, we can focus on SPSS. When your data contains missing data the first step is to assess the type and amount of missing values (for each field) in your file, the second step is to determine how to handle it.
Modeler recognizes types of “missing values”. They are:
You’ll know (as a TM1 developer) that you need to understand that there is a difference in how you handle the programming of numeric missing data as opposed to string missing data (but that’s another story for another time as this discussion is focused on SPSS MODELER).
In MODELER, missing information is “dealt with” by declaring them as “blanks”. Note: You need to declare your missing information because by default, neither of the two types of missing information (mentioned above) is defined as a blank value by MODELER. For example, if a field, sex is checked against values Male and Female, with action Abort attached to it, then stream execution is aborted if an empty space is encountered. If it is desired that stream execution is not interrupted in this situation, you need to declare these values as blank values.
Blanks for Strings
String fields with missing values require additional considerations.
Global Knowledge – Introduction to Data Mining: “It is important to distinguish between the normal use of the word “blank” and Modelers use of it; to avoid misunderstanding, the use of the wording “series of spaces” instead of “series of blanks” here – blanks are predefined codes representing missing data”.
Blanks in Your Analysis
It is important to understand that defining blanks does not necessarily have a direct effect on how MODELER treats that value for a field in all nodes.
For example, if the value 99 is defined as a blank value for a field children (number of children); a histogram will include the blank value. One way to work around this would be to replace blank values with the undefined value ($null$), because that value is excluded from graphical displays.
It should be emphasized that how you handle your missing data when developing a model has some complications. If you use only valid records to develop a model to predict and outcome, then you cannot use the model successfully on new data unless all the fields in that file have non missing data.
To define blanks in SPSS Modeler, you can use the Missing column on the Type tab of a Type node. The Type node is a general approach to defining blanks and can be inserted at any point in a stream:
(Once blanks have been defined for a field, an asterisk in the cell in the Missing column indicates that missing values have been defined for that field!)
Numeric v. String
Although the dialogs are the same for numeric and string fields, it is important to note that when we talk about numeric fields we use “blanks whereas with strings, it’s “missing”.
Data loaded into a TM1 or SPSS model will, in most cases, include files consisting of thousands (or hundreds of thousands) of records. It is not reasonable, given the number of fields and records in files of this size, for you to visually inspect all fields in every record (of every file) for missing or invalid values.
In TM1, data is usually loaded from an EDW or GL system, so (once the initial testing phase has been completed) the probability that incoming data contains unexpected values (should be) somewhat small. However, the data will (most likely) need to be transformed into a format that the TM1 model can use, or is more optimal for the model’s use. Additionally, if data is being manually input to the model (for example a user entering a (sales) forecast value), then inspecting the data for invalid values is required.
With SPSS Modeler, data may come from sources more “sketchy” – marketing surveys, multiple files merged into one or even manually typed-in data. Obviously, the need for auditing and transforming is critical.
Types of Data Violations
Generally, (with TM1 or SPSS) three different types of data violations may be found during the loading process:
When an invalid value (a value violating one of the three rules) is found, one of five possible actions can be taken:
In TM1, this process is more typically a “remapping” from one value to another. For example, translating a product’s code used in one system to that products code used in another.
Handling data Violations
In SPSS Modeler, the Type node enables data checking and transforming. (Checking can also be done using the Types tab in a data source node). To check and transform, you need to specify 1) what the valid values are and 2) the action to be taken:
In TM1, checking for and handling (transforming) of invalid values is a little more work.
Using TurboIntegrator processes is the best approach for loading and transforming data in a TM1 model and scripting logic (using predefined functions) is required to evaluate and take action on records being loaded into TM1:
In both tools you have similar objectives – load data, ensure that the data is “usable” and if it is not, perform an appropriate action (rather than break the model!). SPSS Modeler allows you to do some checking and transforming by selecting values in dialogs, while Cognos TM1 requires you to use TI scripting to accomplish these basic operations. Both can be straight forward or complex and both can be automated and reused on future datasets.
SPSS Data Audit Node
Once data is imported into SPSS Modeler, the next step is to explore the data and to become “thoroughly acquainted” with its characteristics. Most (if not all) data will contain problems or errors such as missing information and/or invalid values. Before any real work can be done using this data you must assess its quality (higher quality = more accurate the predictions).
Addressing issues of data quality
Fortunately, SPSS Modeler makes it (almost too) easy! Modeler provides us several nodes that can be used for our integrity investigation. Here are a couple of things even a TM1 guy can do.
Auditing the data
After importing the data, do a preview to make sure the import worked and things “look okay”.
In my previous blog I talked about a college using predictive analytics to predict which students might or might not graduate on time, based upon their involvement in athletics or other activities.
From the Variable File Source node, it was easy to have a quick look at the imported file and verify that the import worked.
Another useful option is run a table. This will show if field values make sense (for example, if a field like age contains numeric values and no string values). The Table node is cool – after dropping it into my stream and connecting my source node to it, I can open it up and click run (to see all of my data nicely fit into a “database like” table) or I can do some filtering using the real-time “expression builder”.
The expression builder lets me see all of the fields in my file (along with their level of measurement (shown as Type) and their Storage (integer, real, string). It also gives me the ability to select from SPSS predefined functions and logical operators to create a query expression to run on my data. Here I wanted to highlight all students in the file that graduated “on time”:
You can see the possibilities that the Table node provides – but of course it is not practical to visually inspect thousands of records. A better alternative is the Data Audit node.
The Data Audit node is used to study the characteristics of each field. For continuous fields, minimum and maximum values are displayed. This makes it easy to detect out of range values.
Our old pal measurement level
Remember, measurement level (a fields “use” or “purpose”)? Well the data audit node reports different statistics and graphs, depending on the measurement level of the fields in your data.
For categorical fields, the data audit node reports the number of unique values (the number of categories).
For continuous fields, minimum, maximum mean, standard deviation (indicating the spread in the distribution), and skewness (a measure of the asymmetry of a distribution; if a distribution is symmetric it has a skewness value of 0) are reported.
For typeless fields, no statistics are produced.
“Distribution” or “Histogram”?
The data audit node also produces different graphs for each field (except for typeless fields, no graphs are produced for them) in your file (again based upon the field’s level of measurement).
For a categorical field (like “gender”) the Data Audit Node will display a distribution graph and for a continuous field (for example “household income”) it will display a histogram graph.
So back to my college’s example, I added an audit node to my stream and took a look at the results.
First, I excluded the “ID” field (it is just a unique student identification number and has no real meaning for the audit node). Most of the fields in my example (gender, income category, athlete, activities and graduate on time) are qualified as “Categorical” so the audit node generated distribution graphs, but the field “household income” is a “Continuous” field, so a histogram was created for it (along with the meaningful statistics like Min, Max, Mean, etc.).
Another awesome feature – if you click on the generated graphs, SPSS will give you a close up of the graph along with totals, values and labels.
I’ve talked before about the importance of understanding field measure levels. The fact that the audit data node generates statistics and chart types are derived from the measurement level is another illustration of how modeler uses the approach that measurement level determines the output.
In TM1, you may be used to “integer or string”, in SPSS Modeler, data gets much more interesting. In fact, you will need to be familiar with a concept known as “Field Measurement Level” and the practice of “Data Instantiation”.
In TM1, data is transformed by aggregation, multiplication or division, concatenation or translation, and so on, all based on the “type” of the data (meaning the way it is stored), with SPSS, the storage of a field is one thing, but the use of the field (in data preparation and in modeling) is another. For example if you take (numeric) data fields such as “age” and “zip code”, I am sure that you will agree that age has “meaning” and a statistic like mean age makes sense while the field zip code is just a code to represent a geographical area so mean doesn’t make sense for this field.
So, considering the intended use of a field, one needs the concept of measurement level. In SPSS, the results absolutely depend on correctly setting a field’s measurement level.
Measurement Levels in Modeler
SPSS Modeler defines 5 varieties of measurement levels. They are:
This would describe a field with only 2 categories – for example male/female.
A nominal field would be a field with more than 2 categories and the categories cannot be ranked. A simple example might be “region”.
An Ordinal field will contain more than 2 categories but the categories represent ordered information perhaps an “income category” (low, medium or high).
This measurement level is used to describe simple numeric values (integer or real) such as “age” or a “years of employment”.
Finally, for everything else, “Typeless” is just that – for fields that do not conform to any other types –like a customer ID or account number.
Along with the idea of setting measurement levels for all fields in a data file, comes the notion of Instantiation.
In SPSS Modeler, the process of specifying information such as measurement level (and appropriate values) for a field is called instantiation.
Data consumed by SPSS Modeler qualifies all fields as 3 kinds:
Fields with totally unknown measurement level are considered un-instantiated. Fields are referred to as partially instantiated if there is some information about how fields are stored (string or numeric or if the fields are Categorical or Continuous), but we do not have all the information. When all the details about a field are known, including the measurement level and values, it is considered fully instantiated (and Flag, Nominal Ordinal, or Continuous is displayed with the field by SPSS).
It’s a Setup
Just as TM1’s TurboIntegrator “guesses” what field (storage) type and use (contents to TM1 developers) based upon a specified fields value (of course you can override these guesses), SPSS data source nodes will initially assign a measurement level to each field in the data source file for you- based upon their storage value (again, these can be overridden). Integer, real and date fields will be assigned a measurement level of Continuous, while strings area assigned a measurement level of Categorical.
This is the easiest method for defining measurement levels – allowing Modeler to “autotype” by passing data through the source node and then manually reviewing and editing any incorrect measurement levels, resulting a fully Instantiated data file.
If you have a TM1 background it is a quick step to using SPSS Modeler -if you look for similarities in how the tools handle certain tasks like, for example, importing data.
With TM1, source data is transformed and loaded into cube structures for consolidation, modeling and reporting using its ETL tool TurboIntegrator. In SPSS Modeler, source data is loaded and transformed through a “logic stream of nodes” for modeling and reporting.
Here is a closer look:
Cognos TM1 uses TurboIntegrator as its data import mechanism. TurboIntegrator (referred to as “TI”) is a programming or scripting tool that allows you to automate the data importation into a TM1 application. Scripts built with TurboIntegrator or “TI”, can be saved, edited and, through the use of chores, be set up to run at regular intervals.
Through the use of TI’s Data Source feature, you can import data or information external to TM1, such as:
Rather than using scripts, SPSS Modeler utilizes data import nodes, which are all found on the SPSS Sources palette.