Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Archives

Follow our Enterprise Information Technology board on Pinterest

Archive for April, 2012

Dummy Coding with IBM SPSS

Dummy Coding with IBM SPSS

To understand what is meant by dummy coding, you need to understand 2 forms of data:

Qualitative or Quantitative?

“Qualitative data describes items in terms of some quality or categorization while Quantitative data are described in terms of quantity (and in which a range of numerical values are used without implying that a particular numerical value refers to a particular distinct category).” To better understand the differences, always remember that qualitative data is more of an observance, while quantitative is measurable.

 

Your Morning Latte…

So!

If we consider a morning latte example, we might note the following:

 

 

Qualitative Examples

  • robust aroma
  • frothy appearance
  • strong taste
  • burgundy cup

Quantitative Examples

  • 12      ounces of latte
  • Serving  temperature 150º F.
  • serving  cup 7 inches in height
  • cost  $4.95

Statistical Analysis often includes variables in which the numbers represent qualitative categories (such as gender, ethnicity or political affiliation).

Including these variables in an analytical model requires special steps to ensure the results can be interpreted properly. These steps involve coding a categorical variable into multiple dichotomous variables, in which variables take the value of “1″ or zero.

For clarity, a dichotomous variable is defined as a variable that splits or groups data into 2 distinct categories. An example would be employed and unemployed.

This process is known as “dummy coding.” IBM SPSS makes dummy coding an unpretentious practice. Let’s walk through the steps!

  1. Select the categorical variable that you want to dummy code. (Note the number of categories, remembering that dummy coding transforms a variable with “n” categories into “n-1″ categories. For example, a categorical variable on political affiliation with three categories — Democrat, Republican and Independent — would be dummy coded into two dichotomous variables, such as Democrat and Republican. A person who identifies as one of these would be coded a “1″ in the data set. A person with a zero in these categories would be counted as independent).
  2. Click the “Transform” menu at the top of the SPSS data sheet, then select “Recode Into Different Variable,” because you will transform the categorical variable into one or more dichotomous or dummy variables. This opens a window that displays the variables in your data set. Select the variable you want to recode, and then click the arrow, which moves the variable name into the box labeled “Numeric Variable.”
  3. Click the “Output Variable” name box and type a name for your new dichotomous variable. Click “Change.” Click “Old and New Values,” which opens a new display, showing old and new values for the variable you want to transform.
  4. Recode the values of the variable by coding one category as a “1″ and the others as zero. Under “Old Value,” enter the category value to be recoded. Under “New Value,” type a “1,” then click “Add.” On the “Old Value” side, select the “All Other Values” button and type “0″ as the new value. For example, the political affiliation example that codes Democrat as a “1,” Republican as a “2″ and Independent as a “3″ could be recoded into the dichotomous variable Republican, with all “2s” recoded as “1″ and other values coded as zero.Click “Continue” after entering the old and new values for your dummy codes, then click “OK.” SPSS will then recode the categorical variable as you have specified.

 

Done Deal!

Iterative BI – What’s the Difference?

Recently I was in a conversation where a PM declared “Agile’s just waterfall really fast – we can do that no problem!”  Uh oh.

Like (most) everything, delivery methodologies are subject to fashion and trend, and Agile/Scrum/Kanban and the like are en vouge.  Collective, I’ll refer to these highly cyclic methodologies as “iterative” or (little a) agile development.  My interest being BI, I’ll take a little time discussing how these iterative delivery methods impact your BI delivery processes.

Generally, iterative development does a number of things to your teams.  When operating effectively, it (among other things):

  • Brings your users much closer to the development process.
  • Multiplies the number of builds/deployments you do by a factor of LOTS (probably 10-20).
  • Multiplies the number of tests (esp. regressions) required.
  • Makes juggling project tasks more complex by putting many more “balls” in the air.
  • Eliminates the formality (and safety) of predefined scope and quality gates.

Read the rest of this post »

Cognos TM1 Attributes -what are they and what can then do for me?

In this post I’d like to talk about attributes – let’s me begin:

To define an element’s type (numeric, consolidation or string), elements can have attributes defined and assigned to them. What is an attribute you ask? Well, if elements identify data in your cube, then you can think of the element attributes as describing the elements themselves, it’s that simple.

For example, let’s say that some of your users would like to display an account using the account name followed by the account number. Other users would like to display only the account name. You can define an alias attribute for each of these requirements. In fact, you can define as many alias attributes as you need.

 

I can have an account number of “01-0000-00001″ and define multiple aliases so that the user can view the element as any of the following:

  • “01-0000-00001″
  • “01-0000-00001 – Long Surfboard”
  • “Long Surfboard”

Some interesting uses for attributes include:

  • To define features of elements. For example, an employee may have attributes that include “title”, “hire date”, or “department”
  • To provide alternative or “friendly” names, or aliases. For example, an accounting code of “02-0000-00001″ may have an alias of “salary and wages”
  • To control the display format for the numeric data.

An alias attribute may also be used to present data in different languages!

You can also select elements by attribute value in the Subset Editor and display element names in TM1 windows using their aliases.

Creating an Attribute Is Easy!

To create attributes and assign attribute values, you use the TM1 Attributes Editor.

When adding attributes using the Attributes Editor, you will notice that it “defaults” to an attribute type of string, so be careful to select the desired type before proceeding. In most cases, I’ve used TurboIntegrator processes as the tool used to add, update, and delete my dimension attributes. You use the following programming functions:

AttrInsert: To add a new attribute

AttrPutN or AttrPutS: To update the attribute which can be numeric or string value

AttributeDelete: To remove an existing attribute

A key point to know is that, if you try to view or update dimensions attributes and there are a large number of elements in the dimension when you open the Attributes Editor, you will receive a message as shown in the following screenshot:

 

Do not continue!

You can potentially lock your TM1 session for a long time.

The alternative is to access the attributes of this dimension through the attributes cube instead, as it is much faster:

  1.  Select View | Display Control Objects.
  2.  Open the cube called }ElementAttributes_dimension.
  3.  Modify the required fields like in any cube!

This is another important point. The }ElementAttributes cubes are known as Cognos TM1 control cubes. These cubes are automatically generated by TM1. As an clever developer you can either create a new (lookup) cube or use a control cube to look up data, depending upon your needs.

Some key attribute terminology and concepts you should be able to recognize are descriptive attributes, alias attributes, and when to use an attribute versus an additional element. Let’s briefly touch on these.

Descriptive attributes !

Descriptive attributes are simply the attributes which are data that describe the data. For example, consider some attributes for selecting a surfboard:

Alias attributes!

These attributes provide alternative names for elements:

It can be tempting to add many attributes to describe your elements and in most cases this is fine as you can filter your data by attribute value, however you should consider how your data is going to be used and presented. Sometimes it is more appropriate to create elements rather than attributes and sometimes even additional dimensions.

For example, board length is an attribute of surfboard models. The 6.6 boards often outsell the other length boards. If you create one element per board and another dimension with elements for each length, you can use TM1 to track surfboard sales by the length of the board. If you combine sales into a single board length, you might lose valuable detail.

Display format attributes!

A astute use for element attributes is formatting what is displayed in the Cube Viewer. When you create a dimension, an attribute named format is created for you automatically by TM1. This attribute can be used to set a display format for each individual numeric element.

Display format attributes can also be set programmatically with a TurboIntegrator process using the AttrPutS function. Just remember to add the c: to the format string to indicate that it is a custom format:

AttrPutS(‘c:###,###.00′, myDimensioName, myElementName, ‘Format’);

Referring to the IBM Cognos documentation, you see that the numeric data can be displayed in the following formats:

The Cube Viewer will display the format to use:

  1. Elements in the column dimension are checked for formatting.
  2. Elements are checked in the row dimension for display formats.
  3. Elements are checked in the title dimension for display formats (left to right).

The current view formatting is used.

Well, I hope this post is useful to you.

 

“If you’re doing something for the right reasons, nothing can stop you” – Duncan Me.

 

 

 

 

 

 

 

Chi-Squared Challenging using SPSS

  Chi-Squared Challenging using SPSS

A Chi-Square Challenge (or Test) procedure organizes your data pond variables into groups and computes a chi-square statistic. Here is the specific definition:

“The chi-square (chi, the Greek letter pronounced “kye”) statistic is a statistical technique used to determine if a “distribution of observed frequencies” differs from the “theoretical expected frequencies”…

Okay, this is a pretty clever explanation, however it really just means “does what I see match what I had thought I’d see”.

For example, the chi-square test could be used to determine whether a box of crayons contains equal quantities of blue, brown, green, orange, red, and yellow.

Using IBM SPSS you can obtain your chi-test by selecting from the menus:

Analyze > Nonparametric Tests > Legacy Dialogs > Chi-Square…

From there, you can select one or more test variables. Each variable will produce a separate test.

Using a previous blog as an example, I might want to evaluate the results of a poll conducted on marriage, gender and an individual’s overall satisfaction with their life. Using SPSS I can determine (for an example) that there are 132 total “observations” -121 male and 12 female. Does my expected ratio of male v. female align to the actual?  Does my assumption that married females are significantly more satisfied with their life than males are? And so on…

Once again, SPSS makes statistical analysis easy.

 

 

BI Maturity – Now What?

I spend a lot of time working with BI teams assessing how they’re doing and what they should do next.  We often use a BI maturity model as a framework for these assessments.  They’re great accelerators and help the teams to understand where they fall between “getting started” and “guru”, but they’re not great (by themselves) at helping development manager figure out what specific things they need to address next.

Development managers are (nearly) always faced with limited resources and must aggressively prioritize how to spend those resources – on people (FTEs and consulting/contract), on tools, and on “soft” expenses such as training and conferences.  They also are faced with determining how to allocate new projects vs. maintenance work and refactoring.

Read the rest of this post »

IBM SPSS Split File

You can use the IBM SPSS Split File feature to split your data pond into separate groups for further analysis –based on the values of one or more grouping variables.

If you select multiple grouping variables, cases are grouped by each variable within categories of the preceding variable on the Groups Based On list.

For example, if you selected gender as the first grouping variable and minority as the second grouping variable, your cases will be grouped by minority classification within each gender category.

Good to know:

• You can specify up to eight grouping variables.

• Each eight bytes of a long string variable (string variables longer than eight bytes) counts as a variable toward the limit of eight grouping variables.

• Cases should be sorted by values of the grouping variables and in the same order that variables are listed in the Groups Based On list. If the data file isn’t already sorted, select Sort the file by grouping variables.

In my data pond, I have variables defined indicating marital status (a 0 or a 1) and retired indicator (again, a 0 or a 1).  Using SPSS’s “split file” I chose “Organize output by groups”, selected my 2 variables (Marital status and Retired) as “Groups Based on” and then indicated “Sort the file by grouping variables”:

When I click the OK button, my result is a file into 2 “split files” – single people sorted by their retirement status and married people sorted by their retirement status.

Comparing Your Groups

These “split-files” are really groups (not physically separate files) and are presented together for comparison purposes. But all results from any procedures are displayed separately for each split-file group!

Pivot Tables

A single pivot table is created and each split-file variable can be moved between table dimensions.

 

Charts

A separate chart is created for each split-file group and the charts are displayed together in the Viewer.

Organizing Your Output Using Groups.

to split a data pond for analysis:

From the menus you choose:

1.     Data > Split File…

2.     Select Compare groups or Organize output by groups.

3.     Select one or more grouping variables.

Remember! If your data isn’t already sorted by values of the grouping variables you’ll need to select Sort the file by grouping variables!

Simple Inferential Statistics

 Simple Inferential Statistics

“Inferential statistics” is a term used to describe the use of information regarding a sample of subjects to make:

(1) Assumptions about the population at large and/or

(2) Predictions about what might happen in the future

 

What’s your Batting Average?

You can calculate the mean (or average) batting average of a known sample of ball players by adding up their total hits for last season and dividing by the number of the players. The mean of the players is therefore a known variable. To determine a mean of a population of those players for next season requires the data scientist to make assumptions (because their number of hits is not yet known).

Money ball

The goal of inferential statistics is to do just that:

To take what is known and make assumptions or an inference about what is not known”.

The specific procedures used to make inferences about an unknown population or unknown score can vary (depending on the type of data used and the purpose of making the inference).

Basic Procedures

The five most basic inferential procedures include:

  1. T-test
  2. ANOVA
  3. Factor Analysis
  4. Regression Analysis, and
  5. Meta-Analysis.

T-Test

The purpose of a T-test is to determine if a difference exists between the averages of two groups, using the means, standard deviations, and number of subjects for each group.

Factor Analysis

A factor analysis is used when an attempt is being made to break down a large data pond into different subgroups or factors looking at each question within a group of questions to determine how these questions accumulate together.

Regression Analysis

When a correlation is used you can determine the strength and direction of a relationship between two or more variables. Got example, if it is determined that a connection between a midterm test and a final exam was +.95, we could say that these two tests are strongly and directly related to each other. (In other words, a student who scored high on one would likely score high on the other).

When your data pond is much larger and the correlation less than perfect, making a prediction requires the use of the statistical regression, which is basically a formula used to determine where a score falls on a straight line.

Meta-Analysis

“Meta-analysis” refers to the combining of numerous studies into one larger study. When this technique is used, each study becomes one subject in the new “meta study”. For instance, the combination of 12 studies on years in the league and batting averages would result in a Meta study with 12 subjects. “Meta-analysis” combines many studies together to determine if the results of all of them, when taken as a whole, are significant.

Play Ball!

Hugo Cabret: It’s like a puzzle. When you put it together, something’s going to happen.

 

BI Tools – Scheduling and System Automation

Way back at my first full time job, maintaining the system automation program was one of my primary duties.  It was written in C, could launch one thing at a time, and documented dependencies in a cryptic text format.  And I loved it because the alternates were way worse.

Fast forward to today where system automation has become a non-issue in most ways.  The tools are robust, handle a laundry list of scenarios, and have all kinds of notification options including Twitter.  This one’s going to be short and sweet.

Here’s what I want in an automation tool:

  1. Execute jobs locally and remotely across all machine instances in my environment.
  2. Handle dependencies triggered by internal jobs or by any number of external factors including time, file availability, service availability, web service query result, phase of the moon, etc.
  3. Notify me in any way I can dream up, including at least email, issue/ticket, pager/phone, SMS,  and changing the stoplight icon on the BI dashboard.
  4. A nice visualization of the overall state of the system that can be included in some kind of operations dashboard.
  5. Is script-able to allow scheduling to be deployed in the same way as the remainder of the system.

Easy enough.  Actually, if you’re in an organization of any size, there’s probably an enterprise class tool already out there that you can (should) piggyback on – something like CA Autosys or IBM/Tivoli Workload Scheduler or Cisco Tidal.  If you’ve got one of these, great.

If not, there’s a few options:

  • Let each tool schedule itself.  For instance, you might use SQL Agent (SQL Server) for ETL and other data manipulation and Business Object’s internal scheduler for reports. Works, but doesn’t meet many of the above criteria.
  • Buy one of the above tools or something similar.
  • Go open source with something like TORQUE.
  • Patch together a scheduling system using your platforms task scheduler (“task scheduler”, cron, etc.) and a build system such as make, msbuild, etc.

This is a tool class where I really hope you don’t have to spend much time or effort.  If you can afford it, buy one of the commercial tools and move on to more interesting problems.  Otherwise, you’ll end up sinking a lot of time and energy into this rather uninteresting (but critical nonetheless) aspect of you BI system.

BI Tools – Continuous Integration

Automate everything!  The mantra of the real developer – I’d rather spend 2 hours automating a task than 30 minutes of repetitive typing!  Continuous integration finally legitimizes that innate desire to automate during the development process.

Continuous Integration (CI) tools automate the application build process and automated tests on a regular interval (or on demand as code are checked in).  This allows for frequent feedback on some of the basic aspects of quality and can improve confidence in the overall platform’s state moving towards a release.  In BI, continuous integration is especially helpful in detecting synchronization issues between data models, integration (ETL) code, semantic code (report models/universes) and reports.

Read the rest of this post »

Little Data

Here at the Gartner BI Conference in Los Angeles this week, and everywhere else that  I turn for that matter, all I hear about is Big Data this and Big Data that.  Even at breakfast this morning, those now all too familiar two words kept popping up in conversations all around me until I thought was going out of my mind!

But, can I ask a question?  Has everyone forgotten about little data?  For a very long time, that was the only kind of data there was, and it was very important.  In the early days of computing, when there was almost no memory to work with and even the external drives were just tapes or small capacity disks, all anyone even thought about processing, much less storing, was little data.

Data came to you slowly and in small amounts, and it came in very limited forms and from very limited sources.  All of the data was on the old IBM cards that had been keypunched by rows of keypunch typists.  Since everyone knew that only a limited amount of data could be processed, and that it could not be done very quickly, it was generally only at critical points, like an end of a month, quarter or year, that they even expected a lot of data (other than for payroll processing, of course).  Executives and managers made their analysis and decisions based on these rather infrequent reports, and no one expected to know exactly what was going on right now, or yesterday, or maybe even last week.  That would have been impossible.  They would find out in their monthly reports, and that was still faster and more accurate than before computers, so that was still good.

If something happened to your data back then, which it often did, you simply went back to your hard copy paper backup for the information.  That paper was all stored somewhere, because you could never really trust the computers to do everything right.   And, one of the nicest things of all: the data was only in character format, letters and numbers!  No pictures, sounds, videos, phone transmissions, internet downloads, or anything amorphous or uneasily translatable.  It was a character or number, and that was the end of it.

Now, today, where is little data?  Forgotten, unwanted.  No one talks about it, no one cares about.  It doesn’t seem much worth caring about.  If it is doesn’t come in petabytes or terabytes, how can it be worth anything, anyway?

But, one thing Big Data always needs to remember: if it wasn’t for little data getting everything started, there wouldn’t be any Big Data, either.

By the way, a very happy belated April Fool’s Day!

Posted in News