Data Mining, proof of concept / Blogs / Perficient

In this article I would like to share my experience exploring SQL 2005 Data Mining (DM). I used different mining algorithms against a test database, trying to predict some of the values in a test dataset using values from correlated fields.

I used Visual Studio 2005 (BI development studio) and Data Mining Add Ins (DM Add-Ins) for MS Office 2007. It turns out that you can accomplish almost all your Data Mining goals using DM Add-Ins, and it even has some tools, like “Explore Data”, “Clear Data”, “Analyze Key Influences”, which VS 2005 does not have.

Introduction

I started exploring Data Mining recently, trying to investigate how we can use it to help our clients to find patterns and correlations in their data and use them to predict the future and make the right decisions.

Data Mining is a part of Business Intelligence (BI) of SQL 2005, but it has nothing to do with the other BI concepts like cubes, dimensions, measures and so on.

You don’t need to understand the OLAP concepts to use the Data Mining; it is a stand alone product.

I don’t know why Microsoft decided to make the DM a part of the Analysis Services. In my opinion, it would make sense to make something like Data Mining services, because to use DM you have to install and start Analysis Services which contains a lot of functions that have nothing to do with DM. It is very possible that some of the users would want to use just Data Mining without creating OLAP cubes, and if you are not using DM but use OLAP cubes, you don’t need DM functions which are embedded in SSAS.

Unlike the relation databases and OLAP cubes the Data Mining algorithms contain some element of “mystic”. What I mean is that for OLAP cubes (and of course for the relational database) you can always predict the result of the aggregated values in any cell of a cube because you know how the data is aggregated, but for most of the Data Mining algorithms the inner works are not known to the user. In general, you choose one algorithm versus another according to your own instinct, intuition or experience; sometimes you may just like a name of the algorithm, for example “Neural Network” or “Logistic regression”. In most cases you will read something like “some very smart guys from Microsoft” built this or that algorithm based on some clever math theory. The bottom line is that the algorithms are very complex and it will take a lot of time to understand how they work. You just need to vaguely know in which cases each algorithm works best.

In all DM samples Microsoft uses Adventure Works DW database. In most cases the samples would use vTargetMail view to predict if a customer is a potential buyer based on customer’s personal information. For example: if you know the customer’s income, number of children, age, education and so on you can try to guess if she is going to buy a bike in the next year.

My task was to figure out if I can use DM in a different scenario: try different algorithms and see how they would work.

I attempted to predict different fields, using different algorithms and finally I managed to make it work pretty well for Adventure Work database. Adventure Work, most probably, is not a real life database, so I tried to find some real life data to prove the concept.

I found a free baseball database from http://www.baseball1.com/statistics/ which I also used to investigate different algorithms and features. I wanted to predict a player’s salary based on his awards, league he belongs to, pitching points and so on. I got some good experience working with this database but I can’t say that I was very satisfied with the results. After looking at the data thoroughly I think that the salary data in the database is not correct. I first found it when using “Explore data” and “Clean data” tools of DM Add in for MS Excel.

In this article most of the time I will use Adventure Work DW database to predict client yearly income. I tried different DM algorithms. I used Visual Studio 2005 (BI studio) and DM Add-Ins for MS Office 2007 which you can download and install for free from http://www.microsoft.com/sql/technologies/dm/addins.mspx .

Bellow I will use mostly DM Add-Ins for Office 2007 if you want to use VS 2005 (BI studio) you will have to perform similar steps and you will get similar results, but in my opinion DM Add-Ins are more user friendly.

Let’s get started

As I mentioned before I used DM Add-Ins and VS 2005 at the same time for the same goal. Both of the tools have similar functionality but VS 2005 is more developer-oriented and DM Add-Ins is more user friendly, and yet DM Add-Ins has some very useful functions which VS 2005 doesn’t have.

I would recommend starting any DM project with DM Add-Ins, which will help you explore and investigate the data and make the decision about which fields of the table to use for prediction.

Prepare the data

As advised in the data mining tutorial, to train and test your model you need to split your data into two parts. I created two views from vTargetMail view: vTargetMail_DM and vTargetMail_test using the WHERE clause. One of the views contains customers with the customerKey<20000 and another with customerKey>20000.

Connect to an analysis services database

To work with the Add-Ins you need to connect to an existing analysis services database. The database can be empty or contain cubes. As I mentioned before, the Data Mining service has nothing to do with the OLAP cubes. If you already have mining models in the database you will be able to see them using “Manage models” command.

For example on Figure 1 I’m connected to DataMining Test analysis services database. All of the data mining models you will create from the Excel will be created in this database.

Figure 1 Data Mining tools.

Import the data

Open a new Excel workbook and create a new worksheet; open a new or existing connection to a relational database or analysis services and fill the worksheet with the data.

Figure 2 Table created from a database connection

Explore and clear the data

After that you can use the features available only in Add-Ins (not in VS 2005), I would start with the “Explore Data” (see Figure 1, the first button).

This feature can be very helpful, for example here is the result of the data exploration of the baseball player salaries:

The IT Leader's Guide to Multicloud Readiness

This guide provides practical key insights and important factors to consider to make informed decisions in your multicloud journey.

Download the Guide

Figure 3 Explore Data result of the baseball player salaries.

As you can see, building a data mining model using such data doesn’t make sense, because the system is trying to use all salaries available including $50,000 and $22000000. The data may be wrong, but even if it is not, we need to delete just a few rows with extremely low and extremely high salaries, which are not typical for the majority of the players, to be able to create a pattern. To clear the data you can use “Clean Data” command (the second button on Figure 1).

Choose the input fields

After exploring and clearing your data you have to make a decision of what information you are going to use to predict the data, click “Analyze” tab (see Figure 1, the second tab strip from the top).

Figure 4 Analyze tabs.

By the way here is a good tip about the Add-Ins: some of the modal windows didn’t make sense, for example:

Figure 5 Analyze Key Influences window initially.

May be it was because of my screen resolution.

What I needed to do is to resize the window:

Figure 6 Analyze Key Influences window after resizing.

Key Influences analysis can give you a clue which fields to use to anticipate the YearlyIncome, after analyzing the results and using common sense I choose:

Age (it is probable that older people earn more than the younger ones),

Education (education should affect the income),

Occupation (definitely should affect the income),

Geography (income is different in different countries),

Gender (I know that it’s wrong, but usually males earn more).

Choose the data mining algorithm

Now is the time to make a decision about which algorithm you are going to use to predict the YearlyIncome of the customers.

In VS 2005 you have seven different algorithms: Decision Tree, Clustering, Association Rules, Logistic Regression, Linear Regression, Naïve Bayes, and Neural Network. As I mentioned before, most of the algorithms are “black boxes”, I tried all of them and Decision Tree worked the best, at least in case of anticipating the yearly income.

Looks like Microsoft encourages you to use the decision tree algorithm; DM Add-Ins has four choices for data modeling (see Figure 1): Classify, Estimate, Cluster, Associate, and Forecast. As you can see it doesn’t show the algorithm’s name directly but as I figured out that Classify and Estimate are both decision tree algorithms. When I used them for the YearlyIncome prediction it created decision tree mining models in both cases, the only difference was that in case of Classify the early income was discrete and in case of Estimate it was continuous.

Running the Data Modeling wizards is straight forward. Just click Classify or Estimate and if you have already made the decisions (see the previous steps) about which data table to use, which field to predict, and which fields to use for prediction, you won’t have any problems going through the steps. It is very important to remember the name of the DM model that will be created; you would better think in advance about some naming conventions, because you will most probably create many mining models for different algorithms and different parameters and it is very easy to get lost. The mining model will be created in the database you connected to in the step 1 (connect to the database). After completing the wizard steps, you will see the decision tree diagram:

Figure 7 Tree view browser, the decision tree diagram

You can try to analyze the diagram, play with it, or you can just close it. In the future if you want to see it again you can always click “Browse” and select the mining model you just created.

Accuracy Chart

To create the accuracy chart of the created model click “Accuracy Chart” button, follow the steps of the wizard.

The accuracy chart for the YearlyIncome looks pretty good:

Figure 8 Accuracy chart for the Yearly Income, Microsoft Tree model

For comparison, here is the accuracy chart for the baseball player salary, as you can see, the results are much worse, though it’s still better than a random guess:

Figure 9 Accuracy chart for baseball player salary, Microsoft Tree model, baseball database (for comparison).

Classification matrix

Here is the classification matrix for YearlyIncome. As you can see, the best results are calculated for the income range 39400-72026.

The prediction results

Click “Query” and follow the wizard’s steps. When asked for the mining model-select the mining model you just created. When asked for the source data-select the test data set you created on step 1 (Prepare the data). The model will generate the YearlyIncome for the test dataset.

Now you can compare the yearly income for the test table and the generated results.

On Figure 10 “Output 12” is the generated results, as you can see the results are pretty close to the actual Yearly Income (the next column on the left).

Figure 10 Generated Yearly Income and the actual Yearly Income in the test data set.

Modification of a mining model

The main disadvantage of DM Add-Ins is that once you create a mining model, you can’t modify it, however you can use VS 2005 to do that. As I mentioned, all mining models you created in DM Add-Ins are stored in the analysis services database you are connected to. In VS 2005 create a new Business Intelligence project from the template “Import Analysis Services 9.0 Database” using the database you are connected to and you will be able to see and modify all of the models you have created.

Conclusion

Comparing two Data Ming tools: VS 2005 (business intelligence studio) and DM Add-Ins, I think that DM Add-Ins has more advantages than VS 2005, and it has a much more clear and user friendly interface.

The main disadvantages of the DM Add-Ins are:

Smaller number of algorithms (but I think Add-In has all of the algorithms you need anyway).
Inability to modify created models.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Data Mining, proof of concept

by Dave Scheele on September 14th, 2007 | ~ minute read