Data Wrangling - Comparing Three Predictive Analytic Techniques
Blog
  • Topics
  • Industries
  • Partners

Explore

Topics

Industries

Partners

Data Wrangling – Comparing Three Predictive Analytic Techniques

I spend a bit of time data wrangling. I try to pay mind to what the predictive analytic technique needs. Likewise, it does things on its own too. Then again, when interpreting results, I think on it again. Worse, when I try to compare models or create an ensemble, I really need to know. So, I made this one stop ref.

First of all, it is important to understand how different techniques handle data irregularity. This is a simple post that aggregates some things to know. Let’s compare decision trees, linear regression, and neural networks.

Comparison of data wrangling

Examples  

Decision Trees

 

 

Linear Regression

 

 

Neural Networks

 

Data Types

Categorical vs continuous. Units of measure.

Continuous vars are binned.

Categorical vars are made continuous. Also, can lessen sizing challenge with transformations.

Categorical vars are made continuous. Also, can adaptive normalize orders-of-magnitude.

Missing Values

Missing at Random (MAR). Similarly, Missing Completely at Random (MCAR). Not Missing at Random (NMAR).

Doesn’t care, but there are different ways to deal (eg dumping into most popular node or keep as separate bin). Further, can use surrogate splitting rules.

Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Generally, there are many ways to impute.

Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Also, there are many ways to impute.

Distributions

Skewness. Outliers. Also, Class imbalance or small disjuncts.

No assumption about inputs or targets distros. Also, skewness can cause problems.

Assumes multivariate normality. Generally, outliers can cause problems. Can do transforms to make normal.

Doesn’t assume any pattern. Also, problems can occur when skewed more than lognormal.

Unbalanced Data (bias)

Unrepresentative sample. Faulty polling. Awkward binning. Cherry picking.

Overall, is low bias (no assumption about target) and high variance (small input change makes big difference). Also, could change penalties for wrong classification. Or, can limit tree depth.

Can perform regularization (to prevent more complex models). Further, can add a weighting var.

Can create drop out layers (deactivated neurons are temporarily not propagated). Also, can perform regularization to prevent more complex models.

Variable Relationships

Between vars.
Conversely, between predictors and targets.

No assumption (is non-parametric). Further, depth of tree lets target be non-linear. Generally, likes one good var for first split.

Assumes no correlation among vars and predictors-target is linear. Further, makes assumptions about residuals too. Try to combine vars or use PCA.

Can find non-linear relationship of predictors-target. Generally, likes a good a priori starting point. Try to combine vars or use PCA.

 

For more info about Perficient and predictive analytics: Data, Cloud, Analytics, Big Data

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up