I spend a bit of time data wrangling. I try to pay mind to what the predictive analytic technique needs. Likewise, it does things on its own too. Then again, when interpreting results, I think on it again. Worse, when I try to compare models or create an ensemble, I really need to know. So, I made this one stop ref.
First of all, it is important to understand how different techniques handle data irregularity. This is a simple post that aggregates some things to know. Let’s compare decision trees, linear regression, and neural networks.
Comparison of data wrangling
Examples |
Decision Trees
|
Linear Regression
|
Neural Networks
|
|
Data Types |
Categorical vs continuous. Units of measure. |
Continuous vars are binned. |
Categorical vars are made continuous. Also, can lessen sizing challenge with transformations. |
Categorical vars are made continuous. Also, can adaptive normalize orders-of-magnitude. |
Missing Values |
Missing at Random (MAR). Similarly, Missing Completely at Random (MCAR). Not Missing at Random (NMAR). |
Doesn’t care, but there are different ways to deal (eg dumping into most popular node or keep as separate bin). Further, can use surrogate splitting rules. |
Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Generally, there are many ways to impute. |
Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Also, there are many ways to impute. |
Distributions |
Skewness. Outliers. Also, Class imbalance or small disjuncts. |
No assumption about inputs or targets distros. Also, skewness can cause problems. |
Assumes multivariate normality. Generally, outliers can cause problems. Can do transforms to make normal. |
Doesn’t assume any pattern. Also, problems can occur when skewed more than lognormal. |
Unbalanced Data (bias) |
Unrepresentative sample. Faulty polling. Awkward binning. Cherry picking. |
Overall, is low bias (no assumption about target) and high variance (small input change makes big difference). Also, could change penalties for wrong classification. Or, can limit tree depth. |
Can perform regularization (to prevent more complex models). Further, can add a weighting var. |
Can create drop out layers (deactivated neurons are temporarily not propagated). Also, can perform regularization to prevent more complex models. |
Variable Relationships |
Between vars. Conversely, between predictors and targets. |
No assumption (is non-parametric). Further, depth of tree lets target be non-linear. Generally, likes one good var for first split. |
Assumes no correlation among vars and predictors-target is linear. Further, makes assumptions about residuals too. Try to combine vars or use PCA. |
Can find non-linear relationship of predictors-target. Generally, likes a good a priori starting point. Try to combine vars or use PCA. |
For more info about Perficient and predictive analytics: Data, Cloud, Analytics, Big Data