I spend a bit of time data wrangling. I try to pay mind to what the predictive analytic technique needs. Likewise, it does things on its own too. Then again, when interpreting results, I think on it again. Worse, when I try to compare models or create an ensemble, I really need to know. So, I made this one stop ref.
First of all, it is important to understand how different techniques handle data irregularity. This is a simple post that aggregates some things to know. Let’s compare decision trees, linear regression, and neural networks.
Comparison of data wrangling
Examples | Decision Trees
| Linear Regression
| Neural Networks
| |
Data Types | Categorical vs continuous. Units of measure. | Continuous vars are binned. | Categorical vars are made continuous. Also, can lessen sizing challenge with transformations. | Categorical vars are made continuous. Also, can adaptive normalize orders-of-magnitude. |
Missing Values | Missing at Random (MAR). Similarly, Missing Completely at Random (MCAR). Not Missing at Random (NMAR). | Doesn’t care, but there are different ways to deal (eg dumping into most popular node or keep as separate bin). Further, can use surrogate splitting rules. | Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Generally, there are many ways to impute. | Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Also, there are many ways to impute. |
Distributions | Skewness. Outliers. Also, Class imbalance or small disjuncts. | No assumption about inputs or targets distros. Also, skewness can cause problems. | Assumes multivariate normality. Generally, outliers can cause problems. Can do transforms to make normal. | Doesn’t assume any pattern. Also, problems can occur when skewed more than lognormal. |
Unbalanced Data (bias) | Unrepresentative sample. Faulty polling. Awkward binning. Cherry picking. | Overall, is low bias (no assumption about target) and high variance (small input change makes big difference). Also, could change penalties for wrong classification. Or, can limit tree depth. | Can perform regularization (to prevent more complex models). Further, can add a weighting var. | Can create drop out layers (deactivated neurons are temporarily not propagated). Also, can perform regularization to prevent more complex models. |
Variable Relationships | Between vars. Conversely, between predictors and targets. | No assumption (is non-parametric). Further, depth of tree lets target be non-linear. Generally, likes one good var for first split. | Assumes no correlation among vars and predictors-target is linear. Further, makes assumptions about residuals too. Try to combine vars or use PCA. | Can find non-linear relationship of predictors-target. Generally, likes a good a priori starting point. Try to combine vars or use PCA. |
For more info about Perficient and predictive analytics: Data, Cloud, Analytics, Big Data