Data Wrangling - Comparing Three Predictive Analytic Techniques

I spend a bit of time data wrangling. I try to pay mind to what the predictive analytic technique needs. Likewise, it does things on its own too. Then again, when interpreting results, I think on it again. Worse, when I try to compare models or create an ensemble, I really need to know. So, I made this one stop ref.

First of all, it is important to understand how different techniques handle data irregularity. This is a simple post that aggregates some things to know. Let’s compare decision trees, linear regression, and neural networks.

Comparison of data wrangling

	Examples	Decision Trees	Linear Regression	Neural Networks
Data Types	Categorical vs continuous. Units of measure.	Continuous vars are binned.	Categorical vars are made continuous. Also, can lessen sizing challenge with transformations.	Categorical vars are made continuous. Also, can adaptive normalize orders-of-magnitude.
Missing Values	Missing at Random (MAR). Similarly, Missing Completely at Random (MCAR). Not Missing at Random (NMAR).	Doesn’t care, but there are different ways to deal (eg dumping into most popular node or keep as separate bin). Further, can use surrogate splitting rules.	Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Generally, there are many ways to impute.	Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Also, there are many ways to impute.
Distributions	Skewness. Outliers. Also, Class imbalance or small disjuncts.	No assumption about inputs or targets distros. Also, skewness can cause problems.	Assumes multivariate normality. Generally, outliers can cause problems. Can do transforms to make normal.	Doesn’t assume any pattern. Also, problems can occur when skewed more than lognormal.
Unbalanced Data (bias)	Unrepresentative sample. Faulty polling. Awkward binning. Cherry picking.	Overall, is low bias (no assumption about target) and high variance (small input change makes big difference). Also, could change penalties for wrong classification. Or, can limit tree depth.	Can perform regularization (to prevent more complex models). Further, can add a weighting var.	Can create drop out layers (deactivated neurons are temporarily not propagated). Also, can perform regularization to prevent more complex models.
Variable Relationships	Between vars. Conversely, between predictors and targets.	No assumption (is non-parametric). Further, depth of tree lets target be non-linear. Generally, likes one good var for first split.	Assumes no correlation among vars and predictors-target is linear. Further, makes assumptions about residuals too. Try to combine vars or use PCA.	Can find non-linear relationship of predictors-target. Generally, likes a good a priori starting point. Try to combine vars or use PCA.

For more info about Perficient and predictive analytics: Data, Cloud, Analytics, Big Data

Data Wrangling – Comparing Three Predictive Analytic Techniques

by Rick Kapalko on November 5th, 2019 | ~ minute read

Comparison of data wrangling

Decision Trees

Linear Regression

Neural Networks

Data Types

Continuous vars are binned.

Categorical vars are made continuous. Also, can lessen sizing challenge with transformations.

Categorical vars are made continuous. Also, can adaptive normalize orders-of-magnitude.

Missing Values

Doesn’t care, but there are different ways to deal (eg dumping into most popular node or keep as separate bin). Further, can use surrogate splitting rules.

Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Generally, there are many ways to impute.

Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Also, there are many ways to impute.

Distributions

No assumption about inputs or targets distros. Also, skewness can cause problems.

Assumes multivariate normality. Generally, outliers can cause problems. Can do transforms to make normal.

Doesn’t assume any pattern. Also, problems can occur when skewed more than lognormal.

Unbalanced Data (bias)

Overall, is low bias (no assumption about target) and high variance (small input change makes big difference). Also, could change penalties for wrong classification. Or, can limit tree depth.

Can perform regularization (to prevent more complex models). Further, can add a weighting var.

Can create drop out layers (deactivated neurons are temporarily not propagated). Also, can perform regularization to prevent more complex models.

Variable Relationships

No assumption (is non-parametric). Further, depth of tree lets target be non-linear. Generally, likes one good var for first split.

Assumes no correlation among vars and predictors-target is linear. Further, makes assumptions about residuals too. Try to combine vars or use PCA.

Can find non-linear relationship of predictors-target. Generally, likes a good a priori starting point. Try to combine vars or use PCA.

Tags

Leave a Reply

Rick Kapalko

Categories

Follow Us

Data Wrangling – Comparing Three Predictive Analytic Techniques

by Rick Kapalko on November 5th, 2019 | ~ minute read

Comparison of data wrangling

Decision Trees

Linear Regression

Neural Networks

Data Types

Continuous vars are binned.

Categorical vars are made continuous. Also, can lessen sizing challenge with transformations.

Categorical vars are made continuous. Also, can adaptive normalize orders-of-magnitude.

Missing Values

Doesn’t care, but there are different ways to deal (eg dumping into most popular node or keep as separate bin). Further, can use surrogate splitting rules.

Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Generally, there are many ways to impute.

Cannot handle missing values. Thus, must drop or impute. Dropping NMAR can create bias. Also, there are many ways to impute.

Distributions

No assumption about inputs or targets distros. Also, skewness can cause problems.

Assumes multivariate normality. Generally, outliers can cause problems. Can do transforms to make normal.

Doesn’t assume any pattern. Also, problems can occur when skewed more than lognormal.

Unbalanced Data (bias)

Overall, is low bias (no assumption about target) and high variance (small input change makes big difference). Also, could change penalties for wrong classification. Or, can limit tree depth.

Can perform regularization (to prevent more complex models). Further, can add a weighting var.

Can create drop out layers (deactivated neurons are temporarily not propagated). Also, can perform regularization to prevent more complex models.

Variable Relationships

No assumption (is non-parametric). Further, depth of tree lets target be non-linear. Generally, likes one good var for first split.

Assumes no correlation among vars and predictors-target is linear. Further, makes assumptions about residuals too. Try to combine vars or use PCA.

Can find non-linear relationship of predictors-target. Generally, likes a good a priori starting point. Try to combine vars or use PCA.

Tags

Leave a Reply

Rick Kapalko

Categories

Follow Us

Related Posts