Home

Xero Infinium

Various ways to transform raw data

When to do it

Numerical Data
multiple variables
when dealing with multiple variables measured at different scales do not contribute equally to the analysis. This puts all variables into a uniform measurement.
When performing regression analysis, standardizing multi-scale variables can help reduce multi-col-linearity issues for models containing interaction terms
Standardizing continuous predictor variables in neural network is extremely important
Standardizing your data prior to cluster analysis is also extremely critical

When not to

tree-based analyses are not sensitive to outliers and do not require variable transformations
standardization of multi-scaled data is not necessary for Decision Trees, Random Forest and/or Gradient Boosting algorithms.

When to do it

Numerical Data
Data is skewed or does not fit a Normal bell curve
Performing a log transformation may result in the data taking on a normal curve or "log-normal"
This reduces the effect of extreme values that may have been "drowning out" other values that may be important

When not to

log transform does not always make data normal or always reduce extreme values and can have the opposite effect

When to use it

For feature elimination
There are a lot of variables, and it is not known which are the most useful
Many variables are correlated with themselves and thus the information from one can summarize the others into just 1
PCA turns all columns into "principle components"(PC's) that are all independent variables.
PC's can be thought of as 1 variable column that is the combination of 1 or more from the raw data
For example if there are 2 variables in a data set for height(cm) and height(inches) there would be an exact correlation and these 2 columns would be merged into 1 which values would be a medium between the 2

When not to use

When to use it

When a model can only handle categorical data
when continuous numerical data can have useful information compressed into groups
This makes continuous numerical data discrete such as taking age and making age groups
In order to find the best way to group data or if it even should be grouped, this class starts by trying 2 groups and performing a chi^2 test compared to a target variable, it then continues ,3 groups, 4, 5 and so on until it either hits a set limit for tries or until each group has 1 value ( the same as it was ) and whichever grouping had the best chi^2 result will be the one it chooses

When not to use

model is suited for numerical data
putting data into groups decreases the information it has, if it hits a limit where there are only a few values in a group and like 5000 groups, you might as well have just kept it numeric.