-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Logan Collier edited this page Sep 9, 2019
·
1 revision
Various ways to transform raw data
When to do it
- Numerical Data
- multiple variables
- when dealing with multiple variables measured at different scales do not contribute equally to the analysis. This puts all variables into a uniform measurement.
- When performing regression analysis, standardizing multi-scale variables can help reduce multi-col-linearity issues for models containing interaction terms
- Standardizing continuous predictor variables in neural network is extremely important
- Standardizing your data prior to cluster analysis is also extremely critical
When not to
- tree-based analyses are not sensitive to outliers and do not require variable transformations
- standardization of multi-scaled data is not necessary for Decision Trees, Random Forest and/or Gradient Boosting algorithms.
When to do it
- Numerical Data
- Data is skewed or does not fit a Normal bell curve
- Performing a log transformation may result in the data taking on a normal curve or "log-normal"
- This reduces the effect of extreme values that may have been "drowning out" other values that may be important
When not to
- log transform does not always make data normal or always reduce extreme values and can have the opposite effect
When to use it
- For feature elimination
- There are a lot of variables, and it is not known which are the most useful
- Many variables are correlated with themselves and thus the information from one can summarize the others into just 1
- PCA turns all columns into "principle components"(PC's) that are all independent variables.
- PC's can be thought of as 1 variable column that is the combination of 1 or more from the raw data
- For example if there are 2 variables in a data set for height(cm) and height(inches) there would be an exact correlation and these 2 columns would be merged into 1 which values would be a medium between the 2
When not to use
- PCA makes the data harder to understand
When to use it
- When a model can only handle categorical data
- when continuous numerical data can have useful information compressed into groups
- This makes continuous numerical data discrete such as taking age and making age groups
- In order to find the best way to group data or if it even should be grouped, this class starts by trying 2 groups and performing a chi^2 test compared to a target variable, it then continues ,3 groups, 4, 5 and so on until it either hits a set limit for tries or until each group has 1 value ( the same as it was ) and whichever grouping had the best chi^2 result will be the one it chooses
When not to use
- model is suited for numerical data
- putting data into groups decreases the information it has, if it hits a limit where there are only a few values in a group and like 5000 groups, you might as well have just kept it numeric.