Skip to content
Logan Collier edited this page Sep 9, 2019 · 1 revision

Xero Infinium


Transmute

Various ways to transform raw data

Standardize

When to do it

  • Numerical Data
  • multiple variables
  • when dealing with multiple variables measured at different scales do not contribute equally to the analysis. This puts all variables into a uniform measurement.
  • When performing regression analysis, standardizing multi-scale variables can help reduce multi-col-linearity issues for models containing interaction terms
  • Standardizing continuous predictor variables in neural network is extremely important
  • Standardizing your data prior to cluster analysis is also extremely critical

When not to

  • tree-based analyses are not sensitive to outliers and do not require variable transformations
  • standardization of multi-scaled data is not necessary for Decision Trees, Random Forest and/or Gradient Boosting algorithms.

LogTransform

When to do it

  • Numerical Data
  • Data is skewed or does not fit a Normal bell curve
  • Performing a log transformation may result in the data taking on a normal curve or "log-normal"
  • This reduces the effect of extreme values that may have been "drowning out" other values that may be important

When not to

  • log transform does not always make data normal or always reduce extreme values and can have the opposite effect

Principle Component Analysis (PCA)

When to use it

  • For feature elimination
  • There are a lot of variables, and it is not known which are the most useful
  • Many variables are correlated with themselves and thus the information from one can summarize the others into just 1
  • PCA turns all columns into "principle components"(PC's) that are all independent variables.
  • PC's can be thought of as 1 variable column that is the combination of 1 or more from the raw data
  • For example if there are 2 variables in a data set for height(cm) and height(inches) there would be an exact correlation and these 2 columns would be merged into 1 which values would be a medium between the 2

When not to use

  • PCA makes the data harder to understand

Ordinal Scale

When to use it

  • When a model can only handle categorical data
  • when continuous numerical data can have useful information compressed into groups
  • This makes continuous numerical data discrete such as taking age and making age groups
  • In order to find the best way to group data or if it even should be grouped, this class starts by trying 2 groups and performing a chi^2 test compared to a target variable, it then continues ,3 groups, 4, 5 and so on until it either hits a set limit for tries or until each group has 1 value ( the same as it was ) and whichever grouping had the best chi^2 result will be the one it chooses

When not to use

  • model is suited for numerical data
  • putting data into groups decreases the information it has, if it hits a limit where there are only a few values in a group and like 5000 groups, you might as well have just kept it numeric.
Clone this wiki locally