Skip to content

Latest commit



201 lines (177 loc) · 7.74 KB

File metadata and controls

201 lines (177 loc) · 7.74 KB

Data Preprocessing

Table of contents

Data Preprocessing

Import Dataset

dataset = pd.read_csv("data.csv")

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes

Select Data

Using Index iloc

  • .iloc[] allowed inputs are:

    Selecting Rows

    • An integer, e.g. dataset.iloc[0] > return row 0 in <class 'pandas.core.series.Series'>
    Country      France
    Age              44
    Salary        72000
    Purchased        No
    • A list or array of integers, e.g.dataset.iloc[[0]] > return row 0 in DataFrame format
       Country   Age   Salary  Purchased
    0  France    44.0  72000.0        No
    • A slice object with ints, e.g. dataset.iloc[:3] > return row 0 up to row 3 in DataFrame format
         Country   Age   Salary Purchased
    0    France   44.0  72000.0        No
    1    Spain    27.0  48000.0       Yes
    2    Germany  30.0  54000.0        No

    Selecting Rows & Columns

    • Select First 3 Rows & up to Last Columns (not included) X = dataset.iloc[:3, :-1]
         Country   Age   Salary
    0   France  44.0  72000.0
    1    Spain  27.0  48000.0
    2  Germany  30.0  54000.0

Numpy representation of DF

  • DataFrame.values: Return a Numpy representation of the DataFrame (i.e: Only the values in the DataFrame will be returned, the axes labels will be removed)
  • For ex: X = dataset.iloc[:3, :-1].values
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]]

(Back to top)

Handle Missing Data


  • sklearn.impute.SimpleImputer(missing_values={should be set to np.nan} strategy={"mean",“median”, “most_frequent”, ..})
  •[:, 1:3]): Fit the imputer on X.
  • imputer.transform(X[:, 1:3]): Impute all missing values in X.
from sklearn.impute import SimpleImputer

#Create an instance of Class SimpleImputer: np.nan is the empty value in the dataset
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

#Replace missing value from numerical Col 1 'Age', Col 2 'Salary'[:, 1:3]) 

#transform will replace & return the new updated columns
X[:, 1:3] = imputer.transform(X[:, 1:3])

Encode Categorical Data

Encode Independent Variables

  • Since for the independent variable, we will convert into vector of 0 & 1
  • Using the ColumnTransformer class &
  • OneHotEncoder: encoding technique for features are nominal(do not have any order) image
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
  • transformers: specify what kind of transformation, and which cols
  • Tuple ('encoder' encoding transformation, instance of Class OneHotEncoder, [cols to transform])
  • remainder ="passthrough" > to keep the cols which not be transformed. Otherwise, the remaining cols will not be included
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])] , remainder="passthrough" )
  • Fit and Transform with input = X in the Instance ct of class ColumnTransformer
#fit and transform with input = X
#np.array: need to convert output of fit_transform() from matrix to np.array
X = np.array(ct.fit_transform(X))
  • Before converting categorical column [0] Country
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
  • After converting, France = [1.0, 0, 0] vector
[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]

Encode Dependent Variables

  • For the dependent variable, since it is the Label > we use Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#output of fit_transform of Label Encoder is already a Numpy Array
y = le.fit_transform(y)

#y = [0 1 0 0 1 1 0 1 0 1]

Splitting Training set and Test set

  • Using the train_test_split of SkLearn - Model Selection
  • Recommend Split: test_size = 0.2
  • random_state = 1: fixing the seed for random state so that we can have the same training & test sets anytime
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

(Back to top)

Feature Scaling

  • What ? Feature Scaling (FS): scale all the features in the same scale to prevent 1 feature dominates the others & then neglected by ML Model

  • Note #1: FS no need to apply in all the times in all ML Models (like Multi-Regression Models)

    • Why no need FS for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3, since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS.
  • Note #2: For dummy variables from Categorial Features Encoding, no need to apply FS Screenshot 2021-01-16 at 11 35 13 AM

  • Note #3: FS MUST be done AFTER splitting Training & Test sets

  • Why ?

    • Test Set suppose to the brand-new set, which we are not supposed to work with the Training Set
    • FS is technique to get the mean & median of features in order to scale
    • If we apply FS before splitting Training & Test sets, it will include the mean & median of both Training Set and Test Set
    • FS MUST be done AFTER Splitting => Otherwise, we will cause Information Leakage

How ?

  • There are 2 main Feature Scaling Technique: Standardisation & Normalisation
  • Standardisation: This makes the dataset, center at 0 i.e mean at 0, and changes the standard deviation value to 1.
    • Usage: apply all the situations
  • Normalisation: This makes the dataset in range [0, 1]
    • Usage: apply when the all the features in the data set have the normal distribution

Screenshot 2021-01-16 at 10 59 20 AM

Standardisation Feature Scaling:

  • We will use StandardScaler from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
  • For X_train: apply StandardScaler by using fit_transform
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
  • For X_test: apply StandardScaler only use transform, because we want to apply the SAME scale as X_train
#only use Transform to use the SAME scaler as the Training Set
X_test[:,3:] = sc.transform(X_test[:,3:])

(Back to top)

