For THEORY PART, please refer to https://github.com/sandipanpaul21/Machine-Learning-Notes-Daywise
- What Dataset is all about? Problem Objective ?
- Number of Rows and Columns. Data Stored is in which type? Is it a Data-frame or a Dictionary?
- Uni means one, so Single Variable Analysis
- Mainly deals with Numerical Measures used in a dataset (Total 14 Numerical Measures)
- Measure of Central Tendency : 1. Mean, 2. Median, 3. Mode
- Measure of Data Spread : 4. Quartile, 5. Percentile, 6. Range, 7. IQR, 8. Boxplot, 9. Variance, 10. Standard Deviation
- Variation between Variables : 11. Covariance, 12. Correlation Coefficient (Pearson and Spearman)
- Measure Distribution and Peakness : 13. Skewness and 14. Kurtosis
- Bi means two, so Two Variable Analysis
- There are majorly two types of Data Variable: Continuous & Categorical Variable
- So 3 possible combinations for Bivariate Analysis
- Continuous vs Continuous : Correlation Coefficient
- Categorical vs Categorical : Chi Square Test
- Continuous vs Categorical : T Test (n < 30), Z Test (n > 30) and ANOVA Test
- Outlier are data points that differs significantly from other observations
- Techniques to Detect Outliers : 1. Box Plot and 2. Z-Score
- Technqiues to Remove Outliers : Capping Based on Upper and Lower Range
- Missing Values in the Dataset cause concern for Machine Learning Model
- Techniques for Imputing Missing Values
- Continuous Data : Median Imputation
- Categorical Data : Mode Imputation
- KNN Imputation (why better than Median and Mode imputation)
- Tweaking the features, to increase the efficiency of the Model
- 3 Major Steps in Feature Engineering : 1.Transformation, 2.Scaling & 3. Construction
- Feature Transformation
- Feature transformation is performed to normalize the data
- Methods Used : 1.Log Transformation, 2.Square Root, 3.Cube Root & 4.Box-Cox Transformation
- Feature Scaling
- Feature scaling is conducted to standardize the independent features
- Method Used : Mix-Max Scaler
- Feature Construction
- It is a process of creating features based on the original descriptors
- Methods Used : 1.Binning and 2.Encoding
- With Boston Dataset, all 5 regression assumptions checked (Why, What and How)
- Linearity between Target & Features : Plot Predicted & Target
- Normality of Error Term : Anderson-Darling Test & Skewness in Error Term
- Multicollinearity among Predictors : Correlation & VIF
- Autocorrelation among Error Term : Durbin-Watson Test
- Homoscedasticity,same variance within error terms : Residual Plot
- Define number of clusters, take centroids and measure distance
- Euclidean Distance : Measure distance between points
- Number of Clusters defined by Elbow Method
- Elbow Method : WCSS vs Number of Cluster
- Silhouette Score : Goodness of Clustering
- Group similar objects into groups
- Type of HC
- Agglomerative : Bottom Up approach
- Divisive : Top Down approach
- Number of Clusters defined by Dendogram
- Dendogram : Joining datapoints based on distance & creating clusters
- Linkage : To calculate distance between two points of two clusters
- Single linkage : Minimum Distance between two clusters
- Complete linkage : Maximum Distance between two clusters
- Average linkage : Average Distance between two clusters
- No need to give pre-define clusters
- Distance metric is Euclidean Distance
- Need to give 2 parameters
- eps : Radius of the circle
- min_samples : minimum data points to consider it as clusters
- Probablistic Model
- Uses Expectation-Minimization (EM) steps:
- E Step : Probability of datapoint of each cluster
- M Step : For each cluster,revise parameter based on proabability