Skip to content

Latest commit

 

History

History
951 lines (882 loc) · 74.4 KB

README.md

File metadata and controls

951 lines (882 loc) · 74.4 KB

Hanhan_Applied_DataScience

Applied data science recommendations and tutorials

Older Notes

Important Tips

  • Sampling strategies should be documented.
  • Split data into train and test first, then do data preprocessing on training data. After model trained, apply the final data preprocessing methods on the testing data.
  • Working Notes 🌺
  • Deep Learning tips

Research Resources

Applied Recommendations

  • Pipeline tools
  • Platforms
  • Core Components
    • Feature Store
    • HPO
    • etc.

MLOps

  • Rational databases
  • Big Data Systems
  • Big Data Cloud Platforms

Code Speed Up Methods

Ray

  • Ray vs Python multiprocessing vs Python Serial coding
    • Code example
    • My code example
    • Blog details
    • To sum up, Ray is faster than python multiprocessing. It seems that the basic steps are initialize actors, and for each actor, call the function you want to be processed distributedly.

AI Ethics

Federated Learning

  • In this paper, a simple application of federated learning
    • Shared & aggregated weights, instead of shared data
  • Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralised data.
    • The aggregation algorithm used in federated learning - it's mainly to address data imbalance between clients and data non-representativeness. The algorithm is to sum up the weighted model parameter from each selected client's model as the general model param, the weight is the percentage of data records of the client among all the selected clients' records.
  • FedML: A Research Library and Benchmark for Federated Machine Learning
    • It describes the system design, new programming interface, application examples, benchmark, dataset, and some experimental results.
  • Tensorflow built-in federated learning
  • How to Customize Federated Learning
    • Inspiration
      • It captures the core part of FL, global model global params (weights), local model local params (weights), aggregated params (weights) are used to update the global model and then distributed to the local model for next iteration of params aggregation. Repeat.
      • The FL aggregation method was built upon deep learning and used for update the weights. If we want to apply FL in classical ML method, we can borrow the idea and instead of updating weights, we can update params.
      • It also compared SGD with FL, and FL does have certain optimization function.

Data Quality Check

When you got the data from the client or from other teams, better to check the quality first.

Label Quality Check

  • Overall Check
    • Data Imbalance
  • Within Each Group
    • "Group" here can be each account/user/application/etc.
    • How does different labels distribute within each group
    • Better to understand why
  • Label Quality
    • Confident Learning: Cleanlab is a python package provides insights on potential mistakenly labeled data

Data Drift Check

Data Exploration

Univariate Analysis

  • Check distribution for continuous, categorical variables
    • For continuous variables, in many cases, I just check percentile at min, 1%, 5%, 25%, 50%, 75%, 95%, 99% and max. This is easier to implement and even more straightforward to find the outliers.
    • For categorical data, you can also find whether there is data inconsistency issue, and based on the cause to preprocess the data
  • Check null percentage
  • Check distinct values count and percentage
    • Pay attention to features with 0 variance or low variance, think about the causes. Features with 0 variance should be removed before model training.
    • For categorical features, besides count of unique values, we can also use Simpson's Diversity Index
  • Use q-q plot to check whether the data distribution aligns with an assumed distribution
    • Example
    • When the distribution aligns, the q-q plot will show a roughly straight line
    • q-q plot can also be used to check skewness

Bivariate Analysis

  • Check correlation between every 2 continuous variables
    • More methods to check correlations
      • Point Biserial Correlation, Kendall Rank Correlation, Spearman’s Rank Correlation, Pearson Correlation and its limitation (linear only, cant be used for ordinal data)
    • Person vs Spearman
      • Pearson correlation evaluated the linear relationship between two continuous variables.
        • A relationship is linear when a change in one variable is associated with a proportional change in the other variable.
      • Spearman evaluates a monotonic relationship.
        • A monotonic relationship is one where the variables change together but not necessarily at a constant rate.
    • Check chi-square test between 2 categorical features (similar to correlation for continuous features)
      • (p-value) probability 0 means the 2 variables are dependent; 1 means independent; a value x in [0,1] range means the dependence between the 2 variables is at (1-x)*100%
        • H0 (Null Hypothesis) = The 2 variables to be compared are independent. Normally when p-value < 0.05 we reject H0 and the 2 variables are NOT independent.
      • sklearn chi-square
    • NOTE: "Same" distribution doesn't mean same correlation, see this example!
    • Check ANOVA between categorical and continuous variables
  • Comparing with correlation, Mutual Information (MI) measures the non-linear dependency between 2 random variables
    • Larger MI, larger dependency
  • Using PPS (predictive powerscore) to check asymmetrical correlation
    • Sometimes x can predict y but y can't predict x, this is asymmetrical
    • Besides using it as correlation between features, PPS can also be used to check stronger predictors
  • 2 way table or Stacked Column Chart - for 2 variable variables, check count, percentage of group by the 2 variables
  • For features that are highly dependent on each other, in parametrics algorithms these features should be removed to reduce error; in nonparametrics algorithms, I still recommend to remove them, because for algorithms such as trees tend to put highly dependent features at the same importance level, which won't contribute feature importance much.
  • Feature selection based on the dependency between features and the label
    • Select features that have stronger correlation with the label, because these features tend to contribute to the label values.
    • correlation: continuous features vs continuous label
    • chi-square: categorical features vs categorical label
      • Cramer’s V for Nominal Categorical Variable (categorical values in numerical format)
      • Mantel-Haenszed Chi-Square for ordinal categorical variable (categorical values in order)
    • ANOVA: categorical features vs cotinuous label; or vice versa

Regression Coefficients

  • Unstandardized vs Standardized Regression Coefficients
    • Unstandardized coefficient α are used for independent variables when they all kept their own units (such as kg, meters, years old, etc.). It means when there is one unit change in the independent variable, there is α cahnge in the dependent variable y.
      • However, because those independent variables are in different units, based on unstandardized coefficients, we cannot tell which feature is more important
    • So we need standardized coefficient β, larger abs(β) indicates the feature is more important
      • Method 1 to calculate β is to convert each observation of an independent variable to 0 mean and 1 std
        • new_x_i = (x_i - Mean_x) / Std_x, do this for both independent and dependent variables
        • Then get β
      • Method 2
        • Linear Regression: β_x = α_x * (std_x / std_y)
        • Logistic Regression: β_x = (sqrt(3)/pi) * α_x * std_x
      • It means when there is 1 std increase in the independent variable, there is β std increase in the dependent variable

Deal With Missing Values

  • Check whether missing values appear in different values with different probability. This may help understand whether the missing value is missing completely at random (missing with same probability for different values) or whether there are certain values tend to have more missing values and why.
  • Deletion
    • List wise deletion - remove the whole list
    • Pair wise deletion - remove missing values for each column, but each column may end up with difference number of records
    • It's safer to use deletion when the missing values are completely missing at random, since deletion will reduce the records and may reduce the prediction power
  • Impute with mean/median/mode
    • General Imputation - Replace missing values with selected value in the whole column
    • Similar Case Imputation - For different group of values, impute with the selected value from that group
  • Impute with special values, such as "MISSING", -1, etc.
  • KNN Imputation
  • LGBM Imputation
    • It uses LGBM to impute missing values
  • Miss Forest, Mice Forest
    • Handle missing values with random forest
    • It explains how
    • Works for both numerical, categorical features
  • Model prediction to compare whether imputing missing values will help

Deal With Outliers

  • To check whether there is ourliers, I normally check distribution, even just by using percentile. And only deal with a very small percentage of outliers.
    • It's recommended that to use the model run on raw features without any preprocessing, this is the baseline result. Then run the same model with preprocessed data, to compare. So that you will know how much does preprocessing affect your prediction in each project.
  • Sometimes I also check boxplot, but the data I am dealing with tend to have large amount of outliers, and imputing data above & below 1.5*IQR will reduce the prediction power too much.
  • Better to know what caused the outliers, this may help you decide how to deal with them
  • Decide which are outliers
  • ML methods can be used for anomaly detection
  • To deal with ourliers, I normally use:
    • Simply replace the outliers with NULL
    • Replace outliers with median/mean/mode, or a special value
    • Binning the feature
    • Just leave it there
    • Anoter suggest method is to build seperate models for normal data and outliers when there are large amount of outliers which are not caused by errors. In industry, you may not be allowed to build seperate models but it's still a method to consider.

Dimensional Reduction

  • Check missing values, for those with high percentage of missing values, you may want to remove them.
  • Check variance
  • Deal with collinearity issues
    • Correlation matrix to check the correlation between pairs of features - Collinearity
    • VIF (Variance Inflation Factor) to check the correlation exists between 3+ features but could not be found in any pair of features - Multicollinearity
      • How to use this method, check the code here, "Check Multicollinearity" section.
      • Normally when VIF is between 5 and 10, there could be multicollineary issue of the feature. When VIF > 10, it's too high and the feature should be removed.
      • 2 major methods to deal with features with high VIF
        • Remove it: Often start with removing features with highest VIF
        • Combine features with high VIF into 1 feature
      • The implementation of using VIF to drop features
      • Description of VIF
        • The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone. VIF score of an independent variable represents how well the variable is explained by other independent variables.
        • VIF = 1/(1-R^2), higher R^2 means higher correlation between the variable and other variables, so higher VIF indicates higher multicollinearity.
  • Use tree models to find feature importance
    • Better to remove highly correlated features before doing this. Some tree model will put highly correlated features all as important if one of them is highly ranked
  • Dimensional Reduction Algorithms
    • Better to use standardize data before applying these methods, by making data on the same scale, distance calculation makes more sense.
    • sklearn decomposition
    • The intermediate result from autoencoder - encoder
    • sklearn manifold learning
    • How does PCA work, great way to explain "eigenvalue" also plot the explained variance of each principle component
      • Unsupervised, linear
      • Needs data standardization
    • How does LDA work
      • Linear, Supervised, as it's trying to separate classes as much as possible
      • Needs data standardization
    • How does t-SNE work
      • Linear, Unsupervised, Non-parametric (no explicit data mapping function), therefore mainly used in visualization
      • "Similarity" & normal distribution in original dimension, "similarity" & t-distribution in lower dimension.
        • Both probability distributions are used to calculate similarity socres
        • t-distribution has "flatter" shape with longer tails, so that data points can be spread out in lower dimensions, to avoid high density of gathering
      • "Perplexity" determines the density
      • Needs data standardization
    • How does Isomap work
      • Unsupervised, Nonlinear, with the help of MDS (multidimensional scaling), it will try to keep both global and local (between-point distance) structure
        • Linear method like PCA focuses on keeping global structure more
    • How does LLE work
      • Nonlinear, Unsupervised
      • Focusing on local structure more, more efficient than Isomap as it doesn't need to calculate pair-wise distance, but Isomap works better when you want to keep both global and local structure
    • How does UMAP work
      • Supports both supervised & unsupervised
      • Trying to maintain both local & global structures
  • Feature Selection Methods

Feature Engineering

  • Scaling the features
    • Sometimes, you want all the features to be normalized into the same scale, this is especially helpful in parametric algorithms.
    • sklearn scaling methods on data with outliers
      • PowerTransformer is better in dealing with outliers
      • Sometimes, you just want to scale the values between 0 and 1 so MaxMinScaler is still popular
  • Transform nonlinear relationship to linear relationship, this is not only easier to comprehend, but also required for parametric algorithms
    • scatter plot to check the relationship; we can also use pandas cross tab method
    • Binning
    • Such as log, here are a list of methods for nonlinear transformation: https://people.revoledu.com/kardi/tutorial/Regression/nonlinear/NonLinearTransformation.htm
      • 🌺 Some data do need to use log, such as "income" in banking data. There is a common assunption that, income data is log-normally distributed. Applying log on the better can be better.
      • This can be applied to features that have inevitable "outliers" that you should not remove, but it could make the overall distribution hard to see. When there are non-positive values, sometimes you might need to use np.log(df + 1) * np.sign(df)
    • Kernel functions
    • PCA - It can convert the whole feature set into normalized linear combination
  • To calculate skewness
  • Convert skewed distribution to symmetric distribution, this is also prefered for parametric algorithms. Skewed data could reduce the impact from lower values which may also important to the prediction.
    • log to deal with righ skewness
    • squre root, cube root
    • exp to deal with left skewness
    • power
    • binning
  • derived features
  • one-hot features
  • decision tree paths as the feature
  • Methods to deal with categorical features
  • Normalize data into [0, 1] without min, max
    • We can use sigmoid function exp(x)/(exp(x) + 1) to normalize any real value into 0, 1 range
    • If we check the curve of sigmoid function, you can see that for any x that belongs to a real number, y is always between 0 and 1.
  • Data Transformation
    • Transform non-normal distribution into normal distribution
    • In this example, using box-cox method can convert distributions with multiple peaks into normal distribution
      • The requirement of box-cox is, all the values have to be positive.

Deal With Imbalanced Data

Semi-supervised learning for imbalanced data

  • 2020 paper - Rethinking the Value of Labels for Improving Class-Imbalanced Learning
  • Inspirations
    • Train the model without label first to generate the predicted labels as new feature for the next model learning (semi-supervised learning)
    • More relevant new data input might reduce test error
    • Smaller data imbalance ratio for unlabeled data also reduce unlabeled data test error
    • The use of T-SNE for visualization to check class boundary
      • This is data dependent, not all the dataset could show obvious class boundary

Sampling Methods

Cost Sensitive Learning (class weights)

  • This method is becoming more and more popular recently. Majorly you just set the class weights based on the importance of false positive and false negative.
  • In practice, it is worthy to know more from the customers or the marketing team, trying to understand the cost of TP/TN or FP/FN.
  • Example-dependent Cost Sensitive method
    • Costcla - it's a package that do model prediction including cost matrix
      • The cost matrix is example dependent. Each row has the cost of [FP, FN, TP, TN]. So when you are creating this cost matrix yourself, your training & testing data all records the costs, each row of your data is an example. Sample Model Prediction
      • The drawback is, testing data also needs the cost matrix, but in practice, you may not know. However this method can still be used in train-validation dataset, in order to find the optimal model.

Thresholding

  • When the prediction result is in probability format, we can change the threshold of class prediction. By default the threshold is 50-50. With the evaluation metric, better to draw a curve with thresholds as x-axis and evaluation result as y-axis, so that we will know which threshold to choose is better.

Given Partial Labels

  • When the labels are given by the client, without being able to work with any business expert, there can be many problems. Especially when they only gave you 1 class of the labels, and you might need to assume the rest of the data all belong to the other label, it can be more problematic.
  • Here're the suggestions
    • Better not to build the machine learning pipeline at the early stage, since the pipeline can create much more limitation in the work when you want to try different modeling methods.
    • If not trust the label quality, try clustering first, and check the pattern in each cluster, and compare with the given labels.
    • Communicate with the business experts frequently if possible, to understand how did the labels get generated.
    • If your model is supervised method, and if using all the data there will be huge data imbalance issue, better to choose representative sample data as the assumed label, and try to reduce the size, in order to reduce the data imbalance issue.

Other Methods

  • Clustering & Multiple Model Training
    • Cluster the majority class into multiple non-overlapped clusters. For each cluster, train them with the minority class and build a model. Average the final prediction
    • You don't need testing data here, but the drawback is you always need the label which won't be the case in practice. When there are more data, you also need to operate on all the data, which can be time & computational consuming.

Reference

  • Data Exploration Guidance
  • Impact of Data Size on Model Performance
    • It's using deep learning, but can inspire on multiple other models, especially those using stochastic methods
    • Use bootstrap methods + multiple seeds to do multiple run, cross validatio are suggested. Because for stochastic methods, different seed could lead to different direction and optimal results. It could also help dealing with the high variance issue in a model.
    • "As such, we refer to neural network models as having a low bias and a high variance. They have a low bias because the approach makes few assumptions about the mathematical functional form of the mapping function. They have a high variance because they are sensitive to the specific examples used to train the model."
    • Larger training set tend to lead to better prediction results, smaller test case may not always be the case. So 7:3 or 8:2 train-test ratio is always a good start.
  • Guidance on dealing with imbalnced data 1, Guidance on dealing with imbalnced data 2

Models

Baseline Models for R&D

  • During R&D, there will be more and more fancy methods, especially those in deep learning. Are they going to perform better comparing with these easy-to-implement baselines?
  • Categorize based on words/characters
    • Naive Bayesian
    • Fisher Methods
    • You can do self-implementation, check chapter 6
  • Find themes of a piece of text
    • Non-Negative Matrix Factorization
    • You can do self-implementation, check chapter 10

Which Model to Choose

  • Linear or Nonlinear
    • In the code here, we can use residual plot to check whether there is linear/non-linear relationship between the feature set and the label, to decide to use linear/nonlinear model.
      • If the residual plot is showing funnel shape, it indicates non-constant variance in error terms (heteroscedasticity), which also tends to have residuals increase with the response value (Y). So we can also try to use a concave function (such as log, sqrt) to transform Y to a much smaller value, in order to reduce heteroscedasticity.
      • As we can see in the code, after transforming Y with log, the residual plot was showing a much more linear relationship (the residuals are having a more constant variance)
  • TPOT Automatic Machine Learning
    • TPOT for model selection
    • It's a very nice tool that helps you find the model with optimized param, and also export the python code for using the selected model TPOT Examples
    • TPOT Params for Estimators
      • 10-fold of cross validation, 5 CV iterations by default
      • TPOT will evaluate population_size + generations × offspring_size pipelines in total. So when the dataset is large, it can be slow.
    • All the classifiers that TPOT supports
    • All the regressors that TPOT supports
    • Different TPOT runs may result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means that it uses randomness (in part) to search the possible pipeline space.
    • The suggestion here is: Run TPOT multiple times with different random_state, narrow down to fewer models and do further evaluation (see below spot check pipeline). If the data size is large, try to reduce population_size + generations × offspring_size, cv and use subsample
    • Sample exported TPOT python file
  • Spot-check Model Evaluation
    • After you have a few list of models, you want to quickly check which one performs better. What I'm doing here is, for each model, use all the training data but with stratified kfold cross validation. Finally it evaluates the average score and the score variance.
      • Some people think it's important to use bootstrap, which split the data into multiple folds and run on each fold or it will use different seeds to run the same model multiple times. I'm not using bootstrap here, because the first solution is similar to stratified kfold cross valiation, the second solution, I will use it when I have finalized 1 model, and use bootstrap to deliver the final evaluation results.
  • Tools other than TPOT

Notes for Evaluation Methods in Model Selection

  • R-Square, RSS (residual sum of squares), will decrease when there are more features, but the test error may not drop. Therefore, R-Square, RSS should not be used for selecting models that have different number of features.
    • RSS = MSE*n
    • R-square = explained variation / total variation, it means the percentage of response variable variance explained by the model.
      • It's between 0% to 100%
      • 0% indicates that the model explains none of the variability of the response data around its mean.
      • 100% indicates that the model explains all the variability of the response data around its mean.

Cross Validation

  • When there is time order in the data
    • Solution - Forward Chaining
      • Step 1: training(1), testing(2)
      • Step 2: training(1,2), testing(3)
      • Step 3: training(1,2,3), testing(4)
      • ...
    • Application - sklearn time series split

Multi-Tag Prediction

More about Ensembling

  • We know ensembling models tend to work better in many cases, such as Xgboost and LightGBM

Super Learner

  • Super Learner example
    • It trains on a list of base models, using the predictions as the input of a meta model to predict the targets, and finally might be better than a specific "best model"

Algorithms Details

RBF Kernel

  • Radial Basis Function kernel (RBF kernel) is used to determine edge weights in affinity matrix, a matrix defines a pairwise relationship points.
    • Sklrean's implementation of RBF kernel looks slightly different as it replaces 1/2sigma^2 with a hyperparameter gamma. The effect is the same as it allows you to control the smoothness of the function. High gamma extends the influence of each individual point wide, hence creating a smooth transition in label probabilities. Meanwhile, low gamma leads to only the closest neighbors having influence over the label probabilities.
  • Reference

Semi Supervised Learning

SVM

Decision Tree

Gradient Descent

  • The final comparision table is very helpful
    • weights vs gradient
    • tree depth
    • classifier weights
    • data variance capture

L1 vs L2

  • L1 regularization adds the penalty term in cost function by adding the absolute value of weight(Wj) parameters, while L2 regularization adds the squared value of weights(Wj) in the cost function.
  • L1 pushes weight w towards 0 no matter it's positive or negative, therefore, L1 tend to be used for feature selection
    • If w is positive, the regularization L1 parameter λ>0 will push w to be less positive, by subtracting λ from w. If w is negative, λ will be added to w, pushing it to be less negative. Hence, this has the effect of pushing w towards 0.
  • As for dealing with multicollinearity, L1, L2 and Elastic Net (uses both l1, L2) could all do some help

Model Evaluation

Before Evaluation

  • There are things we can do to make the evaluation more reliable:
    • Hold-out Data

      • If the dataset is large enough, it's always better to have a piece of hold-out data, and always use this piece of data to evaluate the models
    • Run the same model multiple times

      • You can try different seeds with the whole dataset
      • Bootstrap - run the model on randomly selected samples, finally aggregate the results
        • Resample with replacement
        • UNIFORMALY random draw
        • The advantage of draw with replacement is, the distribution of the data won't be changed when you are drawing
        • sklearn methods in data spliting
        • Statistics proved that Bootstrap has close estimates as using the true population, when you have selected enough amount of samples (similar to central theorem)
        • My code - Using Bootstrap to Estimate the Coefficients of Linear Regression
          • Applied on linear regression, and quadratic linear regression
          • It compares bootstrap results and standard formulas results. Even though both got the same estimated coefficients, standard formulas tend to show lower standard errors. However sinece standard formulas makes assumptions, while bootstrap is nonparametric, bootstrap tends to be more reliable
    • Cross Validation

      • We can calculate the average evaluation score and the score variance to compare model performance
      • In sklearn, we can just use cross_val_score, which allows us either to use integer as stratified kfold cv folds, or cross validation instances

Evaluation Methods

  • There is a suggestion for model evaluation, which is, choose and train models on the same dataset and averaging the evaluation results. The idea here is, "combined models increase prediction accuracy"
  • "Probability" means predict continuous values (such as probability, or regression results), "Response" means predict specific classes

Evaluation for Time Series Data

  • The model doesn't have to be forecasting/time series models only, it can be classification model too with features generated using sliding window/tumbling window.
  • No matter which model to use, when there is time series in the data, better to use walk-forward evaluation, which is to train on historical but predict on the future data.
    • Better to sort the input data in time order, even though the model doesn't check historical records and the features already added time series value, if the dta is not sorted in time order, when the later records appeared before its historical record during training, it can still create bias
  • ROC Curve vs. Precision-Recall Curve
    • ROC curve requires the 2 classes are balanced. Precision-Recall curve is better at being used on imbalanced dataset.
    • Both of them have TPR (recall/sensitivity, the percentage of TP cases are described as positive), the difference is in precision and FPR.
      • FPR = FP/(FP + TN), when the negative class is much larger than positive class, FPR can be small.
      • Precision = TP/(TP + FP), it indicates the percentage of TP are correct. When the negative class is much larger than the positive class, this can be affected less, comparing with ROC.
  • Logloss is used for probability output, therefore it's not used as a regression method in sklearn
    • In deep learning loss function doesn't have to be logloss, it can simple be "mean_absolute_error"
  • MSE/RMSE works better for continuous target
  • R-square = explained variation / total variation, it means the percentage of response variable variance explained by the model.
    • Interpretation
      • 1 is the best, meaning no error
      • 0 means your regression is no better than taking the mean value
      • Negative value means you are doing worse than the mean value
    • It's between 0% to 100%
      • 0% indicates that the model explains none of the variability of the response data around its mean.
      • 100% indicates that the model explains all the variability of the response data around its mean.
    • R-square cannot tell bias, so have to be used with residual plot.
      • When the residual plot dots are randomly dispersed around the horizontal axis, a linear model is better for the data, otherwise a nonlinear model is better.
    • Lower R-square doesn't mean the model is bad.
      • For unpredictable things, lower R-square is expected. Or R-squared value is low but you have statistically significant predictors, it's also fine.
    • High R-square doesn't mean good model.
      • The residual plot may not be random and you would choose a nonlinear model rather than a linear model.
  • Residual Plot
    • In regression model, Response = (Constant + Predictors) + Error = Deterministic + Stochastic
      • Deterministic - The explanatory/predictive information, explained by the predictor variables.
      • Stochastic - The unpredictable part, random, the error. Which also means the explanatory/predictive information should be in the error.
    • Therefore, residual plot should be random, which means the error only contains the Stochastic part.
    • So when the residual plot is not random, that means the predictor variables are not capturing some explanatory information leaked in the residual plot. Posibilities when predictive info could leak into the residual plot:
      • Didn't capture a missing variable.
      • Didn't capture a missing higher-order term of a variable in the model to explain the curvature.
      • Didn't capture the interaction between terms already in the model.
      • The residual is correlated to one or more variables, and these variables didn't get captured in the model.
        • Better to check the correlation between residuals and variables.
      • Autocorrelation - Adjacent residuals are correlated to each other.
  • References
  • Still needs labels
  • Sometimes, in simply do not have the label at all. You can:
    • Compare with the datasets that have label, at the risk of non-transformable
    • Check predicted results distribution, comparing between different rounds
    • Your customer will expect a percentage for each label, compare with that...
  • Suggestions to improve scores
    • Making the probabilities less sharp (less confident). This means adjusting the predicted probabilities away from the hard 0 and 1 bounds to limit the impact of penalties of being completely wrong.
    • Shift the distribution to the naive prediction (base rate). This means shifting the mean of the predicted probabilities to the probability of the base rate, such as 0.5 for a balanced prediction problem.
    • Reference
  • How to Compare 2 Models After Each has been Running Multiple Rounds

After Evaluation - The confidence/reliability of prediction

  • Calibration
  • Concordant & Discordant
  • KS Test - Kolmogorov-Smirnov (K-S) chart is a measure of the difference between the y_true and y_pred distributions for each class respectively. It's a method can be used to compare 2 samples.
    • KS test vs t-test: Imagine that the population means were similar but the variances were very different. The Kolmogorov-Smirnov test could pick this difference up but the t-test cannot

Time Series Specific

  • Data Exploration
    • Lagged Features
      • When you features are created by lagged data, you can check the correlation between these lags. If the correlation is close to 0, it means current values have no correlation to previous time values
  • If you want a quick forecast, such as hourly, daily, weekly, yearly forecast, try Facebook Prophet
  • My code - time series functions
  • 11 Forecasting methods
    • When to use which model
    • It seems that methods do not have forcast but only have predict will only predict the next value. With forecast we can predict the next period of values
  • My code with 7 methods
    • Some are not in above 11 methods
    • Simple, Exponential & Weighted Moving Average
      • Simple moving average (SMA) calculates an average of the last n prices
      • The exponential moving average (EMA) is a weighted average of the last n prices, where the weighting decreases exponentially with each previous price/period. In other words, the formula gives recent prices more weight than past prices.
      • The weighted moving average (WMA) gives you a weighted average of the last n prices, where the weighting decreases with each previous price. This works similarly to the EMA.
    • Grid Search to tune ARIMA params
  • Combing multiple models forecasts
    • Combining multiple forecasts leads to increased forecast accuracy.
    • Choose and train models on the same time series and averaging the resulting forecasts
    • Reference

Optimization Problem

Linear Programming

  • Pareto Front
    • When you have multiple evluation metrics (or objective functions), this method helps to find those records with at least 1 metric wroks better.
    • With this method, it helps remove more unnecessary records --> more efficient.
    • The idea can be modified for other use cases, such as instead of having at least 1 metric works better, we can define all the metrics or a certain metrics work better, etc.
  • MADM (Multi criteria decision analysys methods)
    • The methods listed in skcriteria are mainly using linear programming
  • Use Cases
    • Param Tuning - Pareto Front improves the efficiency, MADM find the optimal option

Model Explainability

Dashboard Tools

Plot Decision Tree Boundry

  • It has summarized different methods in ML model interpretations, such as NN interpretation, model-agnostic interpretation, etc.

Lime - Visualize feature importance for all machine learning models

  • Their GitHub, Examples and the Paper: https://github.com/marcotcr/lime
    • Interpretable: The explanation must be easy to understand depending on the target demographic
    • Local fidelity: The explanation should be able to explain how the model behaves for individual predictions
    • Model-agnostic: The method should be able to explain any model
    • Global perspective: The model, as a whole, should be considered while explaining it
  • The tool can be used for both classification and regression. The reason I put it here is because it can show feature importance even for blackbox models. In industry, the interpretability can always finally influence whether you can apply the more complex methods that can bring higher accuracy. Too many situations that finally the intustry went with the most simple models or even just intuitive math models. This tool may help better intrepretation for those better models.
  • My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/Better4Industry/lime_interpretable_ML.ipynb
    • It seems that GitHub cannot show those visualization I have created in IPython. But you can check LIME GitHub Examples
    • LIME requires data input to be numpy array, it doesn't support pandas dataframe yet. So that's why you can see in my code, I was converting the dataframe, lists all to numpy arraies.
  • NOTES
    • Currently they have to use predicted probability in explain_instance() function
    • You also need to specify all the class names in LimeTabularExplainer, especially in classification problem, otherwise the visualization cannot show classes well

SHAP

  • SHAP value is a type of feature importance value: https://christophm.github.io/interpretable-ml-book/shap.html#shap-feature-importance
  • It uses Shapley values at its core and is aimed at explaining each individual record.
    • "Shapley value for each feature is basically trying to find the correct weight such that the sum of all Shapley values is the difference between the predictions and average value of the model. In other words, Shapley values correspond to the contribution of each feature towards pushing the prediction away from the expected value."
      • So higher shaply value indicates the feature pushes the prediction towards the positive class more.
      • So lower shaply value indicates the feature pushes the prediction towards the negative class more.
  • More details from my practice, including shap decision plot
  • It supports feature importance interpretation for those famous emsembling models, as well as deep learning model output interpretation.
  • Tutorials
  • The visualization generated from this library are more like sklearn stype, basic but useful. It has multiple visualizers:
    • Feature Analysis, Target visualizer, Regression/Classfication/Clustering visualizers, Model Selection visualizer, Text Modeling visualizers.
  • Find all its visualizers
  • It might be better for deep learning. So far I haven't seen how impressive this tool is~
  • Examples
    • Much more text than visualization

Reference

My model evaluation previous detailed summary

Tools

R Tools

Data Manipulation Tools

  • dplyr - majorly to do query related operation, such as "select", "group by", "filter", "arrage", "mutate", etc.
  • data.table - fast, even more convenient than data.frame, can also do queries inside
  • ggplot2
  • reshape2 - reshape the data, such as "melt", "dcast", "acast" (reversed melt)
  • readr - read files faster, different functions supports different files
  • tidyr - makes the data "tidy". "gather" is similar to above "melt" in reshape2; "seperate", "sperate_d" could help seperate 1 column into multiple columns and vice versa, etc.
  • lubridate - deal witb datetime
  • My code of R 5 packages for dealing with missing values
    • MICE - it assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them.
    • Amelia - It assumpes that All variables in a data set have Multivariate Normal Distribution (MVN). It uses means and covariances to summarize data.
    • missForest - It builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values.
    • Hmisc - It automatically recognizes the variables types and uses bootstrap sample and predictive mean matching to impute missing values. You don’t need to separate or treat categorical variable. It assumes linearity in the variables being predicted.
    • mi - It allows graphical diagnostics of imputation models and convergence of imputation process. It uses bayesian version of regression models to handle issue of separation. Imputation model specification is similar to regression output in R. It automatically detects irregularities in data such as high collinearity among variables. Also, it adds noise to imputation process to solve the problem of additive constraints.
    • Recommend to start with missForest, Hmisc, MICE, and then try others
  • How does R output use p-value in regression
    • In R, after we have applied regression, we will see coefficient values as well as p-values.
    • The null hypothesis is, coefficient is 0 for a variable (no effect to the model).
    • So when p-value is lower than the significance level, reject the null hypothesis, which also means the variable should be included in the model. When p-value is high, accept the null hypothesis, and the variable should be removed from the model.

Reference

Python Tools

Notes in Speed Up

  • Better not use pandas iterrows() but to store each record as a dictionary and store in a list for iteration. Iterating pandas rows can be much slower.