Applied data science recommendations and tutorials
- Sampling strategies should be documented.
- Split data into train and test first, then do data preprocessing on training data. After model trained, apply the final data preprocessing methods on the testing data.
- Working Notes 🌺
- Deep Learning tips
- I especially like Papers with Code 🌺
- You can also find papers here under Methods groups: https://paperswithcode.com/methods
- Prototyping Tookit
- Need big data environment, cloudxlab saves efforts in environment setup
- Pipeline tools
- Platforms
- Core Components
- Feature Store
- HPO
- etc.
- 🌺 Made with ML
- Google ML Engineering Guides
- Docker
- Deploy deep learning web application with TensorFlow.js
- It includes image labeling, modeling with Google Colab and deployment on TensorFlow.js
- Build ML Web Application with Flask
- How to deploy flask app to Azure
- Flask vs FastAPI
- Personally, I didn't find fastapi is better, because when I was using a HTML template to get user typed input and generate the model output, with flask it was simple, you just need to render the template. Using fastapi with template rendering was much more complex.
- ML App with Streamlit
- The UI and the 3D visualization built
- Deploy ML service with BentoML
- About BentoML
- It allows you to deploy the trained model as REST API (built directly for you), docker image or python loadable file
- When your docker image is running, you can use it as REST API locally to, just go to http://localhost:5000
- Correction
- For docker, the bento command should be
bentoml containerize IrisClassifier:latest -t iris-classifier:latest
- "iris-classifier" is the image name, "latest" is the tag name
- For docker, the bento command should be
- MLOps library
- Here's a simple example with code, you can write a pipeline in .py, then specify the devops workflow in .yaml to run not only ML pipeline but also other devops work
- Might be Azure ML specific, check getting started here
- ZenML: https://github.com/zenml-io/zenml
- Inferrd: https://docs.inferrd.com/
- Easily deploy your model on GPUs
- More Open Source MLOps Tools
- DVC Studio, MLFlow
- DVC Studio: https://github.com/iterative/dvc
- MLFlow: https://github.com/mlflow/mlflow
- About Kubernetes
- Conscepts
- Commands
- How to deploy your application using Kubernetes together with Docker
- More suggestions about Kubernetes in production
- More deployment walkthrough
- Rational databases
- Big Data Systems
- Big Data Cloud Platforms
- Ray vs Python multiprocessing vs Python Serial coding
- Code example
- My code example
- Blog details
- To sum up, Ray is faster than python multiprocessing. It seems that the basic steps are initialize actors, and for each actor, call the function you want to be processed distributedly.
- Different tools for measuring AI Ethics
- Questions to think when using these tools:
- If a model can't pass the AI Ethics measurement, can we get suggestions to be able to passs the measurement?
- Questions to think when using these tools:
- Google AI Principles
- It suggests to evaluate model performance for each segment/subgroup too, to see whether there could be bias
- Data exploration can also disclose some bias in the data, so Google introduced Facets for some data exploration
- In this paper, a simple application of federated learning
- Shared & aggregated weights, instead of shared data
- Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralised data.
- The aggregation algorithm used in federated learning - it's mainly to address data imbalance between clients and data non-representativeness. The algorithm is to sum up the weighted model parameter from each selected client's model as the general model param, the weight is the percentage of data records of the client among all the selected clients' records.
- FedML: A Research Library and Benchmark for Federated Machine Learning
- It describes the system design, new programming interface, application examples, benchmark, dataset, and some experimental results.
- Tensorflow built-in federated learning
- The way how this example works is, it radomly choose some clients' data (or use aggregation function to do that more efficiently), then sends the multi-clients' data to the remote server to train a model
- Detailed tensorflow and keras federated learning ipyhton notebook
- All the tensorflow federated tutorials
- How to Customize Federated Learning
- Inspiration
- It captures the core part of FL, global model global params (weights), local model local params (weights), aggregated params (weights) are used to update the global model and then distributed to the local model for next iteration of params aggregation. Repeat.
- The FL aggregation method was built upon deep learning and used for update the weights. If we want to apply FL in classical ML method, we can borrow the idea and instead of updating weights, we can update params.
- It also compared SGD with FL, and FL does have certain optimization function.
- Inspiration
When you got the data from the client or from other teams, better to check the quality first.
- Overall Check
- Data Imbalance
- Within Each Group
- "Group" here can be each account/user/application/etc.
- How does different labels distribute within each group
- Better to understand why
- Label Quality
- Confident Learning:
Cleanlab
is a python package provides insights on potential mistakenly labeled data
- Confident Learning:
- When to use which metrics to measure drift or compare distributions
- KS vs PSI vs WD
- It also has KL, JS, which both can be applied to numerical distributions or categorical distributions.
- Details on how did I apply these methods can be found in: https://github.com/hanhanwu/Hanhan_Applied_DataScience/blob/master/data_exploration_functions.py
- Different methods to detect concept drift, covariate drift
- How to use K-S test to compare 2 numerical distributions: https://stackoverflow.com/questions/10884668/two-sample-kolmogorov-smirnov-test-in-python-scipy
- K-S test can be used even when the 2 distributions are in different length, and it's non-parametric
- But I found, no matter it's K-S test or wasserstein distance, even when 2 distributions look similar, K-S' null hypothesis can be rejected (indicating 2 distributions are not identical) and wasserstein distance can be large...
- Chi-square is used for comparing categorical features' distributions, it's non-parametric too but requires the 2 distributions share the same length...
- PSI is used when your data has normal fluctuations and you mainly care about significant drift in numerical data
- How to use K-S test to compare 2 numerical distributions: https://stackoverflow.com/questions/10884668/two-sample-kolmogorov-smirnov-test-in-python-scipy
- Continual Learning with Ensembling models
- Add trained model's knowledge to the new data, new model
- My Code - IPython
- When you are using
chi2
orf_classif
, the features should have no NULL, no negative value.
- When you are using
- Data Exploration Code I often use
- Check distribution for continuous, categorical variables
- For continuous variables, in many cases, I just check percentile at min, 1%, 5%, 25%, 50%, 75%, 95%, 99% and max. This is easier to implement and even more straightforward to find the outliers.
- For categorical data, you can also find whether there is data inconsistency issue, and based on the cause to preprocess the data
- Check null percentage
- Check distinct values count and percentage
- Pay attention to features with 0 variance or low variance, think about the causes. Features with 0 variance should be removed before model training.
- For categorical features, besides count of unique values, we can also use Simpson's Diversity Index
- Use q-q plot to check whether the data distribution aligns with an assumed distribution
- Example
- When the distribution aligns, the q-q plot will show a roughly straight line
- q-q plot can also be used to check skewness
- Check correlation between every 2 continuous variables
- More methods to check correlations
- Point Biserial Correlation, Kendall Rank Correlation, Spearman’s Rank Correlation, Pearson Correlation and its limitation (linear only, cant be used for ordinal data)
- Person vs Spearman
- Pearson correlation evaluated the linear relationship between two continuous variables.
- A relationship is linear when a change in one variable is associated with a proportional change in the other variable.
- Spearman evaluates a monotonic relationship.
- A monotonic relationship is one where the variables change together but not necessarily at a constant rate.
- Pearson correlation evaluated the linear relationship between two continuous variables.
- Check chi-square test between 2 categorical features (similar to correlation for continuous features)
- (p-value) probability 0 means the 2 variables are dependent; 1 means independent; a value x in [0,1] range means the dependence between the 2 variables is at
(1-x)*100%
- H0 (Null Hypothesis) = The 2 variables to be compared are independent. Normally when p-value < 0.05 we reject H0 and the 2 variables are NOT independent.
- sklearn chi-square
- (p-value) probability 0 means the 2 variables are dependent; 1 means independent; a value x in [0,1] range means the dependence between the 2 variables is at
- NOTE: "Same" distribution doesn't mean same correlation, see this example!
- Check ANOVA between categorical and continuous variables
- Example to check ANOVA f-value
- sklearn ANOVA
- Lower F-score the higher dependent between the variables
- Besides ANOVA, we could calculate t-score or z-score (less than 30 records)
- More methods to check correlations
- Comparing with correlation, Mutual Information (MI) measures the non-linear dependency between 2 random variables
- Larger MI, larger dependency
- Using PPS (predictive powerscore) to check asymmetrical correlation
- Sometimes x can predict y but y can't predict x, this is asymmetrical
- Besides using it as correlation between features, PPS can also be used to check stronger predictors
- 2 way table or Stacked Column Chart - for 2 variable variables, check count, percentage of group by the 2 variables
- For features that are highly dependent on each other, in parametrics algorithms these features should be removed to reduce error; in nonparametrics algorithms, I still recommend to remove them, because for algorithms such as trees tend to put highly dependent features at the same importance level, which won't contribute feature importance much.
- Feature selection based on the dependency between features and the label
- Select features that have stronger correlation with the label, because these features tend to contribute to the label values.
- correlation: continuous features vs continuous label
- chi-square: categorical features vs categorical label
- Cramer’s V for Nominal Categorical Variable (categorical values in numerical format)
- Mantel-Haenszed Chi-Square for ordinal categorical variable (categorical values in order)
- ANOVA: categorical features vs cotinuous label; or vice versa
- Unstandardized vs Standardized Regression Coefficients
- Unstandardized coefficient α are used for independent variables when they all kept their own units (such as kg, meters, years old, etc.). It means when there is one unit change in the independent variable, there is α cahnge in the dependent variable y.
- However, because those independent variables are in different units, based on unstandardized coefficients, we cannot tell which feature is more important
- So we need standardized coefficient β, larger abs(β) indicates the feature is more important
- Method 1 to calculate β is to convert each observation of an independent variable to 0 mean and 1 std
- new_x_i = (x_i - Mean_x) / Std_x, do this for both independent and dependent variables
- Then get β
- Method 2
- Linear Regression: β_x = α_x * (std_x / std_y)
- Logistic Regression: β_x = (sqrt(3)/pi) * α_x * std_x
- It means when there is 1 std increase in the independent variable, there is β std increase in the dependent variable
- Method 1 to calculate β is to convert each observation of an independent variable to 0 mean and 1 std
- Unstandardized coefficient α are used for independent variables when they all kept their own units (such as kg, meters, years old, etc.). It means when there is one unit change in the independent variable, there is α cahnge in the dependent variable y.
- Check whether missing values appear in different values with different probability. This may help understand whether the missing value is missing completely at random (missing with same probability for different values) or whether there are certain values tend to have more missing values and why.
- Deletion
- List wise deletion - remove the whole list
- Pair wise deletion - remove missing values for each column, but each column may end up with difference number of records
- It's safer to use deletion when the missing values are completely missing at random, since deletion will reduce the records and may reduce the prediction power
- Impute with mean/median/mode
- General Imputation - Replace missing values with selected value in the whole column
- Similar Case Imputation - For different group of values, impute with the selected value from that group
- Impute with special values, such as "MISSING", -1, etc.
- KNN Imputation
- LGBM Imputation
- It uses LGBM to impute missing values
- Miss Forest, Mice Forest
- Handle missing values with random forest
- It explains how
- Works for both numerical, categorical features
- Model prediction to compare whether imputing missing values will help
- To check whether there is ourliers, I normally check distribution, even just by using percentile. And only deal with a very small percentage of outliers.
- It's recommended that to use the model run on raw features without any preprocessing, this is the baseline result. Then run the same model with preprocessed data, to compare. So that you will know how much does preprocessing affect your prediction in each project.
- Sometimes I also check boxplot, but the data I am dealing with tend to have large amount of outliers, and imputing data above & below 1.5*IQR will reduce the prediction power too much.
- Better to know what caused the outliers, this may help you decide how to deal with them
- Decide which are outliers
- Distributions
- Boxplot, 1.5IQR
- Modified Z score
- Check
MAD
value
- Check
- ML methods can be used for anomaly detection
- To deal with ourliers, I normally use:
- Simply replace the outliers with NULL
- Replace outliers with median/mean/mode, or a special value
- Binning the feature
- Just leave it there
- Anoter suggest method is to build seperate models for normal data and outliers when there are large amount of outliers which are not caused by errors. In industry, you may not be allowed to build seperate models but it's still a method to consider.
- Check missing values, for those with high percentage of missing values, you may want to remove them.
- Check variance
- Deal with collinearity issues
- Correlation matrix to check the correlation between pairs of features - Collinearity
- VIF (Variance Inflation Factor) to check the correlation exists between 3+ features but could not be found in any pair of features - Multicollinearity
- How to use this method, check the code here, "Check Multicollinearity" section.
- Normally when VIF is between 5 and 10, there could be multicollineary issue of the feature. When VIF > 10, it's too high and the feature should be removed.
- 2 major methods to deal with features with high VIF
- Remove it: Often start with removing features with highest VIF
- Combine features with high VIF into 1 feature
- The implementation of using VIF to drop features
- Description of VIF
- The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone. VIF score of an independent variable represents how well the variable is explained by other independent variables.
VIF = 1/(1-R^2)
, higher R^2 means higher correlation between the variable and other variables, so higher VIF indicates higher multicollinearity.
- Use tree models to find feature importance
- Better to remove highly correlated features before doing this. Some tree model will put highly correlated features all as important if one of them is highly ranked
- Dimensional Reduction Algorithms
- Better to use standardize data before applying these methods, by making data on the same scale, distance calculation makes more sense.
- sklearn decomposition
- The intermediate result from autoencoder - encoder
- sklearn manifold learning
- How does PCA work, great way to explain "eigenvalue" also plot the explained variance of each principle component
- Unsupervised, linear
- Needs data standardization
- How does LDA work
- Linear, Supervised, as it's trying to separate classes as much as possible
- Needs data standardization
- How does t-SNE work
- Linear, Unsupervised, Non-parametric (no explicit data mapping function), therefore mainly used in visualization
- "Similarity" & normal distribution in original dimension, "similarity" & t-distribution in lower dimension.
- Both probability distributions are used to calculate similarity socres
- t-distribution has "flatter" shape with longer tails, so that data points can be spread out in lower dimensions, to avoid high density of gathering
- "Perplexity" determines the density
- Needs data standardization
- How does Isomap work
- Unsupervised, Nonlinear, with the help of MDS (multidimensional scaling), it will try to keep both global and local (between-point distance) structure
- Linear method like PCA focuses on keeping global structure more
- Unsupervised, Nonlinear, with the help of MDS (multidimensional scaling), it will try to keep both global and local (between-point distance) structure
- How does LLE work
- Nonlinear, Unsupervised
- Focusing on local structure more, more efficient than Isomap as it doesn't need to calculate pair-wise distance, but Isomap works better when you want to keep both global and local structure
- How does UMAP work
- Supports both supervised & unsupervised
- Trying to maintain both local & global structures
- Feature Selection Methods
- Scaling the features
- Sometimes, you want all the features to be normalized into the same scale, this is especially helpful in parametric algorithms.
- sklearn scaling methods on data with outliers
- PowerTransformer is better in dealing with outliers
- Sometimes, you just want to scale the values between 0 and 1 so MaxMinScaler is still popular
- Transform nonlinear relationship to linear relationship, this is not only easier to comprehend, but also required for parametric algorithms
- scatter plot to check the relationship; we can also use pandas cross tab method
- Binning
- Such as
log
, here are a list of methods for nonlinear transformation: https://people.revoledu.com/kardi/tutorial/Regression/nonlinear/NonLinearTransformation.htm- 🌺 Some data do need to use log, such as "income" in banking data. There is a common assunption that, income data is log-normally distributed. Applying
log
on the better can be better. - This can be applied to features that have inevitable "outliers" that you should not remove, but it could make the overall distribution hard to see. When there are non-positive values, sometimes you might need to use
np.log(df + 1) * np.sign(df)
- 🌺 Some data do need to use log, such as "income" in banking data. There is a common assunption that, income data is log-normally distributed. Applying
- Kernel functions
- PCA - It can convert the whole feature set into normalized linear combination
- How to use PCA for dimensional reduction/feature selection
- The components are independent from each other, earlier components capture more variance. Each principle component is also a linear combination of original features
- To calculate skewness
- The difference between the mean and mode, or mean and median, will tell you how far the distribution departs from symmetry.
- Pearson Mode skewness:
(mean - mode)/std
- Pearson Mode skewness alternative:
3*(mean - median)/std
- https://www.statisticshowto.datasciencecentral.com/pearson-mode-skewness/
- Convert skewed distribution to symmetric distribution, this is also prefered for parametric algorithms. Skewed data could reduce the impact from lower values which may also important to the prediction.
log
to deal with righ skewness- squre root, cube root
exp
to deal with left skewnesspower
- binning
- derived features
- one-hot features
- decision tree paths as the feature
- Methods to deal with categorical features
- Some simple categorical encoding methods
- I often use label encoding to convert categorical target value to numerical format, and use ordinal encoding to convert categorical target value to ordinal numerical format
- LightGBM offers good accuracy with built-in integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories. You can also have categorical features as direct input, but need to specify that as categorical type
- 10+ Built-in Categorical Encoding Methods
- Target Encoder is a popular method, and this paper tells the drawbacks of target encoding for reducible bias, and indicating that using smoothing regularization can reduce such bias. We can do smoothing regularization through the param here
- Params for each encoding method
- More descriptions of some of the encoding methods
- Base N creates less dimensions while represent the data effciently, if you choose the proper N
- "In target encoding, for numerical target, we calculate the mean of the target variable for each category and replace the category variable with the mean value; for categorical target, the posterior probability of the target replaces each category"
- Target encoding should only be applied to the training data to avoid target leakage
- 2 methods mentioned in the article to reduce target leakage/overfitting
- When the categories in training and testing data are distributed improperly, the categories may assume extreme value
- Better to remove highly correlated features after applying these encoding methods. Even if the feature correlations affect tree models less, too many dimensions are not optimal to tree models either, might lead to large trees.
- Beta Target Encoding
- requires minimal updates and can be used for online learning
- Check its python implementation, supports numerical target
- Concat multiple categorical features as 1 feature
- Convert to value frequency or response rate, aggregated value. Also for categorical value, we can use part of the value and convert to numerical values
- Some simple categorical encoding methods
- Normalize data into [0, 1] without min, max
- We can use sigmoid function
exp(x)/(exp(x) + 1)
to normalize any real value into 0, 1 range - If we check the curve of sigmoid function, you can see that for any x that belongs to a real number, y is always between 0 and 1.
- We can use sigmoid function
- Data Transformation
- Transform non-normal distribution into normal distribution
- Besides minmax mentioned above, sklean has multiple transform methods, some are even robust to outlisers.
- In this example, using box-cox method can convert distributions with multiple peaks into normal distribution
- The requirement of box-cox is, all the values have to be positive.
- Transform non-normal distribution into normal distribution
- 2020 paper - Rethinking the Value of Labels for Improving Class-Imbalanced Learning
- Inspirations
- Train the model without label first to generate the predicted labels as new feature for the next model learning (semi-supervised learning)
- More relevant new data input might reduce test error
- Smaller data imbalance ratio for unlabeled data also reduce unlabeled data test error
- The use of T-SNE for visualization to check class boundary
- This is data dependent, not all the dataset could show obvious class boundary
- There are oversampling, undersampling and synthetic sampling (which is also oversampling), combined sampling (oversampling + undersampling). In practice, I tried different methods in different projects, so far non of them worked well in both training and testing data.
- imbalanced-learn has multiple advanced sampling methods
- Compare sample feature and population feature distributions
- Numerical features: KS (Kolmogorov-Smirnov) test
- Categorical features: (Pearson’s) chi-square test
- Probability * Non-probability Sampling Methods
- This method is becoming more and more popular recently. Majorly you just set the class weights based on the importance of false positive and false negative.
- In practice, it is worthy to know more from the customers or the marketing team, trying to understand the cost of TP/TN or FP/FN.
- Example-dependent Cost Sensitive method
- Costcla - it's a package that do model prediction including cost matrix
- The cost matrix is example dependent. Each row has the cost of [FP, FN, TP, TN]. So when you are creating this cost matrix yourself, your training & testing data all records the costs, each row of your data is an example. Sample Model Prediction
- The drawback is, testing data also needs the cost matrix, but in practice, you may not know. However this method can still be used in train-validation dataset, in order to find the optimal model.
- Costcla - it's a package that do model prediction including cost matrix
- When the prediction result is in probability format, we can change the threshold of class prediction. By default the threshold is 50-50. With the evaluation metric, better to draw a curve with thresholds as x-axis and evaluation result as y-axis, so that we will know which threshold to choose is better.
- When the labels are given by the client, without being able to work with any business expert, there can be many problems. Especially when they only gave you 1 class of the labels, and you might need to assume the rest of the data all belong to the other label, it can be more problematic.
- Here're the suggestions
- Better not to build the machine learning pipeline at the early stage, since the pipeline can create much more limitation in the work when you want to try different modeling methods.
- If not trust the label quality, try clustering first, and check the pattern in each cluster, and compare with the given labels.
- Communicate with the business experts frequently if possible, to understand how did the labels get generated.
- If your model is supervised method, and if using all the data there will be huge data imbalance issue, better to choose representative sample data as the assumed label, and try to reduce the size, in order to reduce the data imbalance issue.
- Clustering & Multiple Model Training
- Cluster the majority class into multiple non-overlapped clusters. For each cluster, train them with the minority class and build a model. Average the final prediction
- You don't need testing data here, but the drawback is you always need the label which won't be the case in practice. When there are more data, you also need to operate on all the data, which can be time & computational consuming.
- Data Exploration Guidance
- Impact of Data Size on Model Performance
- It's using deep learning, but can inspire on multiple other models, especially those using stochastic methods
- Use bootstrap methods + multiple seeds to do multiple run, cross validatio are suggested. Because for stochastic methods, different seed could lead to different direction and optimal results. It could also help dealing with the high variance issue in a model.
- "As such, we refer to neural network models as having a low bias and a high variance. They have a low bias because the approach makes few assumptions about the mathematical functional form of the mapping function. They have a high variance because they are sensitive to the specific examples used to train the model."
- Larger training set tend to lead to better prediction results, smaller test case may not always be the case. So 7:3 or 8:2 train-test ratio is always a good start.
- Guidance on dealing with imbalnced data 1, Guidance on dealing with imbalnced data 2
- During R&D, there will be more and more fancy methods, especially those in deep learning. Are they going to perform better comparing with these easy-to-implement baselines?
- Categorize based on words/characters
- Naive Bayesian
- Fisher Methods
- You can do self-implementation, check chapter 6
- Find themes of a piece of text
- Non-Negative Matrix Factorization
- You can do self-implementation, check chapter 10
- Linear or Nonlinear
- In the code here, we can use residual plot to check whether there is linear/non-linear relationship between the feature set and the label, to decide to use linear/nonlinear model.
- If the residual plot is showing funnel shape, it indicates non-constant variance in error terms (heteroscedasticity), which also tends to have residuals increase with the response value (Y). So we can also try to use a concave function (such as log, sqrt) to transform
Y
to a much smaller value, in order to reduce heteroscedasticity. - As we can see in the code, after transforming
Y
withlog
, the residual plot was showing a much more linear relationship (the residuals are having a more constant variance)
- If the residual plot is showing funnel shape, it indicates non-constant variance in error terms (heteroscedasticity), which also tends to have residuals increase with the response value (Y). So we can also try to use a concave function (such as log, sqrt) to transform
- In the code here, we can use residual plot to check whether there is linear/non-linear relationship between the feature set and the label, to decide to use linear/nonlinear model.
- TPOT Automatic Machine Learning
- TPOT for model selection
- It's a very nice tool that helps you find the model with optimized param, and also export the python code for using the selected model TPOT Examples
- TPOT Params for Estimators
- 10-fold of cross validation, 5 CV iterations by default
- TPOT will evaluate
population_size + generations × offspring_size
pipelines in total. So when the dataset is large, it can be slow.
- All the classifiers that TPOT supports
- All the regressors that TPOT supports
- Different TPOT runs may result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means that it uses randomness (in part) to search the possible pipeline space.
- The suggestion here is: Run TPOT multiple times with different random_state, narrow down to fewer models and do further evaluation (see below spot check pipeline). If the data size is large, try to reduce
population_size + generations × offspring_size
,cv
and usesubsample
- Sample exported TPOT python file
- Spot-check Model Evaluation
- After you have a few list of models, you want to quickly check which one performs better. What I'm doing here is, for each model, use all the training data but with stratified kfold cross validation. Finally it evaluates the average score and the score variance.
- Some people think it's important to use bootstrap, which split the data into multiple folds and run on each fold or it will use different seeds to run the same model multiple times. I'm not using bootstrap here, because the first solution is similar to stratified kfold cross valiation, the second solution, I will use it when I have finalized 1 model, and use bootstrap to deliver the final evaluation results.
- After you have a few list of models, you want to quickly check which one performs better. What I'm doing here is, for each model, use all the training data but with stratified kfold cross validation. Finally it evaluates the average score and the score variance.
- Tools other than TPOT
- R-Square, RSS (residual sum of squares), will decrease when there are more features, but the test error may not drop. Therefore, R-Square, RSS should not be used for selecting models that have different number of features.
RSS = MSE*n
R-square = explained variation / total variation
, it means the percentage of response variable variance explained by the model.- It's between 0% to 100%
- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.
- When there is time order in the data
- Solution - Forward Chaining
- Step 1: training(1), testing(2)
- Step 2: training(1,2), testing(3)
- Step 3: training(1,2,3), testing(4)
- ...
- Application - sklearn time series split
- Solution - Forward Chaining
- We know ensembling models tend to work better in many cases, such as Xgboost and LightGBM
- Super Learner example
- It trains on a list of base models, using the predictions as the input of a meta model to predict the targets, and finally might be better than a specific "best model"
- Radial Basis Function kernel (RBF kernel) is used to determine edge weights in affinity matrix, a matrix defines a pairwise relationship points.
- Sklrean's implementation of RBF kernel looks slightly different as it replaces 1/2sigma^2 with a hyperparameter gamma. The effect is the same as it allows you to control the smoothness of the function. High gamma extends the influence of each individual point wide, hence creating a smooth transition in label probabilities. Meanwhile, low gamma leads to only the closest neighbors having influence over the label probabilities.
- Reference
- When you have labeled data for all classes:
- How does label propagation work
- How does label spreading work
- To be more accurate, I would say the difference between hard clamping and soft clamping is the weight (alpha) on original label is 0 or not, check sklearn's description
- How does self-training classifier work
- Only has positive & ublabled data
- PU learning - How does E&N work
- How to evaluate PU learning without negative label
- How did margin generated - margin has maximized orthogonal distance between the cloest points in each category and the hyperplane, these closest points are supporting vectors
- How does kernal SVM work for nonlinear data
- Gaussian kernal is
rbf
kernal in sklearn
- Gaussian kernal is
- It's almost the oldest optimization function, trying to find the local minimum through iteration.
- An easy way to understand it
- How to implement gradient descent
X
is feature set,y
is lables,theta
is (initial) weight;alpha
is learning rate;m
is number of records
- The final comparision table is very helpful
- weights vs gradient
- tree depth
- classifier weights
- data variance capture
- L1 regularization adds the penalty term in cost function by adding the absolute value of weight(Wj) parameters, while L2 regularization adds the squared value of weights(Wj) in the cost function.
- L1 pushes weight
w
towards 0 no matter it's positive or negative, therefore, L1 tend to be used for feature selection- If w is positive, the regularization L1 parameter λ>0 will push w to be less positive, by subtracting λ from w. If w is negative, λ will be added to w, pushing it to be less negative. Hence, this has the effect of pushing w towards 0.
- As for dealing with multicollinearity, L1, L2 and Elastic Net (uses both l1, L2) could all do some help
- There are things we can do to make the evaluation more reliable:
-
Hold-out Data
- If the dataset is large enough, it's always better to have a piece of hold-out data, and always use this piece of data to evaluate the models
-
Run the same model multiple times
- You can try different seeds with the whole dataset
- Bootstrap - run the model on randomly selected samples, finally aggregate the results
- Resample with replacement
- UNIFORMALY random draw
- The advantage of draw with replacement is, the distribution of the data won't be changed when you are drawing
- sklearn methods in data spliting
- Statistics proved that Bootstrap has close estimates as using the true population, when you have selected enough amount of samples (similar to central theorem)
- My code - Using Bootstrap to Estimate the Coefficients of Linear Regression
- Applied on linear regression, and quadratic linear regression
- It compares bootstrap results and standard formulas results. Even though both got the same estimated coefficients, standard formulas tend to show lower standard errors. However sinece standard formulas makes assumptions, while bootstrap is nonparametric, bootstrap tends to be more reliable
-
Cross Validation
- We can calculate the average evaluation score and the score variance to compare model performance
- In sklearn, we can just use cross_val_score, which allows us either to use integer as stratified kfold cv folds, or cross validation instances
-
- There is a suggestion for model evaluation, which is, choose and train models on the same dataset and averaging the evaluation results. The idea here is, "combined models increase prediction accuracy"
- "Probability" means predict continuous values (such as probability, or regression results), "Response" means predict specific classes
- The model doesn't have to be forecasting/time series models only, it can be classification model too with features generated using sliding window/tumbling window.
- No matter which model to use, when there is time series in the data, better to use walk-forward evaluation, which is to train on historical but predict on the future data.
- Better to sort the input data in time order, even though the model doesn't check historical records and the features already added time series value, if the dta is not sorted in time order, when the later records appeared before its historical record during training, it can still create bias
- ROC Curve vs. Precision-Recall Curve
- ROC curve requires the 2 classes are balanced. Precision-Recall curve is better at being used on imbalanced dataset.
- Both of them have TPR (recall/sensitivity, the percentage of TP cases are described as positive), the difference is in precision and FPR.
FPR = FP/(FP + TN)
, when the negative class is much larger than positive class, FPR can be small.Precision = TP/(TP + FP)
, it indicates the percentage of TP are correct. When the negative class is much larger than the positive class, this can be affected less, comparing with ROC.
- Logloss is used for probability output, therefore it's not used as a regression method in sklearn
- In deep learning loss function doesn't have to be logloss, it can simple be "mean_absolute_error"
- MSE/RMSE works better for continuous target
R-square = explained variation / total variation
, it means the percentage of response variable variance explained by the model.- Interpretation
- 1 is the best, meaning no error
- 0 means your regression is no better than taking the mean value
- Negative value means you are doing worse than the mean value
- It's between 0% to 100%
- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.
- R-square cannot tell bias, so have to be used with residual plot.
- When the residual plot dots are randomly dispersed around the horizontal axis, a linear model is better for the data, otherwise a nonlinear model is better.
- Lower R-square doesn't mean the model is bad.
- For unpredictable things, lower R-square is expected. Or R-squared value is low but you have statistically significant predictors, it's also fine.
- High R-square doesn't mean good model.
- The residual plot may not be random and you would choose a nonlinear model rather than a linear model.
- Interpretation
- Residual Plot
- In regression model,
Response = (Constant + Predictors) + Error = Deterministic + Stochastic
- Deterministic - The explanatory/predictive information, explained by the predictor variables.
- Stochastic - The unpredictable part, random, the error. Which also means the explanatory/predictive information should be in the error.
- Therefore, residual plot should be random, which means the error only contains the Stochastic part.
- So when the residual plot is not random, that means the predictor variables are not capturing some explanatory information leaked in the residual plot. Posibilities when predictive info could leak into the residual plot:
- Didn't capture a missing variable.
- Didn't capture a missing higher-order term of a variable in the model to explain the curvature.
- Didn't capture the interaction between terms already in the model.
- The residual is correlated to one or more variables, and these variables didn't get captured in the model.
- Better to check the correlation between residuals and variables.
- Autocorrelation - Adjacent residuals are correlated to each other.
- Typically this situation appears in time series, when you can use one residual to predict the next residual.
- Test autocorrelation with ACF, PACF
- ACF (AutoCorrelation Function), PACF (Partial AutoCorrelation Function)
- More description about ACF, PACF
- When there are 1+ lags have values beyond the interval in ACF, or PACF, then there is autocorrelation.
- Test autocorrelation with python statsmodels
- What is Durbin Watson test for autocorrelation
- In regression model,
- References
- Still needs labels
- Sometimes, in simply do not have the label at all. You can:
- Compare with the datasets that have label, at the risk of non-transformable
- Check predicted results distribution, comparing between different rounds
- Your customer will expect a percentage for each label, compare with that...
- Suggestions to improve scores
- Making the probabilities less sharp (less confident). This means adjusting the predicted probabilities away from the hard 0 and 1 bounds to limit the impact of penalties of being completely wrong.
- Shift the distribution to the naive prediction (base rate). This means shifting the mean of the predicted probabilities to the probability of the base rate, such as 0.5 for a balanced prediction problem.
- Reference
- How to Compare 2 Models After Each has been Running Multiple Rounds
- Calibration
- Concordant & Discordant
- KS Test - Kolmogorov-Smirnov (K-S) chart is a measure of the difference between the y_true and y_pred distributions for each class respectively. It's a method can be used to compare 2 samples.
- KS test vs t-test: Imagine that the population means were similar but the variances were very different. The Kolmogorov-Smirnov test could pick this difference up but the t-test cannot
- Data Exploration
- Lagged Features
- When you features are created by lagged data, you can check the correlation between these lags. If the correlation is close to 0, it means current values have no correlation to previous time values
- Lagged Features
- If you want a quick forecast, such as hourly, daily, weekly, yearly forecast, try Facebook Prophet
- My code - time series functions
- 11 Forecasting methods
- When to use which model
- It seems that methods do not have
forcast
but only havepredict
will only predict the next value. Withforecast
we can predict the next period of values
- My code with 7 methods
- Some are not in above 11 methods
- Simple, Exponential & Weighted Moving Average
- Simple moving average (SMA) calculates an average of the last n prices
- The exponential moving average (EMA) is a weighted average of the last n prices, where the weighting decreases exponentially with each previous price/period. In other words, the formula gives recent prices more weight than past prices.
- The weighted moving average (WMA) gives you a weighted average of the last n prices, where the weighting decreases with each previous price. This works similarly to the EMA.
- Grid Search to tune ARIMA params
- Combing multiple models forecasts
- Combining multiple forecasts leads to increased forecast accuracy.
- Choose and train models on the same time series and averaging the resulting forecasts
- Reference
- This is a common solution used in industry problems. With multiple constraint functions and 1 function for optimized solution.
- My code - Linear Programming to Get Optimized List of Youtube Videos
- Only supports minimized problem, so you need to adjust the optimization function if it's supposed to be maximized
- A detailed introduction of applied linear programming
- How does Simplex Method work
- Use this solution when 2D drawing is no longer enough
- Python Linear Programming - scipy.optimize.linprog
- How to use this library...
- Only supports minimized problem, so you need to adjust the optimization function if it's supposed to be maximized
- LP Modeler - PuLP
- Pulp example with concept description, concept includes:
- convex vs concave vs non-convex
- infeasible or unbounded solution
- Pulp example with concept description, concept includes:
- Pareto Front
- When you have multiple evluation metrics (or objective functions), this method helps to find those records with at least 1 metric wroks better.
- With this method, it helps remove more unnecessary records --> more efficient.
- The idea can be modified for other use cases, such as instead of having at least 1 metric works better, we can define all the metrics or a certain metrics work better, etc.
- MADM (Multi criteria decision analysys methods)
- The methods listed in skcriteria are mainly using linear programming
- Use Cases
- Param Tuning - Pareto Front improves the efficiency, MADM find the optimal option
- Tableau
- Spotfire
- Redash
- ExplainerDashboard
- It has summarized different methods in ML model interpretations, such as NN interpretation, model-agnostic interpretation, etc.
- Their GitHub, Examples and the Paper: https://github.com/marcotcr/lime
- Interpretable: The explanation must be easy to understand depending on the target demographic
- Local fidelity: The explanation should be able to explain how the model behaves for individual predictions
- Model-agnostic: The method should be able to explain any model
- Global perspective: The model, as a whole, should be considered while explaining it
- The tool can be used for both classification and regression. The reason I put it here is because it can show feature importance even for blackbox models. In industry, the interpretability can always finally influence whether you can apply the more complex methods that can bring higher accuracy. Too many situations that finally the intustry went with the most simple models or even just intuitive math models. This tool may help better intrepretation for those better models.
- My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/Better4Industry/lime_interpretable_ML.ipynb
- It seems that GitHub cannot show those visualization I have created in IPython. But you can check LIME GitHub Examples
- LIME requires data input to be numpy array, it doesn't support pandas dataframe yet. So that's why you can see in my code, I was converting the dataframe, lists all to numpy arraies.
- NOTES
- Currently they have to use predicted probability in
explain_instance()
function - You also need to specify all the class names in
LimeTabularExplainer
, especially in classification problem, otherwise the visualization cannot show classes well
- Currently they have to use predicted probability in
- SHAP value is a type of feature importance value: https://christophm.github.io/interpretable-ml-book/shap.html#shap-feature-importance
- It uses Shapley values at its core and is aimed at explaining each individual record.
- "Shapley value for each feature is basically trying to find the correct weight such that the sum of all Shapley values is the difference between the predictions and average value of the model. In other words, Shapley values correspond to the contribution of each feature towards pushing the prediction away from the expected value."
- So higher shaply value indicates the feature pushes the prediction towards the positive class more.
- So lower shaply value indicates the feature pushes the prediction towards the negative class more.
- "Shapley value for each feature is basically trying to find the correct weight such that the sum of all Shapley values is the difference between the predictions and average value of the model. In other words, Shapley values correspond to the contribution of each feature towards pushing the prediction away from the expected value."
- More details from my practice, including shap decision plot
- It supports feature importance interpretation for those famous emsembling models, as well as deep learning model output interpretation.
- Tutorials
- The visualization generated from this library are more like sklearn stype, basic but useful. It has multiple visualizers:
- Feature Analysis, Target visualizer, Regression/Classfication/Clustering visualizers, Model Selection visualizer, Text Modeling visualizers.
- Find all its visualizers
- You can visualize neural networks without any prior setup.
- Google Colab Notebooks
- It might be better for deep learning. So far I haven't seen how impressive this tool is~
- Examples
- Much more text than visualization
My model evaluation previous detailed summary
dplyr
- majorly to do query related operation, such as "select", "group by", "filter", "arrage", "mutate", etc.data.table
- fast, even more convenient than data.frame, can also do queries insideggplot2
reshape2
- reshape the data, such as "melt", "dcast", "acast" (reversed melt)readr
- read files faster, different functions supports different filestidyr
- makes the data "tidy". "gather" is similar to above "melt" inreshape2
; "seperate", "sperate_d" could help seperate 1 column into multiple columns and vice versa, etc.lubridate
- deal witb datetime- My code of R 5 packages for dealing with missing values
MICE
- it assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them.Amelia
- It assumpes that All variables in a data set have Multivariate Normal Distribution (MVN). It uses means and covariances to summarize data.missForest
- It builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values.Hmisc
- It automatically recognizes the variables types and uses bootstrap sample and predictive mean matching to impute missing values. You don’t need to separate or treat categorical variable. It assumes linearity in the variables being predicted.mi
- It allows graphical diagnostics of imputation models and convergence of imputation process. It uses bayesian version of regression models to handle issue of separation. Imputation model specification is similar to regression output in R. It automatically detects irregularities in data such as high collinearity among variables. Also, it adds noise to imputation process to solve the problem of additive constraints.- Recommend to start with missForest, Hmisc, MICE, and then try others
- How does R output use p-value in regression
- In R, after we have applied regression, we will see coefficient values as well as p-values.
- The null hypothesis is, coefficient is 0 for a variable (no effect to the model).
- So when p-value is lower than the significance level, reject the null hypothesis, which also means the variable should be included in the model. When p-value is high, accept the null hypothesis, and the variable should be removed from the model.
- MLFlow - Machine Learning Life Cycle Platform
- Record and query experiments: code, data, config, and results.
- Packaging format for reproducible runs on any platform.
- General format for sending models to diverse deployment tools.
- My Code to try it
- Pymongo - Get access to MongoDB easier
- Better not use pandas
iterrows()
but to store each record as a dictionary and store in a list for iteration. Iterating pandas rows can be much slower.
- How to apply Monte Carlo Simulation
- 3 types of t-test
- Regression plots, comparing model output with random output
- Promising New Methods Mentioned in My Garden