The machine learning process involves the following steps:
- 1- Data Preparation: Collect, clean, and preprocess data.
- 2- Data Visualization and Analysis: Visualize and analyze data to identify patterns and relationships.
- 3- Feature Engineering: Select and transform relevant variables in the data.
- 4- Model Selection: Choose the best model for the problem.
- 5- Model Training: Feed data into the model and adjust parameters to minimize error.
- 6- Hyperparameter Tuning: Set hyperparameters to optimize model performance.
- 7- Model Evaluation: Measure accuracy, precision, recall, and other performance metrics.
- 8- Model Deployment: Integrate the model into an application and set up a pipeline to feed new data.
This tutorial covers Machine Learning Basics using Python
.
The repository includes Python notebooks, reference guides, and cheatsheets for the entire Machine Learning process:
- 1- Data preprocessing and analysis: clean and transform data into a format suitable for analysis using
NumPy
andPandas
. - 2- Data visualization: understand and explore data visually using
Matplotlib
andSeaborn
. - 3- Machine learning: explore various algorithms in
Scikit-learn
such as regression, classification, and clustering. - 4- Feature engineering: feature encoding, feature scaling, feature selection, etc.
- 5- Model selection: comparison of ML algorithms, how to choose a ML algorithm, etc.
- 6- Hyperparameters tuning: Grid Search, Random Search, and Bayesian Optimization.
- 7- Model evaluation: validation methods, evaluation metrics, etc.
- 8- Model explainability: feature importance, interpretable models, etc.
The repository also includes two Python notebooks of two popular examples to get started with Machine Learning:
- Classification - Titanic Survival Prediction: Predict whether a passenger on the Titanic ship survived or not based on various features such as their age, gender, ticket class, and cabin location (notebook).
- Regression - Boston House Price Prediction: Predict the median value of houses in Boston neighborhoods based on various features such as crime rate, number of rooms, proximity to employment centers, and accessibility to highways (notebook).
The end of the GitHub repository provides resources and links to practice and advance with Machine Learning:
- The most popular ML dataset platforms.
- The most popular ML competition platforms.
- A guide to tackle ML competitions (PDF).
Tools:
- Python 3
- Jupyter Notebook: web-based interactive computing platform
- Google Colab: cloud-based Jupyter Notebook environment
Concepts:
- Mathematics (refresher)
- Python programming (refresher, notebook, guide GDSC)
- Data Structures (refresher)
Python libraries:
- NumPy: A library for efficient numerical operations and multidimensional arrays, widely used in scientific computing and data analysis.
- Pandas: A data manipulation and analysis library, providing data structures and functions to easily handle and process structured data.
- Matplotlib: A popular plotting library used for creating static, animated, and interactive visualizations.
- Seaborn: A data visualization library based on Matplotlib, providing high-level functions for creating attractive statistical graphics.
- Scikit-learn: A data analysis and modeling library, including ML algorithms for various tasks: classification, regression, clustering, etc.
- 1- Machine learning basic concepts
- 2- Read input data in
Python
- 3- Data preprocessing and analysis:
Numpy
andPandas
- 4- Data visualization:
Matplotlib
andSeaborn
- 5- Machine learning:
Scikit-learn
- 6- Feature engineering
- 7- Model selection and parameter tuning
- 8- Model evaluation and explainability
- 9- Practice: Machine learning datasets
- 10- Practice: Machine learning competitions
1- Machine learning basic concepts
- Presentation on Machine learning basic concepts (PDF)
2- Read input data in Python
- Tutorial to read various sources in a DataFrame (notebook)
3- Data preprocessing and analysis: Numpy
and Pandas
- Numpy cheatsheet (PDF)
- Numpy tutorial (notebook)
- Pandas cheatsheet (PDF)
- Pandas tutorial (notebook)
- Data preprocessing tutorial (notebook)
4- Data visualization: Matplotlib
and Seaborn
- Chart chooser (PDF)
- Matplotlib cheatsheet (PDF)
- Matplotlib tutorial (WEB)
- Seaborn tutorial (WEB)
- Data visualization tutorial (notebook)
5- Machine learning: Scikit-learn
- Machine learning map (PDF)
- Scikit-learn cheatsheet (PDF)
- Scikit-learn tutorial (notebook)
- Machine learning tutorial (notebook)
- Classification: Titanic Survival Prediction (notebook)
- Regression: Boston House Price Prediction (notebook)
6- Feature engineering
- Data cleaning guide (PDF)
- Data preparation cheatsheet (PDF)
- Feature engineering (PDF)
- Feature engineering tutorial (notebook)
- Feature selection methods (IMG)
7- Model selection and parameter tuning
- Comparison of ML algorithms 1 (PDF)
- Comparison of ML algorithms 2 (IMG)
- Comparison of ML algorithms 3 (IMG)
- How to choose a ML algorithm (IMG)
- Hyperparameter tuning (WEB)
8- Model evaluation and explainability
- Evaluation metrics cheatsheet (PDF)
- Evaluation metrics in Python (WEB)
- Model explainability cheatsheet (PDF)
9- Practice: Machine learning datasets
- UCI Machine Learning Repository: https://archive.ics.uci.edu/
- Kaggle datasets: https://www.kaggle.com/datasets
- Awesome Public Datasets: https://github.com/awesomedata/awesome-public-datasets
- Google Dataset Search: https://datasetsearch.research.google.com/
- OpenML Datasets: https://www.openml.org/
- Papers With Code: https://paperswithcode.com/datasets
10- Practice: Machine learning competitions
- Kaggle: https://www.kaggle.com/competitions
- DrivenData: https://www.drivendata.org
- Zindi Africa: https://zindi.africa/competitions
- Guide to tackle ML competitions (PDF)