Skip to content

Latest commit

 

History

History
252 lines (137 loc) · 16.5 KB

README.md

File metadata and controls

252 lines (137 loc) · 16.5 KB

Table of contents

Summary

This github repo hosts a tool called Machine Learning Made Easy (MLme). By integrating four essential functionalities, namely data exploration, AutoML, CustomML, and visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. MLme serves as a valuable resource that empowers researchers of all technical levels to leverage ML for insightful data analysis and enhance research outcomes. By simplifying and automating various stages of the ML workflow, it enables researchers to allocate more time to their core research tasks, thereby enhancing efficiency and productivity.

Demo Server

We have set up a demo server for MLme for demonstration purposes. Please Click Here to launch it. Please note that it may take a moment to load.

User can use the example input data to test it.

Graphical Abstract

Determining Suitability of MLme for Your Dataset

To understand if your data and scientific question fall into the category of a classification problem, there are a few important points to consider:

  • Categorized Data: Your data should consist of examples that are grouped into distinct classes or categories. For example, if you're studying different species of plants, each plant should be labeled with its corresponding species name, like "rose" or "tulip.

  • Prediction Goal: Your scientific question should involve predicting or assigning these labels to new instances based on their features. For instance, you might want to predict the species of a new plant based on its petal length, petal width, and other characteristics.

Installation (via Docker)

  1. Prerequisite: Before proceeding with the installation, ensure that Docker is installed and running. If you haven't installed Docker yet, you can follow the official Docker tutorial for installation instructions.

  2. To obtain the MLme docker image, you may open your terminal and run the provided command.

    docker pull 45474547/mlme:latest     
    
  3. To launch MLme, please run the given command in your terminal after performing the previous steps.

    docker run -p 8080:80 45474547/mlme                     
    
  4. Paste http://localhost:8080/ in your browser to access MLme.

Installation (from source code)

  • Prerequisite

    To use MLme, you must have Python version 3.9 and pip installed and that they are accessible from the terminal.

  • Download sourcecode

    Download the GitHub repository and unzip it.

  • Install dependencies

    1. Open your terminal and change your current working directory to MLme(e.g. cd path/to/MLme-main/).

    2. Please install the required packages using the following command:

      pip install -r requirements.txt

  • Launch MLme

    python -m main

Example Input Data

Tutorial

  • Data Exploration

    • Data Exploration feature allows you to upload your datasets and gain valuable insights through statistical visualizations. By analyzing data patterns, trends, and outliers, you'll be equipped to make informed decisions when developing machine learning pipelines.

    • Step 1: Input

      • Prepare your data in either .csv or .txt format (Example input data). Each row should represent a sample, and each column should represent a feature. The first column should contain the sample name, and the last column should contain the target classes. Make sure your file doesn't have any missing values (NaN). Here's an example of how your input data should look.

        Sample Feature 1 Feature 2 Target Class
        Sample 1 2.5 7.8 A
        Sample 2 1.3 6.7 B
        Sample 3 4.7 3.2 A

        Note: When uploading your file, remember to select the correct separator using the Sep dropdown menu to avoid any errors.

    • Step 2: Output

      • Once you've uploaded your dataset, you'll have access to various analysis options. You can explore your data in-depth using statistical summary tables and five different types of plots, including density and correlation matrix plots. Simply select the desired option from the Plot/Table Type dropdown menu.

      • Additionally, it provides you with the convenience of downloading the plots. Simply click on the camera button provided on each plot to save it for future reference or to share with others.

  • AutoML

    • The AutoML feature aims to provide accessibility to machine learning for all users, including those without technical expertise. It automates the machine learning pipeline, which includes preprocessing, feature selection, and training and evaluating multiple classification models. Additionally, it provides a default dummy classifier for comparison.

    • Step 1: Input

      • Prepare your data in either .csv or .txt format (Example input data). Each row should represent a sample, and each column should represent a feature. The first column should contain the sample name, and the last column should contain the target classes. Make sure your file doesn't have any missing values (NaN). Here's an example of how your input data should look.

        Sample Feature 1 Feature 2 Target Class
        Sample 1 2.5 7.8 A
        Sample 2 1.3 6.7 B
        Sample 3 4.7 3.2 A

        Note: When uploading your file, remember to select the correct separator using the Sep dropdown menu to avoid any errors.

    • Step 2: Configuration

      The AutoML feature offers a few configuration options to customize the analysis according to your requirements:

      • Variance Threshold: The ML pipeline includes a variance threshold feature that eliminates low-variance features. You can specify the threshold value to fine-tune the feature selection process.

      • No of Features to Select: Specify the percentage of features you want to select from the original set using the feature selection step. This allows you to focus on the most relevant features.

      • Tes Set: You have the option to allocate a separate test set, comprising 30% of the initial dataset, for evaluating the model's performance. This set is exclusively used for testing and not for training the models.

    • Step 3: Output

      Once the analysis is complete, the AutoML feature provides you with several outputs to assess and interpret the results:

      • Evaluation Metrics: You will receive a table displaying scores for 11 evaluation metrics for six ML algorithms, including SVM, KNN, AdaBoost, GaussianNB, and the dummy classifier. These metrics help you gauge the performance of each algorithm and make informed comparisons.

      • Selected Features: Another table will show the features selected from the original set. This allows you to identify the most important features that contribute to the model's performance.

      • Model Performance Visualization: The feature includes different visualization options, such as spider plots and heatmaps, to help you visualize and interpret the performance of the models. These plots provide a clear understanding of how each algorithm performs across different metrics.

      • Pipeline Display: For a more detailed view, you can explore the pipelines that were executed during the analysis. This includes information about the steps and parameters involved in each pipeline, giving you insights into the underlying processes.

      • Downloadable Results: You can download a zip file that contains a log file and all the results.pkl files. These files capture the detailed results of the analysis. You can upload the results.pkl file to the visualization tab for further analysis and interpretation.

  • CustomML

    • The CustomML feature is designed for intermediate to advanced machine learning users who want to create a tailored machine learning pipeline to meet their specific requirements. With its user-friendly interface, users can easily design their pipeline by selecting the desired preprocessing steps, classifiers, model evaluation methods, and evaluation metric scores, all through simple toggle buttons. This allows users to focus on selecting the most suitable options for their dataset without the need for programming.

    • Step 1: Designing the Pipeline

      The CustomML feature provides following straightforward interface for designing your pipeline:

      • Preprocessing Steps: You can choose from various preprocessing steps, such as scaling, data resampling, and feature selection. Simply click on the toggle button to include or exclude each algorithm/step in your pipeline.

      • Classifier: Select at least one classifier from the available options by toggling the corresponding button.

      • Model Evaluation Method: Choose the preferred model evaluation method by clicking on the toggle button. This determines how the performance of your model will be assessed.

      • Evaluation Metric Score: Select the desired evaluation metric score for evaluating your model's performance.

      • Customizing Algorithm Parameters: If you wish to customize the parameters of individual algorithms, click on the Parameter button. You will be presented with a list of corresponding parameters that you can adjust according to your preferences. If no changes are made, default parameters will be used.

      • Pipeline Download: Once the user has selected the desired algorithms/steps, they can obtain the designed pipeline by clicking on the submit tab. This will generate a compressed zip file (userInputData.zip) containing README.txt, inputParameter.pkl, and scriptTemplate.py.

      Note: While preprocessing steps are optional, you must select at least one classifier, a model evaluation method, and an evaluation metric score.

    • Step 2: Running the custom-designed pipeline

      To execute your custom-designed pipeline, follow these steps:

      • Open your terminal

      • Change your directory to the previously downloaded folder named "userInputData" from the CustomML feature.

      • Run the following command in the terminal to execute the pipeline:

        python scriptTemplate.py -i path/to/input.csv -p inputParameters.pkl -s tab -o .

        Replace "path/to/input.csv" with the actual path to your input file.

      -> Input file format: Prepare your data in either .csv or .txt format (Example input data). Each row should represent a sample, and each column should represent a feature. The first column should contain the sample name, and the last column should contain the target classes. Make sure your file doesn't have any missing values (NaN). You can refer to the example input data for guidance.

      -> Tags description: These tags provides the usage syntax for running the scriptTemplate.py file.

      usage: scriptTemplate.py [-h] [-i INPUT] [-s SEPARATOR] [-p PARAMETERS] [-o OUTPUT]

    • Step 3: Pipeline Output and Result Interpretation

      Once you have executed the pipeline, it will generate a compressed zip file as the output. This zip file will contain two important files:

      • log.txt: This file provides a detailed log of the pipeline execution. It includes information about each step performed during the process, any warnings or errors encountered, and other relevant details.

      • results.pkl: This file contains the results of your pipeline, including the model outputs, predictions, and evaluation metrics. It serves as a valuable resource for further analysis and interpretation of your ML model's performance.

      To interpret the results obtained from your pipeline, follow these steps:

      • Launch MLme and Navigate to the Visualization tab.

      • Upload the results.pkl file that was generated from your pipeline execution.

      • The MLme will process the results and provide visualizations, metrics, and insights to help you understand and analyze the performance of your ML model. You can explore various plots, charts, and summary statistics to gain deeper insights into the model's behavior and effectiveness.

  • Visualization

    • This feature enables users to effortlessly interpret and analyze their findings with the help of several interactive tables and plots.

    • Input:

    • Output:

      • A range of tables and plots are available for comparative analysis of model performance. Users can customize and download all of the plots in high quality, making them suitable for publication.

Citations information

Please cite MLme article published in GigaScience and bioRxiv.

Akshay Akshay#, Mitali Katoch#, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C. Burkhard, Rosalyn M. Adam, Katia Monastyrskaia, Ali Hashemi Gheinani. Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning–driven data analysis, GigaScience, Volume 13, 2024, giad111, https://doi.org/10.1093/gigascience/giad111

Machine Learning Made Easy (MLme): A Comprehensive Toolkit for Machine Learning-Driven Data Analysis. Akshay Akshay#, Mitali Katoch#, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C. Burkhard, Rosalyn M. Adam, Katia Monastyrskaia, Ali Hashemi Gheinani. bioRxiv 2023.07.04.546825; doi: https://doi.org/10.1101/2023.07.04.546825

# Contributed equally.