Skip to content

CellProfiling/hpa_densenet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPA Densenet

This is a repository for running predictions using the winning densenet model from the HPA Kaggle Image Classification Challenge, as well as the relevant image preprocessing for the model to work ideally. The package also includes the possibility to perform dimensionality reduction using the UMAP package.

The package is centered around using separate commands for different parts of the pipeline. Currently available commands are:

  • preprocess — Preprocessing a set of images for future predictions.

  • predict — Prediction using the HPA Densenet model.

  • dimred — Perform dimensionality reduction on the output from the HPA Densenet model.

  • umapNd — Generates a CSV file with the previous results as a data source to plot a Nd UMAP.

If there are any questions on how to use the code in this repo please ask them by opening up an issue here on Github or by contacting @cwinsnes.

Installation and Setup

Note
It is recommended to run the module in a separate virtual environment such as Anaconda or venvs to avoid any issues with package versions.

Installing the required modules for this package can be done through pip.

python -m pip install -r requirements.txt

Data

The model requires the input images to be stored in the following way:

  • All images should be in a single folder

  • All images should be separated into 4 based on their colors

  • The names of the images should fit the pattern <FILENAME>_{red,blue,yellow,green}.<file_extension>

    • For example file1_red.jpg or 2222_517_E3_1_yellow.png

    • The images should follow the format of the HPA image dataset:

      • The red images should contain to a microtubule marker

      • The blue images should contain to DAPI markers

      • The yellow images should contain to an endoplasmic reticulum marker

      • And the green images should contain the protein of interest.

An example data folder could look like

data/
    images/
        image1_red.jpg          image2_red.jpg
        image1_blue.jpg         image2_blue.jpg
        image1_yellow.jpg       image2_yellow.jpg
        image1_green.jpg        image2_green.jpg

et.c.

Commands

All commands are run from the main module. Specifically, the commands are run through the call python main.py <command> <arguments> To access the help section for any specific command, run python main.py <command> --help.

The purpose and structure of each command is listed below:

preprocess

The preprocess command performs preprocessing on images for them to be usable by the machine learning model. At present time, the only preprocessing that is performed is resizing of the images and the output format is hardcoded to .jpg.

preprocess options

The following arguments are allowed:

-h, --help            show help message and exit
-s SRC_DIR, --src-dir SRC_DIR
                        source directory, where images to process are located.
-d DST_DIR, --dst-dir DST_DIR
                        destination directory, where processed images end up.
--size SIZE           image size
                        The output size of the processed images. Default `1536`
-w NUM_WORKERS, --num-workers NUM_WORKERS
                        The number of multiprocessing workers to perform the resizing
                        Defaults to `10`.
--continue            Continue from a previously aborted run.
                        This should only be done if the `SRC_DIR` is unchanged in between runs.

Note that -s and -d are required arguments!

predict

The predict command runs the densenet model on the processed images. The output from the model is split into three parts: probabilities, meta_information, and features. The probabilities represent the model prediction probabilities while the features correspond to the latent space feature representation of the model. The meta information contains the names of each image that was predicted upon. The three files are timestamped and stored in the output folder.

The probabilities consist of the logit output of the model with the same order as the original Kaggle challenge.

The output files are all compressed numpy storage files and can be loaded using the numpy.load function. Each file contains a python dict with the corresponding information. To see how to load the information, see the example presented in Example run.

At present time, the only format allowed for the input directory is .jpg.

predict options

The following arguments are allowed:

-h, --help            show help message and exit
-s SRC_DIR, --src-dir SRC_DIR
                    src image directory (preprocessed)
-d DST_DIR, --dst-dir DST_DIR
                    output directory
                    The output files will be stored in the compressed numpy
                    format '.npz'.
--size SIZE           image size
                        Defaults to 1536.
--gpu GPU             Which gpus to use for prediction.
                        Any string valid for the environment variable `CUDA_VISIBLE_DEVICES`is valid for this.
                        If cpu calculations ONLY is desired, a value of 'cpu' is also allowed.
                        Defaults to `CUDA_VISIBLE_DEVICES`

Note that -s and -d are required arguments!

dimred

The dimred command runs UMAP dimensionality reduction on the features from the predict command.

The output consists of an n-dimensional array stored in '.npz' format, where n corresponds to the number of dimensions asked for. To se how to easily load the data, see the example in Example run.

dimred options

The following arguments are allowed:

-h, --help            show help message and exit
-s SRC, --src SRC     Source feature file to reduce.
-d DST, --dst DST     File to store predictions in.
                        The prediction will be stored in the compressed
                        numpy format '.npz'.
-n NUM_DIM, --num-dim NUM_DIM
                    Number of dimensions to reduce to. Defaults to 2.

Note that -s and -d are required arguments!

umapNd

The umapNd command generates a simple CSV file from a previous dimensionality result file and meta-information result file.

The output consists CSV file with the columns "Id", "X", "Y", […​]. See the example in Example run.

umapNd options

The following arguments are allowed:

-h, --help            show help message and exit
-sred, --sred         Source reduction file.
-n, --num-dim         Number of present reduced dimensions to add to the CSV
-smeta, --smeta       Source meta-information file.
-d, --dst             File to store the CSV values in.

Note that all arguments are required!

Example run

Assuming you have a data folder containing images on the format described above, a prediction can easily be made using the following commands:

Preprocess and predict

$ python main.py preprocess -s data/images -d data/resized_images
$ python main.py predict -s data/resized_images -d data/predictions

Dimensionality reduction

If you want to perform dimensionality reduction using UMAP, you can run the following commands:

$ python main.py dimred -s data/predictions/<FEATURE_FILE> -d data/umap/reduced.npz

UMAP Nd CSV generation

If you want to generate a CSV file containing the date to plot a Nd UMAP, you can run the following commands:

$ python main.py umapNd -sred data/umap/<REDUCED_FILE> --num-dim 2 -smeta data/predictions/<METAINFORMATION_FILE> -sprob data/predictions/<PROBABILITIES_FILE> --dst data/umap2d.csv
OR
$ python main.py umapNd -sred data/umap/<REDUCED_FILE> --num-dim 3 -smeta data/predictions/<METAINFORMATION_FILE> -sprob data/predictions/<PROBABILITIES_FILE> --dst data/umap3d.csv

Loading the resulting data

To access the predicted data, use numpy to load the stored arrays:

import numpy as np

features = np.load('data/predictions/<FEATURE_FILE>')['feats']
probabilities = np.load('data/predictions/<PROBABILITY_FILE>')['probs']
image_ids = np.load('data/predictions/<META_INFORMATION_FILE>')['image_ids']

# If you performed dimensionality reduction, you load it in a similar vein.
reduced = np.load('data/umap/reduced.npz')['components']

About

Running the Bestfitting Densenet model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages