This is a repository for running predictions using the winning densenet model from the HPA Kaggle Image Classification Challenge, as well as the relevant image preprocessing for the model to work ideally. The package also includes the possibility to perform dimensionality reduction using the UMAP package.
The package is centered around using separate commands for different parts of the pipeline. Currently available commands are:
-
preprocess
— Preprocessing a set of images for future predictions. -
predict
— Prediction using the HPA Densenet model. -
dimred
— Perform dimensionality reduction on the output from the HPA Densenet model. -
umapNd
— Generates a CSV file with the previous results as a data source to plot a Nd UMAP.
If there are any questions on how to use the code in this repo please ask them by opening up an issue here on Github or by contacting @cwinsnes.
Note
|
It is recommended to run the module in a separate virtual environment such as Anaconda or venvs to avoid any issues with package versions. |
Installing the required modules for this package can be done through pip
.
python -m pip install -r requirements.txt
The model requires the input images to be stored in the following way:
-
All images should be in a single folder
-
All images should be separated into 4 based on their colors
-
The names of the images should fit the pattern
<FILENAME>_{red,blue,yellow,green}.<file_extension>
-
For example file1_red.jpg or 2222_517_E3_1_yellow.png
-
The images should follow the format of the HPA image dataset:
-
The red images should contain to a microtubule marker
-
The blue images should contain to DAPI markers
-
The yellow images should contain to an endoplasmic reticulum marker
-
And the green images should contain the protein of interest.
-
-
An example data folder could look like data/ images/ image1_red.jpg image2_red.jpg image1_blue.jpg image2_blue.jpg image1_yellow.jpg image2_yellow.jpg image1_green.jpg image2_green.jpg et.c.
All commands are run from the main
module.
Specifically, the commands are run through the call python main.py <command> <arguments>
To access the help section for any specific command, run python main.py <command> --help
.
The purpose and structure of each command is listed below:
The preprocess
command performs preprocessing on images for them to be usable
by the machine learning model. At present time, the only preprocessing that is
performed is resizing of the images and the output format is hardcoded to .jpg
.
The following arguments are allowed:
-h, --help show help message and exit -s SRC_DIR, --src-dir SRC_DIR source directory, where images to process are located. -d DST_DIR, --dst-dir DST_DIR destination directory, where processed images end up. --size SIZE image size The output size of the processed images. Default `1536` -w NUM_WORKERS, --num-workers NUM_WORKERS The number of multiprocessing workers to perform the resizing Defaults to `10`. --continue Continue from a previously aborted run. This should only be done if the `SRC_DIR` is unchanged in between runs.
Note that -s
and -d
are required arguments!
The predict
command runs the densenet model on the processed images.
The output from the model is split into three parts: probabilities, meta_information,
and features.
The probabilities represent the model prediction probabilities while the features
correspond to the latent space feature representation of the model.
The meta information contains the names of each image that was predicted upon.
The three files are timestamped and stored in the output folder.
The probabilities consist of the logit output of the model with the same order as the original Kaggle challenge.
The output files are all compressed numpy
storage files and can be loaded
using the numpy.load
function.
Each file contains a python dict with the corresponding information. To see how to load
the information, see the example presented in Example run.
At present time, the only format allowed for the input directory is .jpg
.
The following arguments are allowed:
-h, --help show help message and exit -s SRC_DIR, --src-dir SRC_DIR src image directory (preprocessed) -d DST_DIR, --dst-dir DST_DIR output directory The output files will be stored in the compressed numpy format '.npz'. --size SIZE image size Defaults to 1536. --gpu GPU Which gpus to use for prediction. Any string valid for the environment variable `CUDA_VISIBLE_DEVICES`is valid for this. If cpu calculations ONLY is desired, a value of 'cpu' is also allowed. Defaults to `CUDA_VISIBLE_DEVICES`
Note that -s
and -d
are required arguments!
The dimred
command runs UMAP dimensionality reduction on the features from the
predict
command.
The output consists of an n-dimensional array stored in '.npz' format, where n
corresponds to the number of dimensions asked for. To se how to easily load
the data, see the example in Example run.
The following arguments are allowed:
-h, --help show help message and exit -s SRC, --src SRC Source feature file to reduce. -d DST, --dst DST File to store predictions in. The prediction will be stored in the compressed numpy format '.npz'. -n NUM_DIM, --num-dim NUM_DIM Number of dimensions to reduce to. Defaults to 2.
Note that -s
and -d
are required arguments!
The umapNd
command generates a simple CSV file from a previous dimensionality result file and meta-information result
file.
The output consists CSV file with the columns "Id", "X", "Y", […]. See the example in Example run.
The following arguments are allowed:
-h, --help show help message and exit -sred, --sred Source reduction file. -n, --num-dim Number of present reduced dimensions to add to the CSV -smeta, --smeta Source meta-information file. -d, --dst File to store the CSV values in.
Note that all arguments are required!
Assuming you have a data folder containing images on the format described above, a prediction can easily be made using the following commands:
$ python main.py preprocess -s data/images -d data/resized_images
$ python main.py predict -s data/resized_images -d data/predictions
If you want to perform dimensionality reduction using UMAP, you can run the following commands:
$ python main.py dimred -s data/predictions/<FEATURE_FILE> -d data/umap/reduced.npz
If you want to generate a CSV file containing the date to plot a Nd UMAP, you can run the following commands:
$ python main.py umapNd -sred data/umap/<REDUCED_FILE> --num-dim 2 -smeta data/predictions/<METAINFORMATION_FILE> -sprob data/predictions/<PROBABILITIES_FILE> --dst data/umap2d.csv
OR
$ python main.py umapNd -sred data/umap/<REDUCED_FILE> --num-dim 3 -smeta data/predictions/<METAINFORMATION_FILE> -sprob data/predictions/<PROBABILITIES_FILE> --dst data/umap3d.csv
To access the predicted data, use numpy to load the stored arrays:
import numpy as np
features = np.load('data/predictions/<FEATURE_FILE>')['feats']
probabilities = np.load('data/predictions/<PROBABILITY_FILE>')['probs']
image_ids = np.load('data/predictions/<META_INFORMATION_FILE>')['image_ids']
# If you performed dimensionality reduction, you load it in a similar vein.
reduced = np.load('data/umap/reduced.npz')['components']