SSDIR Dataset
This is one of my older datasets that I had released on the 04.09.2023 on the Enhance Everything discord server, and wanted to release in this repo aswell.
First - why curate an existing dataset?
The LSDIR dataset contains 84'991 training images. Keeping quality within such a huge dataset is difficult. When I inspected the dataset, I found proof of bad quality ones existing in the dataset that will affect model quality negatively
(in general we want the high resolution images to be of high quality, meaning noise free, blur free, compression/artifacts free etc. We can add those things into the lr if we want the model to learn how to deal with them, but for that the hr needs to be clean.)
Example of images from the LSRID training hr
Another thing here is that they are all of varying sizes. I like my sisr trianing dataset to be all of the same size, so tiled. Smaller tiled datasets will improve i/o speed for training, but also the training process (randomly) crops the input anyway. So to better see whats going on, and to see what the model could actually receive during training, here the same images again but crops:
So the below was an attempt to curate a smaller version of this dataset with higher quality. Also, I found when training models that a smaller dataset with high variety can produce similiar quality to a very big dataset. So we do not need huge datasets.
Original release text from 04.09.2023:
SSDIR - Small Scale Dataset for Image Restoration
A training dataset for SISR models, consisting of 10'000 512x512 images, based on LSDIR, achieving an average hyperiqa score of 0.8052392073214054.
SSDIR is based on LSDIR which has been processed/filtered to achieve better training results on SISR upscaling models with the following process:
- Scoring with HyperIQA
- Filtering out the 30'000 top scoring images
- Multiscale with scales 1, 0.75, 0.5 and 0.25 but only saving images >= 512x512px
- Sub-imaging multiscale to 512x512px images
- Score sub-images with HyperIQA
- Filtering out the 10'000 top scoring subimages
- Rename
- Score with HyperIQA (Average hyperiqa score: 0.8052392073214054)
- Generate meta_info.txt
Average hyperiqa score of SSDIR with 10000 images is: 0.8052392073214054
SSDIR_Sharp is the images sharpened as a better alternative to using USM during training for a sharper output since USM seems to introduce haloing and overshoot artifacts.
- Apply Contrast Adaptive High Boost Filter with Filter Type: Normal, Amount: 2.0, Contrast Bias: 2.00
Average hyperiqa score of SSDIR_Sharp/ with 10000 images is: 0.829739721441269
Only the Non-sharp versions are released on this repo since one needs to be careful training upscaling models on the sharpening versions since the models can pick up on sharpening artifacts.