Skip to content

šŸ“ˆ Trainingā€aā€Modelā€inā€traiNNerā€redux

Kim2091 edited this page Dec 12, 2024 · 6 revisions

Introduction

This guide will assume that you have an Nvidia GPU (mandatory, and preferably 8GB+ VRAM) and that you're on Windows 10 or 11.

We will be using traiNNer-redux, which is the best option for training super-resolution models in 2024. Previously, this was neosr, but it now has notable stability issues as reported by multiple users.

Just as a further note, this guide was written from the perspective of Kim2091.

Prerequisites

Make sure you download each of the following, or have them installed on your system already. The installation of each of these should be just a matter of selecting your system specifications, downloading the installer and running the installer.

Datasets

Pre-made datasets

See this page: https://github.com/the-database/traiNNer-redux/wiki/Recommended-Datasets

Building your own Dataset

The dataset is in my opinion the single most important component that goes into your model. It determines all of what your model will do, from removing compression to sharpening lines. Dataset building is very time-consuming and requires skill (which you will learn), but it is also incredibly rewarding when you create a dataset that works.

There are two types of datasets:

  • Paired - This dataset type consists of two sets of images, an HR folder and an LR folder. HR = high quality, LR = low quality
  • OTF - This dataset type only requires an HR folder. The training software will degrade it as you train

If you're not too interested in building a dataset, you can also use pre-built datasets online. There is a huge collection in the Enhance Everything Discord, with options ranging from anime to video game datasets.

If you want additional information, you can take a quick look at Sirosky's dataset guide: https://github.com/Sirosky/Upscale-Hub/wiki/%F0%9F%93%9A-Model-Training-Principles

Image Selection

An often overlooked step to dataset construction is selecting a good variety of images that contain useful information for the model to learn. Images that are very simple, such as a picture of a featureless white wall, probably won't do much for your model (unless you're training a model on featureless white walls). Curated datasets such as these will filter out low-information images, but otherwise, you'll have to do it yourself.

Paired Image Datasets

Paired Image Datasets are what this guide will be focused on. Paired Image Datasets consist of a high resolution image [HR] and a low resolution image [LR]. Before you begin training, you should build your own paired image dataset (more on this shortly). Paired datasets also have the following additional requirements and consideration:

  • The HR should be either 1x, 2x or 4x the size of the LR, depending on the scale of the model you choose to train. 3x is a thing as well, though not all archs support 3x.
  • The LR should represent the source you plan the model to work on in terms of degradations represented (i.e., the problems the source has, such as blur, compression, etc.). The model will learn to correct these flaws, guided by the HRs.
  • The HR should represent what the sourceĀ should look like after going through the model (hence HRs also being referred to as ground truth).
  • The images must be closely aligned. Warping or one image being a few frames off will only serve to confuse the model and produce muddled results.

Paired Image Datasets can be built in three different ways. Note that these methods aren't mutually exclusive-- in many scenarios, you'll want to use a combination of the three.

Real Image Pairs

Real image pairs are typically composed of a LR from an old source, such as VHS or DVD, and a HR from a new source, such as Bluray or a WEB release. Real image pairs have the advantage of being the most realistic representation of the differences between a low resolution source and a modern release. Models created on purely real image pairs can look more "natural" without the oversmoothing of details and artificial sharpness which may be present in models trained on poorly prepared synthetic datasets (more on this later). It should be noted that a potential downside of using only real image pairs without enough source variety can lead to the model "overfitting" for to the dataset. It's important to introduce variety if you want your model to generalize well across different sources-- whether through synthetic image pairs (as discussed later) or having a good variety.

In addition, real image pairs can often be difficult to work with. Here are the typical steps one might take to create a real image pair dataset:

  • Find the LR and HR. This might be a VHS and a BD release for example.
  • Extract matching frames from the LR and HR. The frames must match as closely as possible, which can present a significant challenge.
    • Sirosky created Image Pearer to semi-automate the process, but this is still a labor-intensive feat that requires time, dedication and patience.
  • Align the pairs. After extracting the matching frames, the images generally won't be aligned properly. The HR likely won't be exactly 2x or 4x the size of the LR, and there may be warping which causes the pair to become misaligned, despite being the same frame.
    • ImgAlign comes to the rescue-- it helps resize and warp the images as necessary to make sure the pairs are aligned properly.
Simple_Image_Compare_1.1_AOuUj2L12h.mp4

An example of image warping which will confuse the model.

  • Confirm that the images are aligned properly-- when dropped into something like img-ab or Simple Image Compare, the LR should scale to fit exactly into the HR.
Simple_Image_Compare_1.1_0oXbehxc2p.mp4

Confirming that a LR and HR image pair are perfectly aligned in Simple Image Compare.

  • Congratulations, you now have your first paired image dataset! (Though you might want to collect more from other sources).

So yes, real image pairs can be a real pain to collect. But they often pay massive dividends being "natural" representations of the differences between a low-res and a high-res source. But if this section seemed unpalatable, fortunately, there are two other methods of creating paired datasets which would likely be much easier.

Synthetic HRs

This process involves taking the LRs, and generating 2x or 4x versions of them as the HRs. Thus, you would be using "artificial" or "synthetic" HRs. Synthetic HRs are typically generated by using existing upscaling models (through chaiNNer's image iterator). Sure, you could do it manually, but at that point, it would take so much time that you might as well just make a real image pair dataset. While this approach can sound very simple, synthetic HRs have the downside of carrying over the faults of the model used to upscale the LR.

For example, if a model has poor detail retention, and you only use synthetic HRs based on that model, your model will also have poor detail retention. Thus, it is important to be very selective of the models you use to help generate HRs or find ways to mitigate the issues.

Generally, this isn't recommended.

Simple_Image_Compare_1.1_a1sGtsOzx2.mp4

An example of a synthetic HR-- remember, don't overdo the alterations made to the image.

Synthetic LRs

This is the opposite of synthetic HRs-- you generate LRs from existing HRs. This is typically done by downscaling by 50% or 25% (depending on the scale of your model), then applying degradations. Fortunately, degradations can be easily applied through something such as my dataset destroyer or umzi's wtp dataset destroyer (a bit more complicated, but more features). Some degradations can be applied through chaiNNer as well. Power users will often also leverage AviSynth or VapourSynth to assist with degradations. Typically, you'll want to ensure your synthetic HRs have at least blur and some form of compression applied. That way, the model trained will learn to deblur (sharpen basically) and fix compression artifacting.

A common pitfall to avoid when using synthetic LRs is overzealous application of compression and blur. While it's important to make sure the model learns to deal with compression artifacting, applying too much compression will hurt model's ability to retain details. Similarly, too much blur will hurt detail retention and confuse the model during the training process. It'll also cause oversharpening, which is rarely desireable. So in essence, keep a balance!

Simple_Image_Compare_1.1_C3wcoSb1FF.mp4

An example of a synthetic LR with blur and JPEG degradations.

Concluding Thoughts on Paired Image Datasets

As mentioned earlier, you might often want to mix and match a combination of the three methods discussed above. If a source has a HD release, and you think it looks great, consider generating synthetic LRs for use in the dataset. On the other hand, a show might have never gotten a HD release, but you want to upscale it. You could generate synthetic HRs out of it. Then, there might be a similar show from that same era that also got a Bluray release. You might create a set of real image pairs using the original and the Bluray release. The possibilities are endless!

OTF

I'll only discuss OTF briefly, as I don't use OTF much myself. Unlike image pairs, OTF datasets do not have pairs. They're a collection of images (HRs) which get degraded in realtime (typically through the original Real-ESRGAN degradation pipeline) while training. This extends the training time, but can produce good results. That being said, OTF is really only suited for models focusing on real-life sources. If you're training an anime or a cartoon model, stick with image pairs. Anime and cartoons have their own specific considerations which OTF does not address.

Many models utilizing OTF do so excessively. If not carefully tuned, OTF generates overly strong degradations which models often can't handle properly without massive detail loss. @phhofm has a great writeup on OTF here, including recommended settings. I highly recommend checking it out if you choose to pursue the OTF route.

Validation Dataset [Recommended]

A validation dataset is not required, but it may be of interest to you. It allows for easy "validation" of your model's progress. A validation dataset is essentially a dataset consisting of either single images or image pairs (I'd recommend no more than 8-10 images or pairs). At preset points during training, the training software will generate images using the model's current checkpoint. If using image pairs and you turn on the relevant settings in the config, it will also compare them to the HRs from the validation dataset using PSNR, DISTS and/or and SSIM. Image pair validation datasets essentially represent the ideal of what your model should achieve, and the images generated during the validation process help you determine how close you are to that ideal.

As for what to include, for starters, you can include just images from your dataset if you have truly nothing else. But ideally, you'll want the validation dataset to be truly representative of the sources the model is intended for. Going one step further, you can even have individual images/pairs in your validation dataset represent the model's ability in different areas. For example, you can have one image or pair with large numbers of high frequency details serve to judge your model's detail retention ability. Or you could have a heavily degraded image or pair serve to judge your model's anti-compression abilities.

PSNR, DISTS and/or and SSIM

For PSNR and SSIM, you want the values as high as possible. For DISTS, you want the value as low as possible. Frankly however, all three metrics should only be used very loosely as guidance on how the model is progressing. Oftentimes, after visually comparing the output, you may find that checkpoints with worst validation values actually look better than checkpoints with superior validation values. I find that validation metrics are most useful for flagging when the model is generating horrible artifacts.

You absolutely should validate the results visually rather than relying on these metrics.

Other Dataset Considerations

There are a few other considerations to dataset creation. First off, dataset size. There's no specific rule on how big or how small your dataset should be. Even small, highly tailored datasets can work well for models that have very specific applications. With lighter-weight architectures, there has been some evidence that datasets larger than 5,000 images can be detrimental. Keep in mind: quality beats quantity!

On the topic of quality, one metric of quality is the source diversity in many cases. You'll often want a variety of sources within your dataset, to make sure that your model has a diverse set of information to learn from. If you're using a cartoon episode for example, pull frames from multiple episodes rather than a single one. Spread them out, and your model will adapt itself to the other parts of the show you plan on upscaling.

Image Tiling

To maximize efficiency and to maximize the useful information in your dataset, you should consider tiling your images. Tiling breaks down large images into smaller, more manageable pieces which offers several benefits:

  • Training efficiency: Smaller tiles allow for better batch sizes and more efficient VRAM usage
  • Better feature learning: The model can focus on learning specific details rather than being overwhelmed by large images
  • Reduced redundancy: Instead of processing large areas of similar content (like a blank sky), tiling lets you extract the most informative parts of images
  • More diverse training samples: From a single large image, you can extract multiple tiles containing different features and details

You can use my Image Tiling script to quickly tile your dataset:

python TileImages.py /path/to/image/folder /path/to/output/folder -t 512 512

Picking an Architecture

Architectures, or archs, are essentially frameworks for super resolution work. For example, if you've heard of ESRGAN, that'd be an architecture. Each arch has its own attributes and its own quirks, with some being focused on speed at the expense of robustness, and others being very slow but capable of handling whatever you throw at it.

For simplicity's sake, we'll go with SVRGGNet for the purposes of this guide. SVRGGNet, aka Compact, is from the developers of ESRGAN. As a lighter arch, it is much faster than ESRGAN (think 10x+ in some scenarios). While it isn't quite as a robust as some of its slower counterparts, it is still quite capable and a perfect starting point due to its inference speed (aka upscaling speed), training speed, stability while training (other archs might explode on you if you're not careful) and low VRAM requirements. If you look on OpenModelDB, you'll see that these attributes have rendered it an extremely popular arch for very good reason.

Installing traiNNer-redux

Now that you have a dataset, the next step will be to install the actual training platform to train your models on. There are a few options out there, but traiNNer-redux is the newest and most updated. It's also the most-well documented which makes life so much easier.

  1. Download traiNNer-redux per the installation instructions.
    1. If you don't know how command line works, all you have to do is navigate to your folder where you want to install.
    2. Press Win+R and type cmd. It'll open a prompt.
    3. Copy git clone https://github.com/the-database/traiNNer-redux, and press the right mouse button in command prompt to paste the command in.
    4. Press enter, and you should see the download begin.
  2. You should have installed all the prerequisites earlier, but install them now if not.

Creating a Config

Configs are how you prepare the training software to actually train the model on the dataest you prepared.

  1. Navigate to the options/train folder in your traiNNer-redux base folder. Find the Compact folder and look for Compact_finetune.yml.
  2. Open it up, and take a look at the comments in the config file. They explain everything going on. The default config is generally very good, but you'll want to at least fill in the paths to your datasets.

You should generally be following the default settings in the config, but here are some important settings to understand:

  • During training, the model learns by looking at small pieces (crops) of your images. Two key settings control this:

    • lq_size: This sets how big these pieces are when looking at your low quality images. The default of 96 is a good starting point.
    • scale: This is your upscaling factor (2x, 4x, etc). The model will automatically make the high quality crops bigger by this amount. For example:
      • If scale: 4 and lq_size: 96, it will look at 96x96 pieces of your LR images and 384x384 pieces of your HR images
      • If scale: 2 and lq_size: 96, it will look at 96x96 pieces of your LR images and 192x192 pieces of your HR images
  • For best performance, try to keep lq_size as a multiple of 8 (like 24, 32, 48, 64, 96, or 128)

  • batch_size_per_gpu: This controls how many pieces the model looks at simultaneously. The default of 8 is good for stability. Higher values can improve quality but use more VRAM.

If you run out of VRAM (graphics memory), try lowering lq_size or batch_size_per_gpu.

  • You should make sure to use a pretrain under the path section. Pretrains for compact can be found here. Pretrains serve as a starting point to your training, and will speed up the process substantially. Make sure to pick just the normal Compact version for your scale (don't pick UltraCompact for example). Without a pretrain, you also won't be able to combine your model with other Compact models that have compatible pretrains (aka interpolation).

  • Make sure to fill out the validation dataset paths under the val section in datasets and enable validation by setting val_enabled: true if you have a validation dataset.

  • Don't worry about anything in the train section of the config-- the default settings are optimal. Some more advanced users may want to tweak loss values for specific purposes, but the returns are dubious and likely minimal.

  • The config now supports Mix of Augmentations (MoA) which can help with model generalization. This is disabled by default (use_moa: false) but can be enabled if desired.

Set up Tensorboard [Optional]

Tensorboard provides a fancy interface with graphs to track your model training. If you like fancy graphs, you can install tensorboard as follows.

  1. Make sure use_tb_logger is set to true in your config-- it should be the default.
  2. Install tensorboard via pip install tensorboard.
  3. After you start training, or have trained a model, you can launch tensorboard via tensorboard --logdir traiNNer-redux\traiNNer-redux\experiments\tb_logger\MODELNAME. If it's working, you should see something like the below.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.14.0 at http://localhost:6006/ (Press CTRL+C to quit)
  1. By default, you should be able to access it http://localhost:6006/ via your browser.

image

You should see graphs like the above when opening the tensorboard page.

Start Training!

Now that you have a dataset created, a traiNNer-redux installed and a config set up, it's finally time to begin training! Return to your traiNNer-redux base folder, launch cmd again, and paste in the following: python train.py -opt options/Compact/Compact.yml. Note that if you changed the name of the default compact config, you'll have to update the command accordingly.

  • If you want to stop training, press Ctrl+C.
  • If you want to resume training, you can set a resume state in the config file (found in the experiments > folder with your model name > training_states) or simply use --auto_resume , such as python train.py -opt options/Compact/Compact.yml --auto_resume as your training command to start from the last stopped point.

If all is working as intended, you should see something like this.Ā 

2024-07-19 13:52:31,853 INFO: Building Dataset Train Dataset...
2024-07-19 13:52:31,903 INFO: Dataset [PairedImageDataset] - Train Dataset is built.
2024-07-19 13:52:31,903 INFO: Training statistics:
        Number of train images: 6590
        Dataset enlarge ratio: 1
        Batch size per gpu: 6
        World size (gpu number): 1
        Require iter number per epoch: 1099
        Total epochs: 455; iters: 500000.
2024-07-19 13:52:31,903 INFO: Validation is disabled, skip building val dataset Val Dataset.
2024-07-19 13:52:31,916 INFO: Network [SRVGGNetCompact] is created from spandrel v0.3.4.
2024-07-19 13:52:32,066 INFO: Using Automatic Mixed Precision (AMP) with fp32 and bf16.
2024-07-19 13:52:32,094 INFO: Network UNetDiscriminatorSN is created from traiNNer-redux.
2024-07-19 13:52:32,099 INFO: Using Exponential Moving Average (EMA) with decay: 0.999.
2024-07-19 13:52:32,104 INFO: Network [SRVGGNetCompact] is created from spandrel v0.3.4.
2024-07-19 13:52:32,216 INFO: Loss [MSSIMLoss] is created.
2024-07-19 13:52:33,122 INFO: Loss [PerceptualLoss] is created.
2024-07-19 13:52:33,136 INFO: Loss [HSLuvLoss] is created.
2024-07-19 13:52:33,136 INFO: Loss [GANLoss] is created.
2024-07-19 13:52:33,137 INFO: Model [SRModel] is created.
2024-07-19 13:52:59,375 INFO: Start training from epoch: 0, iter: 0

This means that the model has begun training. It might take a few minutes, but then you should see something like this show up.

2024-07-19 13:53:09,034 INFO: [4x_Co..][epoch:  0, iter:     100, lr:(1.000e-04,)] [performance: 10.355] [eta: 9:56:41] l_g_mssim: 1.9074e-01 l_g_percep: 2.5468e-01 l_g_hsluv: 5.6346e-02 l_g_gan: 7.5302e-02 l_g_total: 5.7707e-01 l_d_real: 6.8977e-01 out_d_real: 5.6849e-02 l_d_fake: 6.6484e-01 out_d_fake: -8.7937e-02
2024-07-19 13:53:16,334 INFO: [4x_Co..][epoch:  0, iter:     200, lr:(1.000e-04,)] [performance: 11.796] [eta: 10:02:19] l_g_mssim: 1.9983e-01 l_g_percep: 2.4514e-01 l_g_hsluv: 4.9399e-02 l_g_gan: 7.5838e-02 l_g_total: 5.7021e-01 l_d_real: 6.6357e-01 out_d_real: 1.6310e-01 l_d_fake: 6.7228e-01 out_d_fake: -8.6010e-02

This means the model is training properly. As for all the numbers:

  • performance shows the iterations per second. By default, the model will save every 1000 iterations, and run a validation check every 1000 iterations. I have mine set to save every 2,500 iterations, with validation every 5,000 iterations to avoid the spam.
  • the eta is determined by your config, and your current speed. I usually just ignore this. Models often complete learning well before the ETA.
  • the loss values such as l_g_mssim should generally be close to 0 (except for l_g_gan). That being said, don't worry about it too much for now.

Monitor Training

Now, all there's left to do is to monitor the progress of your model. Ideally, you will have created a validation dataset, and you should see validation results in the command prompt periodically, like this:

2024-03-19 15:56:00,819 INFO: Validation Validation
         # psnr: 45.6487        Best: 45.9801 @ 5000 iter
         # ssim: 0.9978         Best: 0.9979 @ 5000 iter
         # dists: 0.0123        Best: 0.0120 @ 5000 iter

Here, you can see:

  • A PSNR value of over 45, which is extremely high
  • An SSIM value of 0.9978 (out of 1.0), also very good
  • A DISTS value of 0.0123 (lower is better)

But what's this? Why are our current values lower than the Best values at 5,000 iterations? This may indicate that the model actually peaked at 5,000 iterations. As mentioned before, it's time to bring out the big guns: the Mark 1 eyeball. Never just look at the metrics and take them at face value!

You can find the model's generated validation images under experiments>folder with your model name>visualization. Here, you'll see images generated from your validation dataset's LRs. You will then want to compare the generated images between themselves, and also with the validation dataset's HRs. With a visual check, you can determine whether the model is still training, stagnating or has gone FUBAR.

Simple_Image_Compare_1.1_BaQMEXFy9a.mp4

An example of training instability-- these validation images were 5000 iterations apart. Ideally, these artifacts will go away with continued training.

If after several validation checks your model's validation scores slip or remain stagnant, it's a likely indication that your model is done training. At the risk of sounding like a broken record, please do confirm visually as well.I typically complete Compact models at less than 40K iterations. Some even complete at 5K or 10K iterations, if I use a very on-point pretrain.

Once you determine your model is complete, you can find it in the models folder next to the visualization folder. The file with _g is the actual model-- the _d is a discriminator used in the training process. You don't need it for inference purposes, but it's necessary if you ever want to resume training the same model.

With that, congratulations on your trained model!

Conclusion

If you've made it this far, congratulations on your first trained model. Model training has quite a steep learning curve, but I hope this guide made it a bit easier to decrypt the process. With that being said, if you do decide to continue training models, consider checking out my (much shorter) writeup on Model Training Principles. While much of it is anime-focused, there is still plenty applicable to models dedicated to real-life sources and others. Obviously, some of this will be up to subjective taste, but hopefully it'll provide a sense of what to look for as you prepare datasets and train your own models. Good luck, may the neural network gambling parlor be ever in your favor!

Feel free to reach out if you need help:

  • Discord: kim2091 (guide author)
  • Discord: the database (creator of traiNNer-redux)
  • Join the Enhance Everything Discord server for additional resources and a helpful community of trainers