Is there anyone success to train this model? #34

Jihun999 · 2024-02-16T16:59:24Z

I tried to train this model few days. However, the reconstruction results always abnormal. If there is anyone success to train this model, can you tell me some tips for training?

Jihun999 · 2024-02-16T17:10:59Z

The reconstruction images are like solid image.

RobertLuo1 · 2024-02-20T02:35:59Z

Can you show the reconstruction images after training?

Jihun999 · 2024-02-20T05:57:44Z

It always looks like this image.

RobertLuo1 · 2024-02-20T11:38:31Z

@bridenmj How much epochs do you use? Are you working on the ImageNet Pretrain?

Jihun999 · 2024-02-20T13:05:36Z

Yes I'm working on ImageNet pretraining, It passed 12000 steps. The output image looks always the same. So, I tried LFQ in my own autoencoder, the training works well. It looks like there is something wrong in magvit2 model architecture.

RobertLuo1 · 2024-02-20T13:22:38Z

Actually I reimplement the model structure to align with the magvit2 paper. But I find that the LFQ Loss is negative and the recon loss will get converage easily with or without GAN. The reconstructed images are vague but not the solid color. What about you? @Jihun999

Jihun999 · 2024-02-25T13:07:00Z

Ok, I will reimplement the model first. Thank you for your comment.

Jason3900 · 2024-03-06T08:31:07Z

Actually I reimplement the model structure to align with the magvit2 paper. But I find that the LFQ Loss is negative and the recon loss will get converage easily with or without GAN. The reconstructed images are vague but not the solid color. What about you? @Jihun999

Hey, is it possible to share the code modification for model architecture alignment? Thanks a lot!

lucidrains · 2024-04-25T13:59:37Z

someone i know has trained it successfully.

Jiushanhuadao · 2024-04-27T14:10:31Z

wow, could i know who did it.

StarCycle · 2024-05-15T04:12:25Z

@RobertLuo1 @Jihun999 @lucidrains If you successfully trained this model, would you like to share the pretrained weights and the modified model code?

vinyesm · 2024-05-16T19:44:57Z

Hello there,
Thanks @lucidrains for your work! I have successful trainings on toy data (tried it on images and video) with code in this fork https://github.com/vinyesm/magvit2-pytorch/blob/trainvideo/examples/train-on-video.py and with this video data https://huggingface.co/datasets/mavi88/phys101_frames/tree/main. What seemed to fix the issue is to stop using accelerate (I only train on one GPU).

I tried with only MSE and then also the other losses, and also with/without attend_space layers. All work but I did not try to tune hyperparameters..

lucidrains · 2024-05-16T22:59:08Z

thank you for sharing this Marina! I'll see if I can find the bug, and worse comes to worse, can always rewrite the training code in pytorch lightning

RobertLuo1 · 2024-06-17T06:21:39Z

Hi, recently we have devoted a lot to training the tokenizer in Magvit2, and now we have open source the tokenizer trained with imagenet. Feel free to use that. The project page is https://github.com/TencentARC/Open-MAGVIT2. Thanks @lucidrains so much for your reference code and discussions!

ashah01 · 2024-07-23T01:08:49Z

Hey @lucidrains, I trained a MAGVIT2 tokenizer without modifying your implementation of the accelerate framework. As others have experienced, I initially saw just a solid block in the results/sampled.x.gif files. However, upon loading the model weights from my most recent checkpoint, I was able to get pretty good reconstructions in a sample script that I wrote that performs inference without using the accelerate framework. Additionally, the reconstruction MSE scores were consistent with the ones observed in your training script. This means that whatever bug others are experiencing is not the result of flawed model training, but rather something going wrong with the gif rendering.

*Note: the first file is the saved gif in the results folder. The ground truth frames have a weird colour scheme because I normalized the frame pixels to be between [-1, 1]. The second file is a reconstructed frame from my inference script. MSE was ~0.011 after training on a v100 for 5 hours.

vincentcartillier · 2024-08-26T14:09:59Z

Hello everyone, I also have been struggling training the model.

My goal is to first try to overfit magvit2 on a single video.

I haven't made any modifications to the code base at all. I am using accelerate and training on a single GPU.

Here is the result I get:

Here are the corresponding training curves:

The reconstruction loss does seem to decrease and then plateau.

Let me know if you have any idea why this is happening?

Some additional info:
I am loading the video using the video loader from magvit (ie frames are normalized between [0,1]).
I'm on Ubuntu 20.04, pytorch 2.4 with cuda 12.1, training on a single A40 GPU.
The VideoTokenizer I used is the one provided in the readme ie:

tokenizer = VideoTokenizer(
    image_size = 128,
    init_dim = 64,
    max_dim = 512,
    codebook_size = 1024,
    layers = (
        'residual',
        'compress_space',
        ('consecutive_residual', 2),
        'compress_space',
        ('consecutive_residual', 2),
        'linear_attend_space',
        'compress_space',
        ('consecutive_residual', 2),
        'attend_space',
        'compress_time',
        ('consecutive_residual', 2),
        'compress_time',
        ('consecutive_residual', 2),
        'attend_time',
    )
)

StarCycle · 2024-08-26T14:13:14Z

Hi @vincentcartillier,

Please check Tencent's https://github.com/TencentARC/Open-MAGVIT2, which is based on this implementation but modify some parts

vincentcartillier · 2024-08-26T14:16:35Z

Thanks for your reply. I did came across the Open-MAGVIT2 repo, but correct me if I'm wrong, I don't think they've implemented the video tokenizer yet?
I believe there is only an image tokenizer that has been evaluated on ImageNet

StarCycle · 2024-08-26T14:28:34Z

@vincentcartillier Oh yes but they are developing a video tokenizer...

They had the same problem during training the image tokenizer and finally fixed it. I guess @RobertLuo1 can answer your question ^^

Jason3900 · 2024-08-27T03:01:10Z

typically, if the recon loss is below 0.03, you will see an outline of the video. What you encountered may indicate the architecture is difficult to converge as I manually re-implement magvit2, it will quickly produce the reconstruction within a few hundred steps. In order to debug, you can first skip quantization and only use encoder's output as decoder's input to adjust the whole model's structure. When it's done, then you add quantization as it hard to train. BTW, the repo uses a 2d gan which takes samples frames as input, which is not aligned with the paper. You can use a 3d vqgan instead. But from my point of view, discriminator's training is not the most import part. The encoder and decoder's structures matters.

vincentcartillier · 2024-08-27T03:39:13Z

Got it. Thanks a lot for all the tips. I'll try these out. In the meantime, do you think you could share your re-implementation of magvit2? I'm assuming this is based of this repo.

Jason3900 · 2024-08-27T03:53:59Z

Sorry, I can't because it's an internal project and is still under development > <. The implementation is not based on this project. I followed the google's magvit-v1 jax repo and modified it. The adjustments between v1 and v2 are minimal.

Jason3900 · 2024-08-27T03:58:49Z

But I use the vector-quantize-pytorch's LFQ as it used in here.

vincentcartillier · 2024-08-27T04:14:45Z

Got it. Totally understandable. Thanks a lot for all the tips I'll give it a try!

JingwWu · 2024-09-04T11:40:08Z

@vincentcartillier I encountered the same difficult convergence problem as described by @Jason3900 , but before that, I found that the learning rate is not set correctly. When I checked lr in the optimizer, it was extremely low (~1e-10). And when I train with a higher learning rate (e.g., 1e-5), I get better results.
5k iters

40k iters

train with only 1 video and LFQ is not used.

vincentcartillier · 2024-09-05T00:22:05Z

Got it thanks so much for the pointer. I think the reason why you're seeing such a low learning rate is because of the use of warmup. You can set warmup_step=1 in the VideoTokenizerTrainer() setup parameters to bypass this.

If you haven't used LFQ, are you using FSQ instead? (finite scalar quantization) or another thing?

Could you try running the same thing (same learning rate) with LFQ and see if it works? - it would be great to see if that's the source of the problem we're facing.
Also could you share your setup? (OS, cuda, pytorch version, linux kernel version, GPU type etc..)

Jason3900 · 2024-09-05T07:29:14Z

@vincentcartillier I encountered the same difficult convergence problem as described by @Jason3900 , but before that, I found that the learning rate is not set correctly. When I checked lr in the optimizer, it was extremely low (~1e-10). And when I train with a higher learning rate (e.g., 1e-5), I get better results. 5k iters 40k iters train with only 1 video and LFQ is not used.

Yep, but still it shouldn't take that many steps to get the reconstructed result only with 1 Video. It may indicate that the model is too hard to converge.

JingwWu · 2024-09-05T10:11:05Z

@vincentcartillier sure!

If you haven't used LFQ, are you using FSQ instead? (finite scalar quantization) or another thing?

to test the convergence of this enc / dec arch, I just drop the quantizer and use continuous representation with dim 512

Could you try running the same thing (same learning rate) with LFQ and see if it works? - it would be great to see if that's the source of the problem we're facing.

yes I run the same setting with LFQ, but can't get a converged result in 40k steps.

So I re-implemented the enc / dec arch (using code in this repo with little modification) according to the paper. a surprising result I got this time. Still follow @Jason3900 's suggestion, skip quantization

2k iters, 1 video, no quant with dim 18

vincentcartillier · 2024-09-05T22:51:02Z

Amazing! Would you be comfortable sharing the code modifications you've made? ( maybe via a PR or just sharing your fork).
It's really promising, then I guess the challenge is to include the LFQ part, which, based of Jason's comment should be a working piece already.

vincentcartillier · 2024-09-05T22:57:08Z

I also got to something kinda working. This is the same code, ie no modifications, same settings as my initial post above. Except I've changed the learning rate, or rather I've turned of the warmup (as @JingwWu pointed out). Here is the overfitting results I have after ~5k steps.

This is with using the LFQ + original enc dec implementation.
It is much better than before, but I would have expected to have an even cleaner output given that this is an overfitting experiment.

JingwWu · 2024-09-06T03:22:43Z

Yes, this is the modified CausalConv3d module. The key is to allow stride in the spatial dim. This module can get correct results under kernel size and stride settings mentioned in the paper, and I've dropped other features for simplicity

class CausalConv3d(nn.Module):
    def __init__(
        self,
        chan_in,
        chan_out,
        kernel_size,
        pad_mode = 'constant',
        s_stride = 1,
        t_stride = 1,
        **kwargs
    ):
        super().__init__()
        kernel_size = cast_tuple(kernel_size, 3)

        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size

        assert is_odd(height_kernel_size) and is_odd(width_kernel_size)

        self.pad_mode = pad_mode
        time_pad = time_kernel_size - 1
        height_pad = height_kernel_size // 2
        width_pad = width_kernel_size // 2

        self.time_pad = time_pad
        self.time_causal_padding = (width_pad, width_pad, height_pad, height_pad, time_pad, 0)

        stride = (t_stride, s_stride, s_stride)
        self.conv = nn.Conv3d(
            chan_in,
            chan_out,
            kernel_size,
            stride = stride,
            **kwargs
        )

    def forward(self, x):
        pad_mode = self.pad_mode if self.time_pad < x.shape[2] else 'constant'

        x = F.pad(x, self.time_causal_padding, mode = pad_mode)
        return self.conv(x)

Next is to implement ResBlock X, ResBlock X->Y, ResBlockDown X->Y as described in paper, using the CausalConv3d above and other common Pytorch nn module (eg, nn.Conv3d, nn.GroupNorm, etc) and Blur in this repo. They are then assembled into encoder and decoder

Jason3900 · 2024-11-21T07:31:01Z

@JingwWu @vincentcartillier
Hey, we've implemented a version that's almost perfectly aligned with the original paper. You can check it for more details. https://github.com/cofe-ai/O2-MAGVIT2

JingwWu · 2024-11-23T17:22:59Z

@Jason3900 This is a great open source work! I will check it out in detail.
I basically replicated the performance described in the paper two months ago (FVD ~ 6.9 and LPIPS ~ 0.06, reported on K700 val set), using K700 for pre-training. I observed two things:
First, the performance of video reconstruction is largely impacted by the frame rate of input video, and when the frame rate is too low (meaning there is less redundancy between adjacent frames), the reconstructed video will appear fuzzy.
Second, I tried a more aggressive compression in spatial domain, such as 16x (instead of 8x as reported in the paper), and I also tried to change the structure of the model for this purpose, but the results were not as expected.
Have you encountered the above two problems? Looking forward to your reply.

Jason3900 · 2024-11-26T06:49:40Z

@Jason3900 This is a great open source work! I will check it out in detail. I basically replicated the performance described in the paper two months ago (FVD ~ 6.9 and LPIPS ~ 0.06, reported on K700 val set), using K700 for pre-training. I observed two things: First, the performance of video reconstruction is largely impacted by the frame rate of input video, and when the frame rate is too low (meaning there is less redundancy between adjacent frames), the reconstructed video will appear fuzzy. Second, I tried a more aggressive compression in spatial domain, such as 16x (instead of 8x as reported in the paper), and I also tried to change the structure of the model for this purpose, but the results were not as expected. Have you encountered the above two problems? Looking forward to your reply.

Thanks!
I encountered the same problem as you said for video reconstruction. And I think maybe we need a better filtering of the training dataset such as optical flow. The public dataset such as webvid-10m are full of frames of small difference between them.
For the second problem, I'm working on training a better one. I think better result can be obtained by feeding more high-quality data.

Jihun999 closed this as completed Feb 16, 2024

Jihun999 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2024

Jihun999 closed this as completed Feb 16, 2024

Jihun999 reopened this Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there anyone success to train this model? #34

Is there anyone success to train this model? #34

Jihun999 commented Feb 16, 2024

Jihun999 commented Feb 16, 2024

RobertLuo1 commented Feb 20, 2024 •

edited

Loading

Jihun999 commented Feb 20, 2024

RobertLuo1 commented Feb 20, 2024

Jihun999 commented Feb 20, 2024

RobertLuo1 commented Feb 20, 2024

Jihun999 commented Feb 25, 2024

Jason3900 commented Mar 6, 2024

lucidrains commented Apr 25, 2024

Jiushanhuadao commented Apr 27, 2024

StarCycle commented May 15, 2024

vinyesm commented May 16, 2024

lucidrains commented May 16, 2024

RobertLuo1 commented Jun 17, 2024 •

edited

Loading

ashah01 commented Jul 23, 2024 •

edited

Loading

vincentcartillier commented Aug 26, 2024

StarCycle commented Aug 26, 2024

vincentcartillier commented Aug 26, 2024

StarCycle commented Aug 26, 2024

Jason3900 commented Aug 27, 2024

vincentcartillier commented Aug 27, 2024

Jason3900 commented Aug 27, 2024

Jason3900 commented Aug 27, 2024

vincentcartillier commented Aug 27, 2024

JingwWu commented Sep 4, 2024

vincentcartillier commented Sep 5, 2024

Jason3900 commented Sep 5, 2024

JingwWu commented Sep 5, 2024

vincentcartillier commented Sep 5, 2024 •

edited

Loading

vincentcartillier commented Sep 5, 2024

JingwWu commented Sep 6, 2024

Jason3900 commented Nov 21, 2024

JingwWu commented Nov 23, 2024 •

edited

Loading

Jason3900 commented Nov 26, 2024

Is there anyone success to train this model? #34

Is there anyone success to train this model? #34

Comments

Jihun999 commented Feb 16, 2024

Jihun999 commented Feb 16, 2024

RobertLuo1 commented Feb 20, 2024 • edited Loading

Jihun999 commented Feb 20, 2024

RobertLuo1 commented Feb 20, 2024

Jihun999 commented Feb 20, 2024

RobertLuo1 commented Feb 20, 2024

Jihun999 commented Feb 25, 2024

Jason3900 commented Mar 6, 2024

lucidrains commented Apr 25, 2024

Jiushanhuadao commented Apr 27, 2024

StarCycle commented May 15, 2024

vinyesm commented May 16, 2024

lucidrains commented May 16, 2024

RobertLuo1 commented Jun 17, 2024 • edited Loading

ashah01 commented Jul 23, 2024 • edited Loading

vincentcartillier commented Aug 26, 2024

StarCycle commented Aug 26, 2024

vincentcartillier commented Aug 26, 2024

StarCycle commented Aug 26, 2024

Jason3900 commented Aug 27, 2024

vincentcartillier commented Aug 27, 2024

Jason3900 commented Aug 27, 2024

Jason3900 commented Aug 27, 2024

vincentcartillier commented Aug 27, 2024

JingwWu commented Sep 4, 2024

vincentcartillier commented Sep 5, 2024

Jason3900 commented Sep 5, 2024

JingwWu commented Sep 5, 2024

vincentcartillier commented Sep 5, 2024 • edited Loading

vincentcartillier commented Sep 5, 2024

JingwWu commented Sep 6, 2024

Jason3900 commented Nov 21, 2024

JingwWu commented Nov 23, 2024 • edited Loading

Jason3900 commented Nov 26, 2024

RobertLuo1 commented Feb 20, 2024 •

edited

Loading

RobertLuo1 commented Jun 17, 2024 •

edited

Loading

ashah01 commented Jul 23, 2024 •

edited

Loading

vincentcartillier commented Sep 5, 2024 •

edited

Loading

JingwWu commented Nov 23, 2024 •

edited

Loading