LTX-VideoQ8 is 8bit adaptation of LTXVideo(https://github.com/Lightricks/LTX-Video) with no loss of accuracy and up to 3X speed up in NVIDIA ADA GPUs. Generate 720x480x121 videos in under a minute on RTX 4060 Laptop GPU with 8GB VRAM. Training code coming soon! (8GB VRAM is MORE than enough to full fine tune 2B transformer on ADA GPU with precalculated latents)
40 steps, RTX 4060 Laptop, CUDA 12.6, PyTorch 2.5.1
121x720x1280*: in diffusers the more steps it makes the slower it/sec gets, expected ~7min not 9mins according to it/sec of first 10 steps.
The codebase was tested with Python 3.10.12, CUDA version 12.6, and supports PyTorch >= 2.5.1.
1) Install q8_kernels(https://github.com/KONAKONA666/q8_kernels)
2) git clone https://github.com/KONAKONA666/LTX-Video/tree/main
cd LTX-Video
python -m pip install -e .\[inference-script\]
Then, download the text encoder and vae from Hugging Face Download Q8 version or convert with q8_kernels.convert_weights
from huggingface_hub import snapshot_download
model_path = 'PATH' # The local directory to save downloaded checkpoint
snapshot_download("konakona/ltxvideo_q8", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
follow the inference code in inference.py:
python inference.py --low_vram --transformer_type=q8_kernels --ckpt_dir 'PATH' --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED
python inference.py --ckpt_dir 'PATH' --low_vram --transformer_type=q8_kernels --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED
Left: 8bit, right 16bit
Find side to side comparisons in
https://github.com/KONAKONA666/LTX-Video/tree/main/docs/_static
When writing prompts, focus on detailed, chronological descriptions of actions and scenes. Include specific movements, appearances, camera angles, and environmental details - all in a single flowing paragraph. Start directly with the action, and keep descriptions literal and precise. Think like a cinematographer describing a shot list. Keep within 200 words. For best results, build your prompts using this structure:
- Start with main action in a single sentence
- Add specific details about movements and gestures
- Describe character/object appearances precisely
- Include background and environment details
- Specify camera angles and movements
- Describe lighting and colors
- Note any changes or sudden events
- See examples for more inspiration.
- Resolution Preset: Higher resolutions for detailed scenes, lower for faster generation and simpler scenes. The model works on resolutions that are divisible by 32 and number of frames that are divisible by 8 + 1 (e.g. 257). In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input will be padded with -1 and then cropped to the desired resolution and number of frames. The model works best on resolutions under 720 x 1280 and number of frames below 257
- Seed: Save seed values to recreate specific styles or compositions you like
- Guidance Scale: 3-3.5 are the recommended values
- Inference Steps: More steps (40+) for quality, fewer steps (20-30) for speed
We are grateful for the following awesome projects when implementing LTX-Video:
- DiT and PixArt-alpha: vision transformers for image generation.
- Lightricks for the model