June'23 Update: Hugging Face Spaces demo available here: vivlavida/generative-disco 🌷
Visuals are a core part of our experience of music. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-image models. Users select intervals of music to visualize and then parameterize that visualization by defining start and end prompts. These prompts are warped between and generated according to the beat of the music for audioreactive video. We introduce design patterns for improving generated videos: "transitions", which express shifts in color, time, subject, or style, and "holds", which encourage visual emphasis and consistency.
The only way the demo differs from the video is that it includes fields for OpenAI Key
and Soundcloud URL to Music
. These functions are intended for if you would like to use Disco in a more dedicated way. HuggingFace Spaces allows many people to edit the same space at once, so duplicate / clone the space if you would like to persist your work.
The OpenAI Key
field is only necessary if you want to use the Brainstorming Describe Interval
function, which retrieves inspiration for different subjects of prompts using GPT-3. We pass the OpenAI Key
as an argument for an API call.
The Soundcloud URL to Music
allows you to change the music file.
- All fields have to be filled out in the form (Interval #) for the generation to begin.
While video intervals are generating you can check the progress by looking at the Logs tab. Generating even a few seconds of content on the community GPU may take awhile, so we recommend generating in very short intervals (like 0.5 seconds at a time).
Full Youtube video
arXiv Preprint
DiscoEx_1.mp4
Disco_Ex2.mov
Generative Disco's system design. Users begin by interacting with the waveform to create intervals within the music (#1). To find prompts that will define the start and end of intervals, users can brainstorm prompts using prompt suggestions from GPT-4 or videography domain knowledge (#4-6) and explore text-to-image generations (#7, #8). Results users like can be dragged and dropped into the start and end areas (#10,#11), after which an interval can be generated. Generated intervals populate in the tracks area (#15) and can be stitched into a video that renders in the Video Area (#9).
Docker hub docker pull hellovivian/art-ai:disco_local
*Requires a GPU with a decent amount of VRAM.
Install the package
pip install -U stable_diffusion_videos
conda env create -f disco_environment.yml
Authenticate with Hugging Face
huggingface-cli login
conda activate video
python flask_app.py
The Stable Diffusion checkpoint used was V1-4. The web application was written in Python, Javascript, and Flask. Images were generated with 50 iterations on an NVIDIA V100. The music, assets (images), stylesheets, and Javascript are collected in the static folder. Logic and routing is controlled by flask_app.py
The system was built on top of two open-source repositories: a) stable-diffusion-videos from Hugging Face, written by Nate Raw and b) wavesurfer.js. stable-diffusion-videos also builds on a script shared by @karpathy. The script was modified to this gist, which was then updated/modified to this repo.
BibTex
@misc{liu2023generative,
title={Generative Disco: Text-to-Video Generation for Music Visualization},
author={Vivian Liu and Tao Long and Nathan Raw and Lydia Chilton},
year={2023},
eprint={2304.08551},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
You can file any issues/feature requests :)