Skip to content

Latest commit

 

History

History
206 lines (138 loc) · 6.31 KB

README.md

File metadata and controls

206 lines (138 loc) · 6.31 KB

🎰TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Intro Image

$$ TextSSR ~ Capability ~ Showcase. $$

📢News

[2024.12.05] - The training dataset and generative dataset are released!

[2024.12.04] - We released the latest model and online demo, check on ModelScope.

[2024.12.03] - Our paper is available at here.

📝TODOs

  • Provide publicly checkpoints and gradio demo
  • Release TextSSR-benchmark dataset and evaluation code
  • Release processed AnyWord-lmdb dataset
  • Release our scene text synthesis dataset, TextSSR-F
  • Release training and inference code

💎Visualization

Intro Model

$$ Model ~ Architecture ~ Display. $$

Intro Framework

$$ Data ~ Synthesis ~ Pipeline. $$

Results

$$ Results ~ Presentation. $$

🛠Installation

Environment Settings

  1. Clone the TextSSR Repository:

    git clone https://github.com/YesianRohn/TextSSR.git
    cd TextSSR
  2. Create a New Environment for TextSSR:

    conda create -n textssr python=3.10
    conda activate textssr
  3. Install Required Dependencies:

    • Install PyTorch, TorchVision, Torchaudio, and the necessary CUDA version:
    conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
    • Install the rest of the dependencies listed in the requirements.txt file:
    pip install -r requirements.txt
    • Install our modified diffusers:
    cd diffusers
    pip install -e .
    cd ..

Checkpoints/Data Preparation

  1. Data Preparation:

    • You can use the Anyword-3M dataset provided by Anytext. However, you will need to modify the data loading code to use AnyWordDataset instead of AnyWordLmdbDataset.
    • If you have obtained our AnyWord-lmdb dataset, simply place it in the TextSSR folder.
  2. Font File Preparation:

    • You can either download the Alibaba PuHuiTi font from here, which should be named AlibabaPuHuiTi-3-85-Bold.ttf, or you can use your own custom font file.
    • Place your font file in the TextSSR folder.
  3. Model Preparation:

  • If you want to train the model from scratch, first download the SD2-1 model from Hugging Face.
    • Place the downloaded model in the model folder.
    • During the training process, you will obtain several model checkpoints. These should be placed sequentially in the model folder as follows:
      • vae_ft (trained VAE model)
      • step1 (trained CDM after step 1)
      • step2 (trained CDM after step 2)

After the preparations outlined above, you will have the following file structure:

TextSSR/
├── model/
│   ├── stable-diffusion-v2-1
│   ├── vae_ft
│       ├── checkpoint-x/
│       	├── vae/
│       	└── ...
│   ├── step1
│       ├── checkpoint-x/
│       	├── unet/
│       	└── ...
│   ├── step2
│       ├── checkpoint-x/
│       	├── unet/
│       	└── ...
│   └── AnyWord-lmdb/                      
│       ├── step1_lmdb/
│       ├── step2-lmdb/
├── AlibabaPuHuiTi-3-85-Bold.ttf
├── ...(the same as the GitHub code)

🚂 Training

  1. Step 1: Fine-tune the VAE:

    accelerate launch --num_processes 8 train_vae.py --config configs/train_vae_cfg.py
  2. Step 2: First stage of CDM training:

    accelerate launch --num_processes 8 train_diff.py --config configs/train_diff_step1_cfg.py
  3. Step 3: Second stage of CDM training:

    accelerate launch --num_processes 8 train_diff.py --config configs/train_diff_step2_cfg.py

🔍 Inference

  • Ensure the benchmark path is correctly set in infer.py.
  • Run the inference process with:
    python infer.py

This will start the inference and generate the results.

📊Evaluation

TBD

🔗Citation

@article{ye2024textssr,
  title={TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition},
  author={Ye, Xingsong and Du, Yongkun and Tao, Yunbo and Chen, Zhineng},
  journal={arXiv preprint arXiv:2412.01137},
  year={2024}
}

🌟 Acknowledgements

Many thanks to these great projects for their contributions, which have influenced and supported our work in various ways: SynthText, TextOCR, DiffUTE, Textdiffuser & Textdiffuser-2, AnyText, UDiffText, SceneVTG, and SVTRv2.

Special thanks also go to the training frameworks: STR-Fewer-Labels and OpenOCR.