Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine Tuning with Arabic #33

Closed
Mahmuod1 opened this issue Aug 21, 2022 · 5 comments
Closed

Fine Tuning with Arabic #33

Mahmuod1 opened this issue Aug 21, 2022 · 5 comments

Comments

@Mahmuod1
Copy link

First I would to thank you for this repo
i want to work in Arabic lang and Arabic lang and the Arabic Lang is RTL
could you tell me a pref to the changes i would make when adding the Arabic Lang in the SynthDoG to create the Arabic dataset
and in the model creation

@Mahmuod1 Mahmuod1 changed the title Fine Tuning with arabic Arabic Fine Tuning with Arabic Aug 21, 2022
@Mahmuod1
Copy link
Author

@gwkrsrch please any help

@gwkrsrch
Copy link
Collaborator

Hi @Mahmuod1 , there are several options you can take. You may modify the layout/textbox generation module to make the desired RTL layout. There would be several code lines to modify, e.g., textbox, layouts.
Another option is to generate the data with your own code based on SynthDoG. The followings are the main flow of the preliminary version of SynthDoG. The first step is to draw texts on a paper texture image. The following links would also be helpful to you.

And then, using a perspective transformation (or other transformations), you can embed the synthetic paper into a background. Although the idea is simple, you will see some agreeable results. You may further enhance the quality of the generated samples via various techniques, but it is optional. Hope this helps :) Feel free to reopen this or open another issue if you have anything new for sharing.

@Mahmuod1
Copy link
Author

thanks, @gwkrsrch for your detailed instructions
can you please give me instructions for the donut model configuration that will be changed as it language specific
I will use the document parsing training so please can you tell me what should care about the training configurations

@gwkrsrch
Copy link
Collaborator

As a general tip, to train a model for a new language, you need to care about the token vocabulary/tokenizer. #11 would be useful to you :)

@akashlp27
Copy link

akashlp27 commented Oct 5, 2023

Hi, @Mahmuod1, @gwkrsrch were you able to generate images using synthdog for RTL languages such as arabic.... any suggestions will help a lot..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants