-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ds_tool.py #52
Conversation
* dataset_subset: always required * dataset_split: if set, only process the specified split, otherwise, all splits * upload_subset: specify if you want to rename the subset * upload_split: specify if you want to rename the split (dataset_split needs to be set or dataset_subset only has one split) * format_text: call format_asr_text on format_fields if specified * format_fields: the fields to format
The inclusion of nltk package can break multiprocess on Mac and cause issues with dataset.map operation when num_proc > 1, due to an issue with `tkinter`. See nltk/nltk#2949
--format_text is redundant given that --format_fields is empty by default.
By default dataset.map() has a write_batch_size=1000, which can lead to large memory consumption when --num_workers (num_proc) is large (needed for high tts/textgen throughput). Set --write_batch_size to a smaller value when --num_workers is large. Also add --format_fields for the TTS task.
@farzadab check failed due to the hack solution for the nltk bug, borrowed from: nltk/nltk#2949 ultravox/data/text_proc.py:4: error: Incompatible types in assignment (expression has type "None", target has type Module) [assignment]
Any idea how to fix/ignore the check failure? |
To ignore you can add a |
b09c709
to
8615cf7
Compare
updated to ignore the error: |
3310546
to
dc940e0
Compare
52527e9
to
45e0b8c
Compare
Jinja2 template can include `text_proc.format_asr_text(text)` for dynamic text formatting. Any text processing method defined in ultravox.data.text_proc is accessible from this mechanism.
45e0b8c
to
5a554b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I think there's still an issue with the data-dict processing per the comments there.
Co-authored-by: Justin Uberti <justin@uberti.name>
Co-authored-by: Justin Uberti <justin@uberti.name>
7f64bb1
to
94b86ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG with just some minor comments to resolve
94b86ff
to
2334a56
Compare
This PR addresses several issues in ds_tool.py and provides better support for creating synthesized speech/text