Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ds_tool.py #52

Merged
merged 17 commits into from
Jul 25, 2024
Merged

Update ds_tool.py #52

merged 17 commits into from
Jul 25, 2024

Conversation

zqhuang211
Copy link
Contributor

This PR addresses several issues in ds_tool.py and provides better support for creating synthesized speech/text

  • Simplified the logic for consistent naming of subsets/splits for datasets (both input and output).
  • Fixed bug related to ntlk/multiprocessing on Mac.
  • Added option to format text fields before generation.
  • Added option to control memory consumption per process.
  • Added template for creating continuations.

Zhongqiang Huang added 6 commits July 19, 2024 08:31
* dataset_subset: always required
* dataset_split: if set, only process the specified split, otherwise, all splits
* upload_subset: specify if you want to rename the subset
* upload_split: specify if you want to rename the split (dataset_split needs to be set or dataset_subset only has one split)
* format_text: call format_asr_text on format_fields if specified
* format_fields: the fields to format
The inclusion of nltk package can break multiprocess on Mac and cause
issues with dataset.map operation when num_proc > 1, due to an issue
with `tkinter`.

See nltk/nltk#2949
--format_text is redundant given that --format_fields is empty by default.
By default dataset.map() has a write_batch_size=1000, which can lead
to large memory consumption when --num_workers (num_proc) is large
(needed for high tts/textgen throughput). Set --write_batch_size to a
smaller value when --num_workers is large.

Also add --format_fields for the TTS task.
@zqhuang211 zqhuang211 changed the title Update ds tool Update ds_tool.py Jul 19, 2024
@zqhuang211
Copy link
Contributor Author

@farzadab check failed due to the hack solution for the nltk bug, borrowed from: nltk/nltk#2949

ultravox/data/text_proc.py:4: error: Incompatible types in assignment (expression has type "None", target has type Module) [assignment]

sys.modules["tkinter"] = None

Any idea how to fix/ignore the check failure?

@farzadab
Copy link
Contributor

To ignore you can add a # type: ignore comment on the same line.
I haven't looked closely to know why the check is failing though.

@zqhuang211
Copy link
Contributor Author

updated to ignore the error:
sys.modules["tkinter"] = None # type: ignore

ultravox/data/text_proc.py Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
@zqhuang211 zqhuang211 requested a review from juberti July 22, 2024 19:34
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
Jinja2 template can include `text_proc.format_asr_text(text)` for
dynamic text formatting. Any text processing method defined in
ultravox.data.text_proc is accessible from this mechanism.
Copy link
Contributor

@juberti juberti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I think there's still an issue with the data-dict processing per the comments there.

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
zqhuang211 and others added 2 commits July 22, 2024 16:40
Co-authored-by: Justin Uberti <justin@uberti.name>
Co-authored-by: Justin Uberti <justin@uberti.name>
Copy link
Contributor

@juberti juberti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG with just some minor comments to resolve

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved
@zqhuang211 zqhuang211 merged commit 978b329 into fixie-ai:main Jul 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants