Update ds_tool.py #52

zqhuang211 · 2024-07-19T16:07:27Z

This PR addresses several issues in ds_tool.py and provides better support for creating synthesized speech/text

Simplified the logic for consistent naming of subsets/splits for datasets (both input and output).
Fixed bug related to ntlk/multiprocessing on Mac.
Added option to format text fields before generation.
Added option to control memory consumption per process.
Added template for creating continuations.

* dataset_subset: always required * dataset_split: if set, only process the specified split, otherwise, all splits * upload_subset: specify if you want to rename the subset * upload_split: specify if you want to rename the split (dataset_split needs to be set or dataset_subset only has one split) * format_text: call format_asr_text on format_fields if specified * format_fields: the fields to format

The inclusion of nltk package can break multiprocess on Mac and cause issues with dataset.map operation when num_proc > 1, due to an issue with `tkinter`. See nltk/nltk#2949

--format_text is redundant given that --format_fields is empty by default.

By default dataset.map() has a write_batch_size=1000, which can lead to large memory consumption when --num_workers (num_proc) is large (needed for high tts/textgen throughput). Set --write_batch_size to a smaller value when --num_workers is large. Also add --format_fields for the TTS task.

zqhuang211 · 2024-07-19T16:30:00Z

@farzadab check failed due to the hack solution for the nltk bug, borrowed from: nltk/nltk#2949

ultravox/data/text_proc.py:4: error: Incompatible types in assignment (expression has type "None", target has type Module) [assignment]

sys.modules["tkinter"] = None

Any idea how to fix/ignore the check failure?

farzadab · 2024-07-19T16:41:59Z

To ignore you can add a # type: ignore comment on the same line.
I haven't looked closely to know why the check is failing though.

zqhuang211 · 2024-07-19T17:14:09Z

updated to ignore the error:
sys.modules["tkinter"] = None # type: ignore

ultravox/data/text_proc.py

ultravox/tools/ds_tool/ds_tool.py

Jinja2 template can include `text_proc.format_asr_text(text)` for dynamic text formatting. Any text processing method defined in ultravox.data.text_proc is accessible from this mechanism.

juberti

Looks good, but I think there's still an issue with the data-dict processing per the comments there.

ultravox/tools/ds_tool/ds_tool.py

Co-authored-by: Justin Uberti <justin@uberti.name>

juberti

LG with just some minor comments to resolve

ultravox/tools/ds_tool/ds_tool.py

Zhongqiang Huang added 6 commits July 19, 2024 08:31

Add continuation template

2fff95f

Fix multiprocess bug related to ntlk

dc0300a

The inclusion of nltk package can break multiprocess on Mac and cause issues with dataset.map operation when num_proc > 1, due to an issue with `tkinter`. See nltk/nltk#2949

Remove flag --format_text from ds_tool.py

3cc9d85

--format_text is redundant given that --format_fields is empty by default.

Add example for generating continuation for large dataset

7cc418c

zqhuang211 requested review from juberti and farzadab July 19, 2024 16:07

zqhuang211 changed the title ~~Update ds tool~~ Update ds_tool.py Jul 19, 2024

Minor format changes

8615cf7

zqhuang211 force-pushed the update-ds_tool branch from b09c709 to 8615cf7 Compare July 19, 2024 17:02

Make sure keys match in jinja2 template rendering

be12d21

juberti reviewed Jul 22, 2024

View reviewed changes

Zhongqiang Huang added 4 commits July 22, 2024 11:33

Follow google style for module import

917fb18

Move --writer_batch_size to top-level argument class

6576ce8

Make --dataset_subset optional

81f502f

Add context for temporary fix of nltk bug

dc940e0

zqhuang211 force-pushed the update-ds_tool branch from 3310546 to dc940e0 Compare July 22, 2024 15:57

farzadab reviewed Jul 22, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

zqhuang211 requested a review from juberti July 22, 2024 19:34

farzadab approved these changes Jul 22, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

zqhuang211 force-pushed the update-ds_tool branch from 52527e9 to 45e0b8c Compare July 22, 2024 20:46

Use Jinja2 template for flexible text formatting

5a554b1

Jinja2 template can include `text_proc.format_asr_text(text)` for dynamic text formatting. Any text processing method defined in ultravox.data.text_proc is accessible from this mechanism.

zqhuang211 force-pushed the update-ds_tool branch from 45e0b8c to 5a554b1 Compare July 22, 2024 20:56

juberti reviewed Jul 22, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

zqhuang211 and others added 2 commits July 22, 2024 16:40

Update ultravox/tools/ds_tool/ds_tool.py

d92e858

Co-authored-by: Justin Uberti <justin@uberti.name>

Update ultravox/tools/ds_tool/ds_tool.py

044b4ab

Co-authored-by: Justin Uberti <justin@uberti.name>

Remove redundant line due to use of jinja template

2189851

zqhuang211 force-pushed the update-ds_tool branch from 7f64bb1 to 94b86ff Compare July 24, 2024 19:57

juberti approved these changes Jul 24, 2024

View reviewed changes

Revert change to enable push_to_hub at the subset level

2334a56

zqhuang211 force-pushed the update-ds_tool branch from 94b86ff to 2334a56 Compare July 24, 2024 20:14

zqhuang211 merged commit 978b329 into fixie-ai:main Jul 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ds_tool.py #52

Update ds_tool.py #52

zqhuang211 commented Jul 19, 2024

zqhuang211 commented Jul 19, 2024

farzadab commented Jul 19, 2024

zqhuang211 commented Jul 19, 2024

juberti left a comment

juberti left a comment

Update ds_tool.py #52

Update ds_tool.py #52

Conversation

zqhuang211 commented Jul 19, 2024

zqhuang211 commented Jul 19, 2024

farzadab commented Jul 19, 2024

zqhuang211 commented Jul 19, 2024

juberti left a comment

Choose a reason for hiding this comment

juberti left a comment

Choose a reason for hiding this comment