Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic dataset #67

Merged
merged 21 commits into from
Aug 6, 2024
Merged

Add generic dataset #67

merged 21 commits into from
Aug 6, 2024

Conversation

zqhuang211
Copy link
Contributor

Add GenericVoiceDataset to support new datasets via yaml config, e.g.,

data_dicts:
  - path: "fixie-ai/librispeech_asr"
    name: "clean"
    splits:
      - "train.100"
      - "train.360"
    user_template: "Continue the following text using less than 50 words:\n\n<|audio|>"
    assistant_template: "{{ continuation }}"
    transcript_template: "{{ text }}"
    num_samples: 100_000
  - path: "fixie-ai/librispeech_asr"
    name: "other"
    splits:
      - "train.500"
    user_template: "Continue the following text using less than 50 words:\n\n<|audio|>"
    assistant_template: "{{ continuation }}"
    transcript_template: "{{ text }}"
    num_samples: 100_000

See ultravox.training.configs.llama3_whisper_kd.yaml for example experiment configuration.

@zqhuang211
Copy link
Contributor Author

The previous PR was not merged properly. Submitting it one more time.

@zqhuang211 zqhuang211 requested a review from juberti August 5, 2024 23:20
@zqhuang211 zqhuang211 merged commit 66db567 into main Aug 6, 2024
1 check passed
@zqhuang211 zqhuang211 deleted the zhuang/add-generic-dataset branch August 13, 2024 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants