Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Diar-az #39

Merged
merged 1 commit into from
Sep 20, 2024
Merged

Add Diar-az #39

merged 1 commit into from
Sep 20, 2024

Conversation

afk0901
Copy link
Contributor

@afk0901 afk0901 commented Aug 20, 2024

Diar-az creates files for a (diarization) corpus from Gecko and provides organization, cleaning and correction of data for Kaldi to Gecko to Kaldi/corpus and back.

@wq2012
Copy link
Owner

wq2012 commented Aug 20, 2024

I think this should fall into "Other software" instead of "Diarization dataset".

This is not a new dataset. It's just a format conversion tool, is it correct?

@judyfong
Copy link
Contributor

Its a tool specifically for the ruv-di dataset

@wq2012
Copy link
Owner

wq2012 commented Aug 20, 2024

If so, we should add ruv as a dataset, and this repo as "Other Software".

@judyfong
Copy link
Contributor

judyfong commented Aug 20, 2024 via email

@afk0901
Copy link
Contributor Author

afk0901 commented Aug 20, 2024

Yes, I think Other software works and maybe a better fit, as it's not really a dataset, rather it was a tool to support the ruv-di dataset. To correct this, should this pull request be just updated or a new one created?

The dataset was never published, only the resulting models. Also, yes that dataset should be added but it was also lost in a cyber security attack in January 2024 on Reykjavik University’s servers. If you want, you could put a placeholder text for the RÚV-DI dataset here in this repo and we could try to recreate the dataset. We have a license that lists all the shows and episodes contained within the dataset. So we could recreate it from that. Other software works in my opinion.

On Tuesday, August 20, 2024, Quan Wang @.> wrote: If so, we should add ruv as a dataset, and this repo as "Other Software". — Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUMNYEZUD2QOJQQ7AE2X5TZSNI2HAVCNFSM6AAAAABMY6MAQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGAZTMOJRGA . You are receiving this because you commented.Message ID: @.>

@wq2012
Copy link
Owner

wq2012 commented Aug 23, 2024

To correct this, should this pull request be just updated or a new one created?

I'm OK either way.

@afk0901
Copy link
Contributor Author

afk0901 commented Aug 24, 2024

To correct this, should this pull request be just updated or a new one created?

I'm OK either way.

Fixed, added to other software.

@judyfong
Copy link
Contributor

@afk0901 i believe you also need to put the placeholder text for the dataset for this pr to be properly closed.

In terms of recreating the dataset i believe it's actually best if @wq2012 recreates the dataset with daan and pet of google. And @afk0901 finish our writeup of this dataset creation. When we are both done we compare notes on arxiv and write the dataset paper together for interspeech, icassp, or sand2025, or wand in october.

@judyfong
Copy link
Contributor

For continuity and clarity I believe it's best if my second paragraph is dealt with separately, not in this pr. Thus i have created a new issue for it within this repo.

@wq2012
Copy link
Owner

wq2012 commented Sep 20, 2024

To correct this, should this pull request be just updated or a new one created?

I'm OK either way.

Fixed, added to other software.

I didn't see the change.

@afk0901 afk0901 force-pushed the master branch 2 times, most recently from f32caa9 to 8d1a453 Compare September 20, 2024 19:29
README.md Outdated
@@ -295,6 +296,7 @@ Team in the Inaugural DIHARD Challenge](https://www.isca-speech.org/archive/pdfs
| [VoxConverse](https://github.com/joonson/voxconverse) | TBD | TBD | Free | VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos |
| [MiniVox Benchmark](https://github.com/doerlbh/MiniVox) | [MiniVox Benchmark](https://github.com/doerlbh/MiniVox) | en | Free | MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks. |
| [The AliMeeting Corpus](https://github.com/yufan-aslp/AliMeeting) | Together with audios | zh | Free | |
| RÚV-DI dataset | TBD | is | TBD | |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Add Diar-az
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants