Gowajee corpus

Thai smart home corpus with "Gowajee" hotword

Corpus Description

The corpus was collected in the Automatic Speech Recognition class offered at Chulalongkorn University as a homework assignment. The corpus comprises of the recordings from the Spring 2017-2023 course offerings. The students were asked to form a group of up to six people. Each group were asked to come up with an example smart home application. Each group will record the same set of sentences that they came up with. More specifically, the students were instructed to:

Collect 101 utterances per person (same sentences for each group)
The first utterance is "Gowajee". This is designated as the wakeword.
The second utterance must start with "Gowajee" with an accompanying command
16k Hz, 16 bit depth, mono
In the first two years (2017-2018) the students were encouraged to record with the provided uni-directional microphones. However, this is not enforced. For the latter years, the students were encouraged to record using the hardware they wish to demo on.
They are no other specifications about the recording environment

For the full instructions used for the collection, see here

Benchmarks

Using the voxforge training script, tri3b (speaker dependent) got 14.91% on the dev set and 8.82% on the test set.

Directory structure

dataset

audios
- 2017
- 2018
- 2019
- 2020
- 2021
- 2023
dev
- spk2utt
- text
- utt2spk
- wav.scp
lu
- spk2utt
- text
- utt2spk
- wav.scp
train
- spk2utt
- text
- utt2spk
- wav.scp
test
- spk2utt
- text
- utt2spk
- wav.scp

We kept the recordings from three groups of students aside as a dev set, two groups for the test set, and the rest of the groups as the training set.

The train/dev/test set splits is included in the provided file.

Version 0.9.3

There are 20308 utterances collected from 188 speakers. 163 are males, while 25 are females. The total length of the corpus is 17 hours and 11 minutes. The vocabulary size is 2257 words with a total of 111931 words.

One group recorded "ภาษาลู", a teenage slang version of Thai. This is separated into its own set.

Download

Version 0.9.3

Citing

Please cite the following be sure to include the version number of the corpus

@techreport{gowajee,
     title = {{Gowajee Corpus}},
     author = {Ekapol Chuangsuwanich and Atiwong Suchato and Korrawe Karunratanakul and Burin Naowarat and Chompakorn CChaichot
and Penpicha Sangsa-nga and Thunyathon Anutarases and Nitchakran Chaipojjana and Yuatyong Chaichana},
     year = {2020},
     institution = {Chulalongkorn University, Faculty of Engineering, Computer Engineering Department},
     month = {12},
     Date-Added = {2023-07-30},
     url = {https://github.com/ekapolc/gowajee_corpus}
     note = {Version 0.9.3}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gowajee corpus

Corpus Description

Benchmarks

Directory structure

Version 0.9.3

Download

Citing

About

Releases

Packages

Contributors 4

License

ekapolc/gowajee_corpus

Folders and files

Latest commit

History

Repository files navigation

Gowajee corpus

Corpus Description

Benchmarks

Directory structure

Version 0.9.3

Download

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages