GitHub - hugolpz/audio-cmn: Chinese (zh-cnm) opendata audio files for 8,596 hsk words and 1,707 syllabs.

Audio-cmn aims to provide hight quality & easy to use Chinese words audio recordings for modern web & mobile application. Audio-cmn is all

an original work by the recording of chinese syllabs
a curation work reusing pre-existing audios from SWAC Recorder
a post-processing work by providing light & optimised .mp3 files rather than the huge original .flac files. These audios are thus suitable for mobile application developments.

Voices

# of items	Naming	Set's specifics	Authorship
1707	`cmn-zi4.mp3`	syllabs v.2	Chen Wang, CC-by-sa
+8000	`cmn-名字.mp3`	HSK_2000 list (words,zi)	Yue Tan, CC-by-sa

Structure

Type of data :

.../syllabs/cmn-{tonedPinyin}.mp3 : 1707 chinese syllabs (all)
.../hsk/cmn-{hanzi}.mp3 : 5,596 HSK_2000 words and characters

Qualities

/96k/ - best audio quality, improvement from 64k is not perceptible.
- no syllabs folder
- /96k/hsk/
/64k/ - optimal audio quality for voice recording.
- /64k/syllabs/
- /64k/hsk/
24k-abr - brutally optimized : ~2 times lighter, for 80% of the audio quality.
- /24k-abr/syllabs/
- /24k-abr/hsk/
18k-abr - brutally optimized : ~3 times lighter, for 60% of the audio quality.
- /18k-abr/syllabs/
- /18k-abr/hsk/

Dependencies

sudo apt install libav-tools
sudo apt install lame
curl -L -C - 'http://download.shtooka.net/cmn-caen-tan_flac.tar' -o ./cmn-caen-tan_flac.tar
unrar e -o- './cmn-caen-tan_flac.tar'                                 # '*.flac' ./flac/

Missing audios ?

The current HSK audio database was build upon the official HSK 2000, published in 2000. The HSK 2000 is thus near fully covered (at least 8596 out of ~8800). List comparison with the last HSK 2012 words list is available and done via :

bash ./hsk-missing-audios.bash  HSK2012_all.txt    # List missing audios, compared to input list of words
bash ./missing-audios.bash --help              # Tiny manual

Current difference: 582 HSK2012 words which are missing human audios. See files in ./lists/ .

Credits

Speakers -- see table upper
Hugo Lopez, PLIDAM, INALCO -- Project management, repository, audio compression, file renaming
Nicolas Vion -- recording software & technical support

Log

v.0.1.0: clean up data by deleting the cmn-*5.ext items since copies of cmn-*1.mp3
v.0.1.1: add ./18k-abr (<40MB), an optimized version of ./64kb with understable sound quality
v.0.1.2: improve README.md ; Add ./lists/ and script for comparison with the HSK2012.
v.2.0.0: [BREAKING CHANGE] Merge back former /hskzi/ and /hsk/ back together. [Others]: fix for critical bug on some audios ; Add 24k and 96k ; share the conversion commands via compress-raw.bash

License

CC-by-sa. See table upper for authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voices

Structure

Qualities

Dependencies

Missing audios ?

Credits

Log

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
18k-abr		18k-abr
24k-abr		24k-abr
64k		64k
96k/hsk		96k/hsk
lists		lists
.gitignore		.gitignore
README.md		README.md
audios-cleaner.sh		audios-cleaner.sh
compress-raw.bash		compress-raw.bash
hsk-missing-audios.bash		hsk-missing-audios.bash

hugolpz/audio-cmn

Folders and files

Latest commit

History

Repository files navigation

Voices

Structure

Qualities

Dependencies

Missing audios ?

Credits

Log

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages