-
Notifications
You must be signed in to change notification settings - Fork 0
Bulk Submissions
If you know of a public domain corpus of sentences with more than 100k sentences, you can manually submit a pull request to add this as a bulk dataset. However, you will need to manually perform QA (quality assurance) to make sure the sentences are valid and high-quality.
This Discourse post has a more detailed guide for how to do manual QA, but in brief:
-
You need 2-3 native speakers to review a random sample of sentences to verify their correctness
-
The sentences should be spelt correctly.
-
The sentences should be grammatically correct.
-
The sentences should be speakable (also avoiding non-native uncommon words)
We're looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review. Feel free to set up this QA however makes the most sense for you, but here's a sample Google Spreadsheets template from Mozilla and the one for Luganda. Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here's an example PR.
Quick Steps
Link to the common voice repository Fork and Clone the repository Before you make a PR, you need to perform a duplicate check. This is to actually verify whether the sentences in the new batch are not already existing on Common Voice.
- Here is the merged .txt file of all the sentences currently on CV.
- Go ahead and run a python compare script against the new batch .txt file of sentences.
- Only upload the txt file that has new sentences that have passed the check.
You can now go ahead and send in your PR as follows:
- git checkout -b add-batch-five-lug
- Add the filename.txt to the folder Server/data/lg/
- git add filename.txt
git status
Git commit -m “ added new batch of 1000 sentences”
- Git push origin add-batch-five-lug ---commit 8007 error rate
The Link to store Manual results for Luganda is here.