Skip to content

Bulk Submissions

Jonathan Mukiibi edited this page Jul 24, 2021 · 3 revisions

Bulk submission

If you know of a public domain corpus of sentences with more than 100k sentences, you can manually submit a pull request to add this as a bulk dataset. However, you will need to manually perform QA (quality assurance) to make sure the sentences are valid and high-quality.

This Discourse post has a more detailed guide for how to do manual QA, but in brief:

  1. You need 2-3 native speakers to review a random sample of sentences to verify their correctness

  2. The sentences should be spelt correctly.

  3. The sentences should be grammatically correct.

  4. The sentences should be speakable (also avoiding non-native uncommon words)

We're looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review. Feel free to set up this QA however makes the most sense for you, but here's a sample Google Spreadsheets template from Mozilla and the one for Luganda. Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here's an example PR.

Quick Steps

Link to the common voice repository Fork and Clone the repository Before you make a PR, you need to perform a duplicate check. This is to actually verify whether the sentences in the new batch are not already existing on Common Voice.

  • Here is the merged .txt file of all the sentences currently on CV.
  • Go ahead and run a python compare script against the new batch .txt file of sentences.
  • Only upload the txt file that has new sentences that have passed the check.

You can now go ahead and send in your PR as follows:

  • git checkout -b add-batch-five-lug
  • Add the filename.txt to the folder Server/data/lg/
  • git add filename.txt
  • git status
  • Git commit -m “ added new batch of 1000 sentences”
  • Git push origin add-batch-five-lug ---commit 8007 error rate

The Link to store Manual results for Luganda is here.

Clone this wiki locally