Language detection accuracy measurements #246

bzz · 2019-10-15T10:55:22Z

Enry right now consist of the sequence matching of strategies that narrow down the possible language options based on different available information:

finelame + extension
first line of the content
regexp heuristics of the raw content
naive bayesian classifier of the tokenized content

As a users, as each strategy can be used independently, I would like to know how accurate will the language detection be for each of the distinct use cases.

Use cases

all strategies together (default)
filename-only language detection
content-only language detection

Evaluation

Right now, the only measure of overall accuracy of language detection process we have is binary (similar to linguist): if the linguist/examples/ are all classified or not.

This issue is about picking a better way of quantifying the prediction quality for the three use cases above.

Steps

identify a small dataset, to evaluate up on smola/language-dataset (from Add human annotations to dubious samples (round 1) smola/language-dataset#3)
a notebook with PoC of the evaluation, to pick the best metric (using Python API from Python bindings for enry #154)
a script that runs entry for each use case on this dataset \w a given metric (e.g as part of CI)

The focus of this task is not to get best possible evaluation, but rather to quickly kick off the automation of having at least some evaluation, that will be improved in subsequent work.

The text was updated successfully, but these errors were encountered:

bzz mentioned this issue Oct 15, 2019

Sync with github/linguist #155

Open

4 tasks

bzz mentioned this issue Oct 28, 2019

Python bindings for enry #154

Open

4 tasks

bzz mentioned this issue Nov 18, 2019

python: expose highest-level API #250

Closed

bzz mentioned this issue Mar 19, 2020

py: expose highest-level enry.language() bzz/enry#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language detection accuracy measurements #246

Language detection accuracy measurements #246

bzz commented Oct 15, 2019 •

edited

Loading

Language detection accuracy measurements #246

Language detection accuracy measurements #246

Comments

bzz commented Oct 15, 2019 • edited Loading

Use cases

Evaluation

Steps

bzz commented Oct 15, 2019 •

edited

Loading