Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language detection accuracy measurements #246

Open
1 of 3 tasks
bzz opened this issue Oct 15, 2019 · 0 comments
Open
1 of 3 tasks

Language detection accuracy measurements #246

bzz opened this issue Oct 15, 2019 · 0 comments

Comments

@bzz
Copy link
Contributor

bzz commented Oct 15, 2019

Enry right now consist of the sequence matching of strategies that narrow down the possible language options based on different available information:

  • finelame + extension
  • first line of the content
  • regexp heuristics of the raw content
  • naive bayesian classifier of the tokenized content

As a users, as each strategy can be used independently, I would like to know how accurate will the language detection be for each of the distinct use cases.

Use cases

  • all strategies together (default)
  • filename-only language detection
  • content-only language detection

Evaluation

Right now, the only measure of overall accuracy of language detection process we have is binary (similar to linguist): if the linguist/examples/ are all classified or not.

This issue is about picking a better way of quantifying the prediction quality for the three use cases above.

Steps

The focus of this task is not to get best possible evaluation, but rather to quickly kick off the automation of having at least some evaluation, that will be improved in subsequent work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@bzz and others