Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

Proposal: Multi-lingual feature sets #171

Closed
halfak opened this issue Aug 12, 2015 · 2 comments
Closed

Proposal: Multi-lingual feature sets #171

halfak opened this issue Aug 12, 2015 · 2 comments

Comments

@halfak
Copy link
Member

halfak commented Aug 12, 2015

Right now, a feature extraction is limited to the use of a single language. For example, revscoring.features.diff.badwords_added depends on the language utility languages.is_badword. as a result, a feature list can only have a count of "badwords_added" as identified by one "language". The result is that we have a lot of mixture in our badwords sets and we're not poised to support multi-lingual wikis like Commons and WikiData.

I propose that we convert the concept of a languages from a context (in which feature extraction happens) to a feature set with the necessary context baked in. This would mean that we can use multiple language features in parallel. E.g.

badwords = [
    revision.bytes,
    diff.bytes_changed,
    english.diff.badwords_added,
    portuguese.diff.badwords_added,
    persian.diff.badwords_added,
    ...
]

This would also mean that we wouldn't need to associate a revscoring.languages.Language with a model -- just the set of features that were used to build the model. That would substantially reduce the complication and potential mistakes involved in generating and using model files.

@he7d3r
Copy link
Contributor

he7d3r commented Aug 13, 2015

Looks like this will also be useful to make the badword/informal word lists not to have intersections from one language to other.

@halfak
Copy link
Member Author

halfak commented Sep 2, 2015

Resolved in #172

@halfak halfak closed this as completed Sep 2, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants