Proposal: Multi-lingual feature sets #171

halfak · 2015-08-12T15:27:39Z

Right now, a feature extraction is limited to the use of a single language. For example, revscoring.features.diff.badwords_added depends on the language utility languages.is_badword. as a result, a feature list can only have a count of "badwords_added" as identified by one "language". The result is that we have a lot of mixture in our badwords sets and we're not poised to support multi-lingual wikis like Commons and WikiData.

I propose that we convert the concept of a languages from a context (in which feature extraction happens) to a feature set with the necessary context baked in. This would mean that we can use multiple language features in parallel. E.g.

badwords = [
    revision.bytes,
    diff.bytes_changed,
    english.diff.badwords_added,
    portuguese.diff.badwords_added,
    persian.diff.badwords_added,
    ...
]

This would also mean that we wouldn't need to associate a revscoring.languages.Language with a model -- just the set of features that were used to build the model. That would substantially reduce the complication and potential mistakes involved in generating and using model files.

The text was updated successfully, but these errors were encountered:

he7d3r · 2015-08-13T18:18:48Z

Looks like this will also be useful to make the badword/informal word lists not to have intersections from one language to other.

halfak · 2015-09-02T21:32:23Z

Resolved in #172

he7d3r added the enhancement label Aug 15, 2015

halfak mentioned this issue Aug 23, 2015

Languages as feature sets #172

Merged

halfak closed this as completed Sep 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Multi-lingual feature sets #171

Proposal: Multi-lingual feature sets #171

halfak commented Aug 12, 2015

he7d3r commented Aug 13, 2015

halfak commented Sep 2, 2015

Proposal: Multi-lingual feature sets #171

Proposal: Multi-lingual feature sets #171

Comments

halfak commented Aug 12, 2015

he7d3r commented Aug 13, 2015

halfak commented Sep 2, 2015