Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Wikicorpus custom filters #2089

Merged
merged 15 commits into from
Jul 13, 2018

Conversation

mattilyra
Copy link
Contributor

@mattilyra mattilyra commented Jun 13, 2018

I wanted to add a simple method for filtering out wiki articles based on arbitrary rules. The idea is to be able to easily produce datasets like the WikiText-103.

I am currently working on making a German equivalent of that corpus (call it DeWiki103 for now), but that requires filtering out articles based on regular expressions applied to the text of each XML element (short of parsing the Wikimedia format).

This PR adds a simple filter_articles method to WikiCorpus that gets called in extract_pages. Returning None will filter out that specific article.

Let me know if you think this might be a useful addition.

@@ -380,6 +380,13 @@ def extract_pages(f, filter_namespaces=False):
if ns not in filter_namespaces:
text = None

if callable(filter_articles) and text:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if filter_articles isn't a callable? If I understand correctly, it should be a callable always, no need to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just catching the default case where filter_articles=None - I didn't see the point of introducing a dummy callable that always returns true. I guess you could make the default filter_articles=False but that confuses the type semantics of filter_articles is it bool or a callable

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not None preferred (since that's what we really want to test). But a dummy filter_articles=gensim.utils.identity is also OK.

@piskvorky
Copy link
Owner

piskvorky commented Jun 17, 2018

Definitely useful, thanks @mattilyra ! And I like the solution (callable injection): simple and flexible 👍

filter_articles: callable, optional
If set each XML article element will be passed to this callable before being processed. Only articles
where the callable returns an XML element are processed, returning None allows filtering out
some articles based on customised rules.
Copy link
Owner

@piskvorky piskvorky Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any restrictions on the callable? Can it be an anonymous lambda function? (troublesome with parallelized multiprocessing -- lambdas cannot be pickled)

Copy link
Contributor Author

@mattilyra mattilyra Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, for the parallel self.process to work filter_articles needs to be serialisable, what's the practice for checking this in gensim? Try to pickle the callable and raise an informative exception if that fails? Is pickle the standard serialiser in gensim?

On a related "fun" note, see https://stackoverflow.com/questions/50328386/python-typing-pickle-and-serialisation

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a simple note in the docs. What does the error look like now, if a user passes something not pickleable? If it looks reasonable, I'd just keep that.

@menshikh-iv what is our policy here in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there's no specific catch for this so it'll just be python's PicklingError - that plus a note in the docs I think should be enough, but I don't know how this is handled elsewhere in gensim

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too. Someone advanced enough to be setting this parameter better be advanced enough to RTFM :)

@piskvorky piskvorky changed the title Wikicorpus custom filters (WIP) [MRG] Wikicorpus custom filters Jun 18, 2018
@piskvorky
Copy link
Owner

piskvorky commented Jun 18, 2018

If the tests pass, LGTM to me! Thanks Matti.

Let's wait for @menshikh-iv 's final verdict & merge.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @mattilyra, please add a test for this case (check that filter function works as expected), you can use https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2 as wiki dump with this code

from gensim.test.utils import datapath
from gensim.corpora import WikiCorpus

path = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
corpus = WikiCorpus(path)

@@ -544,10 +555,15 @@ def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), diction
Maximal token length.
lower : bool, optional
If True - convert all text to lower case.
filter_articles: callable or None, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callable, optional

filter_articles: callable or None, optional
If set, each XML article element will be passed to this callable before being processed. Only articles
where the callable returns an XML element are processed, returning None allows filtering out
some articles based on customised rules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an example of this function and link it in docstring like

:func:`gensim.corpora.wikicorpus.filter_example`

this can be useful for better understanding, how function works

  • input parameters
  • return values (probably True/False better that XML element/None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean add an example filter_example to the wikicorpus module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only as example (no need to use it by default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, very busy at work but will try to get it done this week

Copy link
Contributor

@menshikh-iv menshikh-iv Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to resolve merge conflict too

@mattilyra
Copy link
Contributor Author

@menshikh-iv ERROR: InvocationError for command '/home/travis/build/RaRe-Technologies/gensim/.tox/py36-linux/bin/pytest gensim/test' (exited with code 1) ????

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the minor code style fix (ident).

@@ -365,6 +426,13 @@ def extract_pages(f, filter_namespaces=False):
if ns not in filter_namespaces:
text = None

if filter_articles is not None:
if not filter_articles(elem, namespace=namespace, title=title,
text=text, page_tag=page_tag,
Copy link
Owner

@piskvorky piskvorky Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use hanging indent in Gensim (not vertical alignment): https://www.python.org/dev/peps/pep-0008/#id3

Why? Makes code easier to maintain (people forget to re-indent after modifying the opening line). Plus some people (me!) also find it more consistent and easier to read :)

@mattilyra
Copy link
Contributor Author

Have to respect properly enforced coding style 😄 👍

is there a gensim code style guide somewhere, I didn't see anything in the contribution guidelines and the linter does not complain about that.

@piskvorky
Copy link
Owner

piskvorky commented Jul 11, 2018

Awesome, thanks.

Both vertical and hanging indents are OK by Python standards (PEP8), so linters wouldn't necessarly complain. Consistent hanging indent is just our project choice.

We keep developer instructions here on Github: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page#code-style

@piskvorky
Copy link
Owner

Awesome job, thanks @mattilyra !

@piskvorky piskvorky merged commit 46ccefb into piskvorky:develop Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants