-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Wikicorpus custom filters #2089
[MRG] Wikicorpus custom filters #2089
Conversation
gensim/corpora/wikicorpus.py
Outdated
@@ -380,6 +380,13 @@ def extract_pages(f, filter_namespaces=False): | |||
if ns not in filter_namespaces: | |||
text = None | |||
|
|||
if callable(filter_articles) and text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if filter_articles
isn't a callable? If I understand correctly, it should be a callable always, no need to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just catching the default case where filter_articles=None
- I didn't see the point of introducing a dummy callable that always returns true. I guess you could make the default filter_articles=False
but that confuses the type semantics of filter_articles
is it bool
or a callable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not None
preferred (since that's what we really want to test). But a dummy filter_articles=gensim.utils.identity
is also OK.
Definitely useful, thanks @mattilyra ! And I like the solution (callable injection): simple and flexible 👍 |
gensim/corpora/wikicorpus.py
Outdated
filter_articles: callable, optional | ||
If set each XML article element will be passed to this callable before being processed. Only articles | ||
where the callable returns an XML element are processed, returning None allows filtering out | ||
some articles based on customised rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any restrictions on the callable? Can it be an anonymous lambda function? (troublesome with parallelized multiprocessing -- lambdas cannot be pickled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, for the parallel self.process
to work filter_articles
needs to be serialisable, what's the practice for checking this in gensim
? Try to pickle
the callable and raise an informative exception if that fails? Is pickle
the standard serialiser in gensim
?
On a related "fun" note, see https://stackoverflow.com/questions/50328386/python-typing-pickle-and-serialisation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a simple note in the docs. What does the error look like now, if a user passes something not pickleable? If it looks reasonable, I'd just keep that.
@menshikh-iv what is our policy here in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there's no specific catch for this so it'll just be python's PicklingError
- that plus a note in the docs I think should be enough, but I don't know how this is handled elsewhere in gensim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too. Someone advanced enough to be setting this parameter better be advanced enough to RTFM :)
…ed default value check to check against None not against callable().
If the tests pass, LGTM to me! Thanks Matti. Let's wait for @menshikh-iv 's final verdict & merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @mattilyra, please add a test for this case (check that filter function works as expected), you can use https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2 as wiki dump with this code
from gensim.test.utils import datapath
from gensim.corpora import WikiCorpus
path = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
corpus = WikiCorpus(path)
gensim/corpora/wikicorpus.py
Outdated
@@ -544,10 +555,15 @@ def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), diction | |||
Maximal token length. | |||
lower : bool, optional | |||
If True - convert all text to lower case. | |||
filter_articles: callable or None, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callable, optional
gensim/corpora/wikicorpus.py
Outdated
filter_articles: callable or None, optional | ||
If set, each XML article element will be passed to this callable before being processed. Only articles | ||
where the callable returns an XML element are processed, returning None allows filtering out | ||
some articles based on customised rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create an example of this function and link it in docstring like
:func:`gensim.corpora.wikicorpus.filter_example`
this can be useful for better understanding, how function works
- input parameters
- return values (probably True/False better that XML element/None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean add an example filter_example
to the wikicorpus
module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only as example (no need to use it by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, very busy at work but will try to get it done this week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to resolve merge conflict too
@menshikh-iv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after the minor code style fix (ident).
gensim/corpora/wikicorpus.py
Outdated
@@ -365,6 +426,13 @@ def extract_pages(f, filter_namespaces=False): | |||
if ns not in filter_namespaces: | |||
text = None | |||
|
|||
if filter_articles is not None: | |||
if not filter_articles(elem, namespace=namespace, title=title, | |||
text=text, page_tag=page_tag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use hanging indent in Gensim (not vertical alignment): https://www.python.org/dev/peps/pep-0008/#id3
Why? Makes code easier to maintain (people forget to re-indent after modifying the opening line). Plus some people (me!) also find it more consistent and easier to read :)
Have to respect properly enforced coding style 😄 👍 is there a |
Awesome, thanks. Both vertical and hanging indents are OK by Python standards (PEP8), so linters wouldn't necessarly complain. Consistent hanging indent is just our project choice. We keep developer instructions here on Github: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page#code-style |
Awesome job, thanks @mattilyra ! |
I wanted to add a simple method for filtering out wiki articles based on arbitrary rules. The idea is to be able to easily produce datasets like the WikiText-103.
I am currently working on making a German equivalent of that corpus (call it
DeWiki103
for now), but that requires filtering out articles based on regular expressions applied to the text of each XML element (short of parsing the Wikimedia format).This PR adds a simple
filter_articles
method toWikiCorpus
that gets called inextract_pages
. ReturningNone
will filter out that specific article.Let me know if you think this might be a useful addition.