Parser not correctly splitting sections on certain pages #66

Rua · 2014-04-06T00:29:01Z

To reproduce:

Retrieve the wikitext from https://en.wiktionary.org/w/index.php?title=á&oldid=25803937
Run the following code on it, where "text" is the retrieved text:

text = mwparserfromhell.parse(text)

for langsection in text.get_sections([2]):
    wikipedia.output(unicode(langsection) + "\n@@@@@@@@@@\n")

I've used pywikipedia to output the text but use anything that works. The @ signs allow you to see at what point the parser has split the page into sections.

For some reason, on this particular page, it groups the ==Old Norse==, ==Old Portuguese== and ==Portuguese== sections together as one. This shouldn't be happening obviously. They should be recognised as three separate level 2 sections.

earwig · 2014-04-06T09:22:38Z

Unfortunately, this is a side effect of #40, namely that there's a stray '' on the line Þeir eru á hólmi. under Old Norse. The parser's getting confused, which will be avoided in the future by having it be more intelligent about when unclosed italics tags are implicitly closed by MediaWiki.

For now, though, you have a solution available, assuming you aren't interested in working with italics or bold on the page. Instead of doing:

text = mwparserfromhell.parse(text)

...you can do:

text = mwparserfromhell.parser.Parser().parse(text, skip_style_tags=True)

This will cause the parser to treat the '' as plain text, and the rest of the page will be parsed more accurately. This is a temporary fix until the parser is more accurate.

I'll close this issue, since it falls under #40.

Rua · 2014-04-06T12:42:24Z

Unfortunately, that solution just gives an error:

TypeError: parse() got an unexpected keyword argument 'skip_style_tags'

It looks like the source code in the mwparserfromhell documentation doesn't include that parameter either.

earwig · 2014-04-06T17:10:22Z

Well, you're probably not using develop then.

https://github.com/earwig/mwparserfromhell/blob/develop/mwparserfromhell/parser/__init__.py#L56

earwig added this to the version 0.4 milestone Apr 6, 2014

earwig self-assigned this Apr 6, 2014

earwig added the aspect: parser label Apr 6, 2014

earwig closed this as completed Apr 6, 2014

earwig added result: duplicate and removed priority: low labels May 25, 2014

earwig removed this from the version 0.4 milestone May 25, 2014

earwig removed their assignment Dec 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser not correctly splitting sections on certain pages #66

Parser not correctly splitting sections on certain pages #66

Rua commented Apr 6, 2014

earwig commented Apr 6, 2014

Rua commented Apr 6, 2014

earwig commented Apr 6, 2014

Parser not correctly splitting sections on certain pages #66

Parser not correctly splitting sections on certain pages #66

Comments

Rua commented Apr 6, 2014

earwig commented Apr 6, 2014

Rua commented Apr 6, 2014

earwig commented Apr 6, 2014