Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser not correctly splitting sections on certain pages #66

Closed
Rua opened this issue Apr 6, 2014 · 3 comments
Closed

Parser not correctly splitting sections on certain pages #66

Rua opened this issue Apr 6, 2014 · 3 comments

Comments

@Rua
Copy link

Rua commented Apr 6, 2014

To reproduce:

text = mwparserfromhell.parse(text)

for langsection in text.get_sections([2]):
    wikipedia.output(unicode(langsection) + "\n@@@@@@@@@@\n")

I've used pywikipedia to output the text but use anything that works. The @ signs allow you to see at what point the parser has split the page into sections.

For some reason, on this particular page, it groups the ==Old Norse==, ==Old Portuguese== and ==Portuguese== sections together as one. This shouldn't be happening obviously. They should be recognised as three separate level 2 sections.

@earwig earwig added this to the version 0.4 milestone Apr 6, 2014
@earwig earwig self-assigned this Apr 6, 2014
@earwig
Copy link
Owner

earwig commented Apr 6, 2014

Unfortunately, this is a side effect of #40, namely that there's a stray '' on the line Þeir eru á hólmi. under Old Norse. The parser's getting confused, which will be avoided in the future by having it be more intelligent about when unclosed italics tags are implicitly closed by MediaWiki.

For now, though, you have a solution available, assuming you aren't interested in working with italics or bold on the page. Instead of doing:

text = mwparserfromhell.parse(text)

...you can do:

text = mwparserfromhell.parser.Parser().parse(text, skip_style_tags=True)

This will cause the parser to treat the '' as plain text, and the rest of the page will be parsed more accurately. This is a temporary fix until the parser is more accurate.

I'll close this issue, since it falls under #40.

@earwig earwig closed this as completed Apr 6, 2014
@Rua
Copy link
Author

Rua commented Apr 6, 2014

Unfortunately, that solution just gives an error:

TypeError: parse() got an unexpected keyword argument 'skip_style_tags'

It looks like the source code in the mwparserfromhell documentation doesn't include that parameter either.

@earwig
Copy link
Owner

earwig commented Apr 6, 2014

@earwig earwig removed this from the version 0.4 milestone May 25, 2014
@earwig earwig removed their assignment Dec 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants