Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tokenizer incorrectly handles some difficult tag-related markup #40

Open
earwig opened this issue Aug 19, 2013 · 14 comments
Open

The tokenizer incorrectly handles some difficult tag-related markup #40

earwig opened this issue Aug 19, 2013 · 14 comments

Comments

@earwig
Copy link
Owner

earwig commented Aug 19, 2013

  1. Bold and italics that cross contexts are handled incorrectly, because the tree structure does not support overlapping nodes (for example, ''foo'''bar''baz''', or ''foo{{bar|baz''}}). Fixing this will probably be very difficult.
  2. Open tags that do not have a close tag before the parser reaches EOF are ignored, whereas some of them should be parsed (like bold and italics) and have some kind of "hidden close" flag set.
  3. MediaWiki counts the occurrences of ; in the block before any text and uses this as the maximum number of parsable :s after. The current implementation only allows one : regardless of how many ;s there are.
  4. MediaWiki prevents some tags from crossing certain contexts (italics and bold can't cross headings, for example) but this implementation has no such restriction.
  5. The parser only recognizes a space as the separator character between the URL and its link title in [ ] tags, but MediaWiki also accepts some other syntax (e.g. [http://example.com/''Example''] is valid).

1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.

@ghost ghost assigned earwig Aug 19, 2013
@earwig
Copy link
Owner Author

earwig commented Aug 21, 2013

Regarding (1), a line from MediaWiki's source:

            # ''Something [http://www.cool.com cool''] -->
            # <i>Something</i><a href="http://www.cool.com"..><i>cool></i></a>

@ghost
Copy link

ghost commented Oct 27, 2013

Also, this.

== Something ==
'' Hello, world!

== Something else ==
Lorem ipsum dolor sit amet.''

@earwig
Copy link
Owner Author

earwig commented Oct 28, 2013

So it seems italics/bold can't cross links but can cross templates. I need to figure exactly which nodes are restrictive.

@earwig
Copy link
Owner Author

earwig commented Oct 28, 2013

1946cf6

@Prillan
Copy link

Prillan commented Apr 18, 2014

Hi! There seems to be a case you've missed.

Bold (and italics I guess) are implicitly closed when wikitable cells end. E.g. http://wiki.teamliquid.net/starcraft2/index.php?title=2014_WCS_Season_1_Europe/Premier&oldid=687367

{| class="wikitable"
|width=190px bgcolor="{{RaceColor|p}}" align="center" | '''{{p}} Protoss ''(13)''
|width=190px bgcolor="{{RaceColor|t}}" align="center" | '''{{t}} Terran ''(8)''
|width=190px bgcolor="{{RaceColor|z}}" align="center" | '''{{z}} Zerg ''(11)''

gives

<table class="wikitable">
<tr>
<td width="190px" bgcolor="#B8F2B8" align="center"> <b><a href="/starcraft2/File:Picon_small.png" class="image" title="Protoss"><img alt="Protoss" src="/starcraft/images2/a/ab/Picon_small.png" width="17" height="15" /></a> Protoss <i>(13)</i></b>
</td>
<td width="190px" bgcolor="#B8B8F2" align="center"> <b><a href="/starcraft2/File:Ticon_small.png" class="image" title="Terran"><img alt="Terran" src="/starcraft/images2/9/9d/Ticon_small.png" width="17" height="15" /></a> Terran <i>(8)</i></b>
</td>
<td width="190px" bgcolor="#F2B8B8" align="center"> <b><a href="/starcraft2/File:Zicon_small.png" class="image" title="Zerg"><img alt="Zerg" src="/starcraft/images2/c/c9/Zicon_small.png" width="17" height="15" /></a> Zerg <i>(11)</i></b>
</td>

@earwig
Copy link
Owner Author

earwig commented Apr 18, 2014

Hmm... yeah, that's tough because the parser doesn't understand tables yet. I'll need to add that before this is fixable.

@danvk
Copy link

danvk commented Sep 7, 2014

Pulling in a workaround from #80: @earwig suggested passing skip_style_tags=True to mwparserfromhell.parse to work around @Prillan's issue. This worked perfectly.

To get this feature, I had to track the development version on github rather than the released version on PyPI. Here's the line from my requirements.txt:

-e git+https://github.com/earwig/mwparserfromhell.git#egg=mwparserfromhell

@earwig
Copy link
Owner Author

earwig commented May 23, 2015

Most of this is going to require an overhaul of how parsing is done (I finally have an idea how I'm going to do it, but it'll be a lot of work)... so pushing this back as the main task for v1.0.

@lahwaacz
Copy link
Contributor

Consider this wikitext:

''foo
bar''

MediaWiki 1.26 parses this as

<i>foo</i>
bar

which suggests that style markup cannot span across multiple lines. mwparserfromhell does this the hard/old? way:

\n
<
      i
>
      foo\nbar
</
      i
>
\n

@earwig
Copy link
Owner Author

earwig commented Feb 23, 2016

Oh joy.

@mhsmith
Copy link

mhsmith commented Apr 8, 2016

almond.txt

The attached file is a reduced version of https://en.wikipedia.org/w/index.php?title=Almond&oldid=706024513. I'd like to reduce it more, but any structural change anywhere in the text makes the problem disappear, so I don't know if this is actually an instance of this bug.

The initial table is parsed correctly, subject to point 2 above, i.e. the unclosed <small> and <center> tags are returned as plain text. But everything after the table is returned as plain text too, with the exception of headings and lists. For example:

===
       Almond flour and skins
===
\n[[Almond flour]] is often used as a [[gluten-free]] alternative to wheat flour

Replicating the initial line, like this:

{|
|-
| Production<small>(million tonnes)
|-
| Production<small>(million tonnes)
|-
| {{flag|USA}} || style="text-align:center;"|<center> 1.8
|-

Results in the rest of the table not being parsed either:

<
      table
>
      <
            tr
      >
            <
                  td
            >
                   Production<small>(million tonnes)\n
            </
                  td
            >
      </
            tr
      >
      |-\n| Production<small>(million tonnes)\n|-\n| {{flag|USA}} || style="text-align:center;"|<center> 1.8\n|-\n| {{flag|Australia}} || style="text-align:center;"|<center> 0.16\n|-\n| {{flag|Spain}} || style="text-align:center;" |<center> 0.15\n|-\n| {{flag|Morocco}} || style="text-align:center;"|<center> 0.1\n|-\n| {{flag|Iran}} || style="text-align:center;"|<center> 0.09\n|-\n!'''World''' !! style="text-align:center;"|<center> '''2.92'''\n
</
      table
>

@mhsmith
Copy link

mhsmith commented Apr 9, 2016

Here's a really weird example from https://fr.wikipedia.org/w/index.php?title=Opposition_p%C3%A9rih%C3%A9lique&oldid=112493222 :

[[Image:Opposition périhélique.PNG|thumb|250px|Schéma présentant les oppositions périhélique et aphélique de la {{quoi|[[Terre]] et de [[Mars (planète)|Mars]]]]
On dit que deux corps célestes sont en '''opposition périhélique''' lorsque tous deux sont simultanément au [[périhélie]] de leur orbite en alignement parfait avec le [[Soleil]]. Il en résulte que la distance entre ces deux corps célestes est alors minimale.}}

With the template interrupted by the end of the image context, MediaWiki appears to actually invoke the template twice in order to achieve the author's (presumed) intention.

@vladiscripts
Copy link

vladiscripts commented Apr 18, 2016

Answer on #148
Perhaps ... Many of pages with this issue AWB marks as "have unclosed tags". But not all, e.g. no a tag errors in https://ru.wikipedia.org/w/index.php?title=%D0%9B%D0%B8%D0%BC%D0%BE%D0%BD&oldid=76351442. This page without errors too.

Tables placed in one sections of pages, but parser doesn't see templates in other sections. Could add function recognition "== ==" as secondary mark end of tables?

@bfontaine
Copy link

bfontaine commented Jan 12, 2021

Other weird ones with malformed italics in templates:

mwparserfromhell.parse("{{foo|''bar}} {{foo|bar''}}").filter_templates()
# => ["{{foo|''bar}}", "{{foo|bar''}}"]

mwparserfromhell.parse("{{foo|''bar}} ''...'' {{foo|bar''}}").filter_templates()
# => ["{{foo|bar''}}"]

mwparserfromhell.parse("{{foo|''bar}} ''").filter_templates()
# => []

mwparserfromhell.parse("{{foo|''bar}} ''bar''").filter_templates()
# => []

mwparserfromhell.parse("{{foo|''bar}}").filter_templates()
# => ["{{foo|''bar}}"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants
@danvk @mhsmith @earwig @lahwaacz @bfontaine @Prillan @vladiscripts and others