-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strip_code() needs special case for [[File:...]] links and interwiki links #136
Comments
The problem here is that we are depending on wiki-specific namespace names, which are very unpredictable, so a way to configure that will be required. |
That's a good point. It seems very awkward to configure this per-wiki since I assume the "File" namespace will be internationalized. Is there anyway that wiki-install-specific params like this can be applied in the parser? |
What's the use case for such context-sensitive code stripping? There are many more similar problems: should category and interlanguage links be stripped too? And I don't even have to start with templates... |
Oh! We're working on extracting prose (sentences and paragraphs) from English Wikipedia and this was one of the issues that came up with using strip_code() to get rid of content that doesn't show up when rendered. We're not too concerned about handling what is rendered via templates. Generally, I agree -- it's a bummer that we use link syntax for images, but, we can't go back and fix that now. :/ |
I did this once to identify what kind of thing a link points to. The file for all Wikipedias is 260k uncompressed. If you only need this for English Wikipedia it's pretty straightforward. You can ask it to give you all its namespaces: However, there's some 200+ Wikipedias in total: The IDs are consistent across languages and most namespaces should exist everywhere. This is OK if you target the major Wikipedias that don't go anywhere soon. However, many of those only have a few pages and are sometimes removed. New ones get added from time to time. I personally wouldn't want to have to support this as it can lead to all sorts of annoyingly complex debugging with different versions floating around. It's a bit like those poor guys handling timezones in pytz. Now, here's the catch: You can also link to a Wikipedia thing in a different language. This includes images: Oh, the joy of working with Wikipedia, where every dirty hack ever imagined is actually the preferred way of doing things ;) Btw, templates can also generate text: |
+1 for not merging siteinfo data into mwparserfromhell. Maybe it would be more appropriate if I could provide the parser with such assets when I ask it to parse a chunk of wikitext. Though, honestly, I'm starting to believe that this isn't something mwparserfromhell should handle. For my specific use-case, I can walk the syntax tree and check wikilink nodes as I go to make the decisions that work for my application. |
There's probably something you can do involving API functions like the one that expands templates... In EarwigBot's copyvio detector, I had to support a similar use case. It strips out images, reference, etc and leaves alone template parameter values that are longer than a certain length (which we interpret to be article content rather than technical stuff)—rather hacky overall, but it works in the majority of cases. That code is here: https://github.com/earwig/earwigbot/blob/develop/earwigbot/wiki/copyvios/parsers.py#L140 , maybe it can give you some ideas. I am not sure if this is something we should support in the parser directly, though. |
|
That's sensible. |
IMO I vote for the caller providing a 'namespaces map' which can be used to indicate what namespace names map to which namespace numbers. |
Why namespace numbers? There are no namespace numbers in the language, only in the database, which is not touched at all by mwparserfromhell. Even though the common namespace numbers are constant across all wikis, there is no need to bring numbers to the parser. What would be needed and sufficient is a map of localized and aliased namespace names to the corresponding canonical (English) names. Also attributes like first-letter sensitiveness and other quirks I may be unaware of would have to be described. |
mwparserfromhell should IMHO be kept minimal. Anyone doing serious wikitext expansion should use custom routines, perhaps with some knowledge of the wiki. |
I agree with what @ricordisamoa just said. |
Somehow the names of |
Il 05/01/2016 09:57, John Vandenberg ha scritto:
Sono l'oscurità del nero più profondo |
I just had a similar case where I want to strip_code and remove all the images, so this is my solution: img = re.compile('(File|Image|ALTERNATIVE_ALIASES):', re.I)
prs = mwparserfromhell.parser.Parser().parse(wikicode)
remove_img = [f for f in prs.ifilter_wikilinks() if img.match(str(f.title))]
for f in remove_img:
prs.remove(f) And to get ALTERNATIVE_ALIASES you need to:
Anyway, maybe parser should provide some utility function of "remove links which have target that match x" (where x is re/function) - this seems to be useful and general enough. |
Just use |
I've noticed problems with strip_code() and filtering HTML tags as well. Based on https://de.wikipedia.org/wiki/User:FNBot?action=raw:
I do:
As far as I know, that HTML spam should not be in there. Is that problem related to the problem documented in this Issue? Is this a different problem? Is this the intended behavior? |
What you see is the intended behaviour. The HTML tags are imbalanced in your code sample, so mwparserfromhell treats the offending segment as just text. A real library like lxml would simply fix the imbalanced tags for you, for comparison. But mwparser doesn't because we require parsing not to transform the text, i.e. you can take the output of |
I've implemented this. See #301. |
My solution leads to the following on the current wikitext for $ python3 parse.py 西村博之.mediawiki|grep Wikilink|grep 画像:|column -t -s $'\t' -T4
$ python3 parse.py 西村博之.mediawiki|grep Wikilink|head -n 20
import mwparserfromhell
from mwparserfromhell.nodes import Wikilink
import argparse
import csv, sys
import json
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='parse wikitext for a Wikipedia page.')
parser.add_argument('input', help='Input file')
args = parser.parse_args()
csvw = csv.writer(sys.stdout, delimiter="\t")
with open(args.input) as f:
inp = f.read()
for tag in mwparserfromhell.parse(inp).nodes:
csvw.writerow([type(tag).__name__] + ([tag.title, tag.args, tag.text] if isinstance(tag, Wikilink) else [str(tag)])) |
It looks like strip_code() interprets File-links like they are regular wikilinks. Instead, it should remove everything except the caption/alt-text. In the example below, the size param is dumped into the text.
The text was updated successfully, but these errors were encountered: