Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infobox not included in template #217

Open
kireet opened this issue Apr 10, 2019 · 1 comment
Open

Infobox not included in template #217

kireet opened this issue Apr 10, 2019 · 1 comment

Comments

@kireet
Copy link

kireet commented Apr 10, 2019

I am looking for a simple way to extract the first paragraph of the first section from wikipedia pages. I tried to get the first section and process the text/link nodes of that section, but it doesn't seem to work reliably. E.g (using the parse method from the readme):

p = parse('Arthur Jensen')
for n in p.get_sections()[0].filter_text()[:6]:
    print(n)

prints

about
the Danish actor
Arthur Jensen (actor)
the New Zealand musician and composer
Arthur Owen Jensen
{{Infobox scientist
|name=Arthur Jensen
   |birth_name              = Arthur Robert Jensen
   |image             = Arthur Jensen Vanderbilt 2002.jpg
   |image_size        = 200px
   |caption           = Arthur Jensen, 2002 at 

the infobox template also doesn't seem to be returned by filter_templates?
print([t.name for t in p.filter_templates()]) prints:

['about', 'Birth date', 'Death date and age', 'cite journal', 'cite web ', 'Webarchive', 'Says who', 'cite book ', 'cite book', 'cite book ', 'cite web', 'cite book ', 'cite journal ', 'cite book', 'cite book', 'quote', 'Cite journal', 'quote', 'cite web ', 'quote', 'cite journal ', 'cite web ', 'cite book ', 'cite book ', 'cite book ', 'cite web ', 'cite book ', 'cite journal', 'Cite news', 'quote', 'cite journal', 'cite journal', 'cite journal', 'cite web ', 'citation needed', 'webarchive ', 'quote', 'quote', 'quote', 'quote', 'cite book ', 'See also', 'Cite journal', 'cite journal ', 'cite journal ', 'cite journal ', 'Cite book', 'cite journal ', 'cite journal ', 'cite journal ', 'cite web ', 'Reflist', 'ISBN', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'Google Scholar id', 'Authority control', 'DEFAULTSORT:Jensen, Arthur']
@earwig
Copy link
Owner

earwig commented Jun 30, 2019

Sorry this took a while to get a response.

The template is missing in this case because there's a syntax inconsistency with a bold tag (see #40). You can work around this with parse(text, skip_style_tags=True).

To solve your original question, you can try something like this:

>>> code = parse(text, skip_style_tags=True)
>>> print(code.strip_code().splitlines()[0])
'''Arthur Robert Jensen''' (August 24, 1923 – October 22, 2012) was an American psychologist and author. He was a professor of educational psychology at the University of California, Berkeley.  Jensen was known for his work in psychometrics and differential psychology, the study of how and why individuals differ behaviorally from one another.

That would give the first paragraph as a string, with formatting removed. (If you want it as pure text, without the style tags either, you can reparse the text with skip_style_tags=False and call strip_code again...)

If you want the actual nodes from the first paragraph, you could combine get_sections()]0] with an second step to remove any templates before the first non-whitespace text node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants