You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking for a simple way to extract the first paragraph of the first section from wikipedia pages. I tried to get the first section and process the text/link nodes of that section, but it doesn't seem to work reliably. E.g (using the parse method from the readme):
p = parse('Arthur Jensen')
for n in p.get_sections()[0].filter_text()[:6]:
print(n)
prints
about
the Danish actor
Arthur Jensen (actor)
the New Zealand musician and composer
Arthur Owen Jensen
{{Infobox scientist
|name=Arthur Jensen
|birth_name = Arthur Robert Jensen
|image = Arthur Jensen Vanderbilt 2002.jpg
|image_size = 200px
|caption = Arthur Jensen, 2002 at
the infobox template also doesn't seem to be returned by filter_templates? print([t.name for t in p.filter_templates()]) prints:
['about', 'Birth date', 'Death date and age', 'cite journal', 'cite web ', 'Webarchive', 'Says who', 'cite book ', 'cite book', 'cite book ', 'cite web', 'cite book ', 'cite journal ', 'cite book', 'cite book', 'quote', 'Cite journal', 'quote', 'cite web ', 'quote', 'cite journal ', 'cite web ', 'cite book ', 'cite book ', 'cite book ', 'cite web ', 'cite book ', 'cite journal', 'Cite news', 'quote', 'cite journal', 'cite journal', 'cite journal', 'cite web ', 'citation needed', 'webarchive ', 'quote', 'quote', 'quote', 'quote', 'cite book ', 'See also', 'Cite journal', 'cite journal ', 'cite journal ', 'cite journal ', 'Cite book', 'cite journal ', 'cite journal ', 'cite journal ', 'cite web ', 'Reflist', 'ISBN', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'cite journal ', 'Google Scholar id', 'Authority control', 'DEFAULTSORT:Jensen, Arthur']
The text was updated successfully, but these errors were encountered:
The template is missing in this case because there's a syntax inconsistency with a bold tag (see #40). You can work around this with parse(text, skip_style_tags=True).
To solve your original question, you can try something like this:
>>> code = parse(text, skip_style_tags=True)
>>> print(code.strip_code().splitlines()[0])
'''Arthur Robert Jensen''' (August 24, 1923 – October 22, 2012) was an American psychologist and author. He was a professor of educational psychology at the University of California, Berkeley. Jensen was known for his work in psychometrics and differential psychology, the study of how and why individuals differ behaviorally from one another.
That would give the first paragraph as a string, with formatting removed. (If you want it as pure text, without the style tags either, you can reparse the text with skip_style_tags=False and call strip_code again...)
If you want the actual nodes from the first paragraph, you could combine get_sections()]0] with an second step to remove any templates before the first non-whitespace text node.
I am looking for a simple way to extract the first paragraph of the first section from wikipedia pages. I tried to get the first section and process the text/link nodes of that section, but it doesn't seem to work reliably. E.g (using the
parse
method from the readme):prints
the infobox template also doesn't seem to be returned by
filter_templates
?print([t.name for t in p.filter_templates()])
prints:The text was updated successfully, but these errors were encountered: