Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Behavior for Parsing Two Similar Wikipedia Infoboxes #249

Open
john-cherre opened this issue Jul 28, 2020 · 1 comment
Open

Different Behavior for Parsing Two Similar Wikipedia Infoboxes #249

john-cherre opened this issue Jul 28, 2020 · 1 comment

Comments

@john-cherre
Copy link

I have two infoboxes that look exactly the same to me, but I'm getting different behavior in mwparserfromhell. In the first instance I'm getting what I expect - the entire infobox is captured as a template object. In the second instance parts of the infobox are extracted as separate templates. This is confusing since the infoboxes look very similar to me, and I was expecting that the entire infobox could be extracted in the second case.

This is the code I'm using:

mwparserfromhell.parse(text.strip().lower()).filter_templates()

Text 1 Input:

txt1 = """{{Infobox building
| name = 666 Fifth Avenue
| former_names = Tishman Building
| status = Complete
| image = 666 Fifth Avenue by David Shankbone.jpg
| image_size = 300px
| caption = 
| location = 666 Fifth Avenue<br>[[Manhattan]], [[New York (state)|New York]] 10103
| coordinates = {{coord|40.760163|-73.976204|format=dms}}
| start_date = 
| completion_date = 1957
| architect = [[Carson & Lundin]]
| owner = [[Brookfield Properties]]
| cost = $40 million
| floor_area = {{convert|1,463,892|sqft|m2|abbr=on}}
| top_floor = 
| floor_count = 41
| references = 
| map_type = 
| building_type = Office
| antenna_spire = 
| roof = {{convert|483|ft|m|abbr=on}}
| elevator_count = 24 (20 passenger, 4 freight)
| structural_engineer = 
| main_contractor = 
| opening = November 25, 1957
| developer = Tishman Realty and Construction
| management = 
}}"""

Text 1 Output:

['{{infobox building\n| name = 666 fifth avenue\n| former_names = tishman building\n| status = complete\n| image = 666 fifth avenue by david shankbone.jpg\n| image_size = 300px\n| caption = \n| location = 666 fifth avenue<br>[[manhattan]], [[new york (state)|new york]] 10103\n| coordinates = {{coord|40.760163|-73.976204|format=dms}}\n| start_date = \n| completion_date = 1957\n| architect = [[carson & lundin]]\n| owner = [[brookfield properties]]\n| cost = $40 million\n| floor_area = {{convert|1,463,892|sqft|m2|abbr=on}}\n| top_floor = \n| floor_count = 41\n| references = \n| map_type = \n| building_type = office\n| antenna_spire = \n| roof = {{convert|483|ft|m|abbr=on}}\n| elevator_count = 24 (20 passenger, 4 freight)\n| structural_engineer = \n| main_contractor = \n| opening = november 25, 1957\n| developer = tishman realty and construction\n| management = \n}}',
 '{{coord|40.760163|-73.976204|format=dms}}',
 '{{convert|1,463,892|sqft|m2|abbr=on}}',
 '{{convert|483|ft|m|abbr=on}}']

Text 2 Input:

txt2 = """{{Infobox building
| name = Central Park Tower
| alternate_names = Nordstrom Tower
| image = Central Park Tower April 2020.jpg
| caption = Central Park Tower on April 25, 2020
| location = 225 [[57th Street (Manhattan)|West 57th Street]]<br/>[[Manhattan]], [[New York City]], [[New York (state)|New York]], [[United States|U.S.]]
| coordinates = {{coord|40.7663|-73.9810|type:landmark_globe:earth_region:US-NY|display=inline,title}}
| status = Topped Out
| start_date = 2014
| est_completion = 2020<ref name=curbed>{{cite news |author=Amy Plitt |url=https://ny.curbed.com/2017/6/1/15714666/central-park-tower-offering-plan-approval-sales-launch |title=Central Park Tower is now one step closer to launching sales |date=June 1, 2017 |access-date=August 30, 2017 |work=Curbed}}</ref>
| building_type = [[Residential]], [[retail]]
| architectural_style = [[Modern architecture|Modern]]
| architectural = {{cvt|1550|ft|0}}
| floor_count = 131<ref>{{cite web |url=https://www.architecturaldigest.com/story/new-york-city-central-park-tower-worlds-tallest-residential-building </ref><ref>{{cite web |url=https://archpaper.com/2019/09/central-park-tower-tops-out/</ref> (98 habitable floors)<ref name="auto">{{Cite web |url=http://www.skyscrapercenter.com/building/central-park-tower/14269 |title=Central Park Tower - The Skyscraper Center |website=www.skyscrapercenter.com |access-date=October 10, 2018}}</ref>
| elevator_count = 11
| cost = $3 billion<ref name="Tase">{{cite news|url=https://commercialobserver.com/2019/04/all-in-good-tase-the-crisis-for-the-american-cohort-in-tel-aviv-is-essentially-over/|title=All in Good TASE: The Crisis for the American Cohort in Tel Aviv Is Essentially Over|date=April 4, 2019|work=Commercial Observer|last=Gourarie|first=Chava}}</ref>
| floor_area = {{convert|1,285,308|sqft|m2}}<ref name="auto" />
| architect = [[Adrian Smith + Gordon Gill Architecture]]
| structural_engineer = [[WSP Global]]
| main_contractor = [[Lendlease]]
| developer = [[Extell Development Company]]
}}"""

Text 2 Output:

['{{coord|40.7663|-73.9810|type:landmark_globe:earth_region:us-ny|display=inline,title}}',
 '{{cite news |author=amy plitt |url=https://ny.curbed.com/2017/6/1/15714666/central-park-tower-offering-plan-approval-sales-launch |title=central park tower is now one step closer to launching sales |date=june 1, 2017 |access-date=august 30, 2017 |work=curbed}}',
 '{{cvt|1550|ft|0}}',
 '{{cite web |url=https://archpaper.com/2019/09/central-park-tower-tops-out/</ref> (98 habitable floors)<ref name="auto">{{cite web |url=http://www.skyscrapercenter.com/building/central-park-tower/14269 |title=central park tower - the skyscraper center |website=www.skyscrapercenter.com |access-date=october 10, 2018}}</ref>\n| elevator_count = 11\n| cost = $3 billion<ref name="tase">{{cite news|url=https://commercialobserver.com/2019/04/all-in-good-tase-the-crisis-for-the-american-cohort-in-tel-aviv-is-essentially-over/|title=all in good tase: the crisis for the american cohort in tel aviv is essentially over|date=april 4, 2019|work=commercial observer|last=gourarie|first=chava}}</ref>\n| floor_area = {{convert|1,285,308|sqft|m2}}<ref name="auto" />\n| architect = [[adrian smith + gordon gill architecture]]\n| structural_engineer = [[wsp global]]\n| main_contractor = [[lendlease]]\n| developer = [[extell development company]]\n}}',
 '{{cite web |url=http://www.skyscrapercenter.com/building/central-park-tower/14269 |title=central park tower - the skyscraper center |website=www.skyscrapercenter.com |access-date=october 10, 2018}}',
 '{{cite news|url=https://commercialobserver.com/2019/04/all-in-good-tase-the-crisis-for-the-american-cohort-in-tel-aviv-is-essentially-over/|title=all in good tase: the crisis for the american cohort in tel aviv is essentially over|date=april 4, 2019|work=commercial observer|last=gourarie|first=chava}}',
 '{{convert|1,285,308|sqft|m2}}']

Also posted it here.

@lahwaacz
Copy link
Contributor

The second infobox is a mess - it has multiple <ref> tags which are inconsistently terminated inside a template which started inside the tag. You should probably fix the wikicode itself...

For mwparserfromhell, this is a duplicate of #40.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants