Read from multiple <tbody> within a <table> #20891

adamhooper · 2018-05-01T01:28:15Z

closes ENH: handle multiple tbody in read_html() #20690
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

refs pandas-dev#20690

WillAyd

Nice change

WillAyd · 2018-05-01T04:15:24Z

pandas/io/html.py

-        return self._parse_raw_data(res)
+        raw_data = []
+
+        if len(tbodies) > 0:


For PEP8 compliance use the boolean-ness of the sequence, so just if tbodies:

WillAyd · 2018-05-01T04:20:29Z

pandas/tests/io/test_html.py

+            </tbody>
+        </table>'''
+        expected = DataFrame({'A': [1, 3], 'B': [2, 4]})
+        result = self.read_html(StringIO(data))


Move the indexer to this line and make the subsequent line just tm.assert_frame_equal(result, expected). Obviously doesn't change result just reads better with the rest of the tests

codecov · 2018-05-01T05:17:50Z

Codecov Report

Merging #20891 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20891      +/-   ##
==========================================
+ Coverage   91.78%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49337    49338       +1     
==========================================
+ Hits        45285    45286       +1     
  Misses       4052     4052

Flag	Coverage Δ
#multiple	`90.17% <100%> (ø)`	⬆️
#single	`41.93% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/html.py	`88.82% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f799916...a881c60. Read the comment docs.

jreback

small changes. ping on green.

jreback · 2018-05-01T10:37:51Z

pandas/tests/io/test_html.py

@@ -396,6 +396,34 @@ def test_empty_tables(self):
        res2 = self.read_html(StringIO(data2))
        assert_framelist_equal(res1, res2)

+    def test_multiple_tbody(self):
+        """


add the issue number here, you can use use # and not a triple-quotes here for the description of the test

TomAugspurger · 2018-05-01T11:20:43Z

@adamhooper do you know if / when you'll be able to update? I'm cutting the RC soon (hopefully a couple hours from now).

If you're busy now I can push the changes and then we'll get this merged before the RC.

TomAugspurger

Merging on green

TomAugspurger · 2018-05-01T15:20:32Z

Test failures from travis were unrelated. Fixing in #20906

TomAugspurger · 2018-05-01T17:47:13Z

Failures should be fixed in masater.

TomAugspurger · 2018-05-01T17:47:54Z

Thanks @adamhooper!

adamhooper · 2018-05-01T19:05:42Z

Er ... finally I'm available to fix things. You beat me to it -- thanks!

TomAugspurger · 2018-05-01T19:08:35Z

Thank you! I'm glad we could get this in before the release.

…

On Tue, May 1, 2018 at 2:05 PM, Adam Hooper ***@***.***> wrote: Er ... finally I'm available to fix things. You beat me to it -- thanks! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#20891 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvxQ3bQ_7Hz1bb2hlEW2DG-4gkXgks5tuLIKgaJpZM4TtdTg> .

@jowens

This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.

@jowens

This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.

@jowens

This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.

Read from multiple <tbody> within a <table>

54d47e4

refs pandas-dev#20690

WillAyd requested changes May 1, 2018

View reviewed changes

jreback added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels May 1, 2018

jreback added this to the 0.23.0 milestone May 1, 2018

jreback requested changes May 1, 2018

View reviewed changes

Updates

a881c60

TomAugspurger approved these changes May 1, 2018

View reviewed changes

WillAyd approved these changes May 1, 2018

View reviewed changes

TomAugspurger mentioned this pull request May 1, 2018

RLS: 0.23.0 #20531

Closed

71 tasks

TomAugspurger merged commit 926f241 into pandas-dev:master May 1, 2018

adamhooper deleted the issue-20690 branch May 1, 2018 19:05

adamhooper mentioned this pull request Jun 14, 2018

read_html: Handle colspan and rowspan #21487

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read from multiple <tbody> within a <table> #20891

Read from multiple <tbody> within a <table> #20891

adamhooper commented May 1, 2018

WillAyd left a comment

WillAyd May 1, 2018

WillAyd May 1, 2018 •

edited

Loading

codecov bot commented May 1, 2018 •

edited

Loading

jreback left a comment

jreback May 1, 2018

TomAugspurger commented May 1, 2018

TomAugspurger left a comment

TomAugspurger commented May 1, 2018

TomAugspurger commented May 1, 2018

TomAugspurger commented May 1, 2018

adamhooper commented May 1, 2018

TomAugspurger commented May 1, 2018 via email

Read from multiple <tbody> within a <table> #20891

Read from multiple <tbody> within a <table> #20891

Conversation

adamhooper commented May 1, 2018

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd May 1, 2018

Choose a reason for hiding this comment

WillAyd May 1, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented May 1, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

jreback May 1, 2018

Choose a reason for hiding this comment

TomAugspurger commented May 1, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger commented May 1, 2018

TomAugspurger commented May 1, 2018

TomAugspurger commented May 1, 2018

adamhooper commented May 1, 2018

TomAugspurger commented May 1, 2018 via email

WillAyd May 1, 2018 •

edited

Loading

codecov bot commented May 1, 2018 •

edited

Loading