Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html infers wrong datatype #7032

Closed
ghost opened this issue May 4, 2014 · 10 comments · Fixed by #7851
Closed

read_html infers wrong datatype #7032

ghost opened this issue May 4, 2014 · 10 comments · Fixed by #7851
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@ghost
Copy link

ghost commented May 4, 2014

As can be seen in the below code, column 3, 8, 9, and 10 were misinterpreted as datetime objects. Columns 1, 6 and 7 should be integer. How do I force the columns to be interpreted as the proper type? Only 2, 4, 5 and 11 appear to have been read properly. I can pass 'infer_types=False' I suppose and do manual conversion afterwards, but since infer_types is going away, this won't work.

In [63]: import pandas as pd
In [64]: path = r"http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
In [65]: tables = pd.read_html(path)
In [66]: df = tables[1]

In [67]: df.head()
Out[67]:
        1           2          3         4         5        6        7   8   \
1  !000001  California        NaT  37253956  33871648  !000053  !000055 NaT
2  !000002       Texas        NaT  25145561  20851820  !000036  !000038 NaT
3  !000003    New York 1965-11-27  19378102  18976457  !000027  !000029 NaT
4  !000004     Florida        NaT  18801310  15982378  !000027  !000029 NaT
5  !000005    Illinois        NaT  12830632  12419293  !000018  !000020 NaT

   9   10      11
1 NaT NaT  11.91%
2 NaT NaT   8.04%
3 NaT NaT   6.19%
4 NaT NaT   6.01%
5 NaT NaT   4.10%

[5 rows x 11 columns]

dtype: object

In [68]: df.dtypes
Out[68]:
1             object
2             object
3     datetime64[ns]
4             object
5             object
6             object
7             object
8     datetime64[ns]
9     datetime64[ns]
10    datetime64[ns]
11            object
dtype: object
@filmor
Copy link
Contributor

filmor commented May 16, 2014

I have exactly the same problem and I have to rely on infer_types=False in my code. I can't find the rationale behind removing the parameter, why is it deprecated?

@cpcloud
Copy link
Member

cpcloud commented May 16, 2014

The rational is that it doesn't do anything except convert the result of the parse into strings which will happen anyway if you have eg numerical columns that have strange values in them.

@tui-rob
Copy link
Contributor

tui-rob commented Jun 12, 2014

Same issue here. In the example below, read_html incorrectly converts the Firstname column from strings into timestamps because one row contains the Firstname value 'April'.

Original html table here.

url = 'https://www.raceplus.co.uk/raceplus_display_results_fixed.php?content=1&race=47530&show_type=All&event_id=BLEN14&event_id=BLEN14&page=28'
df = pd.read_html(url, skiprows=1,header=0, index_col=0)[0][0:-1]

df.head()
Out[4]: 
        Firstname Lastname     Cat           Swim Time             T1 Time  Race No                                                                      
3862          NaT   Dawson  M30-34 2014-06-12 00:15:15 2014-06-12 00:05:40   
1688          NaT    Crown  M40-44 2014-06-12 00:18:15 2014-06-12 00:07:15   
2269          NaT     Lang  F30-34 2014-06-12 00:18:40 2014-06-12 00:06:28   
6321          NaT   Lawson  M45-49 2014-06-12 00:19:33 2014-06-12 00:08:27   
5208          NaT    Woods  F25-29 2014-06-12 00:17:11 2014-06-12 00:05:41   

                  Bike Time             T2 Time            Run Time  Race No                                                               
3862    2014-06-12 00:48:35 2014-06-12 00:02:30 2014-06-12 00:33:53   
1688    2014-06-12 00:47:37 2014-06-12 00:01:41 2014-06-12 00:31:07   
2269    2014-06-12 00:46:19 2014-06-12 00:02:08 2014-06-12 00:32:19   
6321    2014-06-12 00:44:37 2014-06-12 00:03:22 2014-06-12 00:29:58   
5208    2014-06-12 00:46:46 2014-06-12 00:02:08 2014-06-12 00:34:12   

                       Time   Pos  
Race No                            
3862    2014-06-12 01:45:52  2701  
1688    2014-06-12 01:45:52  2702  
2269    2014-06-12 01:45:52  2703  
6321    2014-06-12 01:45:56  2704  
5208    2014-06-12 01:45:56  2705  

@clarkfitzg
Copy link
Contributor

@cpcloud Has infer_types been deprecated? Here's what I'm talking about (using Python 3):

In [15]: country_url = 'http://en.wikipedia.org/wiki/ISO_3166-1'

In [17]: iso_df = pd.read_html(country_url, header=0)[0]

In [18]: iso_df.head()
Out[18]:
  English short name (upper/lower case) Alpha-2 code Alpha-3 code  \
0                           Afghanistan          NaT          NaT
1          Aland Islands !Åland Islands          NaT          NaT
2                               Albania          NaT          NaT
3                               Algeria          NaT          NaT
4                        American Samoa          NaT          NaT

   Numeric code ISO 3166-2 codes
0             4    ISO 3166-2:AF
1           248    ISO 3166-2:AX
2             8    ISO 3166-2:AL
3            12    ISO 3166-2:DZ
4            16    ISO 3166-2:AS

In [19]: iso_df2 = pd.read_html(country_url, header=0, infer_types=False)[0]

In [20]: iso_df2.head()
Out[20]:
  English short name (upper/lower case) Alpha-2 code Alpha-3 code  \
0                           Afghanistan           AF          AFG
1          Aland Islands !Åland Islands           AX          ALA
2                               Albania           AL          ALB
3                               Algeria           DZ          DZA
4                        American Samoa           AS          ASM

  Numeric code ISO 3166-2 codes
0            4    ISO 3166-2:AF
1          248    ISO 3166-2:AX
2            8    ISO 3166-2:AL
3           12    ISO 3166-2:DZ
4           16    ISO 3166-2:AS

In [21]: pd.__version__
Out[21]: '0.14.0'

@cpcloud
Copy link
Member

cpcloud commented Jul 4, 2014

No it's still there. This is actually a bug that slipped thru the cracks, the nats are wrong and should be fixed. I'll see what I can do over the weekend.

@cpcloud
Copy link
Member

cpcloud commented Jul 4, 2014

Only when this behavior is fixed can we consider deprecating infer types. Infer types was originally there because the original implementation didn't use the Csv parser machinery. Now it does, but the date parsing is somehow being forces where it shouldn't.

@clarkfitzg
Copy link
Contributor

Cool man. read_html is actually one of my favorite features in Pandas, and it's going to be much nicer once this is cleaned up. I appreciate it!

@cpcloud
Copy link
Member

cpcloud commented Jul 4, 2014

No problem dude, glad you like!

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2014

@clarkfitzg check out the pr if you want .... fixes this weird date issue. turns out it was because i was "forcing" convert_objects

@clarkfitzg
Copy link
Contributor

@cpcloud nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants