read_html infers wrong datatype #7032

ghost · 2014-05-04T20:38:00Z

As can be seen in the below code, column 3, 8, 9, and 10 were misinterpreted as datetime objects. Columns 1, 6 and 7 should be integer. How do I force the columns to be interpreted as the proper type? Only 2, 4, 5 and 11 appear to have been read properly. I can pass 'infer_types=False' I suppose and do manual conversion afterwards, but since infer_types is going away, this won't work.

In [63]: import pandas as pd
In [64]: path = r"http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
In [65]: tables = pd.read_html(path)
In [66]: df = tables[1]

In [67]: df.head()
Out[67]:
        1           2          3         4         5        6        7   8   \
1  !000001  California        NaT  37253956  33871648  !000053  !000055 NaT
2  !000002       Texas        NaT  25145561  20851820  !000036  !000038 NaT
3  !000003    New York 1965-11-27  19378102  18976457  !000027  !000029 NaT
4  !000004     Florida        NaT  18801310  15982378  !000027  !000029 NaT
5  !000005    Illinois        NaT  12830632  12419293  !000018  !000020 NaT

   9   10      11
1 NaT NaT  11.91%
2 NaT NaT   8.04%
3 NaT NaT   6.19%
4 NaT NaT   6.01%
5 NaT NaT   4.10%

[5 rows x 11 columns]

dtype: object

In [68]: df.dtypes
Out[68]:
1             object
2             object
3     datetime64[ns]
4             object
5             object
6             object
7             object
8     datetime64[ns]
9     datetime64[ns]
10    datetime64[ns]
11            object
dtype: object

The text was updated successfully, but these errors were encountered:

filmor · 2014-05-16T07:33:47Z

I have exactly the same problem and I have to rely on infer_types=False in my code. I can't find the rationale behind removing the parameter, why is it deprecated?

cpcloud · 2014-05-16T10:48:10Z

The rational is that it doesn't do anything except convert the result of the parse into strings which will happen anyway if you have eg numerical columns that have strange values in them.

tui-rob · 2014-06-12T06:22:03Z

Same issue here. In the example below, read_html incorrectly converts the Firstname column from strings into timestamps because one row contains the Firstname value 'April'.

Original html table here.

url = 'https://www.raceplus.co.uk/raceplus_display_results_fixed.php?content=1&race=47530&show_type=All&event_id=BLEN14&event_id=BLEN14&page=28'
df = pd.read_html(url, skiprows=1,header=0, index_col=0)[0][0:-1]

df.head()
Out[4]: 
        Firstname Lastname     Cat           Swim Time             T1 Time  Race No                                                                      
3862          NaT   Dawson  M30-34 2014-06-12 00:15:15 2014-06-12 00:05:40   
1688          NaT    Crown  M40-44 2014-06-12 00:18:15 2014-06-12 00:07:15   
2269          NaT     Lang  F30-34 2014-06-12 00:18:40 2014-06-12 00:06:28   
6321          NaT   Lawson  M45-49 2014-06-12 00:19:33 2014-06-12 00:08:27   
5208          NaT    Woods  F25-29 2014-06-12 00:17:11 2014-06-12 00:05:41   

                  Bike Time             T2 Time            Run Time  Race No                                                               
3862    2014-06-12 00:48:35 2014-06-12 00:02:30 2014-06-12 00:33:53   
1688    2014-06-12 00:47:37 2014-06-12 00:01:41 2014-06-12 00:31:07   
2269    2014-06-12 00:46:19 2014-06-12 00:02:08 2014-06-12 00:32:19   
6321    2014-06-12 00:44:37 2014-06-12 00:03:22 2014-06-12 00:29:58   
5208    2014-06-12 00:46:46 2014-06-12 00:02:08 2014-06-12 00:34:12   

                       Time   Pos  
Race No                            
3862    2014-06-12 01:45:52  2701  
1688    2014-06-12 01:45:52  2702  
2269    2014-06-12 01:45:52  2703  
6321    2014-06-12 01:45:56  2704  
5208    2014-06-12 01:45:56  2705

clarkfitzg · 2014-07-01T02:50:45Z

@cpcloud Has infer_types been deprecated? Here's what I'm talking about (using Python 3):

In [15]: country_url = 'http://en.wikipedia.org/wiki/ISO_3166-1'

In [17]: iso_df = pd.read_html(country_url, header=0)[0]

In [18]: iso_df.head()
Out[18]:
  English short name (upper/lower case) Alpha-2 code Alpha-3 code  \
0                           Afghanistan          NaT          NaT
1          Aland Islands !Åland Islands          NaT          NaT
2                               Albania          NaT          NaT
3                               Algeria          NaT          NaT
4                        American Samoa          NaT          NaT

   Numeric code ISO 3166-2 codes
0             4    ISO 3166-2:AF
1           248    ISO 3166-2:AX
2             8    ISO 3166-2:AL
3            12    ISO 3166-2:DZ
4            16    ISO 3166-2:AS

In [19]: iso_df2 = pd.read_html(country_url, header=0, infer_types=False)[0]

In [20]: iso_df2.head()
Out[20]:
  English short name (upper/lower case) Alpha-2 code Alpha-3 code  \
0                           Afghanistan           AF          AFG
1          Aland Islands !Åland Islands           AX          ALA
2                               Albania           AL          ALB
3                               Algeria           DZ          DZA
4                        American Samoa           AS          ASM

  Numeric code ISO 3166-2 codes
0            4    ISO 3166-2:AF
1          248    ISO 3166-2:AX
2            8    ISO 3166-2:AL
3           12    ISO 3166-2:DZ
4           16    ISO 3166-2:AS

In [21]: pd.__version__
Out[21]: '0.14.0'

cpcloud · 2014-07-04T01:50:55Z

No it's still there. This is actually a bug that slipped thru the cracks, the nats are wrong and should be fixed. I'll see what I can do over the weekend.

cpcloud · 2014-07-04T01:53:55Z

Only when this behavior is fixed can we consider deprecating infer types. Infer types was originally there because the original implementation didn't use the Csv parser machinery. Now it does, but the date parsing is somehow being forces where it shouldn't.

clarkfitzg · 2014-07-04T01:56:29Z

Cool man. read_html is actually one of my favorite features in Pandas, and it's going to be much nicer once this is cleaned up. I appreciate it!

cpcloud · 2014-07-04T01:58:35Z

No problem dude, glad you like!

cpcloud · 2014-07-26T14:48:53Z

@clarkfitzg check out the pr if you want .... fixes this weird date issue. turns out it was because i was "forcing" convert_objects

clarkfitzg · 2014-07-27T18:43:42Z

@cpcloud nice!

…as-dev#4770, pandas-dev#7032

cpcloud mentioned this issue May 5, 2014

fully deprecate read_html infer_types argument in 0.14 #7037

Closed

jreback added HTML labels May 5, 2014

jreback added this to the 0.15.0 milestone May 5, 2014

jsexauer mentioned this issue May 5, 2014

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

sinhrks mentioned this issue Jun 2, 2014

BUG: DatetimeIndex.insert doesnt preserve name and tz #7299

Merged

cpcloud mentioned this issue Jul 26, 2014

BUG: fix greedy date parsing in read_html #7851

Merged

cpcloud closed this as completed in #7851 Jul 28, 2014

jreback mentioned this issue Aug 23, 2015

DEPR: Bunch o deprecation removals part 2 #10892

Merged

jreback added a commit to jreback/pandas that referenced this issue Aug 24, 2015

DEPR: Remove infer_type keyword from pd.read_html as its unused, pand…

0fde3ba

…as-dev#4770, pandas-dev#7032

jreback mentioned this issue Jul 24, 2016

DEPR: deprecations log for removed issues #13777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html infers wrong datatype #7032

read_html infers wrong datatype #7032

ghost commented May 4, 2014

filmor commented May 16, 2014

cpcloud commented May 16, 2014

tui-rob commented Jun 12, 2014

clarkfitzg commented Jul 1, 2014

cpcloud commented Jul 4, 2014

cpcloud commented Jul 4, 2014

clarkfitzg commented Jul 4, 2014

cpcloud commented Jul 4, 2014

cpcloud commented Jul 26, 2014

clarkfitzg commented Jul 27, 2014

read_html infers wrong datatype #7032

read_html infers wrong datatype #7032

Comments

ghost commented May 4, 2014

filmor commented May 16, 2014

cpcloud commented May 16, 2014

tui-rob commented Jun 12, 2014

clarkfitzg commented Jul 1, 2014

cpcloud commented Jul 4, 2014

cpcloud commented Jul 4, 2014

clarkfitzg commented Jul 4, 2014

cpcloud commented Jul 4, 2014

cpcloud commented Jul 26, 2014

clarkfitzg commented Jul 27, 2014