Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: error when reading postgres table with timezone #7139

Closed
jorisvandenbossche opened this issue May 15, 2014 · 12 comments
Closed

SQL: error when reading postgres table with timezone #7139

jorisvandenbossche opened this issue May 15, 2014 · 12 comments
Labels
IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Left over issue from #6292

I have a postgresql table with a column with type timestamp with time zone. Just reading this table (sql.read_sql_table("everyaware_by_passage", engine)) did not work, because of a datetime column: ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True.

I can import it with read_sql, and there the imported datetime column is of type datetime.datetime with tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None)

Full error traceback (and it seems an issue in pandas itself):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-13ac3c111575> in <module>()
----> 1 sql.read_sql_table("everyaware_by_passage", engine)

c:\users\vdbosscj\scipy\pandas-joris\pandas\io\sql.pyc in read_sql_table(table_name, con, meta, index_col, coerce_float, parse_dates, columns)
    268     table = pandas_sql.read_table(
    269         table_name, index_col=index_col, coerce_float=coerce_float,
--> 270         parse_dates=parse_dates, columns=columns)
    271 
    272     if table is not None:

c:\users\vdbosscj\scipy\pandas-joris\pandas\io\sql.pyc in read_table(self, table_name, index_col, coerce_float, parse_dates, columns)
    795         table = PandasSQLTable(table_name, self, index=index_col)
    796         return table.read(coerce_float=coerce_float,
--> 797                           parse_dates=parse_dates, columns=columns)
    798 
    799     def read_sql(self, sql, index_col=None, coerce_float=True,

c:\users\vdbosscj\scipy\pandas-joris\pandas\io\sql.pyc in read(self, coerce_float, parse_dates, columns)
    609             data, columns=column_names, coerce_float=coerce_float)
    610 
--> 611         self._harmonize_columns(parse_dates=parse_dates)
    612 
    613         if self.index is not None:

c:\users\vdbosscj\scipy\pandas-joris\pandas\io\sql.pyc in _harmonize_columns(self, parse_dates)
    691                 if col_type is datetime or col_type is date:
    692                     if not issubclass(df_col.dtype.type, np.datetime64):
--> 693                         self.frame[col_name] = _handle_date_column(df_col)
    694 
    695                 elif col_type is float:

c:\users\vdbosscj\scipy\pandas-joris\pandas\io\sql.pyc in _handle_date_column(col, format)
     52             return to_datetime(col, coerce=True, unit=format)
     53         else:
---> 54             return to_datetime(col, coerce=True, format=format)
     55 
     56 

c:\users\vdbosscj\scipy\pandas-joris\pandas\tseries\tools.pyc in to_datetime(arg, errors, dayfirst, utc, box, format, coerce, unit, infer_datetime_format)
    313         return arg
    314     elif isinstance(arg, Series):
--> 315         values = _convert_listlike(arg.values, False, format)
    316         return Series(values, index=arg.index, name=arg.name)
    317     elif com.is_list_like(arg):

c:\users\vdbosscj\scipy\pandas-joris\pandas\tseries\tools.pyc in _convert_listlike(arg, box, format)
    306                 return DatetimeIndex._simple_new(values, None, tz=tz)
    307             except (ValueError, TypeError):
--> 308                 raise e
    309 
    310     if arg is None:

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
@danbirken
Copy link
Contributor

I can't seem to mimic this error. I can mimic the aspect where a datetime with time zone column isn't parsed into a datetime64[ns] column, but on my version it seems to happily be parsed into datetime object without error: https://gist.github.com/danbirken/cfd489282f54bc94d3ec

Fixing the underlying issue is actually really easy, you can just add utc=True to the to_datetime() calls in _handle_date_column(), but there is a slight catch. If you set utc=True, then all of the datetimes are converted to UTC and you lose the timezone information. And I think this is bad, since it would lose the property of being able to convert a table into a DataFrame, and then converting that DataFrame back into the same table.

So basically, I think this is working as intended, but if you can provide a full sample code or sample SQL dump that reproduces the error I'll take a look at it.

@danbirken
Copy link
Contributor

FYI I have dug into this and have finally replicated it! Error message was a little misleading.

The issue is that if you have a datetime with time zone column that has datetimes with different time zones (which will happen if you have DST and non-DST datetimes for example), then you'll get this error (triggered by: https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L1051).

Simple gist to cause it: https://gist.github.com/danbirken/8f52541c453b386b8049

So this issue is very valid and I'll try to come up with a reasonable fix for it. I've discovered some related things while digging into this so maybe I can improve multiple things at once.

@jorisvandenbossche
Copy link
Member Author

@danbirken Thanks a lot for looking into this!

I still have to look more detailed to what you have written, and see if I can reproduce it with the example you provided. But I can confirm that it was indeed a database table where the datetime column spanned both DST and non-DST periods.

@danbirken
Copy link
Contributor

Yeah that is what I suspected, and given this will be a common issue it is a problem. I have a few ideas how to fix this i'll get some code out this week.

@danbirken
Copy link
Contributor

Ignore this for a moment, after submitting I have had a change of heart on the best way to fix it.

@danbirken
Copy link
Contributor

Alright, my change of heart is that if errors='ignore', then to_datetime should not be raising this error in the first place.

So I switched the logic to prevent this error from being raised to to_datetime, and added the test for this specific case over there.

I also added in an additional feature, which is the keyword convert_dates_to_utc for all of the read_xyz functions. This allows somebody who has a table with datetime with time zone fields to allow pandas to convert them to UTC, which means they will be converted to a datetime64 array which will use less space and be faster. I think it makes a lot of sense for this to be True by default, but I made it False by default in this change in the interest of backwards compatibility.

So with this change, the only real functionality change is that this error wont be triggered anymore. However, if a person uses the new convert_dates_to_utc feature then they get an additional bonus.

I reconfigured test_sql.py a little bit because AFAIK this change only makes a difference on Postgres, and it didn't make sense (and caused problems) to even pretend it applied to either mysql or sqlite.

@danbirken
Copy link
Contributor

(Resubmitting same change again because travis tests broke on an intermittent network error and I like that green check mark)

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.15.0, 0.14.1 Jul 3, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@mangecoeur
Copy link
Contributor

Just run into this bug. Another minimal way to workaround this would be to make the date_time handling for "known" Datetime columns obey the parse_dates option. At the moment parse_dates only affects non-datetime columns that you might want to force interpreted as datetimes. Setting it to false doesn't affect handling of columns which already have Datetime type so to_datetime is called on them, triggering the bug. If setting parse_dates=False meant that to_datetime was never called you would get an object column you'd have to handle manually, but you wouldn't get this bug. Not as nice as it actually working but at least gives you the option of a workaround...

@mangecoeur
Copy link
Contributor

Running into this bug a lot right with handling DST-enabled datetimes, any idea when this will be merged?

@danbirken
Copy link
Contributor

There is no perfect solution to this bug which is why I suspect no solution has been merged.

I've updated the pull request with my 4th and final proposal to fix this bug, and it is the simplest and I believe the correct fix. Up to the maintainers if they want to merge it, but this is a classic "perfect is the enemy of the good" bug and I don't like those. Rationale for this final proposal will be in the pull request.

danbirken added a commit to danbirken/pandas that referenced this issue Mar 26, 2015


`read_sql_table()` will break if it reads a table with a `timestamp
with time zone` column if individual rows within that column have
different time zones. This is very common due to daylight savings time.

Pandas right now does not have good support for a Series containing
datetimes with different time zones (hence this bug).  So this change
simply converts a `timestamp with time zone` column into UTC during
import, which pandas has great support for.
@pilotstew
Copy link

Until this is fixed an easy workaround might be to use read_sql_query and include AT TIME ZONE 'UTC' in your query. You can then localize to whatever timezone you need in pandas.

jorisvandenbossche pushed a commit that referenced this issue Apr 13, 2015
`read_sql_table()` will break if it reads a table with a `timestamp
with time zone` column if individual rows within that column have
different time zones. This is very common due to daylight savings time.

Pandas right now does not have good support for a Series containing
datetimes with different time zones (hence this bug).  So this change
simply converts a `timestamp with time zone` column into UTC during
import, which pandas has great support for.
@jorisvandenbossche
Copy link
Member Author

Closed by #7364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants