Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Excel date format #977

Open
PyGeek03 opened this issue May 13, 2021 · 7 comments · May be fixed by #979
Open

Add support for Excel date format #977

PyGeek03 opened this issue May 13, 2021 · 7 comments · May be fixed by #979

Comments

@PyGeek03
Copy link

PyGeek03 commented May 13, 2021

Feature Request

In Excel files, date is represented as an integer that counts how many days have passed from either January 1st, 1900 (on Windows) or January 1st, 1904 (on Mac).

The function xlrd.xldate.xldate_as_datetime from the xlrd library is usually used for parsing this date format in Python. Considering Arrow's goal of providing a common API for date parsing, I believe this would be a useful feature for users who have to wrangle multiple date formats in the same column of an Excel file (because of non-standard data entry).

The only issue is, I'm not sure if this format should be supported as a token to use with arrow.get() and arrow.format() functions (and what that token should be), or if it should be its own function. I'm happy to work on this once we've decided on whether to implement this and how to solve this issue.

@anishnya
Copy link
Member

Hi @PyGeek03. Do you mind giving us some examples of inputs and expected outputs?

@PyGeek03
Copy link
Author

Hi @PyGeek03. Do you mind giving us some examples of inputs and expected outputs?

These examples are from the Microsoft documentation page:

  • For the 1900 date system (starts counting from January 1, 1900):
    Input: 0. Output: January 1, 1900
    Input: 34519. Output: July 4, 1994
    Input: 35981. Output: July 5, 1998

  • For the 1904 date system (starts counting from January 1, 1904):
    Input: 0. Output: January 1, 1904
    Input: 34519. Output: July 5, 1998
    Input: 35981. Output: July 6, 2002

An interesting issue is that for compatibility with Lotus 1-2-3, Excel considers 1900 to be a leap year, which is wrong. So the difference between the 2 date systems is not 4 years, but 4 years 1 day.

@anishnya
Copy link
Member

Hi @PyGeek03, I was looking a little more into the documentation. Is the input ever a fraction, or float? I just want to make sure there aren't any weird edge cases I'm missing before I propose a solution. Some of our dependencies don't play nicely with floats.

@PyGeek03
Copy link
Author

From the source code of xlrd.xldate, that seems possible. I reckon we can just write some wrapper functions around xlrd.xldate, and add it as a dependency?

@anishnya
Copy link
Member

anishnya commented May 21, 2021

Hey @PyGeek03. I was playing around with some possible solutions and it looks like we could just create a wrapper that uses our built-in shift function with a start date of either January 1, 1900 (or January 1, 1904). Both xlrd and Arrow use datetime as dependencies, so the behaviour from Arrow would be presumably the same as you are experiencing with xlrd. Also, this way avoids creating an additional dependency for Arrow.

Here is a pretty basic example of how this would work.

import arrow

base_date = arrow.get('1900-01-01')
print(base_date)

new_date = base_date.shift(days=1)
print(new_date)

@jadchaar what are your thoughts? The implementation of this would be very similar to how xlrd does it (https://github.com/python-excel/xlrd/blob/master/xlrd/xldate.py#L130).

Also, our built-in shift function does support fractions/float inputs, so this wouldn't be an issue.

@jadchaar
Copy link
Member

Hey @PyGeek03. I was playing around with some possible solutions and it looks like we could just create a wrapper that uses our built-in shift function with a start date of either January 1, 1900 (or January 1, 1904). Both xlrd and Arrow use datetime as dependencies, so the behaviour from Arrow would be presumably the same as you are experiencing with xlrd. Also, this way avoids creating an additional dependency for Arrow.

Here is a pretty basic example of how this would work.

import arrow

base_date = arrow.get('1900-01-01')
print(base_date)

new_date = base_date.shift(days=1)
print(new_date)

@jadchaar what are your thoughts? The implementation of this would be very similar to how xlrd does it (https://github.com/python-excel/xlrd/blob/master/xlrd/xldate.py#L130).

Also, our built-in shift function does support fractions/float inputs, so this wouldn't be an issue.

I think a native shift solution would work nicely. Since this is a niche use case, we probably want to avoid adding an additional dependency.

@PyGeek03
Copy link
Author

PyGeek03 commented May 22, 2021 via email

@anishnya anishnya linked a pull request May 26, 2021 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants