Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: use pandas.NaT for missing values in dbdate and dbtime dtypes #67

Merged
merged 3 commits into from
Feb 2, 2022

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Feb 2, 2022

This makes them consistent with other date/time dtypes, as well as internally
consistent with the advertised dtype.na_value.

BREAKING-CHANGE: dbdate and dbtime dtypes return NaT instead of None for missing values

Release-As: 0.4.0

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #66
🦕

This makes them consistent with other date/time dtypes, as well as internally
consistent with the advertised `dtype.na_value`.

BREAKING-CHANGE: dbdate and dbtime dtypes return NaT instead of None for missing values

Release-As: 0.4.0
@tswast tswast requested a review from a team February 2, 2022 21:05
@tswast tswast requested a review from a team as a code owner February 2, 2022 21:05
@tswast tswast requested a review from shollyman February 2, 2022 21:05
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API. label Feb 2, 2022
Copy link

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear how/why this gets used in practice. Is this an attempt at exposing typed nulls in the pandas space?

@tswast
Copy link
Collaborator Author

tswast commented Feb 2, 2022

Numpy/Pandas have a special case for date/time arrays. The underlying c-buffer is of integers, rather than floating point values, so NaN can't be stored. Instead they pick a date/time that's unlikely to be used and add lots of checks throughout the code so that it behaves mostly like a NaN would.

https://github.com/numpy/numpy/blob/72a27ad2798575fa5dce9046815597b3a63109a2/numpy/core/include/numpy/ndarraytypes.h#L239

Some new Pandas types also have a pandas.NA value, but this is achieved via a bit mask for null/not-null. Since we're backing our arrays with a numpy datetime64 array, NaT is what we need to store missing values as. The inconsistency between what we were storing and what we were exposing to users was causing some edge cases to break, such as trying to see if an array contains a missing value.

@tswast tswast merged commit f903c2c into main Feb 2, 2022
@tswast tswast deleted the issue28-NaT branch February 2, 2022 22:17
gcf-merge-on-green bot pushed a commit that referenced this pull request Mar 24, 2022
🤖 I have created a release *beep* *boop*
---


## [0.4.0](v0.3.1...v0.4.0) (2022-03-24)


### ⚠ BREAKING CHANGES

* * fix: address failing compliance tests in DateArray and TimeArray
* * fix: address failing compliance tests in DateArray and TimeArray
* * fix: address failing compliance tests in DateArray and TimeArray
* * fix: address failing compliance tests in DateArray and TimeArray
* * fix: address failing compliance tests in DateArray and TimeArray
* * fix: address failing compliance tests in DateArray and TimeArray
* dbdate and dbtime dtypes return NaT instead of None for missing values

### Features

* dbdate and dbtime support numpy.datetime64 values in array constructor ([1db1357](1db1357))


### Bug Fixes

* address failing 2D array compliance tests  in DateArray ([#64](#64)) ([b771e05](b771e05))
* address failing tests with pandas 1.5.0 ([#82](#82)) ([38ac28d](38ac28d))
* allow comparison with scalar values ([#88](#88)) ([7495698](7495698))
* avoid TypeError when using sorted search ([#84](#84)) ([42bc2d9](42bc2d9))
* correct TypeError and comparison issues discovered in DateArray compliance tests ([#79](#79)) ([1e979cf](1e979cf))
* dbdate and dbtime support set item with null values ([#85](#85)) ([1db1357](1db1357))
* use `pandas.NaT` for missing values in dbdate and dbtime dtypes ([#67](#67)) ([f903c2c](f903c2c))
* use public pandas APIs where possible ([#60](#60)) ([e9d41d1](e9d41d1))


### Tests

* add dbtime compliance tests ([#90](#90)) ([f14fb2b](f14fb2b))
* add final dbdate compliance tests and sort ([#89](#89)) ([efe7e6d](efe7e6d))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dtype.na_value is inconsistent with returned values
2 participants