`DataSet.to_numpy()` should use numpy dtypes whenever possible #182

JBGreisman · 2022-09-06T21:17:16Z

Pandas DataFrames that contain ExtensionDtypes always default to output data with object dtype when DataFrame.to_numpy() is called. This is suboptimal for MTZ data, which by construction must be compatible with float32, and possibly int32.

This PR wraps the pandas call with DataSet.to_numpy() to assess whether a more sensible default (either float32 or int32) can be used based on the existing data. This should help to avoid cases where data is unnecessarily cast to an object array, which can lead to unexpected behavior downstream.

codecov-commenter · 2022-09-06T21:29:52Z

Codecov Report

Merging #182 (7c0264f) into main (4222ffc) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #182      +/-   ##
==========================================
+ Coverage   98.36%   98.37%   +0.01%     
==========================================
  Files          45       45              
  Lines        1772     1783      +11     
==========================================
+ Hits         1743     1754      +11     
  Misses         29       29

Flag	Coverage Δ
unittests	`98.37% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
reciprocalspaceship/dataset.py	`98.20% <100.00%> (+0.04%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

kmdalton

I like it. I wish we also handled the case where MTZDtype is mixed with Pandas float dtypes. I'm not sure that can be implemented in a way that doesn't cause nasty side effects.

reciprocalspaceship/dataset.py

tests/dtypes/test_dataset_to_numpy.py

Fix #33: DataSet.to_numpy() should use numpy dtypes whenever possible

d30d940

JBGreisman added enhancement Improvement to existing feature MTZDtypes Issues related to custom dtypes labels Sep 6, 2022

JBGreisman changed the title ~~Fix #33: DataSet.to_numpy() should use numpy dtypes whenever possible~~ DataSet.to_numpy() should use numpy dtypes whenever possible Sep 6, 2022

JBGreisman added 4 commits September 6, 2022 17:48

Remove unused import

cbd1629

Clean up DataSet.to_numpy() docstring formatting

f3a2515

Remove single-use variables from to_numpy()

6dbf990

Fix typo in test docstring

e0b141d

kmdalton requested changes Sep 6, 2022

View reviewed changes

reciprocalspaceship/dataset.py Show resolved Hide resolved

reciprocalspaceship/dataset.py Outdated Show resolved Hide resolved

tests/dtypes/test_dataset_to_numpy.py Outdated Show resolved Hide resolved

tests/dtypes/test_dataset_to_numpy.py Outdated Show resolved Hide resolved

JBGreisman added 2 commits September 6, 2022 19:58

Update to_numpy() docstring

11bb905

Update tests

7c0264f

JBGreisman requested a review from kmdalton September 7, 2022 00:37

kmdalton approved these changes Sep 7, 2022

View reviewed changes

kmdalton merged commit d70c5a3 into main Sep 7, 2022

kmdalton deleted the df2numpy branch September 7, 2022 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DataSet.to_numpy()` should use numpy dtypes whenever possible #182

`DataSet.to_numpy()` should use numpy dtypes whenever possible #182

JBGreisman commented Sep 6, 2022

codecov-commenter commented Sep 6, 2022 •

edited

Loading

kmdalton left a comment

DataSet.to_numpy() should use numpy dtypes whenever possible #182

DataSet.to_numpy() should use numpy dtypes whenever possible #182

Conversation

JBGreisman commented Sep 6, 2022

codecov-commenter commented Sep 6, 2022 • edited Loading

Codecov Report

kmdalton left a comment

Choose a reason for hiding this comment

`DataSet.to_numpy()` should use numpy dtypes whenever possible #182

`DataSet.to_numpy()` should use numpy dtypes whenever possible #182

codecov-commenter commented Sep 6, 2022 •

edited

Loading