Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: Improve from_items error message (#17312) #17881

Merged
merged 6 commits into from
Nov 26, 2017

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Oct 15, 2017

@codecov
Copy link

codecov bot commented Oct 15, 2017

Codecov Report

Merging #17881 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17881      +/-   ##
==========================================
+ Coverage   91.23%   91.24%   +<.01%     
==========================================
  Files         163      163              
  Lines       50102    50106       +4     
==========================================
+ Hits        45712    45719       +7     
+ Misses       4390     4387       -3
Flag Coverage Δ
#multiple 89.05% <100%> (+0.02%) ⬆️
#single 40.31% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.75% <100%> (-0.1%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aed9b92...a440e50. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 15, 2017

Codecov Report

Merging #17881 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17881      +/-   ##
==========================================
+ Coverage    91.3%   91.32%   +0.02%     
==========================================
  Files         163      163              
  Lines       49781    49789       +8     
==========================================
+ Hits        45451    45471      +20     
+ Misses       4330     4318      -12
Flag Coverage Δ
#multiple 89.12% <100%> (+0.02%) ⬆️
#single 40.71% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.81% <100%> (ø) ⬆️
pandas/plotting/_converter.py 65.25% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38f41e6...fc7fd26. Read the comment docs.

@gfyoung gfyoung added the Error Reporting Incorrect or improved errors from pandas label Oct 16, 2017
@@ -1258,6 +1258,13 @@ def from_items(cls, items, columns=None, orient='columns'):
"""
keys, values = lzip(*items)

import array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : I feel like we must have some function checking if an object is array-like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't is_list_like enough in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_list_like seems to work fine. I'll update the PR.

@@ -1258,6 +1258,11 @@ def from_items(cls, items, columns=None, orient='columns'):
"""
keys, values = lzip(*items)

for val in values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, this is completely non-performant.

you need to catch the error and then do the check.

try:
return cls._from_arrays(arrays, columns, None)

except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not simply doing this inside ._from_arrays ? you only would need this once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where orient == 'index' we don't get to ._from_arrays because the previous line

data = [lib.maybe_convert_objects(v) for v in arr]

throws the error:

TypeError: Argument 'objects' has incorrect type (expected numpy.ndarray, got int)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i see what you are doing. can you add a comment to each section about what you are guarding against here.
we normally don't like to try/except around multiple statements but this is really a 'more informative message' guard.

@@ -1205,6 +1205,18 @@ def test_constructor_from_items(self):
columns=['one', 'two', 'three'])
tm.assert_frame_equal(rs, xp)

# GH 17312
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a new tests, also tests DataFrame(dict(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by the second point here. From what I can tell from_items only works with a list of (key, value) tuples and not a dict? So how do I do a test with a dict?
Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying, split off these added cases into a new tests.

also test DataFrame(dict(....)) with the same input, its a dict from the tuples (it will yield the same error messages)

e.g. on current master

In [1]: DataFrame.from_items([('A', 1), ('B', 4)])
ValueError: If using all scalar values, you must pass an index

In [3]: DataFrame(dict({'A': 1, 'B': 4}))
ValueError: If using all scalar values, you must pass an index

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

try:
return cls._from_arrays(arrays, columns, None)

except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i see what you are doing. can you add a comment to each section about what you are guarding against here.
we normally don't like to try/except around multiple statements but this is really a 'more informative message' guard.

elif orient == 'index':
if columns is None:
raise TypeError("Must pass columns with orient='index'")

keys = _ensure_index(keys)
try:
keys = _ensure_index(keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think _ensure_index can raise here

@@ -1205,6 +1205,18 @@ def test_constructor_from_items(self):
columns=['one', 'two', 'three'])
tm.assert_frame_equal(rs, xp)

# GH 17312
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying, split off these added cases into a new tests.

also test DataFrame(dict(....)) with the same input, its a dict from the tuples (it will yield the same error messages)

e.g. on current master

In [1]: DataFrame.from_items([('A', 1), ('B', 4)])
ValueError: If using all scalar values, you must pass an index

In [3]: DataFrame(dict({'A': 1, 'B': 4}))
ValueError: If using all scalar values, you must pass an index

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

can you add a note on 0.22

@reidy-p reidy-p force-pushed the from_items_error branch 2 times, most recently from 0d4ff3f to e57b142 Compare October 31, 2017 21:59
@reidy-p
Copy link
Contributor Author

reidy-p commented Oct 31, 2017

The current behaviour of from_items on master when passed a (key, value) pair with a scalar value is:

In [1]: pd.DataFrame.from_items([('a', 1), ('b', 2)])
Out[1]: ValueError: If using all scalar values, you must pass an index 

In [2]: pd.DataFrame.from_items([('a', 1), ('b', 2)], columns=['col1'], orient='index')
Out[2]: TypeError: Argument 'objects' has incorrect type (expected numpy.ndarray, got int)

These error messages are not very helpful (from_items doesn't have an index parameter, for example). So I have tried to provide more informative error messages:

In [3]: pd.DataFrame.from_items([('a', 1), ('b', 2)])
Out[3]: TypeError: The value in each (key, value) pair must be an array, Series, or dict

In [4]: pd.DataFrame.from_items([('a', 1), ('b', 2)], columns=['col1'], orient='index')
Out[4]: TypeError: The value in each (key, value) pair must be an array, Series, or dict

On the current master pd.DataFrame(dict(..)) and pd.DataFrame.from_dict(dict(..)) raise the same error message as from_items when passed scalar values:

In [5]: pd.DataFrame({'A': 1, 'B': 2})
Out[5]: ValueError: If using all scalar values, you must pass an index 

In [6]: pd.DataFrame.from_dict({'A': 1, 'B': 2})
Out[6]: ValueError: If using all scalar values, you must pass an index 

However, trying to change these error messages in a similar way to from_items caused problems and seems to affect other functions such as df.agg({}):

In [7]: pd.DataFrame({'A': 1, 'B': 2})
Out[7]: TypeError: The value in each key:value pair must be an array, Series, or dict

In [8]: pd.DataFrame.from_dict({'A': 1, 'B': 2})
Out[8]: TypeError: The value in each key:value pair must be an array, Series, or dict

In [9]:df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)})
In [10]:df.agg({'A': 'mean'})
Out[10]:
RecursionError: maximum recursion depth exceeded

So in the updated pull request I have only changed the error message for from_items. I haven't changed the error message for pd.DataFrame(dict(..)) or from_dict but have modified the pd.DataFrame(dict(..)) tests to check for the current error message. Is this solution acceptable or, if not, does anyone have any suggestions on how to proceed? Or maybe pd.DataFrame(dict(..)) or from_dict could be discussed in a separate issue if needed?

@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

can you rebase / update

@jreback jreback added this to the 0.22.0 milestone Nov 25, 2017
@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

one more rebase and should be good


except ValueError:
if not is_nested_list_like(values):
raise TypeError('The value in each (key, value) pair must '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make these ValueErrors to be consistent with the scalar error

@@ -1204,6 +1205,19 @@ def test_constructor_from_items(self):
columns=['one', 'two', 'three'])
tm.assert_frame_equal(rs, xp)

def test_constructor_from_items_scalars(self):
# GH 17312
with tm.assert_raises_regex(TypeError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. ValueError here

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

lgtm. ping on green.

@reidy-p
Copy link
Contributor Author

reidy-p commented Nov 26, 2017

@jreback thanks. It's green now.

@jreback jreback merged commit f6fe089 into pandas-dev:master Nov 26, 2017
@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

thanks!

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

as an aside I think its ok to deprecate .from_items, e.g. #18262 as its trivially replace by dict(...)

@reidy-p reidy-p deleted the from_items_error branch December 10, 2017 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Improve error message when using DataFrame.from_items instead of DataFrame.from_records
4 participants