-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix empty Data frames to JSON round-trippable back to data frames #21318
Conversation
Can you add a test to cover this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need test and whatsnew
Sure, I'll add those both! |
Codecov Report
@@ Coverage Diff @@
## master #21318 +/- ##
==========================================
- Coverage 91.89% 91.85% -0.04%
==========================================
Files 153 153
Lines 49596 49570 -26
==========================================
- Hits 45576 45533 -43
- Misses 4020 4037 +17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments but nothing major - just need to standardize and clean up the code here a bit
@@ -86,6 +87,16 @@ def test_multiindex(self): | |||
assert result == expected | |||
|
|||
|
|||
class TestParseSchema(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need a new class here - can you just move this to TestTableOrientReader
?
@@ -86,6 +87,16 @@ def test_multiindex(self): | |||
assert result == expected | |||
|
|||
|
|||
class TestParseSchema(object): | |||
|
|||
def test_empty_json_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to test_empty_frame_roundtrip
# GH21287 | ||
df = pd.DataFrame([], columns=['a', 'b', 'c']) | ||
json = df.to_json(None, orient='table') | ||
df = parse_table_schema(json, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use pd.read_json
here instead
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -92,7 +92,7 @@ I/O | |||
|
|||
- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`) | |||
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`) | |||
- | |||
- Bug in IO JSON methods reading empty JSON schema back to DataFrame caused an error (:issue:`21287`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly reference :func:`read_json` and make sure to qualify that this only applies when ``orient='table'``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also update reference to :class:`DataFrame`
df = pd.DataFrame([], columns=['a', 'b', 'c']) | ||
json = df.to_json(None, orient='table') | ||
df = parse_table_schema(json, True) | ||
assert df.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure that we've preserved the frame metadata we should use the tm.assert_frame_equal
function here. Typically with tests we will:
- Create a variable called
expected
, which is very explicit about what we want (here that's just a copy ofdf
) - Assign to a variable called
result
, which here would be the result ofpd.read_json
- Make sure result and expected match using
tm.assert_frame_equal
Thanks for the comments! I'll look into this later today. |
Hello @pyryjook! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 08, 2018 at 23:39 Hours UTC |
expected = df.copy() | ||
out = df.to_json(None, orient='table') | ||
result = pd.read_json(out, orient='table') | ||
tm.assert_frame_equal(expected, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises an assertion error:
E AssertionError: DataFrame.index are different
E
E DataFrame.index classes are not equivalent
E [left]: Index([], dtype='object')
E [right]: Float64Index([], dtype='float64')
That's something I need to dig deeper. If there is something obvious, that I'm missing, any pointers would be appreciated in such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this PR! Beat me to it :)
A bit weak but what do we think of just
pd.testing.assert_frame_equal(expected, actual, check_dtype=False)
?
Otherwise I would guess we have to go down the road of including the dtypes in the JSON representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! And actually it only works if I assert it like this:
pd.testing.assert_frame_equal(expected, result, check_dtype=False, check_index_type=False)
So both check_dtype
and check_index_type
have to be set to False
in order to get the assertion right.
Thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the dtype difference is not the solution. The point of this format is to persist that metadata.
What I would do is check that the proper type information for the index is being written out (you can use a io.StringIO instance instead of writing to None
). If that appears correct then there would be an issue with the reader that is ignoring or casting the type of the index after the fact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, ignoring the dtype
and the index_type
will just hide the problem.
Did some initial testings and it seems that on the reading side empty data
with data.dtype == 'object'
gets coerced to Float64 without any clear reason.
I'll push a commit with fix proposal for comments.
pandas/io/json/json.py
Outdated
@@ -686,7 +686,7 @@ def _try_convert_data(self, name, data, use_dtypes=True, | |||
|
|||
result = False | |||
|
|||
if data.dtype == 'object': | |||
if len(data) and data.dtype == 'object': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix that seems to solve the error.
Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of the JSON table schema is that we can be explicit about the types of the columns, so we shouldn't need to infer anything. Any way to avoid a call to this method altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of the JSON table schema is that we can be explicit about the types of the columns, so we shouldn't need to infer anything.
@WillAyd : I'm confused how this is relevant to the patch. It looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that since the datatypes are explicitly defined in the schema that we shouldn't need any type of inference, which I get the impression this method is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that since the datatypes are explicitly defined in the schema that we shouldn't need any type of inference, which I get the impression this method is doing
Sure, but I'm still not seeing relevance to this particular PR. The patch is pretty straightforward from what I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion, I don't think option 2 makes much sense. As a user, I would rather have consistent, though undesirable behavior (and leave the test a bit weak, ignoring dtypes, or perhaps marking as expected failure?) than have empty and non empty data frame behave differently.
If at least non empty dataframe behaved as expected, and there was bad behavior on the corner case of empty ones, i could better see the case for living with inconsistent behavior.
Option 3 is obviously the best choice, though spontaneously seems a bit overkill to me? But, given #21140 I would be glad to take a look this week end and circle back with my best shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swiftly went thru the past commits related to the method and that way tried to find out the reasoning behind the implementation. First of all it's quite old method (5 years) and it has seen only a few modifications during it's lifetime. On top of that, the functionality of it does seem to be quite complexly tied to multiple use cases.
In that light, knowing its quite subtle look on the matter, it sure looks like quite fundamental task to re-think the purpose (or existence) of that method within this PR.
I completely get the point that the fix with the len()
or without would only be a compromise when looking at the whole picture.
Clearly, my opinion of the complexity is affected by the fact that this is my first contribution to this library :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ludaavics that 2 is the least desirable. I would say if you don't see an apparent solution to number 3 above then create a separate issue about the coercion of object types to float with read_json
and table='orient'
. After that you can parametrize the test for check_dtypes
, xfailing the strict check and placing a reference via a TODO comment to the issue around coercion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd : Yes, that sounds like a good plan. Thanks for bearing with me. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, sounds like a plan! I'll make the new issue and the changes to the code accordingly. Thanks guys!
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -92,7 +92,7 @@ I/O | |||
|
|||
- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`) | |||
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`) | |||
- | |||
- Bug in IO JSON :func:`read_json`reading empty JSON schema with ``orient='table'`` back to :class:DataFrame caused an error (:issue:`21287`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyryjook : If you could address the merge conflicts, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I’ll certainly do that when I’m on my laptop again!
out = df.to_json(orient='table') | ||
result = pd.read_json(out, orient='table') | ||
# TODO: After DF coercion issue (GH 21345) is resolved, tighten type checks | ||
tm.assert_frame_equal(expected, result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but can you parametrize this with a "strict_check" parameter whose values can be True
and False
, with the former being marked as an xfail? You can see an example of this below:
None, "idx", pytest.param("index", marks=pytest.mark.xfail), |
The explicit xfail gives more visibility to the issue (I'm being overly cautious here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I’ll make the change. Have to say that I appreciate your pedantics on these! 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes lgtm. I'm OK with merge if tests pass - thanks!
7faa509
to
af52815
Compare
af52815
to
0a26bf8
Compare
Great, I just resolved the merge conflicts in the whatsnew file. |
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -84,4 +85,4 @@ Reshaping | |||
|
|||
Other | |||
|
|||
- Tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this showing up in the diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just noticed the same. Should not be there, sry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! Merging branches can surprise you sometimes 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a removed newline, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks! I did not have chance to fix it during the weekend, but great that it was resolved already.
thanks @pyryjook |
pandas-dev#21318) (cherry picked from commit 415012f)
[x] closes #21287
[x] tests added / passed
[x]passes git diff upstream/master -u -- "*.py" | flake8 --diff
[x]whatsnew entry
Fixes the bug occurring when empty DF, previously saved to JSON-file, is read from JSON back to DF.