Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

Merged
merged 11 commits into from
Dec 30, 2020

Conversation

avinashpancham
Copy link
Contributor

@avinashpancham
Copy link
Contributor Author

avinashpancham commented Dec 25, 2020

Consuming the generator seemed the easiest way to me. Since we also went for this approach in other location of the codebase I used it here as well.

Please LMK if we want to solve this is another way, i.e. not consuming the whole generator at once.

@@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
data = [data]

if record_path is None:
if np.ndim(data) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about inserting a call to next(data) to skip past the first element?

It's always good to be as lazy as possible but tbf I don't know how much that matters here

Copy link
Contributor Author

@avinashpancham avinashpancham Dec 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arw2019 wdym with skipping past the first line with "next(data)"? Cause we do not want to skip the first element.

Below I've written down my understanding and solution to the problem. But lmk if you see this differently!

The problem here is that in the old situation the first line of an input generator was consumed and not part of the output dataframe. This was caused by this line: https://github.com/pandas-dev/pandas/blob/master/pandas/io/json/_normalize.py#L270.

By consuming the whole generator and putting it in a list we don't have that issue anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay right your solution looks good

(There's actually already a TODO there about expanding the generator with nested records, as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great. Wrt to the TODO, which one do you mean? I see this one: TODO: handle record value which are lists, at least error reasonably, but IMO that would justify its own PR since it is related to a different issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that TODO is beyond the scope of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just put a list(data) on L264 (you could even limit this to a Iterable type if you want)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire function needs to be split up into module level functions and cleaned up. I believe we have several open issues about this (but orthogonal to this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put list(data) on L264, but limited it to type Iterator and not Iterable since we use different conversion logic for dicts

Wrt to the splitting of the modules. I will add it to my list, but first I want to address some other PRs for dtypes.

@moink
Copy link
Member

moink commented Dec 26, 2020

Failures in CI caused by #38703

@@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
data = [data]

if record_path is None:
if np.ndim(data) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just put a list(data) on L264 (you could even limit this to a Iterable type if you want)

@@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
data = [data]

if record_path is None:
if np.ndim(data) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire function needs to be split up into module level functions and cleaned up. I believe we have several open issues about this (but orthogonal to this PR).

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Dec 27, 2020
@@ -262,6 +262,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
if isinstance(data, list) and not data:
return DataFrame()

if isinstance(data, abc.Iterator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make these if/elif (all 3 conditions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback added the Bug label Dec 28, 2020
@jreback jreback added this to the 1.3 milestone Dec 28, 2020
elif isinstance(data, abc.Iterator):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have some big performance implications when dealing with large generators - is it not alternately possible to just store the first element for inspection and reuse as necessary while maintaining the state of the generator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we barely support generators (its not even documented), so -1 if this adds any complexity.

@WillAyd
Copy link
Member

WillAyd commented Dec 29, 2020 via email

@jreback jreback merged commit 52bdfdc into pandas-dev:master Dec 30, 2020
@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

thanks @avinashpancham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent results using pd.json_normalize() on a generator object versus list (off by one)
5 participants