BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

avinashpancham · 2020-12-25T14:57:06Z

closes BUG: Inconsistent results using pd.json_normalize() on a generator object versus list (off by one) #35923
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…or input

avinashpancham · 2020-12-25T15:00:01Z

Consuming the generator seemed the easiest way to me. Since we also went for this approach in other location of the codebase I used it here as well.

Please LMK if we want to solve this is another way, i.e. not consuming the whole generator at once.

arw2019 · 2020-12-26T06:59:06Z

pandas/io/json/_normalize.py

@@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
        data = [data]

    if record_path is None:
+        if np.ndim(data) == 0:


What about inserting a call to next(data) to skip past the first element?

It's always good to be as lazy as possible but tbf I don't know how much that matters here

@arw2019 wdym with skipping past the first line with "next(data)"? Cause we do not want to skip the first element.

Below I've written down my understanding and solution to the problem. But lmk if you see this differently!

The problem here is that in the old situation the first line of an input generator was consumed and not part of the output dataframe. This was caused by this line: https://github.com/pandas-dev/pandas/blob/master/pandas/io/json/_normalize.py#L270.

By consuming the whole generator and putting it in a list we don't have that issue anymore

Okay right your solution looks good

(There's actually already a TODO there about expanding the generator with nested records, as well)

Ah great. Wrt to the TODO, which one do you mean? I see this one: TODO: handle record value which are lists, at least error reasonably, but IMO that would justify its own PR since it is related to a different issue

Agreed, that TODO is beyond the scope of this PR

you can just put a list(data) on L264 (you could even limit this to a Iterable type if you want)

this entire function needs to be split up into module level functions and cleaned up. I believe we have several open issues about this (but orthogonal to this PR).

Put list(data) on L264, but limited it to type Iterator and not Iterable since we use different conversion logic for dicts

Wrt to the splitting of the modules. I will add it to my list, but first I want to address some other PRs for dtypes.

moink · 2020-12-26T14:10:26Z

Failures in CI caused by #38703

jreback · 2020-12-27T23:06:35Z

pandas/io/json/_normalize.py

@@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
        data = [data]

    if record_path is None:
+        if np.ndim(data) == 0:


you can just put a list(data) on L264 (you could even limit this to a Iterable type if you want)

jreback · 2020-12-27T23:07:16Z

pandas/io/json/_normalize.py

@@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
        data = [data]

    if record_path is None:
+        if np.ndim(data) == 0:


this entire function needs to be split up into module level functions and cleaned up. I believe we have several open issues about this (but orthogonal to this PR).

jreback · 2020-12-28T17:04:00Z

pandas/io/json/_normalize.py

@@ -262,6 +262,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
    if isinstance(data, list) and not data:
        return DataFrame()

+    if isinstance(data, abc.Iterator):


can you make these if/elif (all 3 conditions)

pandas/io/json/_normalize.py

WillAyd · 2020-12-29T18:36:28Z

pandas/io/json/_normalize.py

+    elif isinstance(data, abc.Iterator):
+        # GH35923 Fix pd.json_normalize to not skip the first element of a
+        # generator input
+        data = list(data)


This could have some big performance implications when dealing with large generators - is it not alternately possible to just store the first element for inspection and reuse as necessary while maintaining the state of the generator?

we barely support generators (its not even documented), so -1 if this adds any complexity.

WillAyd · 2020-12-29T19:23:30Z

Fair point. I might be misremembering but I feel like we do something similar for other IO methods, so a common way of doing that would be nice. But agreed for scope of this PR converting to list is fine

…

On Dec 29, 2020, at 10:39 AM, Jeff Reback ***@***.***> wrote: @jreback commented on this pull request. In pandas/io/json/_normalize.py <#38698 (comment)>: > @@ -261,9 +261,12 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List: if isinstance(data, list) and not data: return DataFrame() - - # A bit of a hackjob - if isinstance(data, dict): + elif isinstance(data, abc.Iterator): + # GH35923 Fix pd.json_normalize to not skip the first element of a + # generator input + data = list(data) we barely support generators (its not even documented), so -1 if this adds any complexity. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#38698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEU4UOCSISA6VXUMZEHTMLSXIO4RANCNFSM4VJFUZVA>.

jreback · 2020-12-30T14:08:38Z

thanks @avinashpancham

…or input (pandas-dev#38698)

BUG: Fix pd.json_normalize to not skip the first element of a generat…

898ccdf

…or input

arw2019 reviewed Dec 26, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into GH35923

66f5ba6

jreback requested changes Dec 27, 2020

View reviewed changes

jreback added the IO JSON read_json, to_json, json_normalize label Dec 27, 2020

avinashpancham added 2 commits December 28, 2020 14:37

Address comments

8537705

Merge remote-tracking branch 'upstream/master' into GH35923

e885c86

jreback requested changes Dec 28, 2020

View reviewed changes

jreback added the Bug label Dec 28, 2020

jreback added this to the 1.3 milestone Dec 28, 2020

avinashpancham added 2 commits December 28, 2020 18:06

Address comments

89546ab

Remove whitespace

bf25860

jreback reviewed Dec 28, 2020

View reviewed changes

pandas/io/json/_normalize.py Show resolved Hide resolved

WillAyd reviewed Dec 29, 2020

View reviewed changes

avinashpancham added 5 commits December 30, 2020 01:22

Raise NotImplementedError for an input that is not an Iterable

9f1f2f9

Rename test name

775f80e

Merge remote-tracking branch 'upstream/master' into GH35923

f9e5332

Merge remote-tracking branch 'upstream/master' into GH35923

cc9cfae

Merge remote-tracking branch 'upstream/master' into GH35923

2ca6be7

jreback approved these changes Dec 30, 2020

View reviewed changes

jreback merged commit 52bdfdc into pandas-dev:master Dec 30, 2020

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: Fix pd.json_normalize to not skip the first element of a generat…

e95dd5d

…or input (pandas-dev#38698)

simonjayhawkins mentioned this pull request Jun 3, 2022

json_normalize skips an entry in a pymongo cursor #30323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

avinashpancham commented Dec 25, 2020

avinashpancham commented Dec 25, 2020 •

edited

Loading

arw2019 Dec 26, 2020

avinashpancham Dec 26, 2020 •

edited

Loading

arw2019 Dec 26, 2020

avinashpancham Dec 26, 2020

arw2019 Dec 26, 2020

jreback Dec 27, 2020

jreback Dec 27, 2020

avinashpancham Dec 28, 2020

moink commented Dec 26, 2020

jreback Dec 27, 2020

jreback Dec 27, 2020

jreback Dec 28, 2020

avinashpancham Dec 28, 2020

WillAyd Dec 29, 2020

jreback Dec 29, 2020

WillAyd commented Dec 29, 2020 via email

jreback commented Dec 30, 2020

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

Conversation

avinashpancham commented Dec 25, 2020

avinashpancham commented Dec 25, 2020 • edited Loading

Choose a reason for hiding this comment

avinashpancham Dec 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moink commented Dec 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Dec 29, 2020 via email

jreback commented Dec 30, 2020

avinashpancham commented Dec 25, 2020 •

edited

Loading

avinashpancham Dec 26, 2020 •

edited

Loading