BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245

ldacey · 2020-09-09T14:55:23Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

We lose the first row when we normalize the "sections" field:

import pandas as pd

import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
                   'sections': [[], [{"id": 1, "answers": [{"id": 1, "value": 10, "question_id": 1}]}], [{"id": 2, "answers": []}]]
                   })
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["id"], record_path="sections", record_prefix="section_")
df2

The problem would then impact the section_answers field if that column gets normalized (we end up with only a single row for ID 2)

pd.json_normalize(df2.to_dict(orient="records"), meta=["id", "section_id"], record_path="section_answers", record_prefix="answer_")

Problem description

I am losing the first row when I normalize the "sections" field in my example. This is because the field contains an empty list.

The end result is that I have a dataframe with fewer unique IDs than I started out with which is not desirable since the distinct count of these IDs are used as a metric.

Expected Output

I would expect one at least one row per meta column that I passed to pd.json_normalize. I do not think that I should lose the row ID 1 without some type of warning at least:

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

phofl · 2020-09-09T15:27:03Z

Hi,

thanks for your report. I produced this error on master.

Could you please provide a minmal example? Your example has a lot of data, which do not seem relevant to the actual problem.
(see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)

ldacey · 2020-09-09T16:13:39Z

Hi @phofl - I updated my example to something simpler.

Hopefully it is still clear that my list of dict columns can contain lists of dicts as well, so as I expand each one into rows (if required), the problem would impact any rows with an empty list. I think ideally we could at least keep the rows and meta columns.

tiagohonorato · 2020-09-09T18:14:22Z

A workaround for this issue would be to insert an empty dict inside the list:

df = pd.DataFrame({'id': [1, 2, 3],
                   'sections': [[{}], [{"id": 1, "answers": [{"id": 1, "value": 10, "question_id": 1}]}], [{"id": 2, "answers": []}]]
                   })

ldacey · 2020-09-11T12:55:04Z

@tiagohonorato

Nice - I think that works.

I ran into another issue related to meta columns if the column contains list-types. Converting these to strings seems to have done the trick though.

Make sure none of my meta columns are lists (convert to strings if they contain [])
Fill the field I am normalizing with the string "[{}]", and then use literal_eval on it to convert it into a list
Skip fields which are entirely empty lists

This seems to be a lot of code to accomplish my goal of exploding/melting nested data into separate rows while keeping all of the original rows. Let me know if you happen to see a glaring flaw or inefficiency please.

def convert_lists_to_strings(df):
    for col in df.select_dtypes(include="object").columns:
        if (df[col].apply(type) == list).any(0):
            df[col] = df[col].astype(str).replace("[]", np.nan).replace("nan", np.nan)
    return df


def unnest_json_column(
    df, field, remove_columns=[], prefix="", max_level=None, sep="_", errors="raise"
):
    from ast import literal_eval
    
    df = convert_lists_to_strings(df)
    if (df[field].apply(type) != list).any(0):
        df[field] = df[field].astype(str).str.replace("nan", "[{}]").apply(literal_eval)
    if df[field].astype(bool).any():
        if isinstance(remove_columns, list) and len(remove_columns) > 0:
            meta = [col for col in df.columns if col not in remove_columns]
        else:
            meta = None
        return pd.json_normalize(
            data=df.to_dict(orient="records"),
            record_path=field,
            meta=meta,
            record_prefix=prefix,
            max_level=max_level,
            sep=sep,
            errors=errors,
        ).rename(columns=lambda x: x.replace("__", "_"))
    else:
        print(f"{field} only contains empty lists, skipping")
        return df

averhagen · 2021-04-29T14:49:07Z

Similar issue here:
#21830

ldacey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2020

phofl added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2020

phofl added IO JSON read_json, to_json, json_normalize and removed Needs Info Clarification about behavior needed to assess issue labels Sep 9, 2020

ldacey closed this as completed Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245

BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245

ldacey commented Sep 9, 2020 •

edited

Loading

INSTALLED VERSIONS

phofl commented Sep 9, 2020

ldacey commented Sep 9, 2020

tiagohonorato commented Sep 9, 2020

ldacey commented Sep 11, 2020

averhagen commented Apr 29, 2021

BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245

BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245

Comments

ldacey commented Sep 9, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

phofl commented Sep 9, 2020

ldacey commented Sep 9, 2020

tiagohonorato commented Sep 9, 2020

ldacey commented Sep 11, 2020

averhagen commented Apr 29, 2021

ldacey commented Sep 9, 2020 •

edited

Loading

Output of `pd.show_versions()`