Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245

Closed
2 of 3 tasks
ldacey opened this issue Sep 9, 2020 · 5 comments
Closed
2 of 3 tasks
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@ldacey
Copy link

ldacey commented Sep 9, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

We lose the first row when we normalize the "sections" field:

import pandas as pd

import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
                   'sections': [[], [{"id": 1, "answers": [{"id": 1, "value": 10, "question_id": 1}]}], [{"id": 2, "answers": []}]]
                   })
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["id"], record_path="sections", record_prefix="section_")
df2

The problem would then impact the section_answers field if that column gets normalized (we end up with only a single row for ID 2)

pd.json_normalize(df2.to_dict(orient="records"), meta=["id", "section_id"], record_path="section_answers", record_prefix="answer_")

Problem description

I am losing the first row when I normalize the "sections" field in my example. This is because the field contains an empty list.

The end result is that I have a dataframe with fewer unique IDs than I started out with which is not desirable since the distinct count of these IDs are used as a metric.

Expected Output

I would expect one at least one row per meta column that I passed to pd.json_normalize. I do not think that I should lose the row ID 1 without some type of warning at least:

image

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0

@ldacey ldacey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2020
@phofl
Copy link
Member

phofl commented Sep 9, 2020

Hi,

thanks for your report. I produced this error on master.

Could you please provide a minmal example? Your example has a lot of data, which do not seem relevant to the actual problem.
(see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)

@phofl phofl added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2020
@ldacey
Copy link
Author

ldacey commented Sep 9, 2020

Hi @phofl - I updated my example to something simpler.

Hopefully it is still clear that my list of dict columns can contain lists of dicts as well, so as I expand each one into rows (if required), the problem would impact any rows with an empty list. I think ideally we could at least keep the rows and meta columns.

@tiagohonorato
Copy link
Contributor

A workaround for this issue would be to insert an empty dict inside the list:

df = pd.DataFrame({'id': [1, 2, 3],
                   'sections': [[{}], [{"id": 1, "answers": [{"id": 1, "value": 10, "question_id": 1}]}], [{"id": 2, "answers": []}]]
                   })

@phofl phofl added IO JSON read_json, to_json, json_normalize and removed Needs Info Clarification about behavior needed to assess issue labels Sep 9, 2020
@ldacey
Copy link
Author

ldacey commented Sep 11, 2020

@tiagohonorato

Nice - I think that works.

I ran into another issue related to meta columns if the column contains list-types. Converting these to strings seems to have done the trick though.

  1. Make sure none of my meta columns are lists (convert to strings if they contain [])
  2. Fill the field I am normalizing with the string "[{}]", and then use literal_eval on it to convert it into a list
  3. Skip fields which are entirely empty lists

This seems to be a lot of code to accomplish my goal of exploding/melting nested data into separate rows while keeping all of the original rows. Let me know if you happen to see a glaring flaw or inefficiency please.

def convert_lists_to_strings(df):
    for col in df.select_dtypes(include="object").columns:
        if (df[col].apply(type) == list).any(0):
            df[col] = df[col].astype(str).replace("[]", np.nan).replace("nan", np.nan)
    return df


def unnest_json_column(
    df, field, remove_columns=[], prefix="", max_level=None, sep="_", errors="raise"
):
    from ast import literal_eval
    
    df = convert_lists_to_strings(df)
    if (df[field].apply(type) != list).any(0):
        df[field] = df[field].astype(str).str.replace("nan", "[{}]").apply(literal_eval)
    if df[field].astype(bool).any():
        if isinstance(remove_columns, list) and len(remove_columns) > 0:
            meta = [col for col in df.columns if col not in remove_columns]
        else:
            meta = None
        return pd.json_normalize(
            data=df.to_dict(orient="records"),
            record_path=field,
            meta=meta,
            record_prefix=prefix,
            max_level=max_level,
            sep=sep,
            errors=errors,
        ).rename(columns=lambda x: x.replace("__", "_"))
    else:
        print(f"{field} only contains empty lists, skipping")
        return df

@averhagen
Copy link

Similar issue here:
#21830

@ldacey ldacey closed this as completed Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

4 participants