-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pd.json_normalize on a column loses rows that have an empty list for that column #36245
Comments
Hi, thanks for your report. I produced this error on master. Could you please provide a minmal example? Your example has a lot of data, which do not seem relevant to the actual problem. |
Hi @phofl - I updated my example to something simpler. Hopefully it is still clear that my list of dict columns can contain lists of dicts as well, so as I expand each one into rows (if required), the problem would impact any rows with an empty list. I think ideally we could at least keep the rows and meta columns. |
A workaround for this issue would be to insert an empty dict inside the list: df = pd.DataFrame({'id': [1, 2, 3],
'sections': [[{}], [{"id": 1, "answers": [{"id": 1, "value": 10, "question_id": 1}]}], [{"id": 2, "answers": []}]]
}) |
Nice - I think that works. I ran into another issue related to meta columns if the column contains list-types. Converting these to strings seems to have done the trick though.
This seems to be a lot of code to accomplish my goal of exploding/melting nested data into separate rows while keeping all of the original rows. Let me know if you happen to see a glaring flaw or inefficiency please.
|
Similar issue here: |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
We lose the first row when we normalize the "sections" field:
The problem would then impact the section_answers field if that column gets normalized (we end up with only a single row for ID 2)
pd.json_normalize(df2.to_dict(orient="records"), meta=["id", "section_id"], record_path="section_answers", record_prefix="answer_")
Problem description
I am losing the first row when I normalize the "sections" field in my example. This is because the field contains an empty list.
The end result is that I have a dataframe with fewer unique IDs than I started out with which is not desirable since the distinct count of these IDs are used as a metric.
Expected Output
I would expect one at least one row per meta column that I passed to pd.json_normalize. I do not think that I should lose the row ID 1 without some type of warning at least:
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: