Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG ?: method .at[idx, "XXX"] generates InvalidIndexError in 1.4.0 or 1.4.1 but not in 1.3.5 #46036

Closed
2 of 3 tasks
thomas-lacroix opened this issue Feb 17, 2022 · 9 comments
Closed
2 of 3 tasks
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Info Clarification about behavior needed to assess issue

Comments

@thomas-lacroix
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

This bug happens with my real life data, however I am sorry but I am unable to reproduce this bug with mock data that model my real life data. The following code works fine :

import pandas as pd
print(pd.__version__)
df = pd.DataFrame([[1, 1, 129, 'WP_158394508.1', '+', 132906, 134099, 397, '-', 'Tyrosine integrase', 'Tyrosine integrase', 'Tyrosine integrase', 'WP_011835230', 357.0, 326.0, 25.46, 5.270e-15, 69.3, 72.29, 88.8, 'Integrase', 'ICE', 'Tn5252', 'pRS01', '-', '-', '-', '-', '-', 'validated', 'no', 'Tyrosine integrase', 'Tyrosine integrase', 'Phage_integrase', 172.0, 4.600e-40, 3.300e-40, 123.3, 0.6, 123.8, 0.6, 98.84, 43.58, 'yes']], index=[0], columns=['hit_blast', 'hit_HMM', 'CDS_num', 'CDS', 'CDS_strand', 'CDS_start', 'CDS_end', 'CDS_length', 'Is_pseudo', 'CDS_Protein_type', 'CDS_Protein_type_blast', 'Blast_description', 'Query_blast', 'Query_blast_length', 'Ali_length', 'Ali_Identity_perc', 'E-value_blast', 'Bitscore_blast', 'CDS_coverage_blast', 'Query_blast_coverage', 'Query_blast_Protein_type', 'Associated_element_type', 'ICE_superfamily', 'ICE_family', 'IME_family', 'Relaxase_family_domain', 'Relaxase_family_MOB', 'Coupling_type', 'False_positives', 'SP_blast_validation', 'Use_annotation', 'Profile_Protein_type', 'Profile_description', 'Profile_name', 'Profile_length', 'i-Evalue_hmm', 'E-value_hmm', 'Score_hmm', 'Bias_hmm', 'Global_score', 'Global_bias', 'HMM_coverage', 'CDS_coverage_hmm', 'Possible_SP'])
idx = 0
df.at[idx, "False_positives"] = "-"

Issue Description

Upgrading to pandas version 1.4.0 or 1.4.1 causes a call to the method .at[idx, "XXX"] to generate an InvalidIndexError :

Traceback (most recent call last):
File "XXX.py", line XXX, in XXX
data.at[idx, "False_positives"] = "-"
File "lib/python3.9/site-packages/pandas/core/indexing.py", line 2274, in setitem
return super().setitem(key, value)
File "/python3.9/site-packages/pandas/core/indexing.py", line 2229, in setitem
self.obj._set_value(*key, value=value, takeable=self._takeable)
File "/python3.9/site-packages/pandas/core/frame.py", line 3869, in _set_value
loc = self.index.get_loc(index)
File "/python3.9/site-packages/pandas/core/indexes/range.py", line 388, in get_loc
self._check_indexing_error(key)
File "/python3.9/site-packages/pandas/core/indexes/base.py", line 5637, in _check_indexing_error
raise InvalidIndexError(key)
pandas.errors.InvalidIndexError: Int64Index([0], dtype='int64')

This error does not occur with pandas version 1.3.5. Can you help figure this out ?

Expected Behavior

No exception should be raised.

Installed Versions

import pandas as pd
pd.show_versions()

INSTALLED VERSIONS

commit : 06d2301
python : 3.9.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-27-generic
Version : #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 59.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

@thomas-lacroix thomas-lacroix added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2022
@phofl
Copy link
Member

phofl commented Feb 18, 2022

Hi,

thanks for your report. I am sorry, but we can't really help you, if you can not provide a reproducible example.

@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 18, 2022
@thomas-lacroix
Copy link
Author

I understand, I am not even sure if my issue is due to a bug or if I need to adapt my code somehow. I probably can not make a reproducible short snippet because my understanding of Pandas is not expert. If you are interested I could give you a bunch of command lines that will reproduce the issue, but it will not just be a python snippet and it will download and execute our tool.
I think it is not normal that a line of code that didn't produce any warning with version 1.3.5 produce an error with version 1.4.0. The only change I do to trigger the error is conda install -c conda-forge pandas=1.4.0, if I revert to conda install -c conda-forge pandas=1.3.5 then the error goes away on the same dateset. Looking at the list of changes for 1.4.0, I couldn't find anything that I need to change in my code related to the .at method. I was hoping to find someone with knowledge of the changes between those 2 versions of Pandas and that can understand what is going on by looking at the error stack. But I think you are right, maybe Stack-overflow is more appropriate for that.

@jreback jreback added this to the No action milestone Feb 18, 2022
@jreback jreback closed this as completed Feb 18, 2022
@phofl
Copy link
Member

phofl commented Feb 18, 2022

This is hard to judge without knowing the data. Based on the content of a DataFrame the expected behavior might be different. Feel free to ping if you are able to create an example. Otherwise maybe stackoverflow might help as you suggested

@adamzev
Copy link

adamzev commented Mar 21, 2022

We are having the same issue.

Here is a reproducible example:

import pandas as pd
data_df = pd.DataFrame(data={'name':['a','a', 'b','a'], 'combine_id': [None, None, None, None]})
target_id_dataframe = pd.DataFrame(data={'index':[0,1, 3], 'other_col':[1,2,3]})
data_df.at[target_id_dataframe['index'], "combine_id"]= 7

The code throws an error because the combine_id column already exists. There's no error if it doesn't exist yet.

The example completes without an error in 1.3.5 but throws the error in 1.4.1.

Here's the traceback

InvalidIndexError                         Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 data_df.at[target_id_dataframe['index'], "combine_id"]= 7

File ~/.local/lib/python3.9/site-packages/pandas/core/indexing.py:2274, in _AtIndexer.__setitem__(self, key, value)
   2271     self.obj.loc[key] = value
   2272     return
-> 2274 return super().__setitem__(key, value)

File ~/.local/lib/python3.9/site-packages/pandas/core/indexing.py:2229, in _ScalarAccessIndexer.__setitem__(self, key, value)
   2226 if len(key) != self.ndim:
   2227     raise ValueError("Not enough indexers for scalar access (setting)!")
-> 2229 self.obj._set_value(*key, value=value, takeable=self._takeable)

File ~/.local/lib/python3.9/site-packages/pandas/core/frame.py:3869, in DataFrame._set_value(self, index, col, value, takeable)
   3867 else:
   3868     series = self._get_item_cache(col)
-> 3869     loc = self.index.get_loc(index)
   3871 # setitem_inplace will do validation that may raise TypeError
   3872 #  or ValueError
   3873 series._mgr.setitem_inplace(loc, value)

File ~/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py:388, in RangeIndex.get_loc(self, key, method, tolerance)
    386         except ValueError as err:
    387             raise KeyError(key) from err
--> 388     self._check_indexing_error(key)
    389     raise KeyError(key)
    390 return super().get_loc(key, method=method, tolerance=tolerance)

File ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py:5637, in Index._check_indexing_error(self, key)
   5633 def _check_indexing_error(self, key):
   5634     if not is_scalar(key):
   5635         # if key is not a scalar, directly raise an error (the code below
   5636         # would convert to numpy arrays and raise later any way) - GH29926
-> 5637         raise InvalidIndexError(key)

Update:
Switching from .at to .loc fixes our issue so perhaps this is an old bug that was fixed since we had been using .at to update multiple values but shouldn't have been?

Looking at pandas/core/indexes/range.py:388 in get_loc
_check_indexing_error throws an InvalidIndexError which isn't handled. If a key error was allowed to be thrown (I'm not sure if that was the old behavior), it would be handled and passed off to an .loc. (

self.loc[index, col] = value
)

@Enterprise-D
Copy link

Enterprise-D commented Jul 14, 2022

I have this problem, too. Indices from .index will trigger the same error when using df.at[indices,...]. After rolling back from 1.4.2 to 1.3.5, the problem disappeared.

@phofl
Copy link
Member

phofl commented Jul 15, 2022

Can you give as a reproducible example?

@Enterprise-D
Copy link

import pandas as pd
full_table_file = '../Data/protein_domain_data/cddid.tbl'
min_domain_length = 30

df = pd.read_csv(full_table_file, sep='\t', header=None, index_col=0)
print(df.shape)
###Filter tiny little families to make life easier
df = df[df[4]>min_domain_length]
print(df.shape)
df.head()

###Identify the domains related to the following search terms
case_insensitive_search_terms = ['integrase', 'excisionase', 'recombinase',
'transposase', 'lysogen', 'temperate']
case_sensitive_search_terms = ['parA|ParA|parB|ParB']

for search_term in case_insensitive_search_terms:
indices = df[df[3].str.contains(search_term, case=False)==True].index
df[search_term] = 0
df.at[indices, search_term] = 1

for search_term in case_sensitive_search_terms:
indices = df[df[3].str.contains(search_term, case=True)==True].index
df[search_term] = 0
df.at[indices, search_term] = 1

python 3.8
pandas 1.4.2
macOS 12.4 on M1 Pro

you will also need the table file https://www.icloud.com/iclouddrive/02cY4kezLwkVj5mV620oXpj0g#cddid

@phofl
Copy link
Member

phofl commented Jul 18, 2022

@adamzev at is only meant for single values, so no guarantees on multiple values.

@Enterprise-J: We need a minimal and reproducible example, see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@thomas-lacroix
Copy link
Author

From what I understand for pandas version 1.4.0 and up: the .at method will fail to update an index list of a single value and throw an InvalidIndexError. Switching to the .loc method for index list of size 1 or more should work. See answer from Mark Greenwood at https://stackoverflow.com/questions/71293357/upgrading-to-pandas-version-1-4-0-or-1-4-1-causes-a-call-to-the-method-atidx/71545633?noredirect=1#comment126506658_71545633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

5 participants