-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Warn about dups in names for read_csv #17346
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17346 +/- ##
==========================================
- Coverage 91.26% 91.24% -0.02%
==========================================
Files 163 163
Lines 49776 49783 +7
==========================================
- Hits 45426 45424 -2
- Misses 4350 4359 +9
Continue to review full report at Codecov.
|
pandas/io/parsers.py
Outdated
counts = {} | ||
warn_dups = False | ||
|
||
for name in names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use set intersection here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? This is a fail-early method, which is why I chose it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply check for len(names) !+ len(set(names)).
much more idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
pandas/io/parsers.py
Outdated
counts[name] = True | ||
|
||
if warn_dups: | ||
msg = ("Duplicate names specified. This " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so are we deprecating this? then this should be a FutureWarning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
1497183
to
b1a7a4a
Compare
pandas/io/parsers.py
Outdated
@@ -406,6 +438,10 @@ def _read(filepath_or_buffer, kwds): | |||
chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1) | |||
nrows = _validate_integer('nrows', kwds.get('nrows', None)) | |||
|
|||
# Check for duplicates in names. | |||
names = kwds.get("names", None) | |||
_check_dup_names(names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this _validate_names
and have it return names, so its a similar patter to the other validators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
d75e1fb
to
869e363
Compare
@jreback : All comments addressed, and tests are green. PTAL |
@jreback @jorisvandenbossche : ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in whatsnew, otherwise lgtm. make sure this is on the deprecation list as well
@@ -283,6 +283,7 @@ Other API Changes | |||
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`) | |||
- Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now | |||
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`) | |||
- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be FutureWarning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh! My bad for not catching that. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing back to UserWarning
in light of later discussion.
Fixed typo and added to deprecation list. Will merge on green then unless told otherwise. |
@gfyoung wait for @jorisvandenbossche comment (as not sure if he commented here). IIRC a comment he made that having duplicate names is ok . |
Sure thing. FWIW, @jorisvandenbossche agreed with your suggestion, see his comment here @jorisvandenbossche : Any comments on this PR? |
@jorisvandenbossche : ping if there any additional comments |
@jreback : It's been a week, and I haven't heard anything from @jorisvandenbossche . Still wait, or can we merge this PR? |
I think Joris is off on holiday. I believe he's back next week. |
@TomAugspurger : Ah! I had a feeling that that was the case (I remember seeing an email about that). I'll wait then until he gets back. |
@jorisvandenbossche : friendly ping |
Can you remember me the rationale for deprecating this? Because I actually previously had a usecase where this proved useful (I had a non-informative column every other column, gave it the same name in |
@jorisvandenbossche : Here is what you said back in July here. Essentially, we are deprecating this behavior because |
@jorisvandenbossche : Any further thoughts on this? |
Sorry for the slow response. So maybe a more general question: is it our intention to once fix |
@jorisvandenbossche : No worries! I think you meant the other way around. The reason for us discouraging duplicates in @jreback : Thoughts? |
yes, of course ... :-)
That's true. But there are many other ways to deliberately make a dataframe with duplicate columns which we don't disallow anyway. To be clear, in general I am all for a restricted scope of capabilities/possibilties. But in this case, limiting the abilities of |
@jorisvandenbossche : True that we'll see them mangled anyhow, but why the need to add complexity to just handle them in the first place? I added the handling for duplicate If the user really wants to have duplicate names, they can set it themselves and reading in the file, but I don't know if we want to actively encourage setting duplicate names to a read-in |
Ah, I assumed that the mangling of |
@jorisvandenbossche : Any updates on this? |
so the only reason we have So pretty much allow what is happening today but with a Its not an error to have duplicates in |
@jreback : I don't recall you saying this before. In addition, I think there has been user interest and not mangling in cases when the CSV file itself contains dupe names. That being said, if we think making the warning less harsh is a good idea, I can do that. |
6378850
to
2ada940
Compare
@jreback : I made it issue a |
lgtm. you might want to note in the doc-string the same warning. |
Sounds good. I'll quickly add that. |
2ada940
to
3446a5f
Compare
@jreback : All is green. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm modulo small comment
pandas/io/parsers.py
Outdated
Check if the `names` parameter contains duplicates. | ||
|
||
Currently, this function issues a warning if that is the case. In the | ||
future, we will raise an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string needs updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
3446a5f
to
9fcffa7
Compare
@jreback : All is green. PTAL. |
thanks @gfyoung I think fine for now, we can always revisit if needed. |
Title is self-explanatory.
xref #17095.