Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Warn about dups in names for read_csv #17346

Merged
merged 1 commit into from
Sep 24, 2017

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Aug 26, 2017

Title is self-explanatory.

xref #17095.

@gfyoung gfyoung added API Design Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Aug 26, 2017
@gfyoung gfyoung added this to the 0.21.0 milestone Aug 26, 2017
@codecov
Copy link

codecov bot commented Aug 26, 2017

Codecov Report

Merging #17346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17346      +/-   ##
==========================================
- Coverage   91.26%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49776    49783       +7     
==========================================
- Hits        45426    45424       -2     
- Misses       4350     4359       +9
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.29% <57.14%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.48% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d43aba8...9fcffa7. Read the comment docs.

counts = {}
warn_dups = False

for name in names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use set intersection here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? This is a fail-early method, which is why I chose it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply check for len(names) !+ len(set(names)). much more idiomatic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

counts[name] = True

if warn_dups:
msg = ("Duplicate names specified. This "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are we deprecating this? then this should be a FutureWarning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

@gfyoung gfyoung force-pushed the dup-names-warn branch 2 times, most recently from 1497183 to b1a7a4a Compare August 30, 2017 08:00
@@ -406,6 +438,10 @@ def _read(filepath_or_buffer, kwds):
chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1)
nrows = _validate_integer('nrows', kwds.get('nrows', None))

# Check for duplicates in names.
names = kwds.get("names", None)
_check_dup_names(names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this _validate_names and have it return names, so its a similar patter to the other validators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the dup-names-warn branch 2 times, most recently from d75e1fb to 869e363 Compare August 30, 2017 15:14
@gfyoung
Copy link
Member Author

gfyoung commented Aug 31, 2017

@jreback : All comments addressed, and tests are green. PTAL

@gfyoung
Copy link
Member Author

gfyoung commented Sep 1, 2017

@jreback @jorisvandenbossche : ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in whatsnew, otherwise lgtm. make sure this is on the deprecation list as well

@@ -283,6 +283,7 @@ Other API Changes
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)
- Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be FutureWarning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! My bad for not catching that. Fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing back to UserWarning in light of later discussion.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 1, 2017

Fixed typo and added to deprecation list. Will merge on green then unless told otherwise.

@jreback
Copy link
Contributor

jreback commented Sep 1, 2017

@gfyoung wait for @jorisvandenbossche comment (as not sure if he commented here). IIRC a comment he made that having duplicate names is ok .

@gfyoung
Copy link
Member Author

gfyoung commented Sep 1, 2017

Sure thing. FWIW, @jorisvandenbossche agreed with your suggestion, see his comment here

@jorisvandenbossche : Any comments on this PR?

@gfyoung
Copy link
Member Author

gfyoung commented Sep 5, 2017

@jorisvandenbossche : ping if there any additional comments

@gfyoung
Copy link
Member Author

gfyoung commented Sep 8, 2017

@jreback : It's been a week, and I haven't heard anything from @jorisvandenbossche . Still wait, or can we merge this PR?

@TomAugspurger
Copy link
Contributor

I think Joris is off on holiday. I believe he's back next week.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 8, 2017

@TomAugspurger : Ah! I had a feeling that that was the case (I remember seeing an email about that). I'll wait then until he gets back.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 13, 2017

@jorisvandenbossche : friendly ping

@jorisvandenbossche
Copy link
Member

Can you remember me the rationale for deprecating this?
Is it because we cannot actually handle it well?

Because I actually previously had a usecase where this proved useful (I had a non-informative column every other column, gave it the same name in names and then dropped the single name. But this specific case can of course easily be solved differently, by giving names like 'dummy1', 'dummy2', .. and then removing all columns that start with 'dummy')

@gfyoung
Copy link
Member Author

gfyoung commented Sep 13, 2017

@jorisvandenbossche : Here is what you said back in July here.

Essentially, we are deprecating this behavior because names is a user-specified parameter, and passing in duplicate names deliberately only encourages buggy behavior.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 18, 2017

@jorisvandenbossche : Any further thoughts on this?

@jorisvandenbossche
Copy link
Member

Sorry for the slow response.

So maybe a more general question: is it our intention to once fix mangle_dupe_cols=True ? (as currently actually only the default False works). If we do, I see no good reason to disallow duplicates in passed names vs names in the csv file.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 18, 2017

@jorisvandenbossche : No worries! I think you meant the other way around. mangle_dupe_cols=True is fully supported at this point. It is mangle_dupe_cols=False that we disabled.

The reason for us discouraging duplicates in names is because duplicates are generally more error-prone. Contrary to duplicates in a file, names is a deliberate choice.

@jreback : Thoughts?

@jorisvandenbossche
Copy link
Member

I think you meant the other way around

yes, of course ... :-)

Contrary to duplicates in a file, names is a deliberate choice.

That's true. But there are many other ways to deliberately make a dataframe with duplicate columns which we don't disallow anyway.

To be clear, in general I am all for a restricted scope of capabilities/possibilties. But in this case, limiting the abilities of name does not actually reduce code complexity, but increases it. As the duplicates are already perfectly handled by the code, so we are introducing a special case. Therefore I was wondering whether this is actually needed (user will see that the names are mangled anyhow).

@gfyoung
Copy link
Member Author

gfyoung commented Sep 18, 2017

@jorisvandenbossche : True that we'll see them mangled anyhow, but why the need to add complexity to just handle them in the first place? I added the handling for duplicate names in an earlier PR because it was a bug, not to enhance support.

If the user really wants to have duplicate names, they can set it themselves and reading in the file, but I don't know if we want to actively encourage setting duplicate names to a read-in DataFrame.

@jorisvandenbossche
Copy link
Member

Ah, I assumed that the mangling of names or the names from the header of the file was done using the same code path? That's not the case ?
(to state it another way: once we remove the deprecation in this PR, we can actually remove more code than is added in this PR?)

@gfyoung
Copy link
Member Author

gfyoung commented Sep 18, 2017

See #17095 : it's a not a ridiculous amount of new logic that I added, but new logic nonetheless 😄 Also, see @jreback comment in that PR here

@gfyoung
Copy link
Member Author

gfyoung commented Sep 21, 2017

@jorisvandenbossche : Any updates on this?

@jreback
Copy link
Contributor

jreback commented Sep 21, 2017

so the only reason we have mangle_dupe_columns is to support duplicates in the first place. I think I stated we should deprecate that argument entirely, then I would allow duplicates in names but show a UserWarning if names contains duplicates.

So pretty much allow what is happening today but with a UserWarning and reducing the path complexity a bit (removing mangle_dupe_columes).

Its not an error to have duplicates in names but I guess can't disallow it entirely.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 21, 2017

@jreback : I don't recall you saying this before. In addition, I think there has been user interest and not mangling in cases when the CSV file itself contains dupe names. That being said, if we think making the warning less harsh is a good idea, I can do that.

@gfyoung gfyoung force-pushed the dup-names-warn branch 2 times, most recently from 6378850 to 2ada940 Compare September 23, 2017 09:24
@gfyoung
Copy link
Member Author

gfyoung commented Sep 23, 2017

@jreback : I made it issue a UserWarning instead. PTAL.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

lgtm. you might want to note in the doc-string the same warning.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 23, 2017

lgtm. you might want to note in the doc-string the same warning.

Sounds good. I'll quickly add that.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 23, 2017

@jreback : All is green. PTAL.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo small comment

Check if the `names` parameter contains duplicates.

Currently, this function issues a warning if that is the case. In the
future, we will raise an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string needs updating

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 24, 2017

@jreback : All is green. PTAL.

@jreback jreback merged commit 1f51271 into pandas-dev:master Sep 24, 2017
@jreback
Copy link
Contributor

jreback commented Sep 24, 2017

thanks @gfyoung I think fine for now, we can always revisit if needed.

@gfyoung gfyoung deleted the dup-names-warn branch September 25, 2017 00:59
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants