Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df fails when columns arg is a list containing dupes #2079

Closed
ghost opened this issue Oct 17, 2012 · 4 comments
Closed

BUG: df fails when columns arg is a list containing dupes #2079

ghost opened this issue Oct 17, 2012 · 4 comments
Milestone

Comments

@ghost
Copy link

ghost commented Oct 17, 2012

In [1]: DataFrame(data,columns=["a","a"])

...
pandas/pandas/core/internals.pyc in _stack_dict(dct, ref_items, dtype)
1344 stacked = np.empty(shape, dtype=dtype)
1345 for i, item in enumerate(items):
-> 1346 stacked[i] = _asarray_compat(dct[item])
1347
1348 # stacked = np.vstack([_asarray_compat(dct[k]) for k in items])

IndexError: index out of bounds

5e6db32 is a failing test for this.

it looks like _to_sdict threads down to a call to _convert_object_array which builds a dict
keyed on column names, so dupe columns get squashed and you end up with a mismatch
between the length of the columns arg to df.__init__ and the data.
_to_sdict is not used for ndarrays so this doesn't haoppen, I was able to reuse
_init_ndarray for the case of columns being a flat list and have things work as expected.

still, too much code touching this, better left to the core devs to decide how to handle this.

@wesm
Copy link
Member

wesm commented Nov 5, 2012

Fixing this is quite an undertaking since there's a lot of existing constructor code that assumes unique column names. I'm on it; probably get it sorted out over next day or so

@wesm wesm closed this as completed in b1b85ae Nov 5, 2012
@ghost
Copy link
Author

ghost commented Nov 5, 2012

Should this work?

pd.DataFrame.from_items([('a',['foo']),('a',['bar'])],columns=['a','a'])
Out[6]: 
     a    a
0  bar  bar

@ghost
Copy link
Author

ghost commented Nov 5, 2012

Also, forgive the nitpick, but since sdict is now abandoned, it would be good to rename the methods
that reference it, enhance readability...

def _list_to_sdict(data, columns, coerce_float=False):
def _list_of_series_to_sdict(data, columns, coerce_float=False):
def _list_of_dict_to_sdict(data, columns, coerce_float=False):

@wesm wesm reopened this Nov 5, 2012
@wesm
Copy link
Member

wesm commented Nov 5, 2012

Yeah that should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant