Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

Closed
smcinerney opened this issue Aug 16, 2013 · 36 comments
Closed

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

smcinerney opened this issue Aug 16, 2013 · 36 comments
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@smcinerney
Copy link

smcinerney commented Aug 16, 2013

When concat'ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they're identical across DataFrames, they don't get sorted.
This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:

df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3,5))

print "Cols unsorted:", concat([df4a,df4b])
# Cols unsorted:           C         B         D         A

print "Cols sorted", concat([df4a,df5])
# Cols sorted           A         B         C         D         E
``'
@hayd
Copy link
Contributor

hayd commented Aug 18, 2013

Looking at this briefly I think this stems from Index.intersection, whose docstring states:

Form the intersection of two Index objects. Sortedness of the result is not guaranteed

Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result...

@jtratner
Copy link
Contributor

@smcierney what order would you expect instead?

@superkeyor
Copy link

I found the auto sort was a bit annoying too (well, I should say depends on your purpose), because I was trying to concat a frame to an empty frame in a loop (like append an element to a list). Then I realized my column order changed. This change also applies to index, if you are concatenating along axis=1.

In a case similar to that of @smcinerney , I expect the final order of CBDAE. E shows up last because the order CBDA shows up first when concatenating.

Therefore I wrote a "hack" (kinda silly though)

sorted = pd.concat(frameList, axis=axis, join=join, join_axes=join_axes, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)

if join_axes:
    return sorted
elif sort:
    return sorted
else:
    # expand all original orders in each frame
    sourceOrder = []
    for frame in frameList:
        sourceOrder.extend(frame.Columns()) if axis == 0 else sourceOrder.extend(frame.Indices())
    sortedOrder = sorted.Columns() if axis == 0 else sorted.Indices()

    positions = []
    positionsSorted = []
    for i in sortedOrder:
        positions.append(sourceOrder.index(i))
        positionsSorted.append(sourceOrder.index(i))
    positionsSorted.sort()

    unsortedOrder = []
    for i in positionsSorted:
        unsortedOrder.append(sortedOrder[positions.index(i)])

    return sorted.ReorderCols(unsortedOrder) if axis == 0 else sorted.ReorderRows(unsortedOrder)

The function is included in my personal module called kungfu! Anyone can adopt the above algorithm, or have a look at my module at https://github.com/jerryzhujian9/kungfu

Finally, I greatly appreciate the work of the development team for this great module!

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@asteppke
Copy link

This behavior is indeed quite unexpected and I also stumbled over it.

 >>> df = pd.DataFrame()

>>> df['b'] = [1,2,3]
>>> df['c'] = [1,2,3]
>>> df['a'] = [1,2,3]
>>> print(df)
   b  c  a
0  1  1  1
1  2  2  2
2  3  3  3

[3 rows x 3 columns]
>>> df2 = pd.DataFrame({'a':[4,5]})
>>> df3 = pd.concat([df, df2])

Naively one would expect that the order of columns is preserved. Instead the columns are sorted:

>>> print(df3)
   a   b   c
0  1   1   1
1  2   2   2
2  3   3   3
0  4 NaN NaN
1  5 NaN NaN

[5 rows x 3 columns]

This can be corrected by reindexing with the original columns as follows:

>>> df4 = df3.reindex_axis(df.columns, axis=1)
>>> print(df4)
    b   c  a
0   1   1  1
1   2   2  2
2   3   3  3
0 NaN NaN  4
1 NaN NaN  5

[5 rows x 3 columns]

Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.

@zadacka
Copy link

zadacka commented Nov 19, 2014

I've just come across this too.

new_data = pd.concat([churn_data, numerical_data])

Produced a DataFrame:

     churn  Var1  Var10  Var100  Var101 
0      -1   NaN    NaN     NaN     NaN     
1      -1   NaN    NaN     NaN     NaN

It would seem more natural for the numerical DataFrame to be concatenated without being sorted first!!

@jreback
Copy link
Contributor

jreback commented Nov 19, 2014

well, this is a bit of work to fix. but pull requests accepted!

@rasbt
Copy link

rasbt commented Jan 13, 2015

Just stumbled upon this same issue when I was concatenating DataFrames. It's a little bit annoying if you don't know about this issue, but actually there is a quick remedy:

say dfs is a list of DataFrames you want to concatenate, you can just take the the original column order and feed it back in:

df = pd.concat(dfs, axis=0)
df = df[dfs[0].columns]

@max-sixty
Copy link
Contributor

I believe append causes the same behavior, FYI

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@inbredtom
Copy link

It's the default behaviour across the board. For example, if you apply a function, f, to a groupby() that returns a varying number of columns, the concatenation taking place behind the scene also auto-sorts the columns.

df.groupby(some_ts).apply(f)

Likely because the known order of the columns is open to interpretation.

However, this also happens for MultiIndices and all hierarchies in MultiIndices. So you can concat dataframes that agree on level0 columns and all bar one level1 columns, and all levels of the MultiIndices will be autosorted because of one mismatch within one level0 column. I don't imagine that is desirable.

I'd love to help, but unfortunately fixing this issue is beyond my ability. Thanks for the hard work all.

@vitalyisaev2
Copy link

+1 for this feature

@ashishsingal1
Copy link
Contributor

Agreed, +1. Unexpected sorting happens all the time for me.

@scyllagist
Copy link

+1, this was an unpleasant surprise!

@Zenadix
Copy link

Zenadix commented Aug 27, 2015

+1, I hate having the columns sorted after every append.

@summerela
Copy link

+1 from me, as well.

Because even if I did want to manually re-order after a concat, when I try to print out the 60 + column names and positions in my dataframe:

 for id, value in enumerate(df.columns):
      print id, value

All 60+ columns are output in alphabetical order, not their actual position in the data frame.

So that means that after ever concat, I have to manually type out a list of 60 columns to reorder. Ouch.

While I'm here, does anyone have a way to print out column name and position that I'm missing?

@jmwoloso
Copy link

+1 for this feature, just ran across the same deal myself.

@summerela Get the column index and then re-index your new dataframe using the original column index

# assuming you have two dataframes, `df_train` & `df_test` (with the same columns) 
# that you want to concatenate

# get the columns from one of them
all_columns = df_train.columns

# concatenate them
df_concat = pd.concat([df_train,
                       df_test])

# finally, re-index the new dataframe using the original column index
df_concat = df_concat.ix[:, all_columns]

Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren't sure how many columns may remain after prior processing steps in your script (maybe you're dropping zero-variance columns, for instance), you could pass a relative index position to new_indices --> new_indices = [-1] and it will take care of the rest.

def reindex_columns(dframe=None, columns=None, new_indices=None):
    """
    Reorders the columns of a dataframe as specified by
    `reorder_indices`. Values of `columns` should align with their
    respective values in `new_indices`.

    `dframe`: pandas dataframe.

    `columns`: list,pandas.core.index.Index, or numpy array; columns to
    reindex.

    `reorder_indices`: list of integers or numpy array; indices
    corresponding to where each column should be inserted during
    re-indexing.
    """
    print("Re-indexing columns.")
    try:
        df = dframe.copy()

        # ensure parameters are of correct type and length
        assert isinstance(columns, (pd.core.index.Index,
                                    list,
                                    np.array)),\
        "`columns` must be of type `pandas.core.index.Index` or `list`"

        assert isinstance(new_indices,
                          list),\
        "`reorder_indices` must be of type `list`"

        assert len(columns) == len(new_indices),\
        "Length of `columns` and `reorder_indices` must be equal"

        # check for negative values in `new_indices`
        if any(idx < 0 for idx in new_indices):

            # get a list of the negative values
            negatives = [value for value
                         in new_indices
                         if value < 0]

            # find the index location for each negative value in
            # `new_indices`
            negative_idx_locations = [new_indices.index(negative)
                                      for negative in negatives]

            # zip the lists
            negative_zipped = list(zip(negative_idx_locations,
                                       negatives))

            # replace the negatives in `new_indices` with their
            # absolute position in the index
            for idx, negative in negative_zipped:
                new_indices[idx] = df.columns.get_loc(df.columns[
                                                          negative])

        # re-order the index now
        # get all columns
        all_columns = df.columns

        # drop the columns that need to be re-indexed
        all_columns = all_columns.drop(columns)

        # now re-insert them at the specified locations
        zipped_columns = list(zip(new_indices,
                                  columns))

        for idx, column in zipped_columns:
            all_columns = all_columns.insert(idx,
                                             column)
        # re-index the dataframe
        df = df.ix[:, all_columns]

        print("Successfully re-indexed dataframe.")

    except Exception as e:
        print(e)
        print("Could not re-index columns. Something went wrong.")

    return df

Edit: Usage would look like the following:

# move 'Column_1' to the end, move 'Column_2' to the beginning
df = reindex_columns(dframe=df,
                     columns=['Column_1', 'Column_2'],
                     new_indices=[-1, 0])

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 12, 2016
@patricktokeeffe
Copy link
Contributor

I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn't work:

dat = pd.concat([out_dust, in_dust, in_air, out_air])
dat.columns = [out_dust.columns + in_dust.columns + in_air.columns + out_air.columns]

The columns still get sorted. Using lists intermediately resolved things, though:

Edit: I spoke too soon..


Follow-up: fwiw, column order can be preserved with chained .join calls on singular objects:

df1.join([df2, df3]) # sorts columns
df1.join(df2).join(df3) # column order retained

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@MikeTam1021
Copy link

I’m pretty sure that’s exactly what people in this thread have been discussing. I see lots of good solutions above that should work.

@h-vetinari
Copy link
Contributor

@MikeTam1021 It's not about turning pandas into SQL (heaven forbid!), but I couldn't agree more with:

There should never be an unwanted automatic sort. If the user wants to sort the column names, let them do that manually.

Concatenating DataFrames should have the same effect as "writing them next to each other", and that implicit sort definitely violates the principle of least astonishment.

@MikeTam1021
Copy link

I agree. It shouldn’t. It also assumes an order to the columns, which is SQLish, and not pure computer science. You should really know where you’re data is.

I hardly use pandas anymore after discovering this and many other issues. It has made me a better programmer.

@armant
Copy link

armant commented Apr 4, 2018

+1 on this

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 5, 2018
@bcucek
Copy link

bcucek commented Apr 18, 2018

This works for me:

cols = list(df1)+list(df2)
df1 = pd.concat([df1, df2])
df1 = df1.loc[:, cols]

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Apr 30, 2018
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
TomAugspurger pushed a commit that referenced this issue May 1, 2018
…0613)

* Stop concat from attempting to sort mismatched columns by default

Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Closes #4588
@DavidEscott
Copy link

I have to bitch about how this patch is rolled out. You have simultaneously changed the function signature of concat AND introduced a warning about the usage. All within the same commit.

The problem with that is that we use pandas on multiple servers and cannot guarantee that all servers have the exact same version of pandas at all times. So now we have less technical users seeing warnings from programs they have never seen before, and are uncertain if the warning is a sign of a problem.

I can readily identify WHERE the warning is coming from, but I can't add either of the suggested options because that would break the program on any server running an older version of pandas.

It would have been preferable if you put the sorting capability in to 0.23, and added the warning to some later version. I know its a pain, but it's rather obnoxious to assume that the users can immediately update all deployments to the latest code.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 4, 2018 via email

@DavidEscott
Copy link

@TomAugspurger There are a multitude of ways that we on our side can deal with this. Certainly filtering warnings is one. Its not great because the mechanics of warnings filters are a bit ugly...

  1. I would have to add the filter to multiple programs
  2. Not a great way to specify a specific warning to filter:
  • I can filter by module and lineno, but that isn't a stable reference,
  • I can filter by module and FutureWarning but then I wouldn't get any warnings at all from pandas and would be surprised by other changes,
  • or I can filter by your long multi-line message
  1. And then remember to take that filter out when everything is upgraded and it no longer matters.

In any case the deficiencies in the warnings module are certainly not something I can put at the foot of the pandas team.

Nor is it your fault that we have an older server we can't easily upgrade, so that would be the other thing I can do (just upgrade all the damn deployments). Ultimately, I recognize that I have to do that and that it is my responsibility to try and keep our deployments close together.

It just seems a bit bizarre to me that you were so concerned about a possible change in user visible end behavior that you added this sort option to what was previously an underspecified API, and yet have simultaneously thrown a warning at the programmer... both the warning and the proposed change in sort behavior constitute "user visible behavior" in my book, just of different severities.

@SHi-ON
Copy link

SHi-ON commented Jun 24, 2019

I've answered a related question on SO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests