Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.shift with axis=1 shifts object columns to the next column with object dtype #26929

Closed
nchrisr opened this issue Jun 18, 2019 · 8 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Multi-Block Issues caused by the presence of multiple Blocks

Comments

@nchrisr
Copy link

nchrisr commented Jun 18, 2019

Code Sample

import pandas as pd
import StringIO

# Store the csv string in a variable 
gps_string = """
"2010-01-12 18:00:00","$GPGGA","180439","7249.2150","N","11754.4238","W","2","10","0.9","-8.1","M","-12.4","M","","*57","","",""
"2010-01-12 17:30:00","$GPGGA","173439","7249.2160","N","11754.4233","W","2","11","0.8","-4.5","M","-12.4","M","","*5B","","",""
"2010-01-12 17:00:00","$GPGGA","170439","7249.2152","N","11754.4235","W","2","11","0.8","-3.1","M","-12.4","M","","*5C","","",""
"2010-01-12 16:30:00","N","11754.4210","W","2","09","1.1","-13.1","M","-12.4","M","","*6C","","","","","",""
"2010-01-12 16:00:00","N","11754.4229","W","2","10","0.9","-2.9","M","-12.4","M","","*53","","","","","",""
"2010-01-12 15:30:00","N","11754.4269","W","2","09","0.8","-4.3","M","-12.4","M","","*54","","","","","",""
"2010-01-12 15:00:00","N","11754.4267","W","2","10","0.8","-1.6","M","-12.4","M","","*56","","","","","",""
"2010-01-12 14:30:00","$GPGGA","143439","7249.2152","N","11754.4253","W","2","11","0.7","-4.3","M","-12.4","M","","*56","","",""
"2010-01-12 14:00:00","N","11754.4245","W","2","10","0.9","-7.0","M","-12.4","M","","*50","","","","","",""
"2010-01-12 13:30:00","$GPGGA","133439","7249.2134","N","11754.4243","W","2","11","0.7","-10.7","M","-12.4","M","","*61","","",""
"2010-01-12 13:00:00","N","11754.4245","W","2","10","0.8","-5.5","M","-12.4","M","","*56","","","","","",""
"2010-01-12 12:30:00","N","11754.4226","W","2","10","0.9","-7.1","M","-12.4","M","","*59","","","","","",""
"2010-01-12 12:00:00","N","11754.4238","W","2","10","0.8","-6.5","M","-12.4","M","","*51","","","","","",""
"2010-01-12 11:30:00","N","11754.4227","W","2","10","0.8","0.1","M","-12.4","M","","*73","","","","","",""
"2010-01-12 11:00:00","-7.4","M","-12.4","M","","*5F","","","","","","","","","","","",""
"2010-01-12 10:30:00","N","11754.4271","W","2","08","1.1","-8.4","M","-12.4","M","","*5A","","","","","","" """

# Read the csv string into a dataframe.
gps_df = pd.read_csv(StringIO.StringIO(gps_string), header=None, index_col=0)
rows_to_shift = gps_df[gps_df[15].isnull()].index
gps_df.loc[rows_to_shift] = gps_df.loc[rows_to_shift].shift(periods=1, axis=1)
# create this file to see the outcome
gps_df.to_csv("f.csv") 

Problem description

I am trying to use the Dataframe. shift() function to move certain rows of data into their correct columns, and the Dataframe.shift() function is doing some weird things, it is creating an empty column of null(s) and moves one of the columns to the end of the dataframe.

Screenshots

Original data:

Screen Shot 2019-06-18 at 4 09 22 PM

Output after execution of the code

Screen Shot 2019-06-18 at 4 31 15 PM

As seen above, the data that was originally in column 10 has been move to column 15 for some reason. Also a column with null values has been created in column 6.

I expect that the data should be moved to the right by one step, that is the data in each column should move to the column to the left of it, and the current behaviour is confusing, based on the documentation of what this function should do.

My expected output:

Screen Shot 2019-06-18 at 4 16 40 PM

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 32
OS: Windows

@TomAugspurger
Copy link
Contributor

Haven't read through everything but is this example fully minimal? At a glance, it seems overly complex to demonstrate the issue. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@nchrisr
Copy link
Author

nchrisr commented Jun 18, 2019

@TomAugspurger I have minimized the example as much as possible, but it is a problem that requires detailed description, that is why it seems complex. The code generates a dataframe from a string and this is why it looks long, in reality the first 15 lines of the code are assigning a dataset to a variable, so it can easily be copied and pasted. I have looked through and I believe every piece of information on there is necessary to demonstrate and recreate the issue.

@TomAugspurger
Copy link
Contributor

OK. Let us know if you figure anything out. I don't plan to look at this any more closely in the near term.

@nchrisr
Copy link
Author

nchrisr commented Jun 18, 2019

@TomAugspurger I Made some changes to the issue to make it more concise, I have been testing this out for the past couple of days and have not been able to find a solution to this issue.

@mroeschke
Copy link
Member

Generally, it looks like shifting object columns will automatically shift to the next column that had an object dtype.

In [43]: df = pd.DataFrame([[2, 9.0, 'a'], ['M', 8.0, 'b']])

In [44]: df
Out[44]:
   0    1  2
0  2  9.0  a
1  M  8.0  b

In [45]: df.shift(1, axis=1)
Out[45]:
     0   1  2
0  NaN NaN  2
1  NaN NaN  M

@mroeschke mroeschke added the Bug label Jun 18, 2019
@mroeschke mroeschke changed the title Potential Bug with Dataframe.shift when used with Dataframe.loc BUG: DataFrame.shift with axis=1 shifts object columns to the next column with object dtype Jun 18, 2019
@nchrisr
Copy link
Author

nchrisr commented Jun 19, 2019

I have found a work around for this problem while this bug is being fixed, just in case anyone had a similar issue.

I convert all the data in the fields to strings and then perform the shift, after which I convert the data back into a csv string, and then read this string into a dataframe again, in order to maintain their previous data types.

import pandas as pd
import StringIO

gps_string = """
"2010-01-12 18:00:00","$GPGGA","180439","7249.2150","N","11754.4238","W","2","10","0.9","-8.1","M","-12.4","M","","*57","","",""
"2010-01-12 17:30:00","$GPGGA","173439","7249.2160","N","11754.4233","W","2","11","0.8","-4.5","M","-12.4","M","","*5B","","",""
"2010-01-12 17:00:00","$GPGGA","170439","7249.2152","N","11754.4235","W","2","11","0.8","-3.1","M","-12.4","M","","*5C","","",""
"2010-01-12 16:30:00","N","11754.4210","W","2","09","1.1","-13.1","M","-12.4","M","","*6C","","","","","",""
"2010-01-12 16:00:00","N","11754.4229","W","2","10","0.9","-2.9","M","-12.4","M","","*53","","","","","",""
"2010-01-12 15:30:00","N","11754.4269","W","2","09","0.8","-4.3","M","-12.4","M","","*54","","","","","",""
"2010-01-12 15:00:00","N","11754.4267","W","2","10","0.8","-1.6","M","-12.4","M","","*56","","","","","",""
"2010-01-12 14:30:00","$GPGGA","143439","7249.2152","N","11754.4253","W","2","11","0.7","-4.3","M","-12.4","M","","*56","","",""
"2010-01-12 14:00:00","N","11754.4245","W","2","10","0.9","-7.0","M","-12.4","M","","*50","","","","","",""
"2010-01-12 13:30:00","$GPGGA","133439","7249.2134","N","11754.4243","W","2","11","0.7","-10.7","M","-12.4","M","","*61","","",""
"2010-01-12 13:00:00","N","11754.4245","W","2","10","0.8","-5.5","M","-12.4","M","","*56","","","","","",""
"2010-01-12 12:30:00","N","11754.4226","W","2","10","0.9","-7.1","M","-12.4","M","","*59","","","","","",""
"2010-01-12 12:00:00","N","11754.4238","W","2","10","0.8","-6.5","M","-12.4","M","","*51","","","","","",""
"2010-01-12 11:30:00","N","11754.4227","W","2","10","0.8","0.1","M","-12.4","M","","*73","","","","","",""
"2010-01-12 11:00:00","-7.4","M","-12.4","M","","*5F","","","","","","","","","","","",""
"2010-01-12 10:30:00","N","11754.4271","W","2","08","1.1","-8.4","M","-12.4","M","","*5A","","","","","",""

 """

gps_df = pd.read_csv(StringIO.StringIO(gps_string), header=None, index_col=0)
rows_to_shift = gps_df[gps_df[15].isnull()].index  # get the indexes to shift
gps_df_all_strings = gps_df.astype(str)  # Convert all the data to be of type str (string)

# Shift the data
gps_df_all_strings.loc[rows_to_shift] = gps_df_all_strings.loc[rows_to_shift].shift(periods=1, axis=1)
s = gps_df_all_strings.to_csv(header=None)  # Put shifted csv data into a string after shifting.
new_gps_df = pd.read_csv(StringIO.StringIO(s), header=None, index_col=0)  # re read csv data.

@And0k
Copy link

And0k commented Jun 11, 2020

string dtype dataframe does not respond to axis

df = pd.DataFrame([['2', '3', '4'], ['K', 'L', 'M']], dtype='string')
df.shift(1, axis=1)
Out: 
      0     1     2
0  <NA>  <NA>  <NA>
1     2     3     4

example workaround (that works for any positive shifts shift_positive):

shift_positive = 1
df.iloc[:, shift_positive:] = df.slice_shift(shift_positive, axis=1)
df.iloc[:, :shift_positive] = pd.NA
df
Out: 
      0  1  2
0  <NA>  2  3
1  <NA>  K  L

@mroeschke mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2020
@jbrockmendel jbrockmendel added the Multi-Block Issues caused by the presence of multiple Blocks label Sep 21, 2020
@jbrockmendel
Copy link
Member

closed by #35578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Multi-Block Issues caused by the presence of multiple Blocks
Projects
None yet
Development

No branches or pull requests

5 participants