Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sets data index if it's not None #16422

Closed
wants to merge 2 commits into from
Closed

Conversation

mogillies
Copy link

Fixed Bug #16342 read_stata ignores index parameter

@jorisvandenbossche
Copy link
Member

Can you add a test for this?
(that's always needed when fixing bus (or when adding new things as well), and typically you want to do first: write a test that currently fails, and then can check if with your changes the test passes)

@jorisvandenbossche jorisvandenbossche added the IO Stata read_stata, to_stata label May 22, 2017
@jorisvandenbossche
Copy link
Member

There is already some functionality to do this in the StataReader.read method (but this does not honor the global passed keyword), so it would be nice to combine the fix with that.

@jorisvandenbossche
Copy link
Member

Actually, have a look at this: https://github.com/pandas-dev/pandas/blob/master/pandas/io/stata.py#L1465-L1466
This is how it is done for columns, so you can do similar for index

@TomAugspurger
Copy link
Contributor

Tests can go in pandas/tests/io/test_stata.py

@mogillies
Copy link
Author

I'm on it.

@codecov
Copy link

codecov bot commented May 22, 2017

Codecov Report

Merging #16422 into master will decrease coverage by <.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16422      +/-   ##
==========================================
- Coverage   90.42%   90.42%   -0.01%     
==========================================
  Files         161      161              
  Lines       51023    51025       +2     
==========================================
  Hits        46138    46138              
- Misses       4885     4887       +2
Flag Coverage Δ
#multiple 88.26% <50%> (-0.01%) ⬇️
#single 40.17% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/stata.py 93.4% <50%> (-0.08%) ⬇️
pandas/core/common.py 91.05% <0%> (-0.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49ec31b...87a2de3. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add the test from the issue & a whatsnew note.

@mogillies
Copy link
Author

mogillies commented May 23, 2017 via email

@mogillies
Copy link
Author

mogillies commented May 23, 2017

I looked at https://docs.google.com/spreadsheets/d/13Xdux8j966QK6QKQtQIyyuS-dOriJvf6uH8LoZyvwYQ/edit#gid=1307138735
It's checking in StataParser.read
if columns is None:
columns = self._columns

How does this help testing? I'm new to this code; please give me some idea how to test it. I was looking at test_stata.py and am confused about why "index" doesn't work anyway. Here are my changes:

def read_stata(filepath_or_buffer, convert_dates=True,
-               convert_categoricals=True, encoding=None, index=None,
+               convert_categoricals=True, encoding=None, index_col=None,
                convert_missing=False, preserve_dtypes=True, columns=None,
                order_categoricals=True, chunksize=None, iterator=False):
 
     reader = StataReader(filepath_or_buffer,
                          convert_dates=convert_dates,
                          convert_categoricals=convert_categoricals,
-                         index=index, convert_missing=convert_missing,
+                         index=index_col, convert_missing=convert_missing,
                          preserve_dtypes=preserve_dtypes,
                          columns=columns,
                          order_categoricals=order_categoricals,
@@ -178,8 +178,8 @@ def read_stata(filepath_or_buffer, convert_dates=True,
             data = reader.read()
         finally:
             reader.close()
-    if index is not None:
-        data.set_index(index)
+    if index_col is not None:
+        data.set_index(index_col)
     return data

@TomAugspurger
Copy link
Contributor

@mogillies I'll push a test to this branch in a few minutes.

@TomAugspurger
Copy link
Contributor

@mogillies ok just pushed a test. You can run it with pytest pandas/tests/io/test_stata.py -k test_read_index (it will fail right now)

@mogillies
Copy link
Author

I wrote this in test_stata.py and passed the test:
def test_index_col(self):
df = tm.makeDataFrame()
df.index.name = 'index_col'
reader = lambda x: read_stata(x).set_index('index_col')
result = tm.round_trip_localpath(df.to_stata, reader)
tm.assert_frame_equal(df, result)

@mogillies
Copy link
Author

in my branch.

@TomAugspurger
Copy link
Contributor

That test doesn't quite do what we want. We want the user to be able to just do

pd.read_stata(path, index_col="my_index"

In your test your reader first reads in the data with read_stata then calls set_index('index_col').

@mogillies
Copy link
Author

So I am ready to commit and the test passed; but realize if someone call
result = pd.read_stata(path, iterator=True, index_col='my_index')
^^^^^^^^^^
it will fail the test with the following error:

E AttributeError: 'StataReader' object has no attribute 'set_index'

Would you like me to commit and PR and is there an issue raised already?

@jorisvandenbossche
Copy link
Member

@mogillies You will have to try to fix that error here as well in this PR, as this is related to your current fix.

Repeating what I linked to above: I think you should take a look at https://github.com/pandas-dev/pandas/blob/master/pandas/io/stata.py#L1465-L1466. There in the read function, the columns keyword is overridden with the one passed through read_stata, you can do the same for index

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

closing as stale

@jreback jreback closed this Aug 17, 2017
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants