Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the excel reader: hours & header #4404

Merged
merged 0 commits into from
Aug 21, 2013
Merged

Conversation

timmie
Copy link
Contributor

@timmie timmie commented Jul 30, 2013

in the excel.py there is a fix enabling reading xlsx files with both
datemodes: (see #4332)
in the parsers.py there is the fix for readinh the header even if there
are additional rows (to be skipped) between a header and the data (see: #4340)

I hope this PR is adequate. If not, please let me know.
I can supply the sample file for testing proving that it works. See also:
https://github.com/timmie/example_code_data

line = self._next_line()
else:
#if there is just 1 header row to be parsed, this works.
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a syntax error here...not aligned with the except clause below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I checked. The code works.
Please check with: https://github.com/timmie/example_code_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you were right ;-)

But now fixed: timmie@b7ef7a5

BTW, thanks for looking into this!

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

can u hook up travis ci?

https://github.com/pydata/pandas/wiki/Testing#hook-up-travis-ci

thanks

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

uff, never done this. I will see what I can do in the next days.

@jtratner
Copy link
Contributor

@timmie it's actually really easy to set up. All you need to do is log in to Travis with your github account and follow a few instructions. The upside is that Travis then tests your code against multiple versions of Python and multiple different environments, which can really help iron out bugs.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

it's important for these kinds of prs especially since we don't want a bunch of failures simply because the underlying lib doesn't exist for, say, python 3.2

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

ok, I added the service hook in the GitHub settings.

What is the next step?

Do you also wanna have a test with the data from https://github.com/timmie/example_code_data?
Before, I woudl appreciate a comment if my changes are appropriate.

THX.

@jtratner
Copy link
Contributor

I find Travis really useful, TBH. You can kind of throw a bunch of things
at the wall and see what sticks without needing to go through the hassle of
compiling 5 times with dependencies.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

we need that locally! i'm working on something to make detox more useful to pandas devs...just a better config really

@jtratner
Copy link
Contributor

git commit --amend -C HEAD
git push --force

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

so, no I see a lot of errors for code I never touched:
https://travis-ci.org/timmie/pandas/jobs/9669404
https://travis-ci.org/timmie/pandas

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

many of those are related to your change

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

actually ALL of them are

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

mmh. worked with my files...

How can I test this on my machine?

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

nosetests -w pandas/io/tests

line = self._next_line()
else:
#if there is just 1 header row to be parsed, this works.
line = self.data[header[0]]

this_columns = []
for i, c in enumerate(line):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why did you change the permissions of this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clone to a FAT32? I do not know. Usually, I work with bzr which can ignore this.
I assume that this is not what breaks the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's probably not

git tracks that information (in a somewhat reduced capacity) i think there are only 2 or 3 different sets of permissions that are possible with git

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

on my local I get:

 nosetests -w pandas/io/tests

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

you're probably not in the pandas directory

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

I am:
~/coding/virtualenvs/pandas/pandas_fork

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2013

try find . -name 'test_pytables.py' from that dir what does it return?

@timmie
Copy link
Contributor Author

timmie commented Jul 30, 2013

find . -name 'test_pytables.py'
./pandas/io/tests/test_pytables.py

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2013

not sure what the issue is

@jreback
Copy link
Contributor

jreback commented Jul 31, 2013

@timmie

make sure that you have these git options set

git config -l --global

so you don't change file modes and line endings are good (if you are on windows?)

core.filemode=False
core.autocrlf=input

@cpcloud maybe add to tip & tricks?

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2013

Sounds good

@timmie
Copy link
Contributor Author

timmie commented Jul 31, 2013

@cpcloud , @jreback
Thanks for the help.

I feel lost again. I found the error and tryed to fix it. But all these hurdles like git, travis etc. are too high for the occasional library contributor...

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2013

Ironically I didn't start to understand git until I started contributing more.. Just keep at it :)

@timmie
Copy link
Contributor Author

timmie commented Jul 31, 2013

what does this mean

nosetests -w pandas/tests/test_series.py 
Traceback (most recent call last):
  File "/usr/bin/nosetests", line 9, in <module>
    load_entry_point('nose==1.1.2', 'console_scripts', 'nosetests')()
  File "/usr/lib/python2.7/dist-packages/nose/core.py", line 118, in __init__
    **extra_args)
  File "/usr/lib/python2.7/unittest/main.py", line 94, in __init__
    self.parseArgs(argv)
  File "/usr/lib/python2.7/dist-packages/nose/core.py", line 135, in parseArgs
    self.config.configure(argv, doc=self.usage())
  File "/usr/lib/python2.7/dist-packages/nose/config.py", line 317, in configure
    self.configureWhere(options.where)
  File "/usr/lib/python2.7/dist-packages/nose/config.py", line 400, in configureWhere
    "not a directory" % path)
ValueError: Working directory pandas/tests/test_series.py not found, or not a directory

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2013

The w option only accepts directories

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2013

the -w option means "working directory"

@timmie
Copy link
Contributor Author

timmie commented Jul 31, 2013

I tried on anther machine.

nosetests -w pandas/io/tests
E
======================================================================
ERROR: Failure: ImportError (cannot import name hashtable)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\loader.py", line 413, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\Python27\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\Python27\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "D:\development\GitHub\pandas\pandas\__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
ImportError: cannot import name hashtable

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)

Apparently, one would need to complie befor that works ;-(

Sorry, this is really getting beyond my capabilities and time allowances.

@timmie
Copy link
Contributor Author

timmie commented Aug 2, 2013

just pull the upstream and then my changes on top.

@jtratner
Copy link
Contributor

jtratner commented Aug 2, 2013

@timmie why did you revert the doc change?

@timmie
Copy link
Contributor Author

timmie commented Aug 2, 2013

@jtratner
sorry, was a git handling mistake on my side.

@jtratner
Copy link
Contributor

jtratner commented Aug 2, 2013

@timmie Can you add a test case (and, potentially, test data) so that you can reproduce and test the error you noted? That way, we'll be sure not to break this again in future changes.

@timmie
Copy link
Contributor Author

timmie commented Aug 12, 2013

OK, a test is there and pass.
Let me commit a minor edit in the docs and then you may merge all.

@timmie
Copy link
Contributor Author

timmie commented Aug 14, 2013

@jtratner

May I ask for your help?

Travis sais all but one test job pass:
https://travis-ci.org/timmie/pandas

NOSE_ARGS="not slow" CLIPBOARD=xclip

with python 2.6

from xlrd import open_workbook
ImportError: No module named xlrd

https://s3.amazonaws.com/archive.travis-ci.org/jobs/10194623/log.txt

Why is xlrd not available there?

@timmie
Copy link
Contributor Author

timmie commented Aug 14, 2013

@jtratner

nevermind, let me troubleshoot my code a bit.

@jtratner
Copy link
Contributor

@timmie - I was at work and couldn't reply. It's pretty simple: not all the
builds have all the dependencies. Just skip the individual test if there is
no xlrd. (use "_skip_if_no_xlrd")

@timmie
Copy link
Contributor Author

timmie commented Aug 19, 2013

@cpcloud or @jtratner

I would be thankful if you could take a look at the code once you find time.

Why do all tests for python != 2.7 fail?

I am not sure how this relates to my changes.
On my local test it works.

Thanks in advance.

@jtratner
Copy link
Contributor

I'll take a look.
On Aug 18, 2013 8:27 PM, "timmie" notifications@github.com wrote:

@cpcloud https://github.com/cpcloud or @jtratnerhttps://github.com/jtratner

I would be thankful if you coudl take a look at the code.

Why do all tests for python != 2.7 fail?

I am not show how this relates to my changes.
On my local test it works.

Thanks in advance.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4404#issuecomment-22844960
.

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

@jtratner Thanks.

@jtratner
Copy link
Contributor

Timmie, please look at the test case In io/tests/test_parsers around line 1974 (exact line is in the Travis test failure output). That's what's failing.

The test is actually failing on every version of Python (it doesn't cause the one 2.7 runner to fail because it only runs tests marked as 'slow').

I'm not totally clear on what exactly you are testing there. For example, testing the '!=' case without testing the case that does output that is dangerous (eg that would pass if another error message was thrown as well, etc.). Alternately, maybe better to test what the entire output should be.

Going to look at it a bit more in depth now.

2 other things:

  1. You need to rebase this on top of the current pandas master, you can run the following to do that:
git pull --rebase upstream master

If that command complains that upstream doesnt exist, run

git remote add upstream https://www.github.com/pydata/pandas.git

Next, you need to rebuild pandas so that everything is up to date. To do that, run:

make clean build

@jtratner
Copy link
Contributor

I posted the previous comment because these tests ought to be failing on your system too. That said, before I can remove this, please go back and remove all the print statements and extra if clauses you added in for testing. It makes it harder to review your PR with all that extra noise in it. Also, for the excel tests you added, instead of defining separate functions, you should be able to know in advance exactly what should result (eg a time delta of 10mins or a datetime) and test against that instead.

Finally, to get the pandas core team to accept your pull request, you need to make it clear what you are changing. From reviewing the associated PRs it looks like you want to:

  1. Allow better control of date parsing by column
  2. Allow for other Excel date modes.
  3. Change excel parser to allow skipping an arbitrary number of lines between header and first row.

Can you lay out how (3) will work? Are you passing some parameter to the function?

Finally, does this PR also add some kind of change to error reporting in read_excel? If it does, you also need to document those changes (and how they are enabled or disabled)

Thanks!

@jtratner
Copy link
Contributor

Sorry writing on my phone! I meant 'before I can review this' not remove this.

@timmie
Copy link
Contributor Author

timmie commented Aug 19, 2013

Timmie, please look at the test case In io/tests/test_parsers around line 1974 (exact line is in the Travis test >failure output). That's what's failing.

I know. But this test was there. I did not know how to modify it. for me, it seems that "verbose=True" does not produce results.

I'm not totally clear on what exactly you are testing there. For example, testing the '!=' case without testing the >case that does output that is dangerous (eg that would pass if another error message was thrown as well, etc.). >Alternately, maybe better to test what the entire output should be.

yes, as I did not understand the full and existing test, I tried to disable with "!=" to see if the others at least pass.

2 other things:

Yes, sure. thanks for the reminder.
My ideas was to see first of my code passes, then clean it (there is a lot of print statement letftovers) and the do the rebase etc.

actually, I wanted to provide you with some feedback about the difficulties I encountered as non-core dev working with this code. It too all of us so much time with this PR. Let's follow up on this after the PR is done.

That said, before I can remove this, please go back and remove all the print statements and extra if clauses you >added in for testing. It makes it harder to review your PR with all that extra noise in it.

as above, I am aware that there are leftovers from "experiements". They should be disabled. If tests go through, I will remove them.
I woudl like to keep some of the comments to make future changes easier. Is that OK?

Allow better control of date parsing by column

Yes.

Allow for other Excel date modes.

Yes/No. this was part of the inital problem. So I wanted to catch that, too.

Change excel parser to allow skipping an arbitrary number of lines between header and first row.

Yes.

Can you lay out how (3) will work? Are you passing some parameter to the function?

I defined a function to get the header from the data read in:
https://github.com/timmie/pandas/blob/aa8241f3f8665e9cec5e22dcce908fa2b3e75000/pandas/io/parsers.py#L1540

this is called in
https://github.com/timmie/pandas/blob/aa8241f3f8665e9cec5e22dcce908fa2b3e75000/pandas/io/parsers.py#L1447

if the current row is equal to the header row.

does this PR also add some kind of change to error reporting in read_excel? If it does, you also need to document those changes (and how they are enabled or disabled)

No, nothing at all. If you are referring to the "verbose=True" this was an existing test I wasn't able to debug.

Also, for the excel tests you added, instead of defining separate functions, you should be able to know in >advance exactly what should result (eg a time delta of 10mins or a datetime) and test against that instead.

you mean like hardcode the result?
I though reading as alternative by xlrd directly would prove that it works also in an alternative way...

@timmie
Copy link
Contributor Author

timmie commented Aug 19, 2013

@jtratner

Wow, it passes now! So great. I feel good...
After having recieved your comments/reply on:
#4404 (comment)
I will try my best to make it ready for acceptance into core.

Thanks again!

@jtratner
Copy link
Contributor

I'm glad to hear that it's working now! (and i'm not a core dev either, I
just contribute to pandas frequently). Definitely if you have comments
about what was difficult, that's useful to keep in mind. I'll try to take
a look after work today.

@timmie
Copy link
Contributor Author

timmie commented Aug 19, 2013

@jtratner

Nice. so this is also your after work thingy?

Comments are at:
https://groups.google.com/forum/#!topic/pydata/ZrBxuluPhHs

@jreback jreback merged commit 9ea0d44 into pandas-dev:master Aug 21, 2013
@timmie
Copy link
Contributor Author

timmie commented Aug 21, 2013

oh, you just merged and closed it!

I am in the middle of trying to clean the merging & conflict chaos and to clean up things.

What shall we do?

@jtratner
Copy link
Contributor

@jreback did you merge this? I don't think this should have been merged.

@jtratner
Copy link
Contributor

I'm actually unclear if this was merged at all - don't see the commit on pydata/master. Maybe it's a github issue?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@timmie I didn't merge this, I think this is a hash collision, nothing added nor deleted

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@jtratner this happened once before...weird....

@timmie just rebase and force push I think you can reuse this PR if not, just open another one

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

This is what I merged, bet github confused it 9ea0d44

@jtratner
Copy link
Contributor

We should report this to Github as a bug.

@timmie sorry that you have to deal with yet another stumbling point here!

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

I'll give them a report

@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2013

this happened once before when we were messing with remote branches for eval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants