fix the excel reader: hours & header #4404

timmie · 2013-07-30T13:57:58Z

in the excel.py there is a fix enabling reading xlsx files with both
datemodes: (see #4332)
in the parsers.py there is the fix for readinh the header even if there
are additional rows (to be skipped) between a header and the data (see: #4340)

I hope this PR is adequate. If not, please let me know.
I can supply the sample file for testing proving that it works. See also:
https://github.com/timmie/example_code_data

cpcloud · 2013-07-30T14:04:42Z

pandas/io/parsers.py

+                        line = self._next_line()
+            else:
+                #if there is just 1 header row to be parsed, this works.
+				try:


looks like a syntax error here...not aligned with the except clause below

OK, I will fix that.

Sorry, I checked. The code works.
Please check with: https://github.com/timmie/example_code_data

OK, you were right ;-)

But now fixed: timmie@b7ef7a5

BTW, thanks for looking into this!

cpcloud · 2013-07-30T16:03:37Z

can u hook up travis ci?

https://github.com/pydata/pandas/wiki/Testing#hook-up-travis-ci

thanks

timmie · 2013-07-30T16:19:27Z

uff, never done this. I will see what I can do in the next days.

jtratner · 2013-07-30T21:10:32Z

@timmie it's actually really easy to set up. All you need to do is log in to Travis with your github account and follow a few instructions. The upside is that Travis then tests your code against multiple versions of Python and multiple different environments, which can really help iron out bugs.

cpcloud · 2013-07-30T21:19:27Z

it's important for these kinds of prs especially since we don't want a bunch of failures simply because the underlying lib doesn't exist for, say, python 3.2

timmie · 2013-07-30T21:28:36Z

ok, I added the service hook in the GitHub settings.

What is the next step?

Do you also wanna have a test with the data from https://github.com/timmie/example_code_data?
Before, I woudl appreciate a comment if my changes are appropriate.

THX.

jtratner · 2013-07-30T21:31:30Z

I find Travis really useful, TBH. You can kind of throw a bunch of things
at the wall and see what sticks without needing to go through the hassle of
compiling 5 times with dependencies.

cpcloud · 2013-07-30T21:35:05Z

we need that locally! i'm working on something to make detox more useful to pandas devs...just a better config really

jtratner · 2013-07-30T21:39:50Z

git commit --amend -C HEAD
git push --force

timmie · 2013-07-30T22:37:16Z

so, no I see a lot of errors for code I never touched:
https://travis-ci.org/timmie/pandas/jobs/9669404
https://travis-ci.org/timmie/pandas

cpcloud · 2013-07-30T22:49:57Z

many of those are related to your change

cpcloud · 2013-07-30T22:50:49Z

actually ALL of them are

timmie · 2013-07-30T22:53:24Z

mmh. worked with my files...

How can I test this on my machine?

cpcloud · 2013-07-30T22:54:20Z

nosetests -w pandas/io/tests

cpcloud · 2013-07-30T22:57:08Z

pandas/io/parsers.py

+                        line = self._next_line()
+            else:
+                #if there is just 1 header row to be parsed, this works.
+                line = self.data[header[0]]

                this_columns = []
                for i, c in enumerate(line):


also why did you change the permissions of this file?

clone to a FAT32? I do not know. Usually, I work with bzr which can ignore this.
I assume that this is not what breaks the tests.

no it's probably not

git tracks that information (in a somewhat reduced capacity) i think there are only 2 or 3 different sets of permissions that are possible with git

timmie · 2013-07-30T23:30:23Z

on my local I get:

 nosetests -w pandas/io/tests

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

cpcloud · 2013-07-30T23:30:54Z

you're probably not in the pandas directory

timmie · 2013-07-30T23:32:21Z

I am:
~/coding/virtualenvs/pandas/pandas_fork

cpcloud · 2013-07-30T23:33:49Z

try find . -name 'test_pytables.py' from that dir what does it return?

timmie · 2013-07-30T23:34:54Z

find . -name 'test_pytables.py'
./pandas/io/tests/test_pytables.py

cpcloud · 2013-07-31T00:08:43Z

not sure what the issue is

jreback · 2013-07-31T00:21:51Z

@timmie

make sure that you have these git options set

git config -l --global

so you don't change file modes and line endings are good (if you are on windows?)

core.filemode=False
core.autocrlf=input

@cpcloud maybe add to tip & tricks?

cpcloud · 2013-07-31T00:33:28Z

Sounds good

timmie · 2013-07-31T00:34:31Z

@cpcloud , @jreback
Thanks for the help.

I feel lost again. I found the error and tryed to fix it. But all these hurdles like git, travis etc. are too high for the occasional library contributor...

cpcloud · 2013-07-31T00:37:06Z

Ironically I didn't start to understand git until I started contributing more.. Just keep at it :)

timmie · 2013-07-31T00:48:29Z

what does this mean

nosetests -w pandas/tests/test_series.py 
Traceback (most recent call last):
  File "/usr/bin/nosetests", line 9, in <module>
    load_entry_point('nose==1.1.2', 'console_scripts', 'nosetests')()
  File "/usr/lib/python2.7/dist-packages/nose/core.py", line 118, in __init__
    **extra_args)
  File "/usr/lib/python2.7/unittest/main.py", line 94, in __init__
    self.parseArgs(argv)
  File "/usr/lib/python2.7/dist-packages/nose/core.py", line 135, in parseArgs
    self.config.configure(argv, doc=self.usage())
  File "/usr/lib/python2.7/dist-packages/nose/config.py", line 317, in configure
    self.configureWhere(options.where)
  File "/usr/lib/python2.7/dist-packages/nose/config.py", line 400, in configureWhere
    "not a directory" % path)
ValueError: Working directory pandas/tests/test_series.py not found, or not a directory

cpcloud · 2013-07-31T00:52:01Z

The w option only accepts directories

cpcloud · 2013-07-31T02:29:12Z

the -w option means "working directory"

timmie · 2013-07-31T08:09:36Z

I tried on anther machine.

nosetests -w pandas/io/tests
E
======================================================================
ERROR: Failure: ImportError (cannot import name hashtable)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\loader.py", line 413, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\Python27\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\Python27\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "D:\development\GitHub\pandas\pandas\__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
ImportError: cannot import name hashtable

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)

Apparently, one would need to complie befor that works ;-(

Sorry, this is really getting beyond my capabilities and time allowances.

timmie · 2013-08-02T20:55:39Z

just pull the upstream and then my changes on top.

jtratner · 2013-08-02T20:59:49Z

@timmie why did you revert the doc change?

timmie · 2013-08-02T21:20:08Z

@jtratner
sorry, was a git handling mistake on my side.

jtratner · 2013-08-02T21:28:51Z

@timmie Can you add a test case (and, potentially, test data) so that you can reproduce and test the error you noted? That way, we'll be sure not to break this again in future changes.

timmie · 2013-08-12T16:01:58Z

OK, a test is there and pass.
Let me commit a minor edit in the docs and then you may merge all.

timmie · 2013-08-14T13:14:54Z

@jtratner

May I ask for your help?

Travis sais all but one test job pass:
https://travis-ci.org/timmie/pandas

NOSE_ARGS="not slow" CLIPBOARD=xclip

with python 2.6

from xlrd import open_workbook
ImportError: No module named xlrd

https://s3.amazonaws.com/archive.travis-ci.org/jobs/10194623/log.txt

Why is xlrd not available there?

timmie · 2013-08-14T14:29:08Z

@jtratner

nevermind, let me troubleshoot my code a bit.

jtratner · 2013-08-14T20:58:38Z

@timmie - I was at work and couldn't reply. It's pretty simple: not all the
builds have all the dependencies. Just skip the individual test if there is
no xlrd. (use "_skip_if_no_xlrd")

timmie · 2013-08-19T00:26:51Z

@cpcloud or @jtratner

I would be thankful if you could take a look at the code once you find time.

Why do all tests for python != 2.7 fail?

I am not sure how this relates to my changes.
On my local test it works.

Thanks in advance.

jtratner · 2013-08-19T00:36:18Z

I'll take a look.
On Aug 18, 2013 8:27 PM, "timmie" notifications@github.com wrote:

@cpcloud https://github.com/cpcloud or @jtratnerhttps://github.com/jtratner

I would be thankful if you coudl take a look at the code.

Why do all tests for python != 2.7 fail?

I am not show how this relates to my changes.
On my local test it works.

Thanks in advance.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4404#issuecomment-22844960
.

cpcloud · 2013-08-19T00:38:56Z

@jtratner Thanks.

jtratner · 2013-08-19T02:25:19Z

Timmie, please look at the test case In io/tests/test_parsers around line 1974 (exact line is in the Travis test failure output). That's what's failing.

The test is actually failing on every version of Python (it doesn't cause the one 2.7 runner to fail because it only runs tests marked as 'slow').

I'm not totally clear on what exactly you are testing there. For example, testing the '!=' case without testing the case that does output that is dangerous (eg that would pass if another error message was thrown as well, etc.). Alternately, maybe better to test what the entire output should be.

Going to look at it a bit more in depth now.

2 other things:

You need to rebase this on top of the current pandas master, you can run the following to do that:

git pull --rebase upstream master

If that command complains that upstream doesnt exist, run

git remote add upstream https://www.github.com/pydata/pandas.git

Next, you need to rebuild pandas so that everything is up to date. To do that, run:

make clean build

jtratner · 2013-08-19T02:52:14Z

I posted the previous comment because these tests ought to be failing on your system too. That said, before I can remove this, please go back and remove all the print statements and extra if clauses you added in for testing. It makes it harder to review your PR with all that extra noise in it. Also, for the excel tests you added, instead of defining separate functions, you should be able to know in advance exactly what should result (eg a time delta of 10mins or a datetime) and test against that instead.

Finally, to get the pandas core team to accept your pull request, you need to make it clear what you are changing. From reviewing the associated PRs it looks like you want to:

Allow better control of date parsing by column
Allow for other Excel date modes.
Change excel parser to allow skipping an arbitrary number of lines between header and first row.

Can you lay out how (3) will work? Are you passing some parameter to the function?

Finally, does this PR also add some kind of change to error reporting in read_excel? If it does, you also need to document those changes (and how they are enabled or disabled)

Thanks!

jtratner · 2013-08-19T02:53:01Z

Sorry writing on my phone! I meant 'before I can review this' not remove this.

timmie · 2013-08-19T06:53:29Z

Timmie, please look at the test case In io/tests/test_parsers around line 1974 (exact line is in the Travis test >failure output). That's what's failing.

I know. But this test was there. I did not know how to modify it. for me, it seems that "verbose=True" does not produce results.

I'm not totally clear on what exactly you are testing there. For example, testing the '!=' case without testing the >case that does output that is dangerous (eg that would pass if another error message was thrown as well, etc.). >Alternately, maybe better to test what the entire output should be.

yes, as I did not understand the full and existing test, I tried to disable with "!=" to see if the others at least pass.

2 other things:

Yes, sure. thanks for the reminder.
My ideas was to see first of my code passes, then clean it (there is a lot of print statement letftovers) and the do the rebase etc.

actually, I wanted to provide you with some feedback about the difficulties I encountered as non-core dev working with this code. It too all of us so much time with this PR. Let's follow up on this after the PR is done.

That said, before I can remove this, please go back and remove all the print statements and extra if clauses you >added in for testing. It makes it harder to review your PR with all that extra noise in it.

as above, I am aware that there are leftovers from "experiements". They should be disabled. If tests go through, I will remove them.
I woudl like to keep some of the comments to make future changes easier. Is that OK?

Allow better control of date parsing by column

Yes.

Allow for other Excel date modes.

Yes/No. this was part of the inital problem. So I wanted to catch that, too.

Change excel parser to allow skipping an arbitrary number of lines between header and first row.

Yes.

Can you lay out how (3) will work? Are you passing some parameter to the function?

I defined a function to get the header from the data read in:
https://github.com/timmie/pandas/blob/aa8241f3f8665e9cec5e22dcce908fa2b3e75000/pandas/io/parsers.py#L1540

this is called in
https://github.com/timmie/pandas/blob/aa8241f3f8665e9cec5e22dcce908fa2b3e75000/pandas/io/parsers.py#L1447

if the current row is equal to the header row.

does this PR also add some kind of change to error reporting in read_excel? If it does, you also need to document those changes (and how they are enabled or disabled)

No, nothing at all. If you are referring to the "verbose=True" this was an existing test I wasn't able to debug.

Also, for the excel tests you added, instead of defining separate functions, you should be able to know in >advance exactly what should result (eg a time delta of 10mins or a datetime) and test against that instead.

you mean like hardcode the result?
I though reading as alternative by xlrd directly would prove that it works also in an alternative way...

timmie · 2013-08-19T09:16:50Z

@jtratner

Wow, it passes now! So great. I feel good...
After having recieved your comments/reply on:
#4404 (comment)
I will try my best to make it ready for acceptance into core.

Thanks again!

jtratner · 2013-08-19T11:08:02Z

I'm glad to hear that it's working now! (and i'm not a core dev either, I
just contribute to pandas frequently). Definitely if you have comments
about what was difficult, that's useful to keep in mind. I'll try to take
a look after work today.

timmie · 2013-08-19T12:55:24Z

@jtratner

Nice. so this is also your after work thingy?

Comments are at:
https://groups.google.com/forum/#!topic/pydata/ZrBxuluPhHs

timmie · 2013-08-21T23:17:34Z

oh, you just merged and closed it!

I am in the middle of trying to clean the merging & conflict chaos and to clean up things.

What shall we do?

jtratner · 2013-08-21T23:22:16Z

@jreback did you merge this? I don't think this should have been merged.

jtratner · 2013-08-21T23:23:20Z

I'm actually unclear if this was merged at all - don't see the commit on pydata/master. Maybe it's a github issue?

jreback · 2013-08-21T23:23:31Z

@timmie I didn't merge this, I think this is a hash collision, nothing added nor deleted

jreback · 2013-08-21T23:24:13Z

@jtratner this happened once before...weird....

@timmie just rebase and force push I think you can reuse this PR if not, just open another one

jreback · 2013-08-21T23:25:44Z

This is what I merged, bet github confused it 9ea0d44

jtratner · 2013-08-21T23:34:03Z

We should report this to Github as a bug.

@timmie sorry that you have to deal with yet another stumbling point here!

jreback · 2013-08-21T23:39:29Z

I'll give them a report

cpcloud · 2013-08-21T23:50:10Z

this happened once before when we were messing with remote branches for eval

cpcloud reviewed Jul 30, 2013
View reviewed changes

jreback merged commit 9ea0d44 into pandas-dev:master Aug 21, 2013

timmie mentioned this pull request Aug 21, 2013

excel reader & skip row between data & header & docs #4631

Merged

timmie mentioned this pull request Aug 29, 2015

ENH: Open Document Format ODS support in read_excel() #9070

Closed

fix the excel reader: hours & header #4404

fix the excel reader: hours & header #4404

Conversation

timmie commented Jul 30, 2013

cpcloud Jul 30, 2013

Choose a reason for hiding this comment

timmie Jul 30, 2013

Choose a reason for hiding this comment

timmie Jul 30, 2013

Choose a reason for hiding this comment

timmie Jul 30, 2013

Choose a reason for hiding this comment

cpcloud commented Jul 30, 2013

timmie commented Jul 30, 2013

jtratner commented Jul 30, 2013

cpcloud commented Jul 30, 2013

timmie commented Jul 30, 2013

jtratner commented Jul 30, 2013

cpcloud commented Jul 30, 2013

jtratner commented Jul 30, 2013

timmie commented Jul 30, 2013

cpcloud commented Jul 30, 2013

cpcloud commented Jul 30, 2013

timmie commented Jul 30, 2013

cpcloud commented Jul 30, 2013

cpcloud Jul 30, 2013

Choose a reason for hiding this comment

timmie Jul 30, 2013

Choose a reason for hiding this comment

cpcloud Jul 30, 2013

Choose a reason for hiding this comment

timmie commented Jul 30, 2013

cpcloud commented Jul 30, 2013

timmie commented Jul 30, 2013

cpcloud commented Jul 30, 2013

timmie commented Jul 30, 2013

cpcloud commented Jul 31, 2013

jreback commented Jul 31, 2013

cpcloud commented Jul 31, 2013

timmie commented Jul 31, 2013

cpcloud commented Jul 31, 2013

timmie commented Jul 31, 2013

cpcloud commented Jul 31, 2013

cpcloud commented Jul 31, 2013

timmie commented Jul 31, 2013

timmie commented Aug 2, 2013

jtratner commented Aug 2, 2013

timmie commented Aug 2, 2013

jtratner commented Aug 2, 2013

timmie commented Aug 12, 2013

timmie commented Aug 14, 2013

timmie commented Aug 14, 2013

jtratner commented Aug 14, 2013

timmie commented Aug 19, 2013

jtratner commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jtratner commented Aug 19, 2013

jtratner commented Aug 19, 2013

jtratner commented Aug 19, 2013

timmie commented Aug 19, 2013

timmie commented Aug 19, 2013

jtratner commented Aug 19, 2013

timmie commented Aug 19, 2013

timmie commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

cpcloud commented Aug 21, 2013