Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for streamed parsing of very large excelx sheets #69

Merged
merged 37 commits into from
Jan 5, 2015

Conversation

rlburkes
Copy link
Contributor

Changes should address the following issues:

Add interface to support streamed reading of an arbitrarily large xlsx document:

my_doc = Roo::Excelx.new('foo/bar/mydoc.xlsx')
my_doc.each_row_streaming(max_rows: 10) do |row|
  # each yielded row is an array of Roo::Excelx::Cell objects. 
  # max_rows allows you to limit how many rows you take which makes 
  # previewing a very large file fairly performant. 
  ...
end

Roo::Utils has a generic interface for yielding elements (streamed) of a given name to a caller.
Roo::Excelx::Cell has a coordinate that can provide row, column information to the consumer.

Added a bunch of basic test coverage for the Roo::Excelx and Roo::Utils classes to make sure I didnt break anything while doing minimal refactoring.

Hopefully you don't mind, I cleaned up a couple inconsistencies with parens, formatting, etc.

@christopherchiu
Copy link

@rlburkes I tried using your fork but I got an exception:

undefined method `present?' for #<Nokogiri::XML::Element:0x007fe29b04e590> (NoMethodError)

@rlburkes
Copy link
Contributor Author

@christopherchiu I updated with a few changes, if you continue to see issue can you provide more details and open an issue?

@ephekt
Copy link
Contributor

ephekt commented Oct 22, 2013

@christopherchiu I fixed this in my latest merge referenced above.

@rlburkes thanks for the important work here.

@Empact We'd greatly appreciate this fork merging back into master. I'll contribute to maintaining it with @rlburkes if needed. Our company ( @christopherchiu is on my team ) will also contribute.

Long live roo :)

@rlburkes
Copy link
Contributor Author

@fareesh thanks, fixed. Shouldn't happened since options is defaulted to and empty hash, but it doesn't hurt to be consistent and err on side of caution.

@fareesh
Copy link

fareesh commented Oct 23, 2013

I have some code like so:

spreadsheet = Roo::Excelx.new(file,{},:ignore)
(3..spreadsheet.last_row).each do |row_index,index|
  # make [[:col1,:col2,:col3],["a","b","c"]] become {:col1 => "a", :col2 => "b", :col3 => "c"}
  row = Hash[[spreadsheet.row(1),spreadsheet.row(row_index)].transpose] 
  ....
end

I want to do the same thing using the each_row_streaming enumerator, which I have managed partially, but unfortunately if I have a row where only the first M out of N columns are filled, then the remaining N-M cells for that row aren't padded at the end, since each_row_streaming doesn't pad cells past the last filled cell.

How would I go about adding a parameter that makes each_row_streaming add padding everywhere, including at the end? Or is there a better way to do this?

@rlburkes
Copy link
Contributor Author

@fareesh The current behavior of padding will pad from column 1(A) up to the last detected column. If you wish to pad after the last detected column (to a specified width) you can do so by padding the yielded row within your application.

(desired_size-row.size).times { row << nil } # or whatever you prefer to pad end of array

@@ -63,16 +63,21 @@ def to_type(format)
module_function :to_type
end

class ExceedsMaxError < Exception; end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not descend from Exception, which is for low-level exceptions. Better to descend from StandardError or below. http://blog.nicksieger.com/articles/2006/09/06/rubys-exception-hierarchy/

ephekt and others added 3 commits October 24, 2013 16:10
@ausangshukla
Copy link

Pl can someone shed light on how to use the new each_row_streaming call to parse large XL files.
Thanks

@ephekt
Copy link
Contributor

ephekt commented Nov 28, 2013

You should be able to see sample code in the readme.

On Thursday, November 28, 2013, Mohith Thimmaiah wrote:

Pl can someone shed light on how to use the new each_row_streaming call to
parse large XL files.
Thanks


Reply to this email directly or view it on GitHubhttps://github.com//pull/69#issuecomment-29468888
.

@Empact
Copy link
Contributor

Empact commented Dec 5, 2013

Thanks again @rlburkes, I'd love to get this in, but it needs some cleanup to start. @ephekt, could you open a PR with your changes?

@ephekt
Copy link
Contributor

ephekt commented Dec 5, 2013

Done. Opened.

On Thu, Dec 5, 2013 at 5:37 AM, Ben Woosley notifications@gh.neting.ccwrote:

Thanks again @rlburkes https://github.com/rlburkes, I'd love to get
this in, but it needs some cleanup to start. @ephekthttps://github.com/ephekt,
could you open a PR with your changes?


Reply to this email directly or view it on GitHubhttps://github.com//pull/69#issuecomment-29897716
.

@lappi-lynx
Copy link

@ephekt, I'm using 1.13.2 version of 'roo' but I can't find any minimal_load or each_row_streaming in the source. Can anyone explain e.g. how I can get only header from large xlsx file?

@rlburkes
Copy link
Contributor Author

rlburkes commented Apr 9, 2014

@extazystas You will need to use the fork this pull request originated from. @Empact any update on getting this merged?

@lappi-lynx
Copy link

@rlburkes, thanks - now I can use this methods. But could you please provide an example how to get only document header?

With origin gem I did it that way:

@headers = Hash.new
@document.row(1).each_with_index do |header,i|
  @headers[header] = i
end

@rlburkes
Copy link
Contributor Author

@extazystas, make sure and check out the readme, @ephekt provided some nice documentation. This should help get you started.

Roo::Excelx.new('xlsx_file.xlsx', minimal_load: true).each_row_streaming(max_rows: 1) do |row_cells|
    # row_cells is an array of the first row (because we are limiting to 1 row) with data in your workbook.
    # If you want the excel row index, row_cells.first.coordinate.y should == excel workbook row.  
end

@lappi-lynx
Copy link

@rlburkes great, thank you!

@rlburkes
Copy link
Contributor Author

@Empact any chance we can work to get this merged? We would love to get off our fork of Roo.

@Empact
Copy link
Contributor

Empact commented Nov 11, 2014

@rlburkes Could you rebase the PR to start? Or point out the relevant PR if it's not this one?

@Empact
Copy link
Contributor

Empact commented Nov 22, 2014

Hey @rlburkes I just pushed a major refactor of Roo::Excelx that makes all the file access lazy-loaded, e.g. sheets, comments, relationships files are only read as they are accessed.

There's still a good amount in your PR I would like to get in (e.g. the each_row_streaming could still be useful). Would you mind porting your changes over to the current codebase? The minimal_load option should no longer be necessary.

@rlburkes
Copy link
Contributor Author

@Empact I went ahead and merged master, thanks for all your work refactoring it is definitely headed in a good direction. No longer need minimal_load flag as you noticed. Added some test coverage, and made some very minor code cleanups.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.69%) when pulling ab1b32e on rlburkes:master into a3c0724 on roo-rb:master.

simonoff added a commit that referenced this pull request Jan 5, 2015
Support for streamed parsing of very large excelx sheets
@simonoff simonoff merged commit 12f3f7b into roo-rb:master Jan 5, 2015
@simonoff
Copy link
Member

simonoff commented Jan 5, 2015

@rlburkes Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants