Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SASDataResult object #30

Closed
tk3369 opened this issue Feb 11, 2018 · 12 comments
Closed

SASDataResult object #30

tk3369 opened this issue Feb 11, 2018 · 12 comments
Assignees

Comments

@tk3369
Copy link
Owner

tk3369 commented Feb 11, 2018

SASLib current returns a Dict object as the result of reading the file. It was a decision made during the early stage of development so it's easier to develop e.g. Revise.jl does not work whenever there's struct changes if I remember correctly.

I think it would be beneficial to migrate to a struct so it is more sane and allow us to write additional functions that operate on the result.

@tk3369
Copy link
Owner Author

tk3369 commented Feb 19, 2018

Breaking changes. Do you guys have time to review the new design? Appreciate your help.

Cc: @tbeason @davidanthoff @xiaodaigh

@tbeason
Copy link
Contributor

tbeason commented Feb 19, 2018

Thanks for the ping. In general I like it, but I have some questions.

  1. I'm uncertain about including the PerfStat struct in the ResultSet. If read is used to incrementally read the file, it makes sense that you might then just append the ResultSet's, but it isn't clear what you would do with the PerfStat from each of them (add it maybe?). I think if this information is only used for benchmarking purposes, perhaps it would be better switch to some setup that uses BenchmarkTools or PkgBenchmark. I don't know what your intended use of the timings is, so maybe there is a bigger picture thing here.

  2. What is your sense for how expensive the metadata function will be? Is gathering that information from the file a cheap operation? Does it scale with the file size or is it mostly constant?

  3. Are you intending for the ResultSet to be mutable or immutable? If immutable, would we benefit from parametric typing? I'm thinking maybe the type of columns could be parametric, which might give us some direct ideas of proper ways to append these things. For example, if you read one chunk of the file and do not encounter missing, but the next chunk does have missing values, you would need to widen the column element type from T to Union{T, Missing}. I think some of the textfile parsers now do something like this, so would be appropriate to look there for inspiration.

I think moving that direction would definitely address some of the current shortcomings of the package, so I'm in favor of it.

@tk3369
Copy link
Owner Author

tk3369 commented Feb 19, 2018

Thanks for the detailed feedbacks.

  1. Indeed, I wasn't sure whether adding PerfStat is a good idea or not. Currently, the code tracks the performance of the two most time consuming logic (reading the data from disk & converting them to proper types). It's a good tool for debugging but probably not as useful for general users. Let's just get rid of it and instead have an option to print them out as a debug log.

  2. The metadata call should be very fast since it's just reading the header portion of the file. It does not depend on the size of the file. It does depend on the number of columns.

  3. I am expecting ResultSet to be immutable although we know the immutability only applies to the references in the struct but not the data within. Currently, only date and datetime columns allow missing and that's hardcoded to the type whether or not there's any missing data there. Probably not the best approach but that's a different issue.

@davidanthoff
Copy link
Contributor

I think this looks great.

At some point it might be neat to be able to stream the data out as an iterator of named tuples without the overhead of ever allocating an array. That can be quite convenient in combination with Query.jl. But, that will require quite a different API and maybe is just an addition for a future iteration?

@tk3369
Copy link
Owner Author

tk3369 commented Feb 20, 2018

Yes, I think that could be a separate enhancement. Are you thinking about integrating with IterableTables or implementing DataStreams API or something different?

@davidanthoff
Copy link
Contributor

IterableTables. Or rather, I'm thinking that StatFiles.jl might use this package instead of ReadStat.jl for the file formats supported. StatFiles.jl is just an iterable tables implementation.

@tk3369
Copy link
Owner Author

tk3369 commented Feb 20, 2018

That's alright. I know you already have tons of packages to worry about. Let me know if you need help integrating with StatFiles.jl.

However, StatFiles.jl does not seem to have a streaming iterator. Please correct me if I'm wrong. Given that SASLib can read the file incrementally, it would possible to build a buffered DataStream source. As Query.jl can take data streams, it would directly support #26.

@davidanthoff
Copy link
Contributor

StatFiles.jl does produce an iterator of named tuples, the iterator table interface is by definition streaming (it is just an iterator of named tuples, more or less). BUT, the implementation is pretty stupid: it first reads everything into arrays, and then streams things from these arrays. The reason for that is simply that ReadStat.jl returns full arrays (and also that this was the quickest way to implement things).

I think one could probably pretty easily change StatFiles.jl into the kind of buffered stream with the implementation of SASLib.jl as it stands right now. But one could probably go one step further: the iterator in StatFiles.jl could literally just read every row on demand from disc, i.e. we could even get rid of the buffer.

All of these strategies though would read everything from disc, even if a user only wants to use a small part of it. If we query later (e.g. load("myfile.sas7bdat") |> @map({_.colA, _.colB}) |> DataFrame) it would still read every column from disc. The really cool thing for that scenario would be to write a custom Query.jl backend for StatFiles.jl. Query.jl is all prepared for that, I just have never used those features. Essentially something like load("myfile.sas7bdat") |> @map({_.colA, _.colB}) would create an in-memory AST like representation of the query, and the StatFiles.jl backend could then analyse that query, figure out that only colA and colB need to be loaded to run this query, and then skip all other columns entirely, never even reading them from disc. Now, such an implementation would be a fair bit of work right now. I have a slowly simmering project running where I hope to make writing that kind of backend a lot simpler, so I think the best strategy for that is to wait until that is further along. But then it would be really neat to integrate.

@tbeason
Copy link
Contributor

tbeason commented Feb 20, 2018

@tk3369 I do suppose that the decision to bring in or not bring in missing as an element type is a separate decision, but if there is a major refactoring in the works, perhaps now is the best time to discuss it. I'll open a separate issue for that. As far as everything else, I think it looks good. I agree about dropping the timings from the struct, maybe the verbose option could indicate to print the times.

@davidanthoff The plans to make things stream right from disc sound great, but I agree with your sentiment earlier that perhaps that is best left for a future iteration. The integration with StatFiles.jl as it is currently would be a big step forward though.

@tk3369 tk3369 self-assigned this Feb 21, 2018
@tbeason
Copy link
Contributor

tbeason commented Feb 22, 2018

When you start working on this, will you let us know what branch to look at?

@tk3369
Copy link
Owner Author

tk3369 commented Feb 23, 2018

See tk/objects branch. Also updated wiki. Let me know what you think :-)

@tk3369
Copy link
Owner Author

tk3369 commented Feb 24, 2018

Just updated user guide (README.md) in the same branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants