SASDataResult object #30

tk3369 · 2018-02-11T21:06:52Z

SASLib current returns a Dict object as the result of reading the file. It was a decision made during the early stage of development so it's easier to develop e.g. Revise.jl does not work whenever there's struct changes if I remember correctly.

I think it would be beneficial to migrate to a struct so it is more sane and allow us to write additional functions that operate on the result.

The text was updated successfully, but these errors were encountered:

tk3369 · 2018-02-19T17:59:11Z

Breaking changes. Do you guys have time to review the new design? Appreciate your help.

Cc: @tbeason @davidanthoff @xiaodaigh

tbeason · 2018-02-19T19:03:37Z

Thanks for the ping. In general I like it, but I have some questions.

I'm uncertain about including the PerfStat struct in the ResultSet. If read is used to incrementally read the file, it makes sense that you might then just append the ResultSet's, but it isn't clear what you would do with the PerfStat from each of them (add it maybe?). I think if this information is only used for benchmarking purposes, perhaps it would be better switch to some setup that uses BenchmarkTools or PkgBenchmark. I don't know what your intended use of the timings is, so maybe there is a bigger picture thing here.
What is your sense for how expensive the metadata function will be? Is gathering that information from the file a cheap operation? Does it scale with the file size or is it mostly constant?
Are you intending for the ResultSet to be mutable or immutable? If immutable, would we benefit from parametric typing? I'm thinking maybe the type of columns could be parametric, which might give us some direct ideas of proper ways to append these things. For example, if you read one chunk of the file and do not encounter missing, but the next chunk does have missing values, you would need to widen the column element type from T to Union{T, Missing}. I think some of the textfile parsers now do something like this, so would be appropriate to look there for inspiration.

I think moving that direction would definitely address some of the current shortcomings of the package, so I'm in favor of it.

tk3369 · 2018-02-19T20:43:09Z

Thanks for the detailed feedbacks.

Indeed, I wasn't sure whether adding PerfStat is a good idea or not. Currently, the code tracks the performance of the two most time consuming logic (reading the data from disk & converting them to proper types). It's a good tool for debugging but probably not as useful for general users. Let's just get rid of it and instead have an option to print them out as a debug log.
The metadata call should be very fast since it's just reading the header portion of the file. It does not depend on the size of the file. It does depend on the number of columns.
I am expecting ResultSet to be immutable although we know the immutability only applies to the references in the struct but not the data within. Currently, only date and datetime columns allow missing and that's hardcoded to the type whether or not there's any missing data there. Probably not the best approach but that's a different issue.

davidanthoff · 2018-02-19T23:20:36Z

I think this looks great.

At some point it might be neat to be able to stream the data out as an iterator of named tuples without the overhead of ever allocating an array. That can be quite convenient in combination with Query.jl. But, that will require quite a different API and maybe is just an addition for a future iteration?

tk3369 · 2018-02-20T00:55:38Z

Yes, I think that could be a separate enhancement. Are you thinking about integrating with IterableTables or implementing DataStreams API or something different?

davidanthoff · 2018-02-20T01:10:06Z

IterableTables. Or rather, I'm thinking that StatFiles.jl might use this package instead of ReadStat.jl for the file formats supported. StatFiles.jl is just an iterable tables implementation.

tk3369 · 2018-02-20T02:00:40Z

That's alright. I know you already have tons of packages to worry about. Let me know if you need help integrating with StatFiles.jl.

However, StatFiles.jl does not seem to have a streaming iterator. Please correct me if I'm wrong. Given that SASLib can read the file incrementally, it would possible to build a buffered DataStream source. As Query.jl can take data streams, it would directly support #26.

davidanthoff · 2018-02-20T04:16:22Z

StatFiles.jl does produce an iterator of named tuples, the iterator table interface is by definition streaming (it is just an iterator of named tuples, more or less). BUT, the implementation is pretty stupid: it first reads everything into arrays, and then streams things from these arrays. The reason for that is simply that ReadStat.jl returns full arrays (and also that this was the quickest way to implement things).

I think one could probably pretty easily change StatFiles.jl into the kind of buffered stream with the implementation of SASLib.jl as it stands right now. But one could probably go one step further: the iterator in StatFiles.jl could literally just read every row on demand from disc, i.e. we could even get rid of the buffer.

All of these strategies though would read everything from disc, even if a user only wants to use a small part of it. If we query later (e.g. load("myfile.sas7bdat") |> @map({_.colA, _.colB}) |> DataFrame) it would still read every column from disc. The really cool thing for that scenario would be to write a custom Query.jl backend for StatFiles.jl. Query.jl is all prepared for that, I just have never used those features. Essentially something like load("myfile.sas7bdat") |> @map({_.colA, _.colB}) would create an in-memory AST like representation of the query, and the StatFiles.jl backend could then analyse that query, figure out that only colA and colB need to be loaded to run this query, and then skip all other columns entirely, never even reading them from disc. Now, such an implementation would be a fair bit of work right now. I have a slowly simmering project running where I hope to make writing that kind of backend a lot simpler, so I think the best strategy for that is to wait until that is further along. But then it would be really neat to integrate.

tbeason · 2018-02-20T16:34:56Z

@tk3369 I do suppose that the decision to bring in or not bring in missing as an element type is a separate decision, but if there is a major refactoring in the works, perhaps now is the best time to discuss it. I'll open a separate issue for that. As far as everything else, I think it looks good. I agree about dropping the timings from the struct, maybe the verbose option could indicate to print the times.

@davidanthoff The plans to make things stream right from disc sound great, but I agree with your sentiment earlier that perhaps that is best left for a future iteration. The integration with StatFiles.jl as it is currently would be a big step forward though.

tbeason · 2018-02-22T19:06:12Z

When you start working on this, will you let us know what branch to look at?

, #31)

tk3369 · 2018-02-23T06:32:12Z

See tk/objects branch. Also updated wiki. Let me know what you think :-)

tk3369 · 2018-02-24T18:09:49Z

Just updated user guide (README.md) in the same branch.

, #31)

tk3369 mentioned this issue Feb 13, 2018

read function should be more lean #31

Closed

tk3369 self-assigned this Feb 21, 2018

tk3369 added a commit that referenced this issue Feb 23, 2018

ResultSet/Metadata changes initial commit for beta testing (re: #29, #30

e9056f3

, #31)

tk3369 mentioned this issue Feb 24, 2018

ResultSet and Metadata implementation #36

Merged

tk3369 added the enhancement label Feb 24, 2018

tk3369 added a commit that referenced this issue Mar 7, 2018

ResultSet/Metadata changes initial commit for beta testing (re: #29, #30

2b9e47c

, #31)

tk3369 closed this as completed Mar 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SASDataResult object #30

SASDataResult object #30

tk3369 commented Feb 11, 2018

tk3369 commented Feb 19, 2018

tbeason commented Feb 19, 2018

tk3369 commented Feb 19, 2018

davidanthoff commented Feb 19, 2018

tk3369 commented Feb 20, 2018 •

edited

Loading

davidanthoff commented Feb 20, 2018

tk3369 commented Feb 20, 2018

davidanthoff commented Feb 20, 2018

tbeason commented Feb 20, 2018

tbeason commented Feb 22, 2018

tk3369 commented Feb 23, 2018

tk3369 commented Feb 24, 2018

SASDataResult object #30

SASDataResult object #30

Comments

tk3369 commented Feb 11, 2018

tk3369 commented Feb 19, 2018

tbeason commented Feb 19, 2018

tk3369 commented Feb 19, 2018

davidanthoff commented Feb 19, 2018

tk3369 commented Feb 20, 2018 • edited Loading

davidanthoff commented Feb 20, 2018

tk3369 commented Feb 20, 2018

davidanthoff commented Feb 20, 2018

tbeason commented Feb 20, 2018

tbeason commented Feb 22, 2018

tk3369 commented Feb 23, 2018

tk3369 commented Feb 24, 2018

tk3369 commented Feb 20, 2018 •

edited

Loading