Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detailed and weighted summary stats #138

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

detailed and weighted summary stats #138

wants to merge 1 commit into from

Conversation

matthieugomez
Copy link
Contributor

This pull request implements detailed and weighted summary stats.
scipy has a a similar describefunction for detailed stats, Stata has a similar summarize command with the options weight and details.
I'll write a similar pull request in DataFrames if if this gets merged here.
As an aside, this pull request is a nice example where a type such as :Ones (issue #135) would make the code simpler.

Additionally, two minor changes:

  • Rename the function summarystats to describe
  • Rename percentile field from q?? to p??

@andreasnoack
Copy link
Member

LGTM.

@nalimilan
Copy link
Member

@andreasnoack Should we merge this then? I even wonder if we should tag the release after it rather than before.

@andreasnoack
Copy link
Member

I just had the same thoughts, but I got in doubt about this PR when looking through it again. The overall idea is fine, but there are some details that I'd like to discuss. Tagging a new version is cheap so I think we should go ahead with the version I've tagged and then just tag another version when this is ready.

@nalimilan
Copy link
Member

Sure.


immutable SummaryStats{T<:AbstractFloat}
function describe{T<:Real}(a::AbstractArray{T}, ::Type{Val{false}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthieugomez After having used Vals for a while in the linear algebra code, I've come to the conclusion that I don't like them. Do you have any ideas about alternatives? The three I can come up with are

  1. Define two different functions
  2. Use the detailed type for both versions, but allow uninitialized fields.
  3. Ignore the type instability

@andreasnoack
Copy link
Member

@matthieugomez What is the typical use case for the summary statistics? My guess is that it is mainly for exploratory data analysis and reports. What other methods would you define than show? I'm wondering if it worth the trouble to have the immutables at all and instead let describe write to some output device similarly to show and print. That would save us all the type stability considerations and avoid four immutables.

@nalimilan
Copy link
Member

I think I'd prefer to keep the immutables, as they can always be useful to store or retrieve some of the information they compute. I don't like functions which print their output directly and don't allow accessing it.

OTOH, type-stability isn't likely a concern here. We could also merge the detailed and non-detailed versions, with an additional boolean field indicating whether the detail-only field are filled.

@matthieugomez
Copy link
Contributor Author

Yes, I think the only case is exploratory data analysis. That being said, like @nalimilan I think it's best to keep the information retrievable.
describe does not need to be type stable, so I could just replace Val by if
I think it's fine to keep 4 immutables. The objects should print differently so it seems natural that they have different types.

@andreasnoack
Copy link
Member

Since the describe function won't be used in performance critical calculations and since the summary statistics are cheap to compute, the values can just be recomputed with mean, std, quantile etc. if necessary. Storing values is mainly a concern for expensive calculations.

Avoiding the types would also make the code shorter and reduce compilation so I only think defining them makes sense if we can come up with other functions than show that could benefit from being overloaded with these types.

@nalimilan
Copy link
Member

Actually, I think the best solution would have been to return a table-like object, with row names giving the indicator type, and a column giving its value. This would allow printing it as a table in rich outputs like IJulia or HTML/LaTeX, instead of raw text. Unfortunately, for now NamedArrays are a separate package StatsBase cannot depend on.

@andreasnoack
Copy link
Member

Maybe we could mimic the parts of NamedArrays we need in a single simpler summary statistics type that can contain an arbitrary number of statistics together with a name for each of them. Maybe just a vector of a key, label, and value. In that way, we'd have a single type and a single show method here and at the same time, the type could be used for fancier LaTeX and HTML tables.

@nalimilan
Copy link
Member

Yeah, but this is a very common need which is not limited to printing statistics. For example, CoefTable is somewhat similar (though richer and more complex). Maybe we should create something as general as possible which could be used by other packages? It could be much simpler than NamedArrays since it would always be a matrix with string row/column names, with only simple support for indexing (using the key).

@matthieugomez
Copy link
Contributor Author

I think DataFrames may be the right object then
https://github.com/dgrtwo/broom
It's the easiest type to impose across different packages

@nalimilan
Copy link
Member

No, I think that DataFrames in Julia are quite different than in R and that they shouldn't be used in the same way. They should be considered as databases, to be used for relatively large data sets. What we need here is very different: a very lightweight wrapper around a matrix or around a set of values. Rows (and possibly columns) should have string names (including spaces if needed), not numbers or identifiers, and the output shouldn't be abbreviated as with DataFrames (which are typically too large to fit on screen; actually I'd prefer if DataFrames printed a summary of columns instead of the actual contents, but that's another story).

This would also allow using this simple type without depending on DataFrames and DataArrays, which many packages won't accept even if they need this kind of object.

@nalimilan
Copy link
Member

FWIW, somebody just requested a way to access the different values computed by describe from DataFrames.jl at JuliaData/DataFrames.jl#926. I think this illustrates that it can be useful, even if it's easy to compute these manually.

@ashleylid
Copy link

@nalimilan thanks 👍

@andreasnoack
Copy link
Member

@kleinash Why do you prefer to access fields instead of computing them with something like

len = length(x)
t = typeof(x)
xu = unique(x)

?

@ashleylid
Copy link

@andreasnoack I'll use that then. Just seemed as though describe could work for looping through aspects of the df.

@andreasnoack
Copy link
Member

If you think it would be easier to use a describe function and then extract the values from a type returned from such a function, I'd like to hear why. There might be good reasons that we haven't thought of yet, but to me, it seems convoluted to use the temporary type to store the result. You might also end up computing more statistics than you actually need.

I still think it would be useful to have a type for pretty specialized HTML and LaTeX printing but that seems to be a different usecase from the one you are considering here.

@ashleylid
Copy link

I appreciate your replies. Thank you for your direction and comments. I will do as suggested.

@nalimilan
Copy link
Member

I still think it would be useful to have a type for pretty specialized HTML and LaTeX printing but that seems to be a different usecase from the one you are considering here.

If you want the results of describe to benefit from pretty-printing, they have to be stored in a type which supports this. That's why I think it's a related issue. Printing raw text to the output should really be reserved to actual text, i.e. not structured objects like tables or series of values.

@ashleylid
Copy link

I just run
$ julia program.jl > output.txt

@andreasnoack
Copy link
Member

@nalimilan I think we agree on most of this now. My main points are that

  • the type should be a single type flexible enough to allow different sets of univariate statistics
  • the type is mainly for printing. We shouldn't encourage using describe for computing intermediate quantities like means and standard deviations. It is simpler and probably more efficient just to use the functions for the specific quantity, mean, std etc.

@nalimilan
Copy link
Member

Fully agreed.

@matthieugomez
Copy link
Contributor Author

To sum up: my understanding is that you want all the summary stats to be stored in the same type, and that this type would print something like a list of keys : values format in a specific order.

I think it would limit the way we can print things. Even for the (very simple) summary statistics case, I'd like the key "sum of weights stat" to be printed after a new line. The situation for Coefficient Tables is even more complicated. Restricting the number of types will restrict the way we can print information, and I don't think it's a good trade off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants