detailed and weighted summary stats #138

matthieugomez · 2015-10-06T23:02:39Z

This pull request implements detailed and weighted summary stats.
scipy has a a similar describefunction for detailed stats, Stata has a similar summarize command with the options weight and details.
I'll write a similar pull request in DataFrames if if this gets merged here.
As an aside, this pull request is a nice example where a type such as :Ones (issue #135) would make the code simpler.

Additionally, two minor changes:

Rename the function summarystats to describe
Rename percentile field from q?? to p??

andreasnoack · 2015-12-08T00:45:27Z

~~LGTM.~~

nalimilan · 2016-03-01T14:28:32Z

@andreasnoack Should we merge this then? I even wonder if we should tag the release after it rather than before.

andreasnoack · 2016-03-01T14:37:22Z

I just had the same thoughts, but I got in doubt about this PR when looking through it again. The overall idea is fine, but there are some details that I'd like to discuss. Tagging a new version is cheap so I think we should go ahead with the version I've tagged and then just tag another version when this is ready.

nalimilan · 2016-03-01T14:41:59Z

Sure.

andreasnoack · 2016-03-01T14:43:26Z

src/scalarstats.jl


-immutable SummaryStats{T<:AbstractFloat}
+function describe{T<:Real}(a::AbstractArray{T}, ::Type{Val{false}})


@matthieugomez After having used Vals for a while in the linear algebra code, I've come to the conclusion that I don't like them. Do you have any ideas about alternatives? The three I can come up with are

Define two different functions

Use the detailed type for both versions, but allow uninitialized fields.

Ignore the type instability

andreasnoack · 2016-03-01T14:48:35Z

@matthieugomez What is the typical use case for the summary statistics? My guess is that it is mainly for exploratory data analysis and reports. What other methods would you define than show? I'm wondering if it worth the trouble to have the immutables at all and instead let describe write to some output device similarly to show and print. That would save us all the type stability considerations and avoid four immutables.

nalimilan · 2016-03-01T15:02:06Z

I think I'd prefer to keep the immutables, as they can always be useful to store or retrieve some of the information they compute. I don't like functions which print their output directly and don't allow accessing it.

OTOH, type-stability isn't likely a concern here. We could also merge the detailed and non-detailed versions, with an additional boolean field indicating whether the detail-only field are filled.

matthieugomez · 2016-03-01T15:19:35Z

Yes, I think the only case is exploratory data analysis. That being said, like @nalimilan I think it's best to keep the information retrievable.
describe does not need to be type stable, so I could just replace Val by if
I think it's fine to keep 4 immutables. The objects should print differently so it seems natural that they have different types.

andreasnoack · 2016-03-01T15:44:30Z

Since the describe function won't be used in performance critical calculations and since the summary statistics are cheap to compute, the values can just be recomputed with mean, std, quantile etc. if necessary. Storing values is mainly a concern for expensive calculations.

Avoiding the types would also make the code shorter and reduce compilation so I only think defining them makes sense if we can come up with other functions than show that could benefit from being overloaded with these types.

nalimilan · 2016-03-01T16:01:37Z

Actually, I think the best solution would have been to return a table-like object, with row names giving the indicator type, and a column giving its value. This would allow printing it as a table in rich outputs like IJulia or HTML/LaTeX, instead of raw text. Unfortunately, for now NamedArrays are a separate package StatsBase cannot depend on.

andreasnoack · 2016-03-01T16:28:30Z

Maybe we could mimic the parts of NamedArrays we need in a single simpler summary statistics type that can contain an arbitrary number of statistics together with a name for each of them. Maybe just a vector of a key, label, and value. In that way, we'd have a single type and a single show method here and at the same time, the type could be used for fancier LaTeX and HTML tables.

nalimilan · 2016-03-01T16:33:22Z

Yeah, but this is a very common need which is not limited to printing statistics. For example, CoefTable is somewhat similar (though richer and more complex). Maybe we should create something as general as possible which could be used by other packages? It could be much simpler than NamedArrays since it would always be a matrix with string row/column names, with only simple support for indexing (using the key).

matthieugomez · 2016-03-01T17:29:19Z

I think DataFrames may be the right object then
https://github.com/dgrtwo/broom
It's the easiest type to impose across different packages

nalimilan · 2016-03-01T17:49:40Z

No, I think that DataFrames in Julia are quite different than in R and that they shouldn't be used in the same way. They should be considered as databases, to be used for relatively large data sets. What we need here is very different: a very lightweight wrapper around a matrix or around a set of values. Rows (and possibly columns) should have string names (including spaces if needed), not numbers or identifiers, and the output shouldn't be abbreviated as with DataFrames (which are typically too large to fit on screen; actually I'd prefer if DataFrames printed a summary of columns instead of the actual contents, but that's another story).

This would also allow using this simple type without depending on DataFrames and DataArrays, which many packages won't accept even if they need this kind of object.

nalimilan · 2016-03-10T13:54:26Z

FWIW, somebody just requested a way to access the different values computed by describe from DataFrames.jl at JuliaData/DataFrames.jl#926. I think this illustrates that it can be useful, even if it's easy to compute these manually.

ashleylid · 2016-03-10T13:57:05Z

@nalimilan thanks 👍

andreasnoack · 2016-03-10T14:01:26Z

@kleinash Why do you prefer to access fields instead of computing them with something like

len = length(x)
t = typeof(x)
xu = unique(x)

?

ashleylid · 2016-03-10T14:58:28Z

@andreasnoack I'll use that then. Just seemed as though describe could work for looping through aspects of the df.

andreasnoack · 2016-03-10T21:25:34Z

If you think it would be easier to use a describe function and then extract the values from a type returned from such a function, I'd like to hear why. There might be good reasons that we haven't thought of yet, but to me, it seems convoluted to use the temporary type to store the result. You might also end up computing more statistics than you actually need.

I still think it would be useful to have a type for pretty specialized HTML and LaTeX printing but that seems to be a different usecase from the one you are considering here.

ashleylid · 2016-03-11T06:28:57Z

I appreciate your replies. Thank you for your direction and comments. I will do as suggested.

nalimilan · 2016-03-11T09:17:11Z

I still think it would be useful to have a type for pretty specialized HTML and LaTeX printing but that seems to be a different usecase from the one you are considering here.

If you want the results of describe to benefit from pretty-printing, they have to be stored in a type which supports this. That's why I think it's a related issue. Printing raw text to the output should really be reserved to actual text, i.e. not structured objects like tables or series of values.

ashleylid · 2016-03-11T09:31:38Z

I just run
$ julia program.jl > output.txt

andreasnoack · 2016-03-11T14:47:27Z

@nalimilan I think we agree on most of this now. My main points are that

the type should be a single type flexible enough to allow different sets of univariate statistics
the type is mainly for printing. We shouldn't encourage using describe for computing intermediate quantities like means and standard deviations. It is simpler and probably more efficient just to use the functions for the specific quantity, mean, std etc.

nalimilan · 2016-03-11T16:18:30Z

Fully agreed.

matthieugomez · 2016-03-11T16:38:23Z

To sum up: my understanding is that you want all the summary stats to be stored in the same type, and that this type would print something like a list of keys : values format in a specific order.

I think it would limit the way we can print things. Even for the (very simple) summary statistics case, I'd like the key "sum of weights stat" to be printed after a new line. The situation for Coefficient Tables is even more complicated. Restricting the number of types will restrict the way we can print information, and I don't think it's a good trade off.

detailed or/and weighted summary stats

bb65c53

andreasnoack reviewed Mar 1, 2016
View reviewed changes

nalimilan mentioned this pull request Apr 19, 2016

Support Integer Printing in CoefTable #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detailed and weighted summary stats #138

detailed and weighted summary stats #138

matthieugomez commented Oct 6, 2015

andreasnoack commented Dec 8, 2015

nalimilan commented Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

andreasnoack Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

matthieugomez commented Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

matthieugomez commented Mar 1, 2016

nalimilan commented Mar 1, 2016

nalimilan commented Mar 10, 2016

ashleylid commented Mar 10, 2016

andreasnoack commented Mar 10, 2016

ashleylid commented Mar 10, 2016

andreasnoack commented Mar 10, 2016

ashleylid commented Mar 11, 2016

nalimilan commented Mar 11, 2016

ashleylid commented Mar 11, 2016

andreasnoack commented Mar 11, 2016

nalimilan commented Mar 11, 2016

matthieugomez commented Mar 11, 2016


		immutable SummaryStats{T<:AbstractFloat}
		function describe{T<:Real}(a::AbstractArray{T}, ::Type{Val{false}})

detailed and weighted summary stats #138

Are you sure you want to change the base?

detailed and weighted summary stats #138

Conversation

matthieugomez commented Oct 6, 2015

andreasnoack commented Dec 8, 2015

nalimilan commented Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

andreasnoack Mar 1, 2016

Choose a reason for hiding this comment

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

matthieugomez commented Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

andreasnoack commented Mar 1, 2016

nalimilan commented Mar 1, 2016

matthieugomez commented Mar 1, 2016

nalimilan commented Mar 1, 2016

nalimilan commented Mar 10, 2016

ashleylid commented Mar 10, 2016

andreasnoack commented Mar 10, 2016

ashleylid commented Mar 10, 2016

andreasnoack commented Mar 10, 2016

ashleylid commented Mar 11, 2016

nalimilan commented Mar 11, 2016

ashleylid commented Mar 11, 2016

andreasnoack commented Mar 11, 2016

nalimilan commented Mar 11, 2016

matthieugomez commented Mar 11, 2016