-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
detailed and weighted summary stats #138
base: master
Are you sure you want to change the base?
Conversation
|
@andreasnoack Should we merge this then? I even wonder if we should tag the release after it rather than before. |
I just had the same thoughts, but I got in doubt about this PR when looking through it again. The overall idea is fine, but there are some details that I'd like to discuss. Tagging a new version is cheap so I think we should go ahead with the version I've tagged and then just tag another version when this is ready. |
Sure. |
|
||
immutable SummaryStats{T<:AbstractFloat} | ||
function describe{T<:Real}(a::AbstractArray{T}, ::Type{Val{false}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthieugomez After having used Val
s for a while in the linear algebra code, I've come to the conclusion that I don't like them. Do you have any ideas about alternatives? The three I can come up with are
- Define two different functions
- Use the detailed type for both versions, but allow uninitialized fields.
- Ignore the type instability
@matthieugomez What is the typical use case for the summary statistics? My guess is that it is mainly for exploratory data analysis and reports. What other methods would you define than |
I think I'd prefer to keep the immutables, as they can always be useful to store or retrieve some of the information they compute. I don't like functions which print their output directly and don't allow accessing it. OTOH, type-stability isn't likely a concern here. We could also merge the detailed and non-detailed versions, with an additional boolean field indicating whether the detail-only field are filled. |
Yes, I think the only case is exploratory data analysis. That being said, like @nalimilan I think it's best to keep the information retrievable. |
Since the Avoiding the types would also make the code shorter and reduce compilation so I only think defining them makes sense if we can come up with other functions than |
Actually, I think the best solution would have been to return a table-like object, with row names giving the indicator type, and a column giving its value. This would allow printing it as a table in rich outputs like IJulia or HTML/LaTeX, instead of raw text. Unfortunately, for now NamedArrays are a separate package StatsBase cannot depend on. |
Maybe we could mimic the parts of |
Yeah, but this is a very common need which is not limited to printing statistics. For example, |
I think DataFrames may be the right object then |
No, I think that DataFrames in Julia are quite different than in R and that they shouldn't be used in the same way. They should be considered as databases, to be used for relatively large data sets. What we need here is very different: a very lightweight wrapper around a matrix or around a set of values. Rows (and possibly columns) should have string names (including spaces if needed), not numbers or identifiers, and the output shouldn't be abbreviated as with DataFrames (which are typically too large to fit on screen; actually I'd prefer if DataFrames printed a summary of columns instead of the actual contents, but that's another story). This would also allow using this simple type without depending on DataFrames and DataArrays, which many packages won't accept even if they need this kind of object. |
FWIW, somebody just requested a way to access the different values computed by |
@nalimilan thanks 👍 |
@kleinash Why do you prefer to access fields instead of computing them with something like len = length(x)
t = typeof(x)
xu = unique(x) ? |
@andreasnoack I'll use that then. Just seemed as though describe could work for looping through aspects of the df. |
If you think it would be easier to use a I still think it would be useful to have a type for pretty specialized HTML and LaTeX printing but that seems to be a different usecase from the one you are considering here. |
I appreciate your replies. Thank you for your direction and comments. I will do as suggested. |
If you want the results of |
I just run |
@nalimilan I think we agree on most of this now. My main points are that
|
Fully agreed. |
To sum up: my understanding is that you want all the summary stats to be stored in the same type, and that this type would print something like a list of keys : values format in a specific order. I think it would limit the way we can print things. Even for the (very simple) summary statistics case, I'd like the key "sum of weights stat" to be printed after a new line. The situation for Coefficient Tables is even more complicated. Restricting the number of types will restrict the way we can print information, and I don't think it's a good trade off. |
This pull request implements detailed and weighted summary stats.
scipy
has a a similardescribe
function for detailed stats, Stata has a similarsummarize
command with the optionsweight
anddetails
.I'll write a similar pull request in
DataFrames
if if this gets merged here.As an aside, this pull request is a nice example where a type such as
:Ones
(issue #135) would make the code simpler.Additionally, two minor changes:
summarystats
todescribe