Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit weights #515

Merged
merged 11 commits into from
Sep 19, 2019
Merged

Add unit weights #515

merged 11 commits into from
Sep 19, 2019

Conversation

lbittarello
Copy link
Contributor

@lbittarello lbittarello commented Aug 22, 2019

Closes #135, #358 and #516

src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 22, 2019

Codecov Report

Merging #515 into master will decrease coverage by 1.4%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #515      +/-   ##
==========================================
- Coverage    90.1%   88.69%   -1.41%     
==========================================
  Files          21       21              
  Lines        2021     2053      +32     
==========================================
  Hits         1821     1821              
- Misses        200      232      +32
Impacted Files Coverage Δ
src/StatsBase.jl 100% <ø> (ø) ⬆️
src/weights.jl 71.69% <0%> (-12.75%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 416af55...de8f5fa. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 22, 2019

Codecov Report

Merging #515 into master will increase coverage by 0.13%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #515      +/-   ##
==========================================
+ Coverage   90.14%   90.27%   +0.13%     
==========================================
  Files          21       21              
  Lines        2029     2057      +28     
==========================================
+ Hits         1829     1857      +28     
  Misses        200      200
Impacted Files Coverage Δ
src/StatsBase.jl 100% <ø> (ø) ⬆️
src/weights.jl 86.53% <100%> (+2.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fde6209...bcd4bf7. Read the comment docs.

src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated
one(T)
end

@propagate_inbounds function Base.getindex(wv::UnitWeights{S, T}, i::AbstractArray{<:Int}) where {S, T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I realize this definition isn't the most natural: we'd better return a new weights vector of the same type. But that's a separate issue since other types do the same thing.

src/weights.jl Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
docs/src/weights.md Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

Can you also add tests?

@lbittarello
Copy link
Contributor Author

Tests added!

test/weights.jl Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

Thanks. A few methods don't seem to be covered by tests: see dark red in the margin at https://codecov.io/gh/JuliaStats/StatsBase.jl/compare/416af5561d3efd11c78c82c269a9d7c7d26c681d...009b9d5b893bd9235dfd96895acfe1a3c1df81ea/diff.

src/weights.jl Outdated Show resolved Hide resolved
@lbittarello
Copy link
Contributor Author

I added more tests.

As it turns out, we need eltype for unit weights or show throws an error (due to the absence of the values field), so I added it back.

test/weights.jl Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

As it turns out, we need eltype for unit weights or show throws an error (due to the absence of the values field), so I added it back.

What if you also remove the eltype methods for AbstractWeights?

@lbittarello
Copy link
Contributor Author

You're right. I'd messed up my tests. I've amended the last commit to eliminate the reintroduction of eltype and include additional revisions.

test/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
@eval begin
function Base.sum(v::$v, w::UnitWeights)
if length(v) != length(w)
throw(DimensionMismatch("Inconsistent array dimension."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does UnitWeight have length?
And if it has to have length then why do we need to actually check it?
Is this a useful feature vs it just always being considered the right size?
Like LinearAlgebra.I

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point. Here's an example. In computing the variance, we obtain the normalization factor through varcorrection. It depends on the number of observations, so we need the length of the weight vector. We should also ensure that the user has not passed the wrong length lest we use the wrong normalization factor. There are ways around it (for instance, a method for variance that bypasses the weights), but we want to help developers avoid defining separate methods. More generally, I think that boundary checks at different points help users catch mistakes before they snowball.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't hurt to check lengths, and you can avoid it using @inbounds. I'd say that in general it's quite easy to get the appropriate length since it's the length of another input argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating the length isn't free for general iterators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but in most cases we take arrays, right? Also you can always create an object of length typemax if you don't care about the checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly isn't always. Isn't dropmissing after all is one of the cases of iterators that are not free to take then length of?

These checks just don't feel great to me.
What error are we catching?
And how does that error relate to weighting?

If one is checking one got the amount of data that was expected, then isn't that better done with an explicit check on the data gathering code? So a good error can be thrown?
I feel like this just adds work for the user of weights, and the best case is they get a confusing error message.

I'm all for defensive programming, but that means checking the function you are defining can work. Not just checking arbitrary things in the calling code.

Also you can always create an object of length typemax if you don't care about the checks.

That doesn't work, does it? This check is checking the lengths are equal, not checking a >.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support weights with dropmissing currently, the main reason being that in general we would need to skip weights corresponding to missing entries (instead of naively taking weights in their order of appearance).

That doesn't work, does it? This check is checking the lengths are equal, not checking a >.

Right, it doesn't work here (I was thinking about custom functions which want to take a UnitWeights as default weights for simplicity of implementation). I guess we could allow something like UnitWeights(nothing) (or maybe UnitWeights) to create an object with a undefined length. But I still think it's nice to take a length by default, for checking, but also because that's required to be an actual AbstractVector (otherwise we need to error when calling size).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lbittarello! Do you want to prepare another PR to apply the values cleanup we talked about?

@lbittarello
Copy link
Contributor Author

Happy to! It'll take me a couple of weeks though, as I'll be out of town.

@nalimilan nalimilan merged commit 8f422bd into JuliaStats:master Sep 19, 2019
@nalimilan
Copy link
Member

Gentle bump about value: it would be nice to have that before we make a release to avoid changing (slightly) the API.

@matthieugomez
Copy link
Contributor

matthieugomez commented Oct 27, 2019

I know I am a bit late but why not import FillArrays and define

using FillArrays
UnitWeights{T} = AbstractWeights{Int, T, Ones{T}}
uweights(T, len) = Ones{T}(len)

FillArrays already solved the problem of representing an array of one number. This would make the code simpler and more efficient. For instance, with this definition sqrt.(uweights(Float64, 10) would return a FillArray (instead of a full vector as in this pull request).

@nalimilan
Copy link
Member

The issue with that approach is that UnitWeights wouldn't be a concrete type, and objects created by uweights wouldn't be <:AbstractWeights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A Ones type
5 participants