Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug or Feature: fit(Histogram,...) and DataFrames #266

Open
DrFlamingoo opened this issue May 18, 2017 · 14 comments
Open

Bug or Feature: fit(Histogram,...) and DataFrames #266

DrFlamingoo opened this issue May 18, 2017 · 14 comments

Comments

@DrFlamingoo
Copy link

Hi all,
I am relying quite a bit on Histograms in my scripts for my PhD, and I often calculate intensities of 2D detectors by supplying coordinates x,y and the signal intensity I as weight to the
fit(Histogram,(x,y),weights(I). This is immideatly plottable via PyPlots plt.imshow(), etc

Since I get my data from DataFrames, a typical command looks like:

dataTuple = (df[:colX], df[:colY])
weights   = WeightVec(df_test[:I])

hist      = fit(Histogram,dataTuple,weights)

This stopped working, instead I have to convert my DataFrames:

dataTuple = convert(Tuple{Array{Float64,1},Array{Float64,1}},(df[:colX],df[:colY]))
weights   = WeightVec(convert(Array{Float64,1},df[:I]))

was that intentional?

@ararslan
Copy link
Member

ararslan commented May 18, 2017

This was an intentional change introduced in #248. After a bit of discussion we decided that it doesn't make sense to have a weight vector that allows contains missing values, since the sum of the weights can would be missing. I think we were under the impression that it wasn't a common use case to have DataArrays passed to the WeightVec constructor, though I hadn't considered when data is coming from a DataFrame. You'll now need to convert any columns you had intended to be used for weighting into normal Vectors.

Do you have anything to add here, @nalimilan?

@rofinn
Copy link
Member

rofinn commented May 18, 2017

@ararslan Would it make sense to add automatic conversion for DataArrays and I guess NullableArrays?

@ararslan
Copy link
Member

We don't want this package to depend on DataArrays or NullableArrays, so the best we could do would be to add a conversion to Vector inside the constructor, which will fail if there are any null values.

@andreasnoack
Copy link
Member

@ararslan I don't see why #248 would exclude the use of DataVectors. Am I missing something?

@ararslan
Copy link
Member

Oh, no, you're right. I'm actually not sure why this is happening.

@rofinn
Copy link
Member

rofinn commented May 18, 2017

@0kto Are there missing values in df_test[:I]?

julia> d = @data([1, 2, NA, 4])
4-element DataArrays.DataArray{Int64,1}:
 1
 2
  NA
 4

julia> weights(d)
ERROR: MethodError: no method matching StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}(::DataArrays.DataArray{Int64,1}, ::DataArrays.NAtype)
Closest candidates are:
  StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{V<:AbstractArray{T<:Real,1}}(::V<:AbstractArray{T<:Real,1}) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:63
  StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{S<:Real,V<:AbstractArray{T<:Real,1}}(::V<:AbstractArray{T<:Real,1}, ::S<:Real) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:63
  StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{T}(::Any) at sysimg.jl:53
 in weights(::DataArrays.DataArray{Int64,1}) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:71

vs

julia> d = @data([1, 2, 4])
3-element DataArrays.DataArray{Int64,1}:
 1
 2
 4

julia> weights(d)
3-element StatsBase.Weights{Int64,Int64,DataArrays.DataArray{Int64,1}}:
 1
 2
 4

@DrFlamingoo
Copy link
Author

Yes, the not all coordinates carry a signal

@rofinn
Copy link
Member

rofinn commented May 18, 2017

Ahh, okay. Yeah, that behaviour is intentional, but I could be convinced otherwise.

@ararslan
Copy link
Member

I maintain that it doesn't really make sense to allow the case where the sum of the weights is missing. What behavior would you expect that to exhibit?

Since DataArrays depends on StatsBase, perhaps we should add an explicit WeightVec constructor for DataArray input in that package, though I'm not sure what such a constructor should do exactly. I guess we could have it convert NAs to 0 weights.

@DrFlamingoo
Copy link
Author

DrFlamingoo commented May 18, 2017

I am home, so I can write a bit more detailed.

After reading your comments I identify two problems for my use case:

The first is that even for dataTuple a DataFrame column is not allowed. That means that
fit(Histogram,dataTuple) will cause an error if dataTuple is constructed directly from DataFrames.

The second issue is the one with the nullable arrays as weights.
I use fit(Histogram,...)because it is really fast. I had written some routines that would basically perform calculations for each pixel (I should rather say voxel, the coordinate space is 4D) like summation of intensities of several datasets, normalisation by monitorcounts, errors, etc. My implementation was rather slow.
I suddenly realized that fit(Histogram,...) could do that much faster as it already implements optimized routines. Then, it is only a matter of matrix operations, and you could even start think about interactive plotting since everything is already neatly sorted in a matrix representation that is accepted by matplotlib's imshowor similar 2D maps.

@DrFlamingoo
Copy link
Author

Well, I guess that is a hack to get things to work :) not having to deal with NAs at that level makes things easier, but I guess I could convert everything from NA to 0, and apply a mask to the endresults.

@rofinn
Copy link
Member

rofinn commented May 18, 2017

I guess we could have it convert NAs to 0 weights.

Seems like that should be a separate function though (e.g., WeightVec(replace!(df[:A], 0.0))). FWIW, we do enough of that stuff at my work that we have a package called Impute.jl... not sure if anyone else would find that useful though.

@DrFlamingoo
Copy link
Author

DrFlamingoo commented May 18, 2017

Actually purging every row with NA signal before handing everything to Histogram is the best option for my aims.
In that case would you mind noting in the docs that weights must not be NAs?
Thanks for you help in this matter!

@nalimilan
Copy link
Member

We can't really mention NA in the docs since StatsBase does not depend on DataArrays at all. Hopefully we will soon have a common concept of missingness in Julia Base, which will also replace DataArrays, at which point it will be possible to explain this, and maybe throw an explicit error.

FWIW, replace! doesn't work on DataArray but you can do WeightVec(convert(Vector, df[:A], 0)).

Also see JuliaStats/DataArrays.jl#249.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants