Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Better copyto #172

Closed
wants to merge 9 commits into from
Closed

WIP: Better copyto #172

wants to merge 9 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 26, 2018

PR following the discussion in #170.

What does it change:

  • append! uses resize! and copyto! which will make it faster, but when it fails it does not clean up
  • copy! will allow src to be non-categorical; dest is array or a view; in this implementation levels are retained and also if dest is ordered copy! might fail
  • copyto!:
    1. if dest is ordered do not allow adding new levels
    2. if dest is not ordered only add levels that are present in src in copied range (not all levels in src - this makes sure that if src is categorical and not categorical holding the same values as src we get the same result)

EDIT
The general idea of the design is that if x is categorical and y is a non-categorical then:

copyto!(x,a,y,b,n)

should be exactly the same as

copyto!(x,a,categorical(y),b,n)

@bkamins bkamins changed the title Better copyto WIP: Better copyto Nov 26, 2018
@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

@nalimilan this is WIP, so let me work on it a bit (I have to dig through all the tests as they require significant changes because if ordered is true we change a lot here).

The implementation I proposed for copyto! now seems to follow the rules we have discussed and is faster than old copyto! (at least on some limited benchmark tests I had time to perform).

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

I am going through the tests (hopefully I will update all today and push relevant changes) and here are two additional comments:

  • copy! indeed should have a different implementation - indeed it should copy levels and moreover it should have a totally separate implementation (I will add it) as Future.copy!([1,2,3],[5,6]) == [5,6] and we do not resize catarrays with copy! right now;
  • copyto! behavior that I propose is inconsistent with vcat in terms of ordering of levels (we could change vcat implementation to use copyto! internally to make it consistent); this should be discussed again I think; also the usage of copyto! in DataFrames.jl assumes that level ordering is retained so if we use what copyto! now proposes the pattern to populate a bigger vector using copyto! should be:
    1. create an empty categorical vector
    2. populate its levels as needed
    3. populate the values after levels are populated

@nalimilan
Copy link
Member

Thanks! That inconsistency between copyto! and vcat is a bit annoying. I'd say the essential property that should hold for copyto! is that it's equivalent to calling setindex! repeatedly. Currently, setindex!(::CategoricalArray, ::Any, ::CatValue) doesn't respect the ordering of levels (i.e. it adds a new level at the end if needed), which means we need copyto! to do the same.

But maybe we could change that? It would make sense for setindex! to insert the level in the position in which it would have in the result of mergelevels, i.e. in the levels of vcat. In general, the principle I've put a lot of effort into implementing is that if the two vectors have compatible orderings of levels, we try very hard to merge the levels into a common order (i.e. one in which the relative order of levels is preserved). I think that's right since the order of appearance is really an arbitrary criterion which shouldn't play a role for categorical arrays.

src/array.jl Outdated Show resolved Hide resolved
end
else
if seen[s]
drefs[dstart+i] = remap[s]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just use remap[s] == -1 as a sentinel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have done first, but allocating seen should be cheap, and the problem is that remap, in order to be fast, should have the same eltype as drefs and they are unsigned.

At least this is what I thought. Now I see that we can use remap[s] == 0 as sentinel, as we use 0 for missing which is handled in other way anyway. I will change this

src/array.jl Outdated Show resolved Hide resolved
end
end
end
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need to duplicate all of this code for performance? Branch prediction is usually quite efficient, so even if you have an if inside the loop I suspect it won't make a big difference. And often LLVM will do the hoisting automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to do isordered check in the innermost part of the loop. I can check if compiler can optimize this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's already a branch I don't think it will matter, even if the compiler doesn't hoist it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we have to do isordered test in every step of the loop. If it is not optimized out then it will be much slower. I will check.

@static if VERSION >= v"0.7.0-DEV.3208"
using Future
Future.copy!(dest::CatArrOrSub, src::AbstractArray) =
copyto!(dest, 1, src, 1, length(src))
Future.copy!(dest::CatArrOrSub, src::CatArrOrSub) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't needed now AFAICT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you want to happen if we want to copy! a normal array to a categorical array? Throw an error?

for i = 1:len2
A[len + i] = B[i]
end
# As in Base, A will be left modified if it is not possible to copy B to A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why Base doesn't call resize! in case of failure. That sounds trivial, undoing the operation is not as hard as in copyto!. But better stick to what Base does, and maybe file an issue in Julia.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the reason is that you would have to do try-catch which will slow down the function? I will open an issue with Base.

x = categorical([1:255;1:255], true)
y = categorical([1, 1000], true)
@test copyto!(x, 1, y, 1, 1)[1] == y[1]
@test_throws ErrorException copyto!(x, 1, y, 1, 1)[2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better check the length? Actually, better check the whole contents of the result (same for the line above).

@test_throws ErrorException copyto!(x, 1, y, 1, 1)[2]

x = categorical([1,2,3])
y = categorical([5,6,7])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use unsorted values to really test that the order is preserved.

copyto!(x,y)
@test levels(x) = [1,2,3,5,6,7]

@test copy!(x, [1,1,1]) == [1,1,1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also check levels? Also with different unsorted values? Same below.

nalimilan and others added 2 commits November 26, 2018 14:43
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

I've put a lot of effort into implementing is that if the two vectors have compatible orderings of levels, we try very hard to merge the levels into a common order

This is what I see in the code. And that is why I am asking, because the issue is complex (in general - working on those things taught me that things that I initially see as simple are not).

So we have two options:

  1. dynamically add levels as needed (what I currently implemented in copyto!)
  2. always merge levels

I can implement both of them with equal speed in copyto! (of course option 2 with setindex! will be slower)

Option 2 is very tempting conceptually. The problem with it is that what we merge might have conflicting orderings (but I guess we should treat it as a corner case and just go with your implementation)?

@nalimilan
Copy link
Member

If you have the time and motivation, it would be interesting to investigate whether it would be possible to have setindex! preserve ordering using mergelevels. When orderings conflicts, the fallback is to add levels at the end.

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

This is what I intended. I really like to have things clean.
However, what I will do, given the raising complexity of this PR is to handle append! separately as this is a simple change that will stay anyway.

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

@nalimilan - sorry to bother you again. I have started working on the order-preserving setindex! which led me to the analysis of mergelevels, because I have to reimplement it to level updating version (not creating a completely new level).

Can you please explain me the logic behind the following output?

julia> CategoricalArrays.mergelevels(false, [1,2,3], [4,2,5,1])
([5, 4, 1, 2, 3], false)

I do not understand why you put 4 after 5 in the result.

@nalimilan
Copy link
Member

No idea. I admit I would have expected [1, 2, 3, 4, 5] (due to the order of levels, not to integer sorting). I guess this has to do with the particular solution I found to insert levels while preserving order when possible: AFAIK it works when it's possible to preserve order, but when that's not possible it might not return an optimal choice. Feel free to suggest better solutions if you can find ones, I must say I don't remember all the details now.

@bkamins
Copy link
Member Author

bkamins commented Nov 27, 2018

I will design such an algorithm, but this is a longer process to make it efficient and consistent for: setindex!, vcat and copyto!. But this will be mildly breaking.

Therefore, because of the complexity we face, I will close this PR and close #170 and open issues/PRs that are split into smaller chunks of changed functionality.

@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2018

I am thinking about this issue for some time. The first conclusion is that we have to have a list of methods we should hande. This is what I have collected:

  • setindex!
  • copyto!
  • copy!
  • vcat (we need to implement a proper method here for CatValue and CatString as they are missing)
  • append!
  • push!
  • hcat (we still have to implement a custom method here as it is missing)
  • hvcat (we still have to implement a custom method here as it is missing)
  • pushfirst! (we still have to implement a custom method here as it is missing)
  • insert! (we still have to implement a custom method here as it is missing)

The key issue is that in general it is impossible to implement mergelevels in a consistent way. What I mean by this is that the result must depend on the order of arguments and this is unavoidable.

The general logic for all methods should be:

  • check if the operation allows adding new levels
  • if it does then add levels
  • if it does not then throw an error if the level would have to be added
  • perform an operation on a modified target (we know that it allows adding levels)

There are then multiple corner cases but the main is:

in cases those methods add levels to a vector should they add all levels at once or only levels present in what is added.

Currently with mergelevels we add all levels, not only those present. I think it is important to have one guiding princple here that is commonly respected by all the functions, as otherwise users will have a hard time learning the API.

The difference in the approach manifests itself e.g. when doing setindex!, append!, copyto! to an ordered catvector. First of all - they all should be consistent. Now if we have add all levels rule then those methods should fail if what we add has some levels that are not supported by the catvector we write to even if the value itself would be accepted. If we have add needed levels rule then such operation will work if what we want to add is a level that is accepted.

The thing is that if we do not have common guiding principles then e.g. vcat(x, y) and append!(x, y) for some data might both work but produce a different result and I think that vcat(x,y) and append!(x,y) should pass equality of results and should pass equality of levels.

To understand what I mean have a look at this example (this is one of many possible examples):

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical([4], ordered=true)
1-element CategoricalArray{Int64,1,UInt32}:
 4

julia> levels!(y, [1,4,2])
1-element CategoricalArray{Int64,1,UInt32}:
 4

julia> levels(hcat(x[1], y))
2-element Array{CategoricalValue{Int64,UInt32},1}:
 1
 4

julia> levels(hcat(x[1:1], y))
2-element Array{Int64,1}:
 1
 4

julia> levels(vcat(x[1], y))
2-element Array{CategoricalValue{Int64,UInt32},1}:
 1
 4

julia> levels(vcat(x[1:1], y))
4-element Array{Int64,1}:
 1
 4
 2
 3

julia> levels(append!(x,y))
4-element Array{Int64,1}:
 1
 2
 3
 4

So in short - I think we first need a principle we want to follow and then reimplement all the methods from grounds-up to follow it. Unfortunately it seems to be a large task, but - if we want to provide a solid base for the future - I would try to do it.

@nalimilan
Copy link
Member

Unfortunately I don't think we can fix cat at this point, since we would introduce ambiguities with other array types. See JuliaLang/julia#20815. So all the situations where you end up with an Array{CategoricalValue} should not exist -- we can't handle this kind of array very well anyway and they don't make a lot of sense.

Regarding whether we should (try to) add all levels or only those actually used, I think we should do the former, because it's not always possible to efficiently check which levels are used (you basically need to make an additional pass over the data). That's one of the reasons why we keep unused levels in the first place (the other being that levels are supposed to have a meaning by themselves, independent from the data).

@bkamins
Copy link
Member Author

bkamins commented Dec 15, 2018

Then I would leave mergelevels you have implemented and just say that:

  • when there is a conflict between level orderings in merged levels the order is undefined;
  • that the resulting order depends on the order of arguments (even without a conflict - as this is the case).

I will go back to this when I am more or less finished with DataFrames.jl stuff (setindex! and broadcasting) as this is probably less pressing.

@bkamins
Copy link
Member Author

bkamins commented Dec 19, 2018

TODO (from Slack):
also consider this case:

using DataFrames
v=DataFrame(:a=>categorical([nothing,"a"]))
append!(v,v)

which currently throws an error

@nalimilan
Copy link
Member

@bkamins Do you know what is still relevant in this PR after we merge #211 and given what we have decided at #99?

@bkamins
Copy link
Member Author

bkamins commented Sep 25, 2019

I will test it after #211 is merged and tagged.

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2019

There is still a problem that:

using DataFrames
v=DataFrame(:a=>categorical([nothing,"a"]))
append!(v,v)

throws an error.

Also before we close this please confirm that this is intended:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> push!(x, 10)
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> append!(x, [10])
ERROR: MethodError: no method matching append!

julia> append!(x, categorical([10]))
4-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 10

@nalimilan
Copy link
Member

There is still a problem that:

using DataFrames
v=DataFrame(:a=>categorical([nothing,"a"]))
append!(v,v)

throws an error.

I don't think CategoricalArrays handle Union eltypes very well. Probably worth filing a separate issue.

Also before we close this please confirm that this is intended:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> push!(x, 10)
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> append!(x, [10])
ERROR: MethodError: no method matching append!

julia> append!(x, categorical([10]))
4-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 10

The push! error is intended (we can discuss that, though). The append! error is probably an oversight worth filing another issue (I wish Base would define a fallback based on resize! and setindex!, but currently it doesn't).

Nothing to change in copyto! itself?

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2019

If this:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical([1,2,1000], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1000

julia> y[3] = 1
1

julia> y
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1

julia> copyto!(x, y)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1

julia> levels(x)
4-element Array{Int64,1}:
    1
    2
    3
 1000

is intended then I think we are OK and this can be closed.

Can I leave opening of the other specific issues you have indicated to you as I feel you understand better what is the core problem with them?

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2019

just for a reference continuing the last example, we have:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> z = [x; y]
6-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 1
 2
 1

julia> levels(z)
4-element Array{Int64,1}:
    1
    2
    3
 1000

so we are consistent.

@nalimilan
Copy link
Member

is intended then I think we are OK and this can be closed.
Yes that's intended. At least that's what we agreed upon AFAICT. Good to know we're consistent now!

Can I leave opening of the other specific issues you have indicated to you as I feel you understand better what is the core problem with them?

Actually I've realized it's covered by #170.

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2019

OK - so this can be closed then. Right?

@nalimilan
Copy link
Member

I guess.

I've filed #213 to fix the append! problem. There are probably many other issues like that, though.

@nalimilan nalimilan closed this Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants