-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common API design #1
Comments
The API recommendations should follow (in the order of importance):
|
I like very much your suggestion of
Very Julian indeed :) I typically find that the datastructure |
I agree; however, I am looking forward to JuliaLang/julia#13942 as an elegant solution to this. Not aware if it is type stable |
I have concerns on dispatching on
and dispatch on that. |
I disagree simply because virtually all search packages implement this approach - so, it is a popular and maybe more natural choice . Assuming that there is a minimal degree of documentation associated to the |
One problem with dispatching on I would prefer two different methods for range and nearest neighbor search. The return values would be fine for the current implementation of VPTrees, however I could imagine there being a datastructure where it is not necessary to compute all distances to the returned values. How is it for example with range search in a KDTree? I suppose it depends on the application if they can profit from precalculated distances. |
Separating range and nn search does make sense as well.
I do not quite understand this. Does it mean returning only indices without the distances (contrived situation: fetching a full cluster of points based on centroid distance of sorts?) In this case, distance values could be |
I'm currently working on reimplementing There's a usage pattern that follows similarly to what's proposed here: graph = nndescent(::Type{<:ApproximateKNNGraph}, data, n_neighbors, metric)
indices, distances = search(graph, data, queries, n_neighbors, metric) However, there's a lot of other functionality that's implemented / planned that goes beyond just index construction and searching (some of that is described in these dev notes). The subtypes of |
To add my thoughts on
I think these are semantically different enough to warrant being separate functions. Furthermore, even if they are combined, you can't distinguish dispatch based on whether the third argument is an |
This looks interesting. I wonder if a peer-to-peer approach to ann search can become feasible with this. If so, a julia search engine similar to yacy that has both online updates and distributed capabilities could become a killer app. |
Good point.
Isn't |
(side comment: as it is not yet clear for me whether other authors want to have all NN-related packages under the same umbrella org, I have compiled this list: #2 that will help us have a summarized picture of what is out there) |
Re: on the search type specification. Both Jonas and Altre have good points, and simply dispatchin on We can dispatch on
(as these structs are already part of a NN ecosystem, no need to add the extra |
@dillondaudert I must say I strongly disagree that inrange and knn searches should be different functions. Here is why: At least in my line of work, the search itself is never the final outcome of the work I have to do. After I find the neighbhors I have to do stuff with them, as for example use them to estimate the Lyapunov exponent. Therefore, I want to be able to have as an argument the possibility to change the way I find the neighbors. If it is a function instead of an argument, then I have to change my code, and not my argument, to see how my result will change with different distance Besides, your very own comment already shows why these two things should be one function:
is what you paste, and I can clearly see that the only difference here is the third argument, which more often than not also happens to have a different type. Julia's multiple dispatch comes to the rescue and reduces the multiplicity of 2 functions into 1. (as @altre already said, the |
I see now that another wonderful benefit of having this common API is that users can easily benchmark each package for their specific application and thus pick the most suitable one!!! :) |
This is not really true as it is easily solvable through: search(index, queries, k::FixedMass, metric) = search(index, queries, k.k, metric)
search(index, queries, r::FixedRadius, metric) = inrange(index, queries, r.ε, metric) which presents the advantage of other packages not having to support/import/care |
I am confused, or I may not have understood your comment, because I think you are contradicting yourself... :S You claim that supporting/caring about
In my eyes different methods mean the same function with different argument types, i.e. multiple dispatch, i.e. what I suggested with
Yeap, that is true, but it is also on the user side. Why make the user do this, if the original API can be what the user would do, since it has less function names and takes advantage of Multiple Dispatch? |
My bad, 2 functions ;)
The user would not do it unless it needed a single method (bringing the user as an argument is highly debatable here). Just to summarize, there are two options:
In my view, the method approach is more problematic because puts an emphasis on all packages supporting I am obviously taking the effort and consideration put into the work i.e. packages so far as a stronger indicator regarding where things should go rather than new choices. |
? We are discussing a common api so that people can use nn searches in a common manner. The usage of this API should be the primary concern, no?
There is no difference whatsoever in what you say. in 1. you support 2 functions, in 2 you support 2 methods. There is no extra "weight" or "extra work" from the package's side whatsoever. It is literally the same amount of work. I don't get this argument at all. |
@Datseris As I previously explained, you cannot distinguish the third argument based on numeric type - My original suggestion was for different functions, but I do kind of like the idea of |
I was claiming that there's enough of a semantic difference between the two operations that different functions were warranted. You may or may not agree, but it isn't a counter-argument to say that they must be the same function since their signatures resemble one another - that's like saying
I agree, this would be one great benefit of a unified API |
I included it just to be explicit. I haven't decided if |
I fully agree, and at the moment we should only consider |
Ofcourse, this is implicit. However, my remark tried to hint that you substituted personal choice (of having two methods) with user choice:
At this point, most user usage patterns and needs are unknown and cannot be discussed as no factual elements on which to base decisions exist (hence the above mentioned substitution). I may be wrong here so if you have any evidence worth mentioning, please do share it. Please note that this discussion should concentrate on subjective aspects of the API that we can agree on. This is not one of them. Thanks. |
@zgornel I apologize if I made you think that I personally think my personal suggestion is better than yours because it is mine. That is not at all the case nor my intention. It is just plain old multiple dispatch. Allow me to clarify: Here is why I believe the Multiple Dispatch approach is subjectively better for a user perspective, or at least for the use case I describe. Imagine the scenario: function lyapunovs(ref, data, structure, method)
λ = 0.0
for point in data
neighbors = search(point, structure, method)
rates = [expansion_rates(ref, n) for n in neighbors]
λ += sum(rates)
end
λ/length(data)
end This is a reasonable function that uses the neighbors in other manners once they are found. I truly hope this seems reasonable usage, and not as personal as you initially perceived. Now, in the rest of my script I call Now, I want to compare my results for different types of finding neighbors.
For me it is crystal clear why 1 is better than 2. If you think otherwise, please explain why. The more perspectives we have on this the better. @zgornel since we are discussing an API for many people to use, please, just like you asked me, I now ask you: provide concrete example on why the 2-functions approach is more suitable. In the example above, for me the situation is clear. But once you provide an example which 2-functions should be better then I of course will reconsider. I have already seen that you have different use cases, so this will be interesting to see. This also adds on your argument:
I don't think this is entirely true. Every person in this discussion is a potential user. Combining all our uses cases already gives a very good estimate I believe, especially since our use cases are so diverse. I am thinking about the argument "its more work for the developer" and how it stands. I still don't see how extending 2 functions is less work than extending 2 methods. In the end of the day you just write 2 Julia functions either way. |
@Datseris no apologies needeed, no offense was taken; I have already stated my arguments - mostly centered upon forced sharing of types and familiarity/consistency of existing signatures - and cannot contribute anything more substantial to the discussion, at least in this direction. |
Uh, you gave me quite a bit of reading material. Multiple dispatch is cool but it brings with it the expectation that methods sharing the same function name have essentially the same behaviour.
I dare to assume that quite a few algorithms/applications rely on getting at least some neighbors returned or at least need to do some special-casing to avoid errors for empty arrays. If your application happens to be fine with a variable amount of neighbors (that can be 0!)
(as we have done so far). |
Well, I can safely say that we were being stupid so far, not realizing how trivial it is to accommodate both options. Here is one way that solves literally all problems, and keeps everyone satisfied: Neighborhood.jl (or however we'll call this at the end) exports these names:
where internally in Neighborhood.jl we just define
The packages that accommodate the common API extend @zgornel @JonasIsensee @altre @dillondaudert Can we get a vote/critique on the suggested API? |
@Datseris LGTM. insert!(ds, index, new_point) # index is some unique key/index in ds, similar to Base.insert!
delete!(ds, index) # again index uniquely identifies a point; deleteat! may also be a name option In both cases a partial update of Potentially faster methods are |
@Datseris I think that looks like a good first proposal (except I would keep search(ds, query, FixedRange; ...)
search(ds, query, FixedAmount; ...) and to have in Neighborhood.jl the convenience functions inrange(ds, query, r; ...) = search(ds, query, FixedRange(r); ...)
knn(ds, query, k; ...) = search(ds, query, FixedAmount(k); ...) Also I'm wondering if anyone has thoughts on which functions should be mandatory to implement (if any). For NearestNeighborDescent.jl, the knn graphs aren't mutable so adding/removing points won't be implemented. I also wasn't planning on supporting an |
I would say knn searches are the minimum requirement. But I think its worth it to consider having an inrange version as well, if it is simple enough to go about. As you noted mutability is only possible in a specific subset of the packages, so it can't be implemented for all. |
I've updated my proposed API because I realized that the absolute minimum for a common API is two functions:
Optionally one should extend in range, and I strongly recommend doing so :P (this is a personal opinion by the way) |
The problem is, how do the individual packages extend |
Okay, I think the best way forward is to see some code. In the near future I'll make a PR here that implements this interface and invite everyone to review, but most importantly to give feedback on how to "make the datastructure" thing. |
Can't we just use the constructor like in NearestNeighbors.jl: |
Is the Levenstein distance mentioned here and on discourse also a Yes, your suggestion seems to be totally valid for me. |
No, in StringDistances.jl it isn't, but I think that is an error, might be just a simple PR away. |
But at the moment I don't see a reason to enforce types, i.e. why limit |
Well, it might stop people from shooting themselves in the foot. Would you suggest M <: PreMetric or just pass a function? |
I would suggest to not add any type limitation to allow for extendability. At the moment I'll do it without limitations on the PR, and we can talk there about adding successive limitations. |
I'm wondering if this is in conflict with the idea of If you want to compare different algs in an application then you would need to pass the function name as a symbol of so and then eval? |
I have thought about that, which is why I have implemented """
searchstructure(ST, data, metric; kwargs...) → st
Create a search structure `st` of type `ST` (e.g. `KDTree, BallTree` etc.) based on the
given `data` and `metric`. The data
"""
function searchstructure end |
You could also just use a list of constructors, similar to: |
I wonder if it's really that onerous to have And btw, |
@dillondaudert Its probably better to ask the opposite question: do you gain anything whatsoever in limiting all metrics to a subtype of |
@JonasIsensee had a good plan with:
but then there were 2 noteworthy comments.
search
should have 3 arguments, because searchinginrange
and searchingknn
is at a fundamental level different enough of an operation that I wouldn’t use a keyword for it.In my eyes the return should be the same for all packages, and it should be either indices AND distances, or points AND distances. The package must know the distances to find the neighbhors, so why not return them anyway.
The text was updated successfully, but these errors were encountered: