-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement scikit-learn's search trees with numba #9
Comments
Hey, since you mentioned xoak in: NOAA-ORR-ERD/gridded#55 I did some looking around before, and I came across this repository: (With MIT license, so good to go) Looks almost a perfect match with what you're proposing here? It uses a jitclass, but very lightly, which is arguably the right approach in my opinion. You could pass the tree data more easily as a namedtuple, if you don't want to pass all the arguments. Some query methods are still missing, but not that difficult to implement; although I'm not sure you can dynamically allocate as efficiently? (Numba could use something like C++'s std::Vector -- or is a typed List this already, it felt significantly slower to me.) Also parallelisation is extremely simple using numba's prange. JIT indeed provides very flexible metric functions, best way to introduce seems by using closures in numba to avoid the function call overhead, I believe: https://numba.pydata.org/numba-doc/latest/user/faq.html#can-i-pass-a-function-as-an-argument-to-a-jitted-function I've also noticed that performance can benefit significantly by aggressively inlining (although this increases compile cost). |
Good to know about the |
This could be done at a later stage, if we choose to go down this way.
The implementation approach used in scikit-learn is interesting in several aspects:
kd-tree and ball tree are built as thin layers on top of a common, binary tree implementation
all tree data is pre-allocated, which could make easier the re-implementation with numba and perhaps could facilitate experimenting with those structures and dask.
I think numba is now mature enough and supported in various distribution so that we can use it as a dependency. I'm not sure if numba's jitted classes are very mature and/or we could avoid using it here, though.
The biggest advantage of using numba is just-in-time compilation that allows very flexible metric functions.
The text was updated successfully, but these errors were encountered: