-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree confusion #44
Comments
It's been a while so I don't really remember how this code works I'm afraid. I'll be a little less busy shortly (my final exam is in an hour 😓). I wouldn't mind delving back into this and making sure it's correct. |
Congratulations on finishing the last exam. I hope it went well. It would be great if you could take a look at this again and see if things are right. |
I think the main source of confusion is that I didn't cite the right paper. This implementation is actually based on this paper (which describes netlib dloess implementation):
The code looks ok to me. The kd-tree is unnecessarily complicated for doing the one dimensional regression that this package supports. It becomes actually useful in higher dimensionality for selecting vertices to perform the fit, then for evaluating by finding adjacent vertices to blend between. With 1d response, just sorting + binary search would be fine. I used kd-trees anyway here with the expectation that I'd eventually support higher dimensional data. |
Indeed. That paper clarifies what's going on. It looks like they anticipated the possible confusion
I've added the paper to the list of references in a4ae2f0 so this issue can be closed. |
While I dived into the source here the use of the KDTree confused me. I'm not experienced in working with trees so I'd appreciate some feedback from experienced tree traversers.
What I find confusing with the current implementation is that the KDTree doesn't contain the elements of the set used to construct the tree, i.e. the rows of the model matrix
x
. I would have thought that the purpose of the tree was to be able to look up each of the rows inx
efficiently. However, the tree consists only of the splitting medians for each dimension. Hence, it seems that the purpose of the tree is more to decide which neighborhoods to use for computing the local regression than a structure for organizing the rows ofx
.I've browsed the two papers of Cleveland but I was unable to extract any details on the tree structure. I also tried to read a bit of the Fortran source code but the code isn't easy to follow to state it mildly. @dcjones in case you are still reading GitHub notification, do you recall what your implementation is based on? I feel like I'm missing something here, so if there is a different reference then I'd like to take a look.
The text was updated successfully, but these errors were encountered: