Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree confusion #44

Closed
andreasnoack opened this issue Dec 15, 2020 · 4 comments
Closed

Tree confusion #44

andreasnoack opened this issue Dec 15, 2020 · 4 comments

Comments

@andreasnoack
Copy link
Member

While I dived into the source here the use of the KDTree confused me. I'm not experienced in working with trees so I'd appreciate some feedback from experienced tree traversers.

What I find confusing with the current implementation is that the KDTree doesn't contain the elements of the set used to construct the tree, i.e. the rows of the model matrix x. I would have thought that the purpose of the tree was to be able to look up each of the rows in x efficiently. However, the tree consists only of the splitting medians for each dimension. Hence, it seems that the purpose of the tree is more to decide which neighborhoods to use for computing the local regression than a structure for organizing the rows of x.

I've browsed the two papers of Cleveland but I was unable to extract any details on the tree structure. I also tried to read a bit of the Fortran source code but the code isn't easy to follow to state it mildly. @dcjones in case you are still reading GitHub notification, do you recall what your implementation is based on? I feel like I'm missing something here, so if there is a different reference then I'd like to take a look.

@dcjones
Copy link
Contributor

dcjones commented Dec 15, 2020

It's been a while so I don't really remember how this code works I'm afraid. I'll be a little less busy shortly (my final exam is in an hour 😓). I wouldn't mind delving back into this and making sure it's correct.

@andreasnoack
Copy link
Member Author

Congratulations on finishing the last exam. I hope it went well. It would be great if you could take a look at this again and see if things are right.

@dcjones
Copy link
Contributor

dcjones commented Jan 19, 2021

I think the main source of confusion is that I didn't cite the right paper. This implementation is actually based on this paper (which describes netlib dloess implementation):

Cleveland, William S., and Eric Grosse. "Computational methods for local regression." Statistics and computing 1.1 (1991): 47-62.

The code looks ok to me. The kd-tree is unnecessarily complicated for doing the one dimensional regression that this package supports. It becomes actually useful in higher dimensionality for selecting vertices to perform the fit, then for evaluating by finding adjacent vertices to blend between. With 1d response, just sorting + binary search would be fine. I used kd-trees anyway here with the expectation that I'd eventually support higher dimensional data.

@andreasnoack
Copy link
Member Author

Indeed. That paper clarifies what's going on. It looks like they anticipated the possible confusion

It[k-d tree] was originally designed for answering nearest-neighbor queries but we use it in a different way.

I've added the paper to the list of references in a4ae2f0 so this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants