Tree confusion #44

andreasnoack · 2020-12-15T08:41:30Z

While I dived into the source here the use of the KDTree confused me. I'm not experienced in working with trees so I'd appreciate some feedback from experienced tree traversers.

What I find confusing with the current implementation is that the KDTree doesn't contain the elements of the set used to construct the tree, i.e. the rows of the model matrix x. I would have thought that the purpose of the tree was to be able to look up each of the rows in x efficiently. However, the tree consists only of the splitting medians for each dimension. Hence, it seems that the purpose of the tree is more to decide which neighborhoods to use for computing the local regression than a structure for organizing the rows of x.

I've browsed the two papers of Cleveland but I was unable to extract any details on the tree structure. I also tried to read a bit of the Fortran source code but the code isn't easy to follow to state it mildly. @dcjones in case you are still reading GitHub notification, do you recall what your implementation is based on? I feel like I'm missing something here, so if there is a different reference then I'd like to take a look.

The text was updated successfully, but these errors were encountered:

dcjones · 2020-12-15T16:53:28Z

It's been a while so I don't really remember how this code works I'm afraid. I'll be a little less busy shortly (my final exam is in an hour 😓). I wouldn't mind delving back into this and making sure it's correct.

andreasnoack · 2021-01-19T09:56:43Z

Congratulations on finishing the last exam. I hope it went well. It would be great if you could take a look at this again and see if things are right.

dcjones · 2021-01-19T18:53:00Z

I think the main source of confusion is that I didn't cite the right paper. This implementation is actually based on this paper (which describes netlib dloess implementation):

Cleveland, William S., and Eric Grosse. "Computational methods for local regression." Statistics and computing 1.1 (1991): 47-62.

The code looks ok to me. The kd-tree is unnecessarily complicated for doing the one dimensional regression that this package supports. It becomes actually useful in higher dimensionality for selecting vertices to perform the fit, then for evaluating by finding adjacent vertices to blend between. With 1d response, just sorting + binary search would be fine. I used kd-trees anyway here with the expectation that I'd eventually support higher dimensional data.

andreasnoack · 2021-01-20T10:14:26Z

Indeed. That paper clarifies what's going on. It looks like they anticipated the possible confusion

It[k-d tree] was originally designed for answering nearest-neighbor queries but we use it in a different way.

I've added the paper to the list of references in a4ae2f0 so this issue can be closed.

andreasnoack closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree confusion #44

Tree confusion #44

andreasnoack commented Dec 15, 2020

dcjones commented Dec 15, 2020

andreasnoack commented Jan 19, 2021

dcjones commented Jan 19, 2021

andreasnoack commented Jan 20, 2021

Tree confusion #44

Tree confusion #44

Comments

andreasnoack commented Dec 15, 2020

dcjones commented Dec 15, 2020

andreasnoack commented Jan 19, 2021

dcjones commented Jan 19, 2021

andreasnoack commented Jan 20, 2021