-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RTree
index results in significantly slower execution
#444
Comments
Hello!
Alright, so the first thing to understand is that DuckDB indexes don't store any actual data themselves. in the case of the R-Tree the index itself only contains bounding boxes and In comparison, when the index is not used and DuckDB performs a "full table scan", the lookup is just a single phase. There is no "jumping around" needed, because we know we're going to have to scan the whole table we just have to do a single pass across all table pages. It's also very easy to parallelize. Alright, so why use the index at all? Well, to avoid having to scan as much. there's two scenarios where that's preferable:
Regarding 2. that's usually the case when dealing with geospatial predicates, however, in this case a simple "point in small polygon" is comparatively cheap, and DuckDB is just really good at brute-force computations. So a selectivity of about ~20% (as in your example) might not actually be enough. For the adaptive-radix-trie we don't recommend creating one unless you have highly selective filters (where you expect to filter out everything except about 0.1% of rows), although Id probably say the R-Tree would still be beneficial for much higher percentages, particularly if you have more complex larger geometries.
Even if the whole R-Tree index fits in RAM, the indirection required to look up individual row-ids is still slower on disk than in memory because you may have to load a disk page just to conclude that the row-id is not present and continue on.
Unfortunately this is not really possible with the current extension capabilities in DuckDB. Extension types can't provide custom statistics (such as spatial extent or distribution) to influence query planning, and deciding when to use an index or not will more or less always come down to heuristics. |
I load a dataset of ~11.3 M points into DuckDB using the following sql:
To reproduce, you can download the
input_data.csv
from here.Then, I set the threads to one and run the following sql:
here is the query profile:
Next I make an
RTree index
using:and I run the same query again. here is the query profile with the index:
I am wondering if this a bug in RTree, or I am using the spatial extension wrong.
The other strange thing that I observed is that if I run the same sql when the DuckDB session that is persisted (i.e., DuckDB started with a database file), then the non-indexed query performs the same, but the query that uses the index is even slower. here is a query profile for the same query when the database is persisted:
As I have enough RAM that the index can totally fit in the RAM, I don't understand why it's slower when I run it when database is persisted.
If this is a specific scenario where an RTree index is not preferred, I think it might be best to detect it in query planning and don't use an index for that query.
Version of DuckDB: v1.1.2 f680b7d08f
p.s.: I have 128GB of RAM in my computer and I set the threads specifically to 1, otherwise the non-indexed query is even faster cause it processes the data in parallel while the query plan that uses the RTree index is being executed sequentially.
The text was updated successfully, but these errors were encountered: