-
So with DuckDB 1.1 there's GeoParquet output, with bbox struct (awesome), and r-tree indexes (also awesome). But I loaded some geospatial data that wasn't ordered and then wrote out the geoparquet and it's clearly not indexed / ordered at all. Is there a way to tell DuckDB to order the data by the r-tree? I tried creating an r-tree index and then doing 'order by geometry', but it didn't seem to do anything. My default way would be to create a quadkey and order by that - but it seems like the native r-tree should be able to help me and hopefully remove a couple steps. Happy to contribute to the docs / share broadly if there's a way to do it. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hello! Unfortunately, DuckDB does not support overriding the comparison operators (used to sort) for custom types, meaning Perhaps there could be a "dummy" scalar function the optimizer could detect to do this if the column is r-tree indexed, or otherwise error out, e.g. There are also some other issues related to the preservation (or lack thereof) of row-order when scanning from a table in parallel or writing to parquet. I think the only way to guarantee the output order of the rows within parquet files is to execute a ORDER BY, which will emit chunks with a "batch index", that the parquet writer can use to reorganize incoming chunks in flight and ensure they are written in sorted order, even when executing with multiple threads. So uh, yeah, we could look into adding some sort of dummy marker function + optimizer rule + index-batch-scan operator, but I don't think its going to be much faster or produce much better results than just doing a full table scan + explicit sort, e.g. |
Beta Was this translation helpful? Give feedback.
Hello! Unfortunately, DuckDB does not support overriding the comparison operators (used to sort) for custom types, meaning
ORDER BY <geometry>
will just order based on the lexical order of the bytes in the internal binary representation. The r-tree index is completely separate from ordering. I know that other databases (like Postgres) sometimes can perform an index scan instead of explicitly sorting when executing an order-by, but this requires the ordering criteria to map to the physical order of the index. In the case of R-Trees, I can't think of any way that would work.Perhaps there could be a "dummy" scalar function the optimizer could detect to do this if the column is r-tree indexe…