Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create interface for indexible tables in IndexedTableAccess #1938

Merged
merged 24 commits into from
Aug 21, 2023

Conversation

nicktobey
Copy link
Contributor

Currently, only ResolvedTables are allowed to have indexes. There exists an interface, sql.IndexAddressable, which any node or table can implement in order to be a candidate for index-based optimization. But in practice, implementing that interface won't actually do anything because the IndexedTableAccess struct explicitly requires a ResolvedTable.

This PR replaces the ResolvedTable field in IndexedTableAccess with a new interface tentatively called TableNode, although a more specific name would probably be better.

In order for a node to be used for index-based optimization, it must implement this interface, and the table returned by the UnderlyingTable method must implement sql.IndexAddressable

@nicktobey nicktobey force-pushed the nicktobey/table-functions branch 2 times, most recently from 0e965aa to 2cfc612 Compare August 14, 2023 20:42
Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the generalization, I am not sure how UnresolvedTable fits in, but those only exist on non-execution paths now. I think it would be a mistake not to add a small test indexable table function in GMS as part of these changes, but I think it is equally good to get the first set of changes in and then follow-up with the test. TestTableFunctions and memory/sequence_table.go are a reference for how to do this.

sql/plan/ddl.go Outdated Show resolved Hide resolved
Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New interface seems fine to me, TableNode is a fine name

sql/analyzer/tables.go Outdated Show resolved Hide resolved
sql/plan/procedure_resolved_table.go Outdated Show resolved Hide resolved
sql/plan/resolved_table.go Show resolved Hide resolved
@nicktobey
Copy link
Contributor Author

Prior to this PR, it was not possible for system table functions to have indexes. (You could make the table function implement sql.IndexAddressable, but it wouldn’t actually do anything.)

It’s now possible to write a system table function that can be optimized. The workflow looks like this:

  1. Create a sql.Partition implementation that encodes all information necessary for the table function logic to only generate rows that match the filter. For example, include a field in the partition object for each field being filtered on.

Partitions were a feature inherited from GMS, designed to facilitate parallelism: a table is divided into multiple partitions that can each be iterated over separately. Dolt doesn’t support parallelism, but we still use partitions because indexes are implemented via partitions: every node that is indexable converts index lookups into an PartitionIterator.

  1. Implement a Partitions method, which returns a partition iterator that will contain every row in the table exactly once. This method is called in situations where the caller wants to get every row in the table.

  2. Implement a PartitionRows method, which takes a partition and returns a RowIter. Most likely, either this or the existing RowIter function will end up calling the other one.

  3. Implement GetIndexes, which provides the list of indexes that the table supports.

These don’t need to be full implementations of the sql.Index interface. Most index functionality won’t actually be used. In the case of table functions, it’s not even possible for a query to inspect these indexes. There doesn't need to be an underlying table.

  1. Implement LookupPartitions, which takes in IndexLookup and generates a PartitionIter.

The process of implementing these methods is cumbersome, but straightforward, so long as you’re only exposing a single index on a single column. Optimizing on multiple columns becomes more complicated.

In addition, there may be optimizations that can’t be expressed as an Index. For instance, imagine a hypothetical system table function that has N columns and any combination of these columns can be filtered on efficiently. In order to make sure that every combination of filters can be done efficiently in the current framework, the node would need to provide N! different indexes.

A better solution would be to allow system tables (and system table functions) to be aware of filters when generating their rows. Conceptually, this could be done by having system tables implement an interface which consumes a filter expression and produces a new system table node. In cases where no optimization can be performed, the interface returns the original system table unchanged. Then we add an optimization that runs after we push filters down the tree, that pattern matches for filters whose child nodes implement this interface.

If we decide to optimize more system table functions in the future, we should strongly consider this better solution, since it will result in cleaner, more readable, more maintainable code that’s faster to write.

@nicktobey nicktobey merged commit dcdb9ae into main Aug 21, 2023
6 checks passed
@nicktobey nicktobey deleted the nicktobey/table-functions branch August 21, 2023 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants