Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add an Arrow-based, columnar binlog buffer #91

Merged
merged 22 commits into from
Sep 29, 2024
Merged

Conversation

fanyang01
Copy link
Collaborator

@fanyang01 fanyang01 commented Sep 20, 2024

This PR introduces a columnar delta buffer to store the DML change logs replicated from the primary server. The change logs are transmitted via the ROWS_EVENT binlog events, which include INSERT, UPDATE, and DELETE operations.

This PR is the first step towards resolving issues #55 and #56. Currently, the buffer is flushed to DuckDB for each binlog event, which is not the intended usage. The primary goal of this PR is to ensure that the new components function as expected and pass the existing tests. I will improve the scheduling of buffer flushing in forthcoming PRs.

To enhance the performance of the columnar delta buffer, the following changes have been made:

  • Modified the Vitess binlog parser to write directly to the Arrow array builder (binlog/rbr.go). This modification minimizes copying and allocation, which is crucial for high performance.
  • Flush the buffered changes to DuckDB in a columnar manner through Arrow IPC (replica/controller.go). This is key to avoiding issue Row-by-Row INSERTs are very slow in DuckDB #55. This implementation can be further optimized to a zero-copy approach once PR #283 is merged.
  • Switched from dolthub/vitess to the official vitess.io/vitess for binlog handling. The official repository has been refactored since the DoltHub fork and thus handles JSON data better and is easier to use. Since our project depends on vitess.io/vitess and dolthub/go-mysql-server, and the latter depends on dolthub/vitess, I have forked go-mysql-server and dolthub/vitess to our organization to resolve the conflicts.

This PR has passed all existing binlog replication tests.

@fanyang01 fanyang01 marked this pull request as ready for review September 29, 2024 06:56
@fanyang01 fanyang01 changed the title [WIP] feat: delta service feat: add Arrow-based delta buffer Sep 29, 2024
@fanyang01 fanyang01 changed the title feat: add Arrow-based delta buffer feat: add Arrow-based columnar delta buffer Sep 29, 2024
@fanyang01 fanyang01 changed the title feat: add Arrow-based columnar delta buffer feat: add Arrow-based columnar binlog buffer Sep 29, 2024
@fanyang01 fanyang01 changed the title feat: add Arrow-based columnar binlog buffer feat: add an Arrow-based, columnar binlog buffer Sep 29, 2024
Copy link
Contributor

@GaoYusong GaoYusong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This PR lays the foundation for seamless OLTP synchronization, making MyDuckServer as easy to use as Apple products. Cheers to Delta!

@fanyang01
Copy link
Collaborator Author

Thanks for your detailed review! The next step would be letting the delta appender buffer as much data as possible until 1) the memory usage becomes too big; 2) some of the tables are queried; 3) a pre-defined time ticker (e.g., every 1 minute) is fired.

@fanyang01 fanyang01 merged commit 28ed15f into main Sep 29, 2024
1 check passed
@GaoYusong
Copy link
Contributor

Thanks for your detailed review! The next step would be letting the delta appender buffer as much data as possible until 1) the memory usage becomes too big; 2) some of the tables are queried; 3) a pre-defined time ticker (e.g., every 1 minute) is fired.

sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants