-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce the ability to show estimate progress of copy #146
Conversation
Introducing a new `log_progress` function to estimate progress during data operations. This function compares the sizes of the parent table and the shadow table using `pg_table_size`. As we can't count the rows directly due to the ongoing transaction, `pg_table_size` provides the closest estimate. Prior to this, we perform a vacuum and analyze on the parent table to enhance the accuracy of the size measurement. It's important to note that the `pg_table_size` of the shadow table may not exactly match the parent table, especially for larger tables. This discrepancy is expected, since the logging is a best effort measure. The function will terminate once `copy_data` completes as it watches for an instance var - `@copy_finished` Progress is logged at 60-second intervals.
Just heads up / fyi @jfrost
|
logger.info("Setting up shadow table", { shadow_table: shadow_table }) | ||
Query.run( | ||
client.connection, | ||
"SELECT create_table_all('#{client.table_name}', '#{shadow_table}');", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting: This bit. I believe its ok to create the shadow table outside of serializable transaction. All the following stuff that is running the alter statement, deleting rows from audit table and then copying the data from audit table into shadow table - is all happening in the single transaction. Thats the important bit to ensure no duplicates are copied over.
And by doing this, it allows us to show estimated progress as well.
4876fe3
to
cfd8b8e
Compare
@shayonj I wasn't able to get this to work. The |
oh interesting! Yeah if you wouldn't mind opening an issue 🙏🏾 |
Looks like the progress reporting does not work if a custom copy statement is used as per this return statement here. It looks like there might not really be a technical reason that prevents doing it? |
oh interesting, thanks for catching that. Feel free to open a PR and i can help do some testing as well. |
I tried some options but the copy progress does not really work at all, as stated earlier in this thread. I think the transaction isolation level on the copy is preventing reads, which is what table size likely does under the hood in some form? Either way, on these large operations where I've tested it (think 24h runtimes or more) it does not report progress in any way, just lock timeout errors. |
Yeah, you are right. I think a refactor is in order here to use two connections with a single snapshot id - one that manages the copy and the other that reports the progress on it. Let me attempt that in the next few weeks and get back |
Introducing a new
log_progress
function to estimate progress during data operations. This function compares the sizes of the parent table and the shadow table usingpg_table_size
. As we can't count the rows directly due to the ongoing transaction,pg_table_size
provides the closest estimate. Prior to this, we perform a vacuum and analyze on the parent table to enhance the accuracy of the size measurement.It's important to note that the
pg_table_size
of the shadow table may not exactly match the parent table, especially for larger tables. This discrepancy is expected, since the logging is a best effort measure. The function will terminate oncecopy_data
completes as it watches for an instance var -@copy_finished
Progress is logged at 60-second intervals.
Lastly, this change also moves the creation of the shadow table outside of the serializable transaction so other connections can see it.
Towards: #102