Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the ability to show estimate progress of copy #146

Merged
merged 7 commits into from
Jan 15, 2024
Merged

Conversation

shayonj
Copy link
Owner

@shayonj shayonj commented Jan 15, 2024

Introducing a new log_progress function to estimate progress during data operations. This function compares the sizes of the parent table and the shadow table using pg_table_size. As we can't count the rows directly due to the ongoing transaction, pg_table_size provides the closest estimate. Prior to this, we perform a vacuum and analyze on the parent table to enhance the accuracy of the size measurement.

It's important to note that the pg_table_size of the shadow table may not exactly match the parent table, especially for larger tables. This discrepancy is expected, since the logging is a best effort measure. The function will terminate once copy_data completes as it watches for an instance var - @copy_finished

Progress is logged at 60-second intervals.

Lastly, this change also moves the creation of the shadow table outside of the serializable transaction so other connections can see it.

Towards: #102

Introducing a new `log_progress` function to estimate progress during data operations.
This function compares the sizes of the parent table and the shadow table using `pg_table_size`.
As we can't count the rows directly due to the ongoing transaction, `pg_table_size` provides the closest estimate.
Prior to this, we perform a vacuum and analyze on the parent table to enhance the accuracy of the size measurement.

It's important to note that the `pg_table_size` of the shadow table may not exactly match the parent table, especially for larger tables. This discrepancy is expected,
since the logging is a best effort measure. The function will terminate once `copy_data` completes as it watches for an instance var - `@copy_finished`

Progress is logged at 60-second intervals.
@shayonj
Copy link
Owner Author

shayonj commented Jan 15, 2024

Just heads up / fyi @jfrost

Lastly, this change also moves the creation of the shadow table outside of the serializable transaction so other connections can see it.

Comment on lines +167 to +172
logger.info("Setting up shadow table", { shadow_table: shadow_table })
Query.run(
client.connection,
"SELECT create_table_all('#{client.table_name}', '#{shadow_table}');",
)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting: This bit. I believe its ok to create the shadow table outside of serializable transaction. All the following stuff that is running the alter statement, deleting rows from audit table and then copying the data from audit table into shadow table - is all happening in the single transaction. Thats the important bit to ensure no duplicates are copied over.

And by doing this, it allows us to show estimated progress as well.

@shayonj
Copy link
Owner Author

shayonj commented Jan 15, 2024

@shayonj shayonj force-pushed the s/progress branch 2 times, most recently from 4876fe3 to cfd8b8e Compare January 15, 2024 18:37
@shayonj shayonj merged commit a6e60be into main Jan 15, 2024
0 of 12 checks passed
@shayonj shayonj deleted the s/progress branch January 15, 2024 18:42
@brycethornton
Copy link
Contributor

@shayonj I wasn't able to get this to work. The select pg_table_size call seemed to be blocked by the INSERT. I also tried to run it manually myself and it just hung. pg-osc did finally finish after about 10 hours so the attempt to show progress didn't seem to negatively affect the process, it just didn't seem to work as intended. I can open an issue if you'd like.

image

@shayonj
Copy link
Owner Author

shayonj commented Feb 11, 2024

oh interesting! Yeah if you wouldn't mind opening an issue 🙏🏾

@tanelsuurhans
Copy link
Contributor

Looks like the progress reporting does not work if a custom copy statement is used as per this return statement here. It looks like there might not really be a technical reason that prevents doing it?

@shayonj
Copy link
Owner Author

shayonj commented Aug 20, 2024

oh interesting, thanks for catching that. Feel free to open a PR and i can help do some testing as well.

@tanelsuurhans
Copy link
Contributor

I tried some options but the copy progress does not really work at all, as stated earlier in this thread. I think the transaction isolation level on the copy is preventing reads, which is what table size likely does under the hood in some form? Either way, on these large operations where I've tested it (think 24h runtimes or more) it does not report progress in any way, just lock timeout errors.

@shayonj
Copy link
Owner Author

shayonj commented Aug 24, 2024

Yeah, you are right. I think a refactor is in order here to use two connections with a single snapshot id - one that manages the copy and the other that reports the progress on it. Let me attempt that in the next few weeks and get back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants