Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform] Old backups are not deleted by a schedule #7780

Closed
SergeyPotachev opened this issue Mar 23, 2021 · 22 comments
Closed

[Platform] Old backups are not deleted by a schedule #7780

SergeyPotachev opened this issue Mar 23, 2021 · 22 comments
Assignees
Labels
area/platform Yugabyte Platform priority/high High Priority
Milestone

Comments

@SergeyPotachev
Copy link
Contributor

SergeyPotachev commented Mar 23, 2021

Backups records in DB (table backup) have empty/null values for schedule_uuid. But old backups are selected for further deletion using the next finder:

  public static List<Backup> getExpiredBackups(UUID scheduleUUID) {
    // Get current timestamp.
    Date now = new Date();
    return Backup.find.query().where()
      .eq("schedule_uuid", scheduleUUID)
      .lt("expiry", now)
      .eq("state", BackupState.Completed)
      .findList();
  }

So no backups could be selected if they don't have the schedule_uuid specified. Also this explains why the customer is able to delete old backups manually.
Possible reason is that the schedule in task_params field has value:
"scheduleUUID":null

If a task, which creates backups, takes schedule_uuid from these parameters then this could be the root cause.
Additional information is inside the ticket 661 (application.log and YW db dump).

@SergeyPotachev SergeyPotachev added the area/platform Yugabyte Platform label Mar 23, 2021
@streddy-yb streddy-yb added this to the 2.7.x milestone Mar 24, 2021
@CharlotteRose
Copy link
Contributor

From the environment:

yugaware=# select count(*) from public.backup;
count
-------
3665 

yugaware=# select count(*) from public.backup where schedule_uuid is not null;
count
-------
2479

yugaware=# select schedule_uuid from public.backup where schedule_uuid is not null;
schedule_uuid
--------------------------------------
e08c9386-3bea-4f43-b08c-54db696f274c
8f69c9d4-b43b-48fc-8505-fe792f9684b6
3f3b8df9-f6a4-4823-84dd-8562cdedac75
a0e1858c-6ecc-4191-8945-af3bcdb32784
7887f209-b352-4feb-b7a2-fbb1bf40c3dd
3f3b8df9-f6a4-4823-84dd-8562cdedac75
a0e1858c-6ecc-4191-8945-af3bcdb32784
a0e1858c-6ecc-4191-8945-af3bcdb32784
a7a5fd2f-36eb-4260-8b6a-d62e729d02d1
a7a5fd2f-36eb-4260-8b6a-d62e729d02d1
e08c9386-3bea-4f43-b08c-54db696f274c
a0e1858c-6ecc-4191-8945-af3bcdb32784
99f650c5-6d0c-4657-a103-24c476ae4d54
0236b9ef-c3e5-4a8f-b8c5-2fa6a9964599
a0e1858c-6ecc-4191-8945-af3bcdb32784
3f3b8df9-f6a4-4823-84dd-8562cdedac75
0236b9ef-c3e5-4a8f-b8c5-2fa6a9964599
a7a5fd2f-36eb-4260-8b6a-d62e729d02d1
3f3b8df9-f6a4-4823-84dd-8562cdedac75
a0e1858c-6ecc-4191-8945-af3bcdb32784
<output truncated> 

@SergeyPotachev - need to know what is the best method to fabricate a UUID for the ~1200 entries where the schedule_UUID is not populated.
Also need guidance on how to validate the cleanup for those with a UUID value.

@sb-yb
Copy link
Contributor

sb-yb commented Mar 26, 2021

@CharlotteRose This is now assigned to me. But I have not had chance to look at it yet. But best way is to set these entries to nil uuid. Something like:

yugaware=# create extension if not exists "uuid-ossp";
yugaware=# select uuid_nil();
               uuid_nil
--------------------------------------
 00000000-0000-0000-0000-000000000000
(1 row)

yugaware=# select uuid_generate_v4();
           uuid_generate_v4
--------------------------------------
 acddb095-76d0-4351-9641-3484f7c71646
(1 row)

So just insert uuid_nil so you can tell these apart.
You may have to insert random uuid if there is some uniqueness constraint that wont allow for multiple entries with same (i.e. nil) uuid.

@sb-yb
Copy link
Contributor

sb-yb commented Mar 26, 2021

I can see that there is no constraint on schedule_uuid - so you can just insert nil and see if that helps. I have not looked at the code yet to answer your other question.

yugaware=# \d backup
                            Table "public.backup"
    Column     |            Type             | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
 backup_uuid   | uuid                        |           | not null |
 customer_uuid | uuid                        |           | not null |
 backup_info   | json_alias                  |           | not null |
 state         | character varying(50)       |           | not null |
 create_time   | timestamp without time zone |           | not null |
 update_time   | timestamp without time zone |           |          |
 task_uuid     | uuid                        |           |          |
 schedule_uuid | uuid                        |           |          |
 expiry        | timestamp without time zone |           |          |
Indexes:
    "pk_backup" PRIMARY KEY, btree (backup_uuid)
    "ix_backup_task_uuid" UNIQUE, btree (task_uuid)

@SergeyPotachev
Copy link
Contributor Author

SergeyPotachev commented Mar 26, 2021

@CharlotteRose
The problem here is that a schedule deletes old backups according to own schedule_uuid. For the schedule and for existing backups they should be the same. So it is not enough to set it to any random or nil value. At first, we need to understand which schedule is active for the selected universe:

SQL> select schedule_uuid from schedule where task_params->>'universeUUID'='96d42355-0d5d-4210-a47a-28d017533a10' and status='Active';

This gives you a required schedule_uuid for the universe with uuid '96d42355-0d5d-4210-a47a-28d017533a10' (it is not from your DB).

The small hint:
For your DB dump (made on March 16th) we have next active schedules:

            schedule_uuid             |             universeuuid             
--------------------------------------+--------------------------------------
 1eaee567-763d-4286-a753-af1914e625c8 | f4122a4b-f19b-466a-9861-b6a1fed6a90e
 53888b6f-f194-4d73-a367-5f61ca949f28 | 4f558902-684a-436f-bef4-886abbbc7b74
 24a43cf0-90cb-4de6-a89d-d3b20278eda7 | 33be1e22-7e3f-4919-b2e7-fb5ac2494f8b
 fb05952e-97b7-439d-ad2f-b377e3706a7a | a63c7ec9-5b27-4507-a81c-24d2c89fb701

If the schedule wasn't changed since March 16th, you can find uuid of the required universe in the second column and to get schedule_uuid from the first column of the same row.

@hkandala
Copy link
Contributor

hkandala commented Mar 26, 2021

Seems like this open PR #7162 already has the fix where the schedule_uuid is removed from the query params for getExpiredBackups().

@SergeyPotachev
Copy link
Contributor Author

SergeyPotachev commented Mar 26, 2021

@hkandala
Hm... I'm not sure that this is a completely correct decision. We need to understand why this line exists. If we remove it then the schedule will be able to remove all old backups created not only by other schedules but manually as well (not sure about the last, but I don't remember that I saw any additional checks for this in other places). If my assumption is correct (and it will delete manual backups after this line removal) then this should be discussed with PMs as it could affect our customers.

@hkandala
Copy link
Contributor

@SergeyPotachev, I agree we need to understand more on why that line exists.
But if I understand it correctly #7162 PR is to introduce an option of retention period for manual backups too.
And also, getExpiredBackups() is only being called in scheduleRunner() in Scheduler.java independent of the schedules.
Removing schedule_uuid directly wouldn't have made sense if this request (#6555) wasn't there. With manual backup retention option available, I don't see a need to have check for schedule_uuid when calling from scheduleRunner() at least.
I am not quite sure about manual deletion part though.

@hkandala
Copy link
Contributor

hkandala commented Mar 26, 2021

Just realized this needs a backport, in that case I think we might need a different fix as manual backup retention option will not be available there

@SergeyPotachev
Copy link
Contributor Author

SergeyPotachev commented Mar 26, 2021

@hkandala
Yes, if a removal of this line lets a schedule automatically delete old manual backups without any additional flags, settings etc. then it is an incomplete/incorrect fix.
And another thing here - the problem is why all backups now don't have schedule_uuid? You can't distinguish whether a backup was created by a schedule or manually.

@hkandala
Copy link
Contributor

@SergeyPotachev, I tried to reproduce that in my local. But I am seeing schedule_uuid is being present when creating scheduled backup. Not sure if there is any specific flow where it stays null.

@sb-yb
Copy link
Contributor

sb-yb commented Mar 29, 2021

So it seems that expiry was added to 2.4 in
#4493
https://phabricator.dev.yugabyte.com/D9361
I see no integration test to check if that works but was tested manually as per the checkin comments.
So as long as backup is expired (expiry < now) and completed it is safe to delete.
Depending on the priority and what customer wants we can tackle in multiple steps.
1.> We can manually run delete on database to delete these expired backups to get rid of bloat quickly.
2.> We can give them patch similar to PR #7162 - that will get rid of scheduleUUID check
3.> None of these answer why these backups are having scheduleUUID = null. Investigate and fix that separately.

Reason for proposing this multistep approach is that #1 is lot easier and if removing db bloat is immediate concern then it achieves that. #2 is also easy to backport to 2.4 so keeping it separate.
For #3 the code needs lot of love and tests need to be added to test actual functionality and not just - some json field was set.
Also #3 can be an artifact of some old buggy entity in database for a bug that is fixed now. We can fix #3 in mainline and not bother with backport.

@sb-yb
Copy link
Contributor

sb-yb commented Mar 30, 2021

These are the schedules that have such backups

             scheduleuuid              | num_completed_expired
----------------------------------------+-----------------------
 (1eaee567-763d-4286-a753-af1914e625c8) |                   138
 (24a43cf0-90cb-4de6-a89d-d3b20278eda7) |                   124
 (53888b6f-f194-4d73-a367-5f61ca949f28) |                   171
 (6124fa26-7114-481f-93cc-2b2084db4b0c) |                    46
 (fb05952e-97b7-439d-ad2f-b377e3706a7a) |                    49
(5 rows)

@sb-yb
Copy link
Contributor

sb-yb commented Mar 30, 2021

Okay figured out what's going on. BackupParams captures Multitable backup as well as single table backup.
scheduleUUID gets populated in task backup params in backupList but its never populated in top level backup params.
The backup gets the scheduleUUID from this top level param (and it is null)

So we need to write migration to lift this scheduleUUID from first entry in backup list or we can just say that it is expected (which is consistent with many other null values and dangling references in backup schema ;)) and go on to delete it with steps 1,2 listed previously.

sb-yb added a commit that referenced this issue Apr 1, 2021
Summary:
Bulk of this change is subset of PR/7162 (Thanks to Mahendra!) but added few mods to that.
Do not filter by scheduleUUID when generating list of expired backups.
This diff also adds filtering out backups if universeUUID is no longer valid (because deleted) for a given customer.

This change can be backported to solve the backup deletion issue 7780.

Test Plan: Added unit test to check that backups with invalid univ do not show up

Reviewers: spotachev, arnav

Reviewed By: arnav

Subscribers: hkandala, jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D11102
sb-yb added a commit that referenced this issue Apr 1, 2021
…chedule

Summary:
Do not filter by scheduleUUID when generating list of expired backups.
This diff also adds filtering out backups if universeUUID is no longer valid (because deleted) for a given customer.

Original commit: D11102 / 5dbd189

Test Plan: Added unit test to check that backups with invalid univ do not show up

Reviewers: arnav

Reviewed By: arnav

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D11113
@CharlotteRose
Copy link
Contributor

@sb-yb - I need to ask for further clarification on step 1. The customer wants to see all outdated backups removed. Doing so manually from Platform removes from disk and DB - however I am unclear on the instruction related to "running delete" on the database.

Please advise.

@sb-yb
Copy link
Contributor

sb-yb commented Apr 2, 2021

Sorry. Bad instructions. So idea was to update yw database to fix the missing scheduleUUID manually and that would lead to deletion from the database on next scheduler run.

FWIW I already have a fix merged yesterday so we should not have to do anything manual.
Manual steps are only needed of you cannot wait till you get that fix deployed.

LMK and I can give you exact sql to update db manually if we must.

@CharlotteRose
Copy link
Contributor

@sb-yb - That is great to hear! I will let them know that it will be fixed in future release. However, in the meantime I will require the steps for manual intervention as it has been expressed by the customer numerous times that they desire a fix for the immediate time frame.
Thank you for your help!

@sb-yb
Copy link
Contributor

sb-yb commented Apr 6, 2021

@CharlotteRose Here goes: you will be doing COMMIT; instead of ROLLBACK - once you are comfortable. May be try just one:

bak_bug=# select scheduleuuid, count(*) as num_completed_expired from ( select backup_info->'backupList'->0->>'scheduleUUID' from public.backup where expiry < now()  AND schedule_uuid is null AND state = 'Completed') as scheduleuuid group by scheduleuuid;
              scheduleuuid              | num_completed_expired
----------------------------------------+-----------------------
 (1eaee567-763d-4286-a753-af1914e625c8) |                   138
 (24a43cf0-90cb-4de6-a89d-d3b20278eda7) |                   124
 (53888b6f-f194-4d73-a367-5f61ca949f28) |                   171
 (6124fa26-7114-481f-93cc-2b2084db4b0c) |                    46
 (fb05952e-97b7-439d-ad2f-b377e3706a7a) |                    49
(5 rows)

bak_bug=# select count(*) as num_completed_expired from ( select backup_info->'backupList'->0->>'scheduleUUID' from public.backup where expiry < now()  AND schedule_uuid is null AND state = 'Completed') as schedul
euuid;
 num_completed_expired
-----------------------
                   528
(1 row)

bak_bug=# begin;
BEGIN
bak_bug=*# update public.backup set schedule_uuid = (backup_info->'backupList'->0->>'scheduleUUID')::uuid  where expiry < now()  AND schedule_uuid is null AND state = 'Completed';
UPDATE 528
bak_bug=*# select count(*) as num_completed_expired from ( select backup_info->'backupList'->0->>'scheduleUUID' from public.backup where expiry < now()  AND schedule_uuid is null AND state = 'Completed') as scheduleuuid;
 num_completed_expired
-----------------------
                     0
(1 row)

bak_bug=*# rollback;
ROLLBACK

Above will update all the backup rows that have expired and have null schedule to scheduleUUID burried deep in the backupInfo json.

Use this page as reference: https://www.postgresql.org/docs/9.5/functions-json.html

@tylarb
Copy link
Contributor

tylarb commented Apr 6, 2021

@sb-yb to clarify on the above - does this need to be done on a recurring schedule ? That is, does it fix that backups are not getting deleted for some jobs permanently (for this release), or does it just remove current, today expired backups?

@sb-yb
Copy link
Contributor

sb-yb commented Apr 6, 2021

Later. It is one time manual step that removes current expired backups.
As mentioned before; this manual step is only needed if you cannot wait till you get the code fix deployed - i.e. they are running out of storage etc.
The code fix should delete old expired backups even if this step is omitted.
Also note that code fix has yet to go through QA. So this can help everyone sleep better given that there are lot of old backups that have piled up :)

@tylarb
Copy link
Contributor

tylarb commented Apr 7, 2021

@sb-yb if this is not a workaround that resolves the problem for current backup schedules, it seems very much like editing the DB is much more hazardous than simply removing old backups from disk (as is currently being done).

@sb-yb
Copy link
Contributor

sb-yb commented Apr 7, 2021

So it was never mentioned that removing directly from disk is already being done.
Only negative with that approach is database state thinks the backup is still around but it is no longer the case.
So once the fix is deployed; we will see lot of "failed to delete" errors. But I think, that should be okay - especially given that; that ship has already sailed.
So I am okay with continuing to delete directly from disk either manually or thru cron job.

@sb-yb
Copy link
Contributor

sb-yb commented Apr 22, 2021

Steps to validate:

  1. Create a whole universe backup schedule with 5 minutes frequency and 10 minutes timeBeforeDelete
  2. Check that backups are getting created every 5 minutes
  3. Check that each created backup then gets auto deleted 10 minutes after creation.

For deletion retry bug:

  1. Delete the backup from disk (not from YW) immediate after creation (before it expires)
  2. Make sure that backup goes to FailedToDelete state after 10 minutes

YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
Bulk of this change is subset of PR/7162 (Thanks to Mahendra!) but added few mods to that.
Do not filter by scheduleUUID when generating list of expired backups.
This diff also adds filtering out backups if universeUUID is no longer valid (because deleted) for a given customer.

This change can be backported to solve the backup deletion issue 7780.

Test Plan: Added unit test to check that backups with invalid univ do not show up

Reviewers: spotachev, arnav

Reviewed By: arnav

Subscribers: hkandala, jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D11102
@yugabyte yugabyte deleted a comment from Tita1979 Nov 11, 2022
@yugabyte yugabyte deleted a comment from Tita1979 Nov 11, 2022
@yugabyte yugabyte deleted a comment from Tita1979 Nov 11, 2022
@yugabyte yugabyte deleted a comment from Tita1979 Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform Yugabyte Platform priority/high High Priority
Projects
None yet
Development

No branches or pull requests

6 participants