Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry deleting tiles on S3. #225

Merged
merged 1 commit into from
Jul 26, 2017
Merged

Conversation

zerebubuth
Copy link
Member

The delete_keys method used to delete tiles from S3 returns an object containing both successes and failures, and does not raise an exception for failures. This means that rare or intermittent errors can result in some tiles not being deleted. Adding a retry mechanism should make sure that is not the case. Although the loop is infinite, the assumption is that errors from S3 are rare and that dealing with them this way is easier than adapting the API of the object to handle returning failure coordinates back to the main process.

The `delete_keys` method used to delete tiles from S3 returns an object containing both successes and failures, and does not raise an exception for failures. This means that rare or intermittent errors can result in some tiles not being deleted. Adding a retry mechanism should make sure that is not the case. Although the loop is infinite, the assumption is that errors from S3 are rare and that dealing with them this way is easier than adapting the API of the object to handle returning failure coordinates back to the main process.
@iandees
Copy link
Member

iandees commented Jul 25, 2017

This seems good. I think I remember copying/pasting a stack trace from an S3 delete that failed when I was working on the pruner but I can't seem to find it.

@nvkelso
Copy link
Member

nvkelso commented Jul 25, 2017

Although the loop is infinite, the assumption is that errors from S3 are rare and that dealing with them this way is easier than adapting the API of the object to handle returning failure coordinates back to the main process.

We have seen S3 blow it's brains out before (down for > 8 hours), how would an infinite loop interact with that?

An alternative is what @iandees setup for S3 retry's in Marble Cutter:

https://github.com/mojodna/marblecutter/blob/9786b0bac96eb5596be4f24c0a66eb27364a0878/examples/render_pyramid.py#L51-L78

@zerebubuth
Copy link
Member Author

We have seen S3 blow it's brains out before (down for > 8 hours), how would an infinite loop interact with that?

It retries the previously failed tiles every minute (by default - it's configurable), so it wouldn't be spamming S3 with retries. The behaviour is to wait until all deletes are successful and, in the case where S3 is down, this would mean waiting until it comes back up.

Some options for the delete behaviour:

  1. Make it impossible for delete_keys to fail by retrying potentially an infinite number of times with some back-off / retry delay.
  2. Pretend that failures don't exist, possibly after some finite number of retries. This means we may get some "stuck" tiles in S3 occasionally, which wouldn't be updated when the data changed.
  3. Allow delete_keys to fail by exception if there are any failures (or still any failures after a finite number of retries). The exception (possibly an assertion) would halt the process.
  4. Change the signature of delete_keys to return the failures (or still failures after a finite number of retries) in addition to the number of successes. The higher-level code could then choose to keep those coords in the TOI, and deal with them on a later run.

On the assumption that S3 being totally down is incredibly rare, although it does happen, we should aim to do something which is "safe" and will recover when S3 comes back up. And is additionally safe from the occasional intermittent 500 error.

  1. Intermittent failures would be retried after some delay. S3 being down would pause the process until S3 came back up.
  2. Intermittent failures would cause "stuck stale" tiles, which never get updated. S3 being down would cause some unknown set of errors, depending on whether any tilequeue instance were still able to write tiles, or the gardener process were still able to write the TOI set to S3. One possible band-aid for this is to run a "sweeper" process occasionally to ensure that all tiles in S3 are also in the TOI. This would need to be done while the gardener isn't running, so that the two processes don't race.
  3. Intermittent failures would cause the whole process to halt, possibly having deleted some tiles from S3 which would cause increased load on tileserver until they were re-rendered on data update. S3 being down would effectively pause the process until S3 came back up.
  4. Intermittent failures would cause temporary TOI bloat until the tile could be deleted. S3 being down would potentially cause "stuck stale" tiles if some the gardener emits jobs to the tilequeue queue but those tiles aren't in the final TOI set written to S3 when it comes back up (e.g: they were added and then removed while S3 was down).

I was assuming that:

  • Intermittent S3 failures (500 errors) will be fixed by a small, finite number of retries. Therefore, looping will repair these within an acceptable runtime for a "background" process such as the gardener.
  • S3 being down means that most of the system is unable to do useful work, such as rendering tiles. Therefore, waiting in a loop is as (in)effective as anything other course of (in)action.

There are likely additional options that I haven't thought of. Which, if any, matches the behaviour that we want?

@rmarianski
Copy link
Member

I think the current behavior is reasonable given the types of failures we can expect and the recovery options available.

Some options for the delete behaviour:

I like option 3 or 4 the most, where we cap the number of retries somewhere and then are just loud about the failure. Realistically though, since the toi is also on s3, I'd be surprised if we couldn't delete some objects but then were able to subsequently update the toi set.

Retrying indefinitely sounds fine to me too, like you have it now, with my pedantic concern being that having a programming error can cause the process to hang indefinitely. But it looks like you have the loop guarded specifically for s3 internal errors, so I don't think there's much risk here.

On the assumption that S3 being totally down is incredibly rare, although it does happen, we should aim to do something which is "safe" and will recover when S3 comes back up. And is additionally safe from the occasional intermittent 500 error.

In general, in terms of inconsistent situations I think we're better off having the tile not exist in s3 but still have it in the toi rather than the other way around.

That being said, don't we have a hole in our system where a tile can get enqueued for processing in one prune run, and then removed from the toi in the next run while it still hasn't gotten processed yet and remains in the queue? It would then get rendered eventually, but no longer exist in the toi, and we would end up with a stale tile.

@nvkelso
Copy link
Member

nvkelso commented Jul 26, 2017

Thanks for the explanation. Since you've guarded against the basic cases I'm fine with the code change.

That being said, don't we have a hole in our system where a tile can get enqueued for processing in one prune run, and then removed from the toi in the next run while it still hasn't gotten processed yet and remains in the queue? It would then get rendered eventually, but no longer exist in the toi, and we would end up with a stale tile.

@rmarianski can you elaborate more on this in a new ticket, please?

@zerebubuth zerebubuth merged commit 17950be into master Jul 26, 2017
@zerebubuth zerebubuth deleted the zerebubuth/retry-s3-delete branch July 26, 2017 16:06
@rmarianski
Copy link
Member

Follow up issue for the race condition #226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants