Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write build artifacts to (cloud) storage from build servers #5549

Merged
merged 9 commits into from
May 9, 2019

Conversation

davidfischer
Copy link
Contributor

@davidfischer davidfischer commented Mar 29, 2019

Goals

  • Store all build artifacts (HTML, PDFs, ePubs, zips, search JSON) in "storage" where storage could be local filesystem storage -- probably just for development -- or cloud blob storage (s3, azure storage, etc.). Since we aren't selectively copying some files and not others, the code should actually be simpler.
  • Since Store ePubs and PDFs in media storage #4947, we store PDFs, ePubs and zips in blob storage (we actually check that it isn't local). This is done from the web servers in production. Instead, this PR proposes to do that from the build server where the build happened.

Considerations

  • On the community side, I'm envisioning this is an Azure storage container and it doesn't really matter if it is public or private as no permissions really need to be checked before serving the HTML/PDF/etc..
  • On the corporate side, I'm envisioning this as a private S3 bucket where permissions are checked before reverse proxying. This may have to come down the road.

Down the road

For this PR, I think we'll keep the code that "syncs" files to the webs as well as adding code to write to "storage". Once we are happy with serving things from blob storage via reverse proxy, we will no longer need to move_files or sync_files whatsoever and the syncers/pullers could completely go away. The web servers won't need much in the way of disks attached to them (both simpler and cheaper web servers). They could be completely ephemeral and potentially autoscaled. update_app_instances could then simply be the API call to the webs to update the database that the build finished.

@davidfischer davidfischer added the PR: work in progress Pull request is not ready for full review label Mar 29, 2019
@davidfischer davidfischer requested a review from a team March 29, 2019 21:16
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems simple enough at first glance. I'm more worried about figuring out how to get azure to actually accept all our uploaded files, which has seemed tricky thus far :/

localmedia=bool(outcomes['localmedia']),
pdf=bool(outcomes['pdf']),
epub=bool(outcomes['epub']),
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this else on the try instead of the if now? I can never remember what try/else does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. An else on a try is taken if there is no exception. I refactored that slightly because we defined build_id in a try block and then use it later. The only reason it wasn't ever a NameError is that we reraise the error in both except blocks.

@davidfischer
Copy link
Contributor Author

Seems simple enough at first glance. I'm more worried about figuring out how to get azure to actually accept all our uploaded files, which has seemed tricky thus far :/

I have a couple ideas here which I would like to try. Firstly, when overwriting a file, we can use storage.open and write to the file rather than storage.save. Secondly, we should probably just have retries.

@davidfischer davidfischer changed the title Store build artifacts in storage from build servers proposal Store build artifacts in storage from build servers Apr 6, 2019
@davidfischer davidfischer removed the PR: work in progress Pull request is not ready for full review label Apr 6, 2019
@davidfischer
Copy link
Contributor Author

I made a first pass at writing everything to storage.

Testing this out

To try this out you need to have a setting:

BUILD_MEDIA_STORAGE = 'path.to.BuildMediaStorage'

In the above example, the BuildMediaStorage class should mixin readthedocs.builds.storage.BuildMediaStorageMixin. At the simplest, you could have BuildMediaStorage simply add the mixin to FileSystemStorage to write build media to a specific directory. You could do something fancy with Azure Storage though.

Notes

Firstly, when overwriting a file, we can use storage.open and write to the file rather than storage.save

After digging deeper, we want to use storage.save but we want to ensure that overwriting is the default. It wasn't obvious how to specify this but the answer is to override get_available_name to not find an alternative name.

Secondly, we should probably just have retries.

I didn't implement this yet although it should be rather easy in the new setup. I suspect that this problem might have been us calling delete followed by save in short succession. That has been removed so this issue might just go away. Retries may still be a good idea though.

@davidfischer davidfischer changed the title Store build artifacts in storage from build servers Write build artifacts to (cloud) storage from build servers Apr 6, 2019
- Useful for writing build artifacts under MEDIA_ROOT in dev
@davidfischer
Copy link
Contributor Author

I added a FileSystemStorage subclass as well. So this can be tested by setting:

BUILD_MEDIA_STORAGE = 'readthedocs.builds.storage.BuildMediaFileSystemStorage'

With that setting, all build artifacts will be written under MEDIA_ROOT (eg. media/html, media/pdf, etc.).

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good (and simpler). Definitely a cleaner approach, and with much less copying between our own machines :)

@ericholscher
Copy link
Member

Tested locally with BUILD_MEDIA_STORAGE = 'readthedocs.builds.storage.BuildMediaFileSystemStorage' and it worked like a charm. 👍

@davidfischer davidfischer requested a review from a team April 8, 2019 16:08
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm +1 on shipping this with it turned on in -ops, so that we're writing everything to Azure. It will continue to write to the builders, which is good.

We probably need to find a way to configure the .org to not copy the media files but continue to have the .com do it properly. I'd be 👍 with turning off copying to the webs of pdf/epub/htmlzip.

@davidfischer
Copy link
Contributor Author

We probably need to find a way to configure the .org to not copy the media files but continue to have the .com do it properly. I'd be 👍 with turning off copying to the webs of pdf/epub/htmlzip.

I looked into this briefly and it may be a little tricky because the code to index files (ImportedFile see this) relies on things being on local disk. That will probably need to be rewritten to use "storage".

- Use the RTD_ prefix
- Assume that settings.RTD_BUILD_MEDIA_STORAGE is set (defined in base)
@davidfischer
Copy link
Contributor Author

I believe this PR is mergeable although we can't completely disable syncers yet (locally or in prod).

  • In production and dev, we need a syncer to write the HTML into the right directory and update any necessary symlinks for subprojects, translations, custom domains, etc.
  • Setting RTD_BUILD_MEDIA_STORAGE in production will write all build artifacts to storage (HTML, PDFs, etc.) but we still need a syncer. This will result in PDFs being written to the webs again but this is a temporary state until we proxy HTML requests to storage.
  • On the corporate side, build servers will not write artifacts since RTD_BUILD_MEDIA_STORAGE won't be defined. To enable this on the corporate side, we would need a storage backend for AWS.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to ship this. I do think we need to be a bit more defensive here, to make sure we aren't letting exceptions break the prod code paths.

localmedia=bool(outcomes['localmedia']),
pdf=bool(outcomes['pdf']),
epub=bool(outcomes['epub']),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely need to try/except this, so that if it fails, we still run the old syncers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but I think I'll handle this try/except block in the function around the actual storage writes which could throw errors.

localmedia=bool(outcomes['localmedia']),
pdf=bool(outcomes['pdf']),
epub=bool(outcomes['epub']),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this should probably be a try/except as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method which only had code removed shouldn't really throw an exception randomly. Do you really think we need a try block for it?

msg=f'Writing {media_type} to media storage - {to_path}',
),
)
storage.copy_directory(from_path, to_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely should be try/except as well, so when it fails it doesn't stop all the other formats from syncing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a try block around the writes/deletes to storage just in case.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these changes. Left some small comments to consider.

Setting RTD_BUILD_MEDIA_STORAGE in production will write all build artifacts to storage (HTML, PDFs, etc.) but we still need a syncer. This will result in PDFs being written to the webs again but this is a temporary state until we proxy HTML requests to storage.

Not sure to follow here. Why PDF will be written again to webs? Because of this line?

Depending on when we want to shrink our disk, I'd say that this is OK. Although, if we want to shrink them soon, we could add a setting to disable copying PDF/ePub into the storage in production. Actually, no need to define a new setting, we could just check for the BUILD_MEDIA_STORAGE and if it's Azure, do not copy them to webs.

else:
types_to_delete.append('epub')

for media_type, build_type in types_to_copy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

This is a little confusing to me. What is media_type and build_type? I suppose the first one is the path name where we want to save it, and the later is the path name from the builder.

If so, where they come from? Isn't it possible to use a constant or get the name from the related class where this is defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree but I believe this is a larger refactor and should be handled in a separate PR. These constants are not defined somewhere convenient (like readthedocs/doc_builder/constants.py) and they are used in quite a few places outside of the scope of this PR like the doc builder backends themselves.

except Exception:
# Ideally this should just be an IOError
# but some storage backends unfortunately throw other errors
log.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have we are duplicating the copy a warning could be enough. Although, I think we want to log an .exception here so we can see it under Sentry and take care of it. Otherwise, if sync fails for any reason we won't know that and we will see random issues on webpages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, .exception could generate a flood of notifications here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a log.warning with exc_info=True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what I want :)

My concern is that if this fail we won't be alarmed/notified at all. Although, alarming us for every single file that was not able to be copied is crazy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just do a log.exception and Sentry can sort it out. It looks like Sentry is smart enough to find similar issues even when the text isn't exactly the same.

@@ -97,5 +97,8 @@ django-cors-middleware==1.3.1
# User agent parsing - used for analytics purposes
user-agents<1.2.0

# Utilities used to upload build media to cloud storage
django-storages>=1.7,<1.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to pin to an exact version our dependencies so the environment is reproducible. You can use something like to allow weekly updates via our bot, but forcing <1.8:

 django-storages==1.7.1  # pyup: <1.8

@davidfischer
Copy link
Contributor Author

Not sure to follow here. Why PDF will be written again to webs? Because of this line?

RTD_BUILD_MEDIA_STORAGE controls whether artifacts are written to storage. The syncer controls whether they are synced to the webs. We can change FILE_SYNCER to be the NullSyncer if we don't want PDFs/HTML/ePubs to be synced back to the webs. Before a proxy is in place to proxy HTML to storage, we can't do that. One of the keys of this PR is to not handle PDFs differently from HTML. There's no reason to check RTD_BUILD_MEDIA_STORAGE from a syncer since the syncer can be overridden if needed.

@davidfischer davidfischer merged commit d97217c into master May 9, 2019
@davidfischer davidfischer deleted the davidfischer/build-media-to-storage branch May 9, 2019 22:38
@davidfischer davidfischer mentioned this pull request May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants