Write build artifacts to (cloud) storage from build servers #5549

davidfischer · 2019-03-29T21:16:03Z

Goals

Store all build artifacts (HTML, PDFs, ePubs, zips, search JSON) in "storage" where storage could be local filesystem storage -- probably just for development -- or cloud blob storage (s3, azure storage, etc.). Since we aren't selectively copying some files and not others, the code should actually be simpler.
Since Store ePubs and PDFs in media storage #4947, we store PDFs, ePubs and zips in blob storage (we actually check that it isn't local). This is done from the web servers in production. Instead, this PR proposes to do that from the build server where the build happened.

Considerations

On the community side, I'm envisioning this is an Azure storage container and it doesn't really matter if it is public or private as no permissions really need to be checked before serving the HTML/PDF/etc..
On the corporate side, I'm envisioning this as a private S3 bucket where permissions are checked before reverse proxying. This may have to come down the road.

Down the road

For this PR, I think we'll keep the code that "syncs" files to the webs as well as adding code to write to "storage". Once we are happy with serving things from blob storage via reverse proxy, we will no longer need to move_files or sync_files whatsoever and the syncers/pullers could completely go away. The web servers won't need much in the way of disks attached to them (both simpler and cheaper web servers). They could be completely ephemeral and potentially autoscaled. update_app_instances could then simply be the API call to the webs to update the database that the build finished.

ericholscher

Seems simple enough at first glance. I'm more worried about figuring out how to get azure to actually accept all our uploaded files, which has seemed tricky thus far :/

ericholscher · 2019-04-01T12:15:12Z

readthedocs/projects/tasks.py

-                    localmedia=bool(outcomes['localmedia']),
-                    pdf=bool(outcomes['pdf']),
-                    epub=bool(outcomes['epub']),
-                )
            else:


Is this else on the try instead of the if now? I can never remember what try/else does.

That's correct. An else on a try is taken if there is no exception. I refactored that slightly because we defined build_id in a try block and then use it later. The only reason it wasn't ever a NameError is that we reraise the error in both except blocks.

davidfischer · 2019-04-01T15:53:06Z

Seems simple enough at first glance. I'm more worried about figuring out how to get azure to actually accept all our uploaded files, which has seemed tricky thus far :/

I have a couple ideas here which I would like to try. Firstly, when overwriting a file, we can use storage.open and write to the file rather than storage.save. Secondly, we should probably just have retries.

davidfischer · 2019-04-06T07:40:11Z

I made a first pass at writing everything to storage.

Testing this out

To try this out you need to have a setting:

BUILD_MEDIA_STORAGE = 'path.to.BuildMediaStorage'

In the above example, the BuildMediaStorage class should mixin readthedocs.builds.storage.BuildMediaStorageMixin. At the simplest, you could have BuildMediaStorage simply add the mixin to FileSystemStorage to write build media to a specific directory. You could do something fancy with Azure Storage though.

Notes

Firstly, when overwriting a file, we can use storage.open and write to the file rather than storage.save

After digging deeper, we want to use storage.save but we want to ensure that overwriting is the default. It wasn't obvious how to specify this but the answer is to override get_available_name to not find an alternative name.

Secondly, we should probably just have retries.

I didn't implement this yet although it should be rather easy in the new setup. I suspect that this problem might have been us calling delete followed by save in short succession. That has been removed so this issue might just go away. Retries may still be a good idea though.

- Useful for writing build artifacts under MEDIA_ROOT in dev

davidfischer · 2019-04-06T18:13:35Z

I added a FileSystemStorage subclass as well. So this can be tested by setting:

BUILD_MEDIA_STORAGE = 'readthedocs.builds.storage.BuildMediaFileSystemStorage'

With that setting, all build artifacts will be written under MEDIA_ROOT (eg. media/html, media/pdf, etc.).

ericholscher

These changes look good (and simpler). Definitely a cleaner approach, and with much less copying between our own machines :)

ericholscher · 2019-04-08T15:16:22Z

Tested locally with BUILD_MEDIA_STORAGE = 'readthedocs.builds.storage.BuildMediaFileSystemStorage' and it worked like a charm. 👍

ericholscher

I think I'm +1 on shipping this with it turned on in -ops, so that we're writing everything to Azure. It will continue to write to the builders, which is good.

We probably need to find a way to configure the .org to not copy the media files but continue to have the .com do it properly. I'd be 👍 with turning off copying to the webs of pdf/epub/htmlzip.

davidfischer · 2019-04-08T16:46:36Z

We probably need to find a way to configure the .org to not copy the media files but continue to have the .com do it properly. I'd be 👍 with turning off copying to the webs of pdf/epub/htmlzip.

I looked into this briefly and it may be a little tricky because the code to index files (ImportedFile see this) relies on things being on local disk. That will probably need to be rewritten to use "storage".

- Use the RTD_ prefix - Assume that settings.RTD_BUILD_MEDIA_STORAGE is set (defined in base)

davidfischer · 2019-04-29T18:51:31Z

I believe this PR is mergeable although we can't completely disable syncers yet (locally or in prod).

In production and dev, we need a syncer to write the HTML into the right directory and update any necessary symlinks for subprojects, translations, custom domains, etc.
Setting RTD_BUILD_MEDIA_STORAGE in production will write all build artifacts to storage (HTML, PDFs, etc.) but we still need a syncer. This will result in PDFs being written to the webs again but this is a temporary state until we proxy HTML requests to storage.
On the corporate side, build servers will not write artifacts since RTD_BUILD_MEDIA_STORAGE won't be defined. To enable this on the corporate side, we would need a storage backend for AWS.

ericholscher

Excited to ship this. I do think we need to be a bit more defensive here, to make sure we aren't letting exceptions break the prod code paths.

ericholscher · 2019-04-29T19:31:51Z

readthedocs/projects/tasks.py

+                        localmedia=bool(outcomes['localmedia']),
+                        pdf=bool(outcomes['pdf']),
+                        epub=bool(outcomes['epub']),
+                    )


We likely need to try/except this, so that if it fails, we still run the old syncers.

That's fair, but I think I'll handle this try/except block in the function around the actual storage writes which could throw errors.

ericholscher · 2019-04-29T19:32:06Z

readthedocs/projects/tasks.py

+                        localmedia=bool(outcomes['localmedia']),
+                        pdf=bool(outcomes['pdf']),
+                        epub=bool(outcomes['epub']),
+                    )


Same here, this should probably be a try/except as well.

This method which only had code removed shouldn't really throw an exception randomly. Do you really think we need a try block for it?

ericholscher · 2019-04-29T19:32:47Z

readthedocs/projects/tasks.py

+                        msg=f'Writing {media_type} to media storage - {to_path}',
+                    ),
+                )
+                storage.copy_directory(from_path, to_path)


This likely should be try/except as well, so when it fails it doesn't stop all the other formats from syncing.

I added a try block around the writes/deletes to storage just in case.

humitos

I like these changes. Left some small comments to consider.

Setting RTD_BUILD_MEDIA_STORAGE in production will write all build artifacts to storage (HTML, PDFs, etc.) but we still need a syncer. This will result in PDFs being written to the webs again but this is a temporary state until we proxy HTML requests to storage.

Not sure to follow here. Why PDF will be written again to webs? Because of this line?

Depending on when we want to shrink our disk, I'd say that this is OK. Although, if we want to shrink them soon, we could add a setting to disable copying PDF/ePub into the storage in production. Actually, no need to define a new setting, we could just check for the BUILD_MEDIA_STORAGE and if it's Azure, do not copy them to webs.

humitos · 2019-05-01T11:37:18Z

readthedocs/projects/tasks.py

+            else:
+                types_to_delete.append('epub')
+
+            for media_type, build_type in types_to_copy:


nit:

This is a little confusing to me. What is media_type and build_type? I suppose the first one is the path name where we want to save it, and the later is the path name from the builder.

If so, where they come from? Isn't it possible to use a constant or get the name from the related class where this is defined?

I tend to agree but I believe this is a larger refactor and should be handled in a separate PR. These constants are not defined somewhere convenient (like readthedocs/doc_builder/constants.py) and they are used in quite a few places outside of the scope of this PR like the doc builder backends themselves.

humitos · 2019-05-01T11:40:21Z

readthedocs/projects/tasks.py

+                except Exception:
+                    # Ideally this should just be an IOError
+                    # but some storage backends unfortunately throw other errors
+                    log.warning(


Now that we have we are duplicating the copy a warning could be enough. Although, I think we want to log an .exception here so we can see it under Sentry and take care of it. Otherwise, if sync fails for any reason we won't know that and we will see random issues on webpages.

Anyway, .exception could generate a flood of notifications here.

Maybe a log.warning with exc_info=True?

I'm not sure what I want :)

My concern is that if this fail we won't be alarmed/notified at all. Although, alarming us for every single file that was not able to be copied is crazy.

I think we just do a log.exception and Sentry can sort it out. It looks like Sentry is smart enough to find similar issues even when the text isn't exactly the same.

humitos · 2019-05-01T11:43:50Z

requirements/pip.txt

@@ -97,5 +97,8 @@ django-cors-middleware==1.3.1
 # User agent parsing - used for analytics purposes
 user-agents<1.2.0

+# Utilities used to upload build media to cloud storage
+django-storages>=1.7,<1.8


I prefer to pin to an exact version our dependencies so the environment is reproducible. You can use something like to allow weekly updates via our bot, but forcing <1.8:

django-storages==1.7.1 # pyup: <1.8

davidfischer · 2019-05-01T16:09:04Z

Not sure to follow here. Why PDF will be written again to webs? Because of this line?

RTD_BUILD_MEDIA_STORAGE controls whether artifacts are written to storage. The syncer controls whether they are synced to the webs. We can change FILE_SYNCER to be the NullSyncer if we don't want PDFs/HTML/ePubs to be synced back to the webs. Before a proxy is in place to proxy HTML to storage, we can't do that. One of the keys of this PR is to not handle PDFs differently from HTML. There's no reason to check RTD_BUILD_MEDIA_STORAGE from a syncer since the syncer can be overridden if needed.

Store build artifacts from build servers proposal

eadb834

davidfischer added the PR: work in progress Pull request is not ready for full review label Mar 29, 2019

davidfischer requested a review from a team March 29, 2019 21:16

ericholscher reviewed Apr 1, 2019

View reviewed changes

Optionally, write build artifacts to storage

9c1eae0

davidfischer changed the title ~~Store build artifacts in storage from build servers proposal~~ Store build artifacts in storage from build servers Apr 6, 2019

davidfischer removed the PR: work in progress Pull request is not ready for full review label Apr 6, 2019

davidfischer changed the title ~~Store build artifacts in storage from build servers~~ Write build artifacts to (cloud) storage from build servers Apr 6, 2019

Add a file system storage subclass

535f315

- Useful for writing build artifacts under MEDIA_ROOT in dev

ericholscher reviewed Apr 8, 2019

View reviewed changes

davidfischer requested a review from a team April 8, 2019 16:08

ericholscher approved these changes Apr 8, 2019

View reviewed changes

davidfischer mentioned this pull request Apr 11, 2019

Add travis-ci style pull request builder #1340

Closed

davidfischer added 2 commits April 17, 2019 13:03

Add a null syncer that doesn't sync files

b61ae5d

Settings tweaks

7f2d14f

- Use the RTD_ prefix - Assume that settings.RTD_BUILD_MEDIA_STORAGE is set (defined in base)

ericholscher approved these changes Apr 29, 2019

View reviewed changes

Be more defensive when writing to storage

8916499

humitos approved these changes May 1, 2019

View reviewed changes

Updates based on feedback

ceb7de2

davidfischer mentioned this pull request May 9, 2019

Fix bug in notifications #5678

Merged

davidfischer added 2 commits May 9, 2019 15:00

Merge branch 'master' into davidfischer/build-media-to-storage

9a84b07

Fix linting based on new configs

83165b7

davidfischer merged commit d97217c into master May 9, 2019

davidfischer deleted the davidfischer/build-media-to-storage branch May 9, 2019 22:38

davidfischer mentioned this pull request May 14, 2019

Storage updates #5698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write build artifacts to (cloud) storage from build servers #5549

Write build artifacts to (cloud) storage from build servers #5549

davidfischer commented Mar 29, 2019 •

edited

Loading

ericholscher left a comment

ericholscher Apr 1, 2019

davidfischer Apr 1, 2019

davidfischer commented Apr 1, 2019

davidfischer commented Apr 6, 2019

davidfischer commented Apr 6, 2019

ericholscher left a comment

ericholscher commented Apr 8, 2019

ericholscher left a comment

davidfischer commented Apr 8, 2019

davidfischer commented Apr 29, 2019

ericholscher left a comment

ericholscher Apr 29, 2019

davidfischer Apr 29, 2019

ericholscher Apr 29, 2019

davidfischer Apr 29, 2019

ericholscher Apr 29, 2019

davidfischer Apr 29, 2019

humitos left a comment •

edited

Loading

humitos May 1, 2019

davidfischer May 1, 2019

humitos May 1, 2019

humitos May 1, 2019

davidfischer May 1, 2019

humitos May 1, 2019

davidfischer May 1, 2019

humitos May 1, 2019

davidfischer commented May 1, 2019

Write build artifacts to (cloud) storage from build servers #5549

Write build artifacts to (cloud) storage from build servers #5549

Conversation

davidfischer commented Mar 29, 2019 • edited Loading

Goals

Considerations

Down the road

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfischer commented Apr 1, 2019

davidfischer commented Apr 6, 2019

Testing this out

Notes

davidfischer commented Apr 6, 2019

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher commented Apr 8, 2019

ericholscher left a comment

Choose a reason for hiding this comment

davidfischer commented Apr 8, 2019

davidfischer commented Apr 29, 2019

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfischer commented May 1, 2019

davidfischer commented Mar 29, 2019 •

edited

Loading

humitos left a comment •

edited

Loading