-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DEPR]: Blockstore #238
Comments
Yes, this is a surprising announcement! But your thorough explanations really help to understand the situation. I just have a question on that specific item:
At the time Blockstore was designed I made the following comment:
To which Dave made a very sensible answer that S3 is cheap while databases are expensive -- totally reasonable answer, especially in the context of a large site like edX.org. So, I'm wondering how the new Learning Core handles storage costs? Also, it has to provide search capabilities. How do we achieve that? Is data automatically synced with Elasticsearch? If yes, how? I guess I could read the existing ADRs and the openedx-learning source code to find the answer... but that's a lot of content to parse. If someone could give me a few pointers, I would really appreciate it. |
@kdmccormick: Thank you for writing this up. Especially since I probably should have done it weeks ago. 😛
@regisb: That's a great question, and something that we spend a lot of time hashing out with every feature. I'm going to give a rather long answer here because while we have lots of discussion around individual pieces of this puzzle, I don't think we have any docs that pull it all together into a summary on disk storage. Also, while looking through the GitHub Issue discussions can be very useful, we've had a whole lot of terminology churn, so it will likely also cause a lot of confusion. I'll do my best to give a comprehensive overview here, but the next place to look would probably be the data models files in openedx-learning:
Storage CostsTo start off, I want to be totally clear that scalabilty and performance is a huge priority for us–I think we have an incredible opportunity to lower costs both for both the storage and processing of content. We can break up our storage requirements for authored content into the following broad buckets:
VersioningIn ModuleStore, both versioning and structure are handled through the structure documents stored in MongoDB (though active structure lookup happens in the Django ORM now). Each time we make any sort of change to a course, we create a snapshot of the structure of the entire course that captures all Section/Subsection/Unit XBlock relationships, as well as all XBlock settings-scoped fields. In terms of storage space used, ModuleStore's approach is enormously wasteful, as the storage cost of new versions scales with the the total size of the course, instead of the size of the edit being made. This can be as much as 2 MB per version on some of the largest courses we've seen people make. Part of this owes to the fact that far more stuff was put into settings than was originally intended, and part of this came about because it was simpler to implement than less wasteful alternatives. Blockstore mostly did the same thing that ModuleStore did, but cheaper–MongoDB is already treated like a key-value store by the ModuleStore (we explicitly call it that), so let's substitute MongoDB with a more affordable object/file backend. In practice, the vastly higher and less predictable latency that this introduced was a source of operational pain, especially as libraries grew very large. This is the biggest single area of change with Learning Core. We want to be able to readily support libraries with tens of thousands of items in them, each separately shareable and addressable. We absolutely can't have the storage cost of a new version scale with the number of items in the library, as it does with ModuleStore and Blockstore today. That's why versioning happens on a per-Component basis in Learning Core. Instead of having a snapshot for every version and top level pointers to the draft and published versions of a library, we have separate models that track the current Draft and Published versions of a Component. There is also a PublishLog model that tracks which things were published and at what time. The latest PublishLog entry identifier can serve as a library-wide version identifier where that's still needed for compatibility reasons. So we keep fast lookups of the current Studio (Draft) and LMS (Published) versions, as well as being able to quickly grab the contents of any specific version. An edit to a particular Component now only generates metadata for that new Component Version, so new version storage costs are vastly cheaper. We also gain a much lower-overhead way to find the draft or published version of a specific Component, without having to read a big snapshot file. The PublishLog lets us still group and track many Components being published at the same. What we lose is the ability to easily rewind to a particular version by pointing to an older snapshot, or easily say, "this was the version of everything that was live at this particular time." It's still possible to reconstruct that data from the PublishLog, but doing so is significantly slower and more complex. StructureIt's great for libraries that Components will be versioned separately, but how does that translate into Courses? After all, courses need to be able to stack Components into Units, Subsections, and Sections. Are we still chaining new versions of all these containers when we make edits to a Component that contains them? The current plan is "no". We expect to model a Unit so that a reference to its child captures both the Component as well as the Component Version. If the Component Version in that reference is set to be null, it means "always use the latest draft/published version of this Component". A Unit only creates a new version if it alters its own metadata or changes its structure by adding, removing, rearranging, or specifically version-pinning its child Components. I hope this flexibility will allow us to efficiently model both the common Studio authoring scenario today (always use the latest draft), as well as certain library content borrowing use cases that require fixed-version inclusion. This same pattern could be extrapolated to higher level containers like Subsections and Sections. This feature has not been implemented yet because it's not a requirement for the MVP of the new content libraries experience. It will make its way into later releases of libraries and courses in the future. We discussed this general idea in this issue. XBlock Field Data for ComponentsModuleStore and Blockstore both store a lot of XBlock field data in Definition documents (in MongoDB and file storage, respectively). Learning Core does something similar in storing these in a text field on a Content model, with a number of differences:
I think that we're probably going to end up generating approximately the same number of Content entries for these as we had Definitions in the old system. The Content entries will be larger, but should also be much easier to prune. We'll need to watch usage patterns carefully here. Static Assets (images, PDFs, etc.)Learning Core still follows Blockstore's lead here and stores these in a file backend using Django's Storage API so that it can be stored either to a filesystem or S3. This should reduce overall hosts as compared to storing this in GridFS on MongoDB as we do for courses today. One of the major operational lessons taken from Blockstore is that passing around direct links to these assets using signed URLs led to issues around cache invalidation, and running assets through the app server itself led to performance issues. The approach we're going to try with Learning Core is to take advantage of the This is covered in more detail at ADR: Serving Course Team Authored Static Assets (also see comments in the pull request. There are also two broad categories of storing static assets:
In both scenarios, we have a normalized storage of static assets that de-duplicates raw file data across different versions. If Problem A uses an image, and Problem B uses the exact same image, the image will only be stored once–even if they are using different names to refer to the image. When it comes to storing metadata about the static asset's relationship to a Component, we chose a fairly simple representation that makes it fast to find the assets for a given Component Version and has indexes to prevent invalid states. For instance, the database will prevent us from creating multiple static assets assigned to the same file path for the same Component Version. The downside of this is that for scenario (1), the storage cost for the static asset metadata of a new version ("this version has these assets using these names") scales with the number of assets associated with that Component Version. This was deemed acceptable because the expectation is that there are relatively few assets associated with a particular Component, but it's an area we have to watch for as usage patterns evolve. For scenario (2), we know there exists situations where there are potentially thousands of files and uploads, so the representation we use for (1) would not be acceptable. That's the entire subject of this issue thread, and this comment has my most recent thoughts on it. This is something that is not explicitly needed for the upcoming content libraries revamp, but will be necessary before we tackle courses. Reducing related model data and index sizeOne non-obvious thing that moving so much into MySQL helps us with is to reduce the size of some of our massive tables and indexes that reference content. We currently have many large tables with CourseKeys and UsageKeys written and indexed as varchar fields, such as Search
The search functionality hasn't been built yet. The tagging support currently uses MySQL, and I think we're likely to ride that as far as we can. We still need to do more discovery work. Though hopefully, the fact that we're publishing individual Components will make the indexing process a lot simpler than today's "something's changed, let's re-index everything" approach. |
Looking back at a high level, I believe I made the following missteps with Blockstore:
So I tried to optimize a pattern that was already familiar to me with Modulestore–snapshots of content, with versions only tracked at the highest level (a full course or library). We would build on that by tracking the dependencies between those top level entities in the database. But a number of things came up as we were building on the data model: We kept finding requirements that required us to keep multiple concurrent versions in the LMS (such as previewing, or using a fixed version of Library content). Once Learning Core had to hold many versions anyway, it lost a lot of the hoped-for data model simplicity. Blockstore ended up serving content directly to users anyway. This was something that we explicitly wanted to avoid (and even called out in the original design goals/non-goals)... but it happened anyway, and performance was poor. That being said, we had authoring performance issues for larger libraries anyway, due to worse-than-expected S3 latency issues. We figured out a more efficient representation for versions and parent-child relationships. Shifting versioning to happen at the Component level and allowing container types like Units to be defined with "unpinned" Component children is what makes this work. If we had kept the Modulestore/Blockstore pattern of snapshots, it would have exploded the storage requirements and made a shift to MySQL prohibitively costly. Figuring this out took a lot of iteration though, dating back to openedx/openedx-learning#1. |
This now Accepted. Code removal is in progress. |
This moves the Content Libraries V2 backend from Blockstore [1] over to Learning Core [2] For high-level overview and rationale of this move, see the Blockstore DEPR [3]. There are several follow-up tasks [4], most notably adding support for static assets in libraries. BREAKING CHANGE: Existing V2 libraries, backed by Blockstore, will stop working. They will continue to be listed in Studio, but their content will be unavailable. They need to be deleted (via Django admin) or manually migrated to Learning Core. We do not expect production sites to be in this situation, as the feature has never left "experimental" status. [1] https://github.com/openedx-unsupported/blockstore [2] https://github.com/openedx/openedx-learning/ [3] openedx/public-engineering#238 [4] #34283
@Yagnesh1998 will be helping with the remaining clean-up items |
I will start work soon. |
Yagnesh is on leave right now. He will resume work when he is back, but in the meantime, this is open to be worked on. In particular, it'd be nice if we could remove the Blockstore package dependency from edx-platform before the Redwood cut on May 9th. |
Blockstore and all of its (experimental) functionality has been replaced with openedx-learning, aka "Learning Core". This commit uninstalls the now-unused openedx-blockstore package and removes all dangling references to it. Note: This also removes the `copy_library_from_v1_to_v2` management command, which has been broken ever since we switched from Blockstore to Learning Core. Part of this DEPR: openedx/public-engineering#238
@kdmccormick: Updating this to Removed. Please reopen if you disagree. |
opaque-keys still has blockstore key types that need to be removed |
@kdmccormick and I talked about this a bit afterwards, but I'm not sure if we reached a conclusion. I support keeping the Blockstore related opaque keys so that we don't break analytics and possibly other long tail code that will need to parse those key types–even if those keys are no longer actively being served in the platform. @kdmccormick: Does that sound right to you? Can we close this ticket now? |
* Trim down BundleVersionLocator docstring to take up less space * Emit warning when construction a BundleVersionLocator * Update other "Blockstore" references to "Learning Core" openedx/public-engineering#238
@ormsbee that makes sense. Here's a PR just to add a warning and update docstrings: openedx/opaque-keys#330 . Once that merges I'm good to close this. |
Removal is complete. |
Proposal Date
2024-02-01
Target Ticket Acceptance Date
2024-02-15
Earliest Open edX Named Release Without This Functionality
Redwood - 2024-04
Rationale
If you’ve been following Open edX core development for a while, you might be surprised to see a deprecation notice for Blockstore, which we’d long thought of as the future of Open edX content storage. Rest assured, we’re still actively working on the original goals of Blockstore, but they’re now part of a more fleshed-out package: the “Learning Core”.
Some Context: Blockstore was developed in ~2018 as a content storage service, enabling courses.edx.org to serve content for the LabXchange project. The broader vision was to replace the Open edX platform’s current storage backend (Modulestore) with something simpler and more flexible, supporting a paradigm shift towards modular, non-linear, and adaptive learning content. The key tenets of Blockstore were that:
You can read more in Blockstore’s DESIGN doc. Blockstore was selected as the basis for the “Content Libraries V2” initiative, which aimed to rebuild the legacy Content Libraries feature to be more robust and broadly useful. It was envisioned that all Open edX content, including traditional courses, would eventually be migrated to Blockstore, and that Modulestore would be deprecated.
However, in the intervening years, LabXchange moved off of the Open edX platform, and the Content Libraries V2 project was delayed several times due to organizational shifts and competing priorities. As of 2023, we are not aware of any production users of Blockstore (if you are aware of any, please comment).
Since 2018, and especially since the creation of Axim in 2021, we have had time to learn lessons from Blockstore’s original design and think deeply about the needs of the Open edX project going forward. We still believe in Blockstore’s original key tenets, but we've also learned more:
Removal
Replacement
We've developed the Learning Core, which replaces Blockstore and incorporates what we've learned over the past few years. The insights of Blockstore live on mostly in the
openedx_learning.core.publishing
sub-package. You can read the various decisions we’ve made and are making for this new system.Deprecation
We will add a deprecation notice to Blockstore's README. We'll also push a final Blockstore release to PyPI so that the deprecation notice shows up there.
Migration
Because Blockstore isn’t currently deployed to production anywhere (again, please comment if you disagree), we have the great opportunity to jump right to Learning Core rather than foisting a two-step Modulestore->Blockstore->LC migration upon site operators. So, we are currently migrating the Content Libraries Relaunch project from Blockstore to the Learning Core, which we are aiming to make experimentally available as early as Redwood (June 2024) and properly available by Sumac (December 2024). We plan to remove Blockstore from the release starting with Redwood. Going forward, we plan to incrementally migrate parts of edx-platform the Learning Core, with the long-term goal of either replacing Modulestore, or reducing Modulestore to a compatibility layer resting on top of a backfilled Learning Core.
Additional Info
Removal Task List
The text was updated successfully, but these errors were encountered: