Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPR]: Blockstore #238

Closed
11 tasks done
Tracked by #34283
kdmccormick opened this issue Feb 1, 2024 · 12 comments
Closed
11 tasks done
Tracked by #34283

[DEPR]: Blockstore #238

kdmccormick opened this issue Feb 1, 2024 · 12 comments
Assignees
Labels
depr Proposal for deprecation & removal per OEP-21

Comments

@kdmccormick
Copy link
Member

kdmccormick commented Feb 1, 2024

Proposal Date

2024-02-01

Target Ticket Acceptance Date

2024-02-15

Earliest Open edX Named Release Without This Functionality

Redwood - 2024-04

Rationale

If you’ve been following Open edX core development for a while, you might be surprised to see a deprecation notice for Blockstore, which we’d long thought of as the future of Open edX content storage. Rest assured, we’re still actively working on the original goals of Blockstore, but they’re now part of a more fleshed-out package: the “Learning Core”.

Some Context: Blockstore was developed in ~2018 as a content storage service, enabling courses.edx.org to serve content for the LabXchange project. The broader vision was to replace the Open edX platform’s current storage backend (Modulestore) with something simpler and more flexible, supporting a paradigm shift towards modular, non-linear, and adaptive learning content. The key tenets of Blockstore were that:

  • It was isolated from the complexities of edx-platform. This made the system easier to comprehend and maintain.
  • Its data model was authoring-first: its primitives were “bundles” and “assets”, which get versioned and published together. In contrast, Modulestore focuses on content blocks and structures.
  • It was agnostic to the structure of the content it managed: bundles might be libraries, units, individual components; bundles may contain XBlocks, or something different. In contrast, Modulestore assumes that everything is an XBlock and that everything is part of a Course.

You can read more in Blockstore’s DESIGN doc. Blockstore was selected as the basis for the “Content Libraries V2” initiative, which aimed to rebuild the legacy Content Libraries feature to be more robust and broadly useful. It was envisioned that all Open edX content, including traditional courses, would eventually be migrated to Blockstore, and that Modulestore would be deprecated.

However, in the intervening years, LabXchange moved off of the Open edX platform, and the Content Libraries V2 project was delayed several times due to organizational shifts and competing priorities. As of 2023, we are not aware of any production users of Blockstore (if you are aware of any, please comment).

Since 2018, and especially since the creation of Axim in 2021, we have had time to learn lessons from Blockstore’s original design and think deeply about the needs of the Open edX project going forward. We still believe in Blockstore’s original key tenets, but we've also learned more:

  • Whereas Blockstore stores almost all of its data on object storage (e.g. S3), we should store data in a relational database like MySQL and connect it LMS/CMS via an in-process library. One of the assumptions in the Blockstore design was the Open edX CMS (running in the same deployment) would be able to query this data very quickly, but in practice we found that reading data from S3 had significantly higher latency than anticipated, causing XBlocks to load very slowly and making additional complex layers of caching absolutely necessary. The vast majority of production issues with Blockstore were either Object Storage errors (expired signed URLs) or cache invalidation errors.
  • Whereas Blockstore (being authoring-first) assumed that some other system would consume and cache its content for efficient loading/filtering/searching/sorting in LMS, we’re better off just building the basis of that “some other system” into the same package as the storage backend. edx-platform developers need a good, consistent pattern for representing learning content; there is guidance for doing so today, but it’s better for everyone if that pattern is provided as a core capability.
  • Whereas Blockstore managed versions at the abstract “bundle” level, we’re better off managing versions for individual learning components. That way, edx-platform, et al, do not need to “choose” what a bundle means in any given context–everything is versioned, so clients just need to choose how to helpfully present that to authors. This still lets us be agnostic as to the shape and structure of the content, as we’re not making assumptions about how the components fit together. It also lets us support extremely large libraries with many thousands of components which, under Blockstore’s bundle-versioning system, introduced performance issues during writes.

Removal

  • Blockstore references need to be removed from edx-platform.
  • Docs referencing Blockstore concepts should be updated or archived.
  • Blockstore itself needs to be archived.

Replacement

We've developed the Learning Core, which replaces Blockstore and incorporates what we've learned over the past few years. The insights of Blockstore live on mostly in the openedx_learning.core.publishing sub-package. You can read the various decisions we’ve made and are making for this new system.

Deprecation

We will add a deprecation notice to Blockstore's README. We'll also push a final Blockstore release to PyPI so that the deprecation notice shows up there.

Migration

Because Blockstore isn’t currently deployed to production anywhere (again, please comment if you disagree), we have the great opportunity to jump right to Learning Core rather than foisting a two-step Modulestore->Blockstore->LC migration upon site operators. So, we are currently migrating the Content Libraries Relaunch project from Blockstore to the Learning Core, which we are aiming to make experimentally available as early as Redwood (June 2024) and properly available by Sumac (December 2024). We plan to remove Blockstore from the release starting with Redwood. Going forward, we plan to incrementally migrate parts of edx-platform the Learning Core, with the long-term goal of either replacing Modulestore, or reducing Modulestore to a compatibility layer resting on top of a backfilled Learning Core.

Additional Info

Removal Task List

  1. content libraries misc
    kdmccormick ormsbee
  2. kdmccormick
@regisb
Copy link

regisb commented Feb 2, 2024

Yes, this is a surprising announcement! But your thorough explanations really help to understand the situation.

I just have a question on that specific item:

Whereas Blockstore stores almost all of its data on object storage (e.g. S3), we should store data in a relational database like MySQL and connect it LMS/CMS via an in-process library.

At the time Blockstore was designed I made the following comment:

I'm a bit surprised by the choice of files for storing content. My 2 cents:

  1. The data structure should be chosen by taking into consideration what kind of read/write access will be required. For instance, file systems are not good at answering the question "what is the most recent file in this folder" (they don't have an index on dates). And, it seems to me that we are frequently going to have to make such queries, for instance to get the latest version of a content element.
  2. Filesystems are bad at searching: will we have to rely on a grep-like tool (i.e: slow) when searching for content?
  3. Separating data in two different storage systems (filesystem and SQL db) requires some synchronization, which is a hard problem.
  4. ...
  5. Files don't have schema: one of the current major issues with xblocks is that they are extremely difficult to migrate, whenever their definition changes. Backward compatibility becomes very hard to maintain. Files have the same problem.

To which Dave made a very sensible answer that S3 is cheap while databases are expensive -- totally reasonable answer, especially in the context of a large site like edX.org.

So, I'm wondering how the new Learning Core handles storage costs? Also, it has to provide search capabilities. How do we achieve that? Is data automatically synced with Elasticsearch? If yes, how?

I guess I could read the existing ADRs and the openedx-learning source code to find the answer... but that's a lot of content to parse. If someone could give me a few pointers, I would really appreciate it.

@ormsbee
Copy link

ormsbee commented Feb 2, 2024

@kdmccormick: Thank you for writing this up. Especially since I probably should have done it weeks ago. 😛

So, I'm wondering how the new Learning Core handles storage costs?

@regisb: That's a great question, and something that we spend a lot of time hashing out with every feature. I'm going to give a rather long answer here because while we have lots of discussion around individual pieces of this puzzle, I don't think we have any docs that pull it all together into a summary on disk storage. Also, while looking through the GitHub Issue discussions can be very useful, we've had a whole lot of terminology churn, so it will likely also cause a lot of confusion.

I'll do my best to give a comprehensive overview here, but the next place to look would probably be the data models files in openedx-learning:

Storage Costs

To start off, I want to be totally clear that scalabilty and performance is a huge priority for us–I think we have an incredible opportunity to lower costs both for both the storage and processing of content.

We can break up our storage requirements for authored content into the following broad buckets:

  1. historical versioning data
  2. structural data connecting courses to sections to subsections to units to components
  3. the XBlock field data for components–e.g. Problems, Videos, HTML
  4. static asset uploads like images and PDFs
  5. other app models that reference content

Versioning

In ModuleStore, both versioning and structure are handled through the structure documents stored in MongoDB (though active structure lookup happens in the Django ORM now). Each time we make any sort of change to a course, we create a snapshot of the structure of the entire course that captures all Section/Subsection/Unit XBlock relationships, as well as all XBlock settings-scoped fields.

In terms of storage space used, ModuleStore's approach is enormously wasteful, as the storage cost of new versions scales with the the total size of the course, instead of the size of the edit being made. This can be as much as 2 MB per version on some of the largest courses we've seen people make. Part of this owes to the fact that far more stuff was put into settings than was originally intended, and part of this came about because it was simpler to implement than less wasteful alternatives.

Blockstore mostly did the same thing that ModuleStore did, but cheaper–MongoDB is already treated like a key-value store by the ModuleStore (we explicitly call it that), so let's substitute MongoDB with a more affordable object/file backend. In practice, the vastly higher and less predictable latency that this introduced was a source of operational pain, especially as libraries grew very large.

This is the biggest single area of change with Learning Core. We want to be able to readily support libraries with tens of thousands of items in them, each separately shareable and addressable. We absolutely can't have the storage cost of a new version scale with the number of items in the library, as it does with ModuleStore and Blockstore today.

That's why versioning happens on a per-Component basis in Learning Core.

Instead of having a snapshot for every version and top level pointers to the draft and published versions of a library, we have separate models that track the current Draft and Published versions of a Component. There is also a PublishLog model that tracks which things were published and at what time. The latest PublishLog entry identifier can serve as a library-wide version identifier where that's still needed for compatibility reasons.

So we keep fast lookups of the current Studio (Draft) and LMS (Published) versions, as well as being able to quickly grab the contents of any specific version. An edit to a particular Component now only generates metadata for that new Component Version, so new version storage costs are vastly cheaper. We also gain a much lower-overhead way to find the draft or published version of a specific Component, without having to read a big snapshot file. The PublishLog lets us still group and track many Components being published at the same.

What we lose is the ability to easily rewind to a particular version by pointing to an older snapshot, or easily say, "this was the version of everything that was live at this particular time." It's still possible to reconstruct that data from the PublishLog, but doing so is significantly slower and more complex.

Structure

It's great for libraries that Components will be versioned separately, but how does that translate into Courses? After all, courses need to be able to stack Components into Units, Subsections, and Sections. Are we still chaining new versions of all these containers when we make edits to a Component that contains them?

The current plan is "no". We expect to model a Unit so that a reference to its child captures both the Component as well as the Component Version. If the Component Version in that reference is set to be null, it means "always use the latest draft/published version of this Component". A Unit only creates a new version if it alters its own metadata or changes its structure by adding, removing, rearranging, or specifically version-pinning its child Components.

I hope this flexibility will allow us to efficiently model both the common Studio authoring scenario today (always use the latest draft), as well as certain library content borrowing use cases that require fixed-version inclusion. This same pattern could be extrapolated to higher level containers like Subsections and Sections.

This feature has not been implemented yet because it's not a requirement for the MVP of the new content libraries experience. It will make its way into later releases of libraries and courses in the future. We discussed this general idea in this issue.

XBlock Field Data for Components

ModuleStore and Blockstore both store a lot of XBlock field data in Definition documents (in MongoDB and file storage, respectively). Learning Core does something similar in storing these in a text field on a Content model, with a number of differences:

  1. Content models are de-duplicated within a Library/Course, so there are some savings there.
  2. Definition documents were difficult to prune because it was unclear where they were used. Content models are easier to look up in this respect, making it more practial to write pruning code for Content that is not currently used and has never been published.
  3. Content models have more text in them than the equivalent Definition document, because they don't split the scopes.
  4. We will be encourage the extension of Content with 1:1 models (e.g. ImageContent, VideoBlockContent) in ways that Definition documents were not.

I think that we're probably going to end up generating approximately the same number of Content entries for these as we had Definitions in the old system. The Content entries will be larger, but should also be much easier to prune. We'll need to watch usage patterns carefully here.

Static Assets (images, PDFs, etc.)

Learning Core still follows Blockstore's lead here and stores these in a file backend using Django's Storage API so that it can be stored either to a filesystem or S3. This should reduce overall hosts as compared to storing this in GridFS on MongoDB as we do for courses today.

One of the major operational lessons taken from Blockstore is that passing around direct links to these assets using signed URLs led to issues around cache invalidation, and running assets through the app server itself led to performance issues. The approach we're going to try with Learning Core is to take advantage of the X-Accel-Redirect header to move more of this work to the caddy/nginx layer.

This is covered in more detail at ADR: Serving Course Team Authored Static Assets (also see comments in the pull request.

There are also two broad categories of storing static assets:

  1. Small sets of assets that are stored for an individual Component, e.g. "these two images in this problem". The idea of static assets being associated with a particular Component and getting released in sync with the problem's versioning is something new to the libraries revamp targeted for Redwood (today you can upload an image within the editor, but the image is put into the general Files and Uploads).
  2. Large sets of files, e.g. Files and Uploads. Also potentially smaller groups of sharable resources in the future, like python_lib.zip source files.

In both scenarios, we have a normalized storage of static assets that de-duplicates raw file data across different versions. If Problem A uses an image, and Problem B uses the exact same image, the image will only be stored once–even if they are using different names to refer to the image.

When it comes to storing metadata about the static asset's relationship to a Component, we chose a fairly simple representation that makes it fast to find the assets for a given Component Version and has indexes to prevent invalid states. For instance, the database will prevent us from creating multiple static assets assigned to the same file path for the same Component Version.

The downside of this is that for scenario (1), the storage cost for the static asset metadata of a new version ("this version has these assets using these names") scales with the number of assets associated with that Component Version. This was deemed acceptable because the expectation is that there are relatively few assets associated with a particular Component, but it's an area we have to watch for as usage patterns evolve.

For scenario (2), we know there exists situations where there are potentially thousands of files and uploads, so the representation we use for (1) would not be acceptable. That's the entire subject of this issue thread, and this comment has my most recent thoughts on it. This is something that is not explicitly needed for the upcoming content libraries revamp, but will be necessary before we tackle courses.

Reducing related model data and index size

One non-obvious thing that moving so much into MySQL helps us with is to reduce the size of some of our massive tables and indexes that reference content. We currently have many large tables with CourseKeys and UsageKeys written and indexed as varchar fields, such as courseware_studentmodule. On a site like edX, this can total up to hundreds of GBs that could be saved by using 4-byte or 8-byte foreign key references instead.

Search

Also, it has to provide search capabilities. How do we achieve that? Is data automatically synced with Elasticsearch? If yes, how?

The search functionality hasn't been built yet. The tagging support currently uses MySQL, and I think we're likely to ride that as far as we can. We still need to do more discovery work. Though hopefully, the fact that we're publishing individual Components will make the indexing process a lot simpler than today's "something's changed, let's re-index everything" approach.

@ormsbee
Copy link

ormsbee commented Feb 2, 2024

Looking back at a high level, I believe I made the following missteps with Blockstore:

  • I assumed Blockstore would be authoring-only, and we would make a simpler published-only data model in the LMS. Learning Core started with the goal of implementing that LMS-oriented published-only data model.
  • I couldn't think of an efficient-enough representation that would support our fully granular, versioned content in a relational database.

So I tried to optimize a pattern that was already familiar to me with Modulestore–snapshots of content, with versions only tracked at the highest level (a full course or library). We would build on that by tracking the dependencies between those top level entities in the database. But a number of things came up as we were building on the data model:

We kept finding requirements that required us to keep multiple concurrent versions in the LMS (such as previewing, or using a fixed version of Library content). Once Learning Core had to hold many versions anyway, it lost a lot of the hoped-for data model simplicity.

Blockstore ended up serving content directly to users anyway. This was something that we explicitly wanted to avoid (and even called out in the original design goals/non-goals)... but it happened anyway, and performance was poor. That being said, we had authoring performance issues for larger libraries anyway, due to worse-than-expected S3 latency issues.

We figured out a more efficient representation for versions and parent-child relationships. Shifting versioning to happen at the Component level and allowing container types like Units to be defined with "unpinned" Component children is what makes this work. If we had kept the Modulestore/Blockstore pattern of snapshots, it would have exploded the storage requirements and made a shift to MySQL prohibitively costly. Figuring this out took a lot of iteration though, dating back to openedx/openedx-learning#1.

@kdmccormick
Copy link
Member Author

This now Accepted. Code removal is in progress.

@kdmccormick kdmccormick transferred this issue from openedx-unsupported/blockstore Feb 15, 2024
@kdmccormick kdmccormick moved this from Accepted to Removing in DEPR: Deprecation & Removal Feb 15, 2024
kdmccormick pushed a commit to openedx/edx-platform that referenced this issue Feb 22, 2024
This moves the Content Libraries V2 backend from Blockstore [1] over to
Learning Core [2] For high-level overview and rationale of this move, see
the Blockstore DEPR [3]. There are several follow-up tasks [4], most notably
adding support for static assets in libraries.

BREAKING CHANGE: Existing V2 libraries, backed by Blockstore, will stop
working. They will continue to be listed in Studio, but their content
will be unavailable. They need to be deleted (via Django admin) or manually
migrated to Learning Core. We do not expect production sites to be in
this situation, as the feature has never left "experimental" status.

[1] https://github.com/openedx-unsupported/blockstore
[2] https://github.com/openedx/openedx-learning/
[3] openedx/public-engineering#238
[4] #34283
@kdmccormick
Copy link
Member Author

@Yagnesh1998 will be helping with the remaining clean-up items

@Yagnesh1998
Copy link

I will start work soon.

@kdmccormick
Copy link
Member Author

Yagnesh is on leave right now. He will resume work when he is back, but in the meantime, this is open to be worked on.

In particular, it'd be nice if we could remove the Blockstore package dependency from edx-platform before the Redwood cut on May 9th.

kdmccormick added a commit to openedx/edx-platform that referenced this issue May 13, 2024
Blockstore and all of its (experimental) functionality has been replaced with
openedx-learning, aka "Learning Core". This commit uninstalls the now-unused
openedx-blockstore package and removes all dangling references to it.

Note: This also removes the `copy_library_from_v1_to_v2` management command,
which has been broken ever since we switched from Blockstore to Learning Core.

Part of this DEPR: openedx/public-engineering#238
@ormsbee
Copy link

ormsbee commented Jul 16, 2024

@kdmccormick: Updating this to Removed. Please reopen if you disagree.

@ormsbee ormsbee closed this as completed Jul 16, 2024
@kdmccormick
Copy link
Member Author

opaque-keys still has blockstore key types that need to be removed

@kdmccormick kdmccormick reopened this Jul 16, 2024
@github-project-automation github-project-automation bot moved this from Removed to Proposed in DEPR: Deprecation & Removal Jul 16, 2024
@kdmccormick kdmccormick moved this from Proposed to Removing in DEPR: Deprecation & Removal Jul 16, 2024
@ormsbee
Copy link

ormsbee commented Aug 5, 2024

@kdmccormick and I talked about this a bit afterwards, but I'm not sure if we reached a conclusion. I support keeping the Blockstore related opaque keys so that we don't break analytics and possibly other long tail code that will need to parse those key types–even if those keys are no longer actively being served in the platform. @kdmccormick: Does that sound right to you? Can we close this ticket now?

kdmccormick added a commit to openedx/opaque-keys that referenced this issue Aug 5, 2024
kdmccormick added a commit to openedx/opaque-keys that referenced this issue Aug 5, 2024
* Trim down BundleVersionLocator docstring to take up less space
* Emit warning when construction a BundleVersionLocator
* Update other "Blockstore" references to "Learning Core"

openedx/public-engineering#238
@kdmccormick
Copy link
Member Author

kdmccormick commented Aug 5, 2024

@ormsbee that makes sense. Here's a PR just to add a warning and update docstrings: openedx/opaque-keys#330 . Once that merges I'm good to close this.

@kdmccormick
Copy link
Member Author

Removal is complete.

@github-project-automation github-project-automation bot moved this from Removing to Removed in DEPR: Deprecation & Removal Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
depr Proposal for deprecation & removal per OEP-21
Projects
Status: Removed
Development

No branches or pull requests

5 participants