feat!: Drop the block_structure.storage_backing_for_cache WaffleSwitch #35185

feanil · 2024-07-25T20:01:47Z

This work is part of DEPR openedx/public-engineering#32

See the above link for rational. This PR make is so that the course structure
cache is always backed by persistent storage.

BREAKING CHANGE: The block_structure.storage_backing_for_cache will no
longer exist and its value will be ignored. If you have this switch set
in your instance you can remove it. The backing cache is now always ON.

Unresolved Questions

Q: Do we need to run any commands to generate the initial
cache for all existing operators? Are there docs we can link to or create for
this?

A: Unclear but, as a part of the get method on the BlockStructureStore if the structure is
not found in the cache/storage, it wll be retrieved from the backing store and added
to the cache. But currently the same check is not happening for getting the structure
from the model. Should we update this? I think operators will need to run the
generate_course_blocks management command before they deploy this code or
the will see errors in their deployment.

eg. ./manage.py lms generate_course_blocks --all_courses --settings=devstack

feanil · 2024-08-20T14:37:33Z

@ormsbee per our conversation last week, I dug a bit deeper into the query count differences and it looks like a lot of the extra queries are savepoints around updating a block blockstructure and then also setting the modified date on the parent block when a block structure changes. I've collected the before and after queries for the test that changed by more than one in this file.

ormsbee

Some minor questions and requests.

In terms of the deployment risk for this: I agree that operators should run the backfill command when upgrading. Honestly, operators should in general probably run a simulate_publish on all their courses every time they upgrade, because not all devs will be as conscientious as you are about thinking through their upgrade process.

I suspect that whether to lazily re-generate the data on read or to just fail the request has to do with the failure mode when something goes wrong with the publish process. The collect process is so expensive for large courses that it's probably better for a high volume site to fail the request altogether than to try to dynamically regenerate on the fly.

Unclear but, as a part of the get method on the BlockStructureStore if the structure is not found in the cache/storage, it wlll be retrieved from the backing store and added to the cache. But currently the same check is not happening for getting the structure from the model. Should we update this?

I don't think we should update it. Grabbing a completely generated blob of data from S3/filestore and stuffing it into memcached/redis is relatively cheap. Regenerating the collected data is much more expensive.

At edX, the publish process would sometimes break block structure collection for a certain course because of some content edge case. The result would be that some data would be unavailable or out of date (e.g. the outline not showing a newly added subsection). It's frustrating for users, but it's better to show that stale data quickly than to constantly hammer the site to break on the same errors over and over again.

ormsbee · 2024-09-09T14:31:34Z

cms/djangoapps/contentstore/management/commands/import.py

@@ -73,6 +73,10 @@ def handle(self, *args, **options):

        for course in course_items:
            course_id = course.id
+            # Importing is an act of publishing so send the course published signal.
+            SignalHandler.course_published.send_robust(sender=self, course_key=course_id)


I'm really surprised that this didn't come up as an issue before. I guess because this management command is so little-used, and the vast majority of imports happen through the Studio UI?

Yea, that was my guess as well.

ormsbee · 2024-09-09T14:37:30Z

openedx/core/djangoapps/content/block_structure/tasks.py

@@ -62,8 +61,6 @@ def _update_course_in_cache(self, **kwargs):
    """
    Updates the course blocks (mongo -> BlockStructure) for the specified course.
    """
-    if kwargs.get('with_storage'):
-        enable_storage_backing_for_cache_in_request()


If we're no longer doing anything with the parameter, we should remove the reference to it in the docstring for update_course_in_cache_v2 in this module.

ormsbee · 2024-09-09T14:38:21Z

openedx/core/djangoapps/content/block_structure/tasks.py

@@ -93,8 +90,6 @@ def _get_course_in_cache(self, **kwargs):
    """
    Gets the course blocks for the specified course, updating the cache if needed.
    """
-    if kwargs.get('with_storage'):
-        enable_storage_backing_for_cache_in_request()
    _call_and_retry_if_needed(self, api.get_course_in_cache, **kwargs)


Likewise, please remove reference to with_storage in docstring for get_course_in_cache_v2.

ormsbee · 2024-09-09T14:48:35Z

lms/djangoapps/grades/rest_api/v1/tests/test_gradebook_views.py

+        # Review Question: Should we be doing this here? Does it make sense to do
+        # this in the xmodule/modulestore/tests/factories.py BlockFactory class
+        # as a part of the publish there?
+        bs_api.update_course_in_cache(self.course_data.course_key)


I think doing it here manually is okay. I'd probably just enable the published signal for this test case and let it build everything–to test part of the machinery as well, since the triggers for that may change over time. It might involve splitting off this test from the rest of the test case though.

I wouldn't put it in xmodule/modulestore/tests/factories.py though, since that's called by a lot of things that may not require this functionality.

Makes sense, I think I'll leave it as is then.

ormsbee · 2024-09-09T14:56:09Z

lms/djangoapps/grades/tests/integration/test_events.py

@@ -76,44 +77,45 @@ def setUp(self):
        CourseEnrollment.enroll(self.student, self.course.id)
        self.instructor = UserFactory.create(is_staff=True, username='test_instructor', password=self.TEST_PASSWORD)
        self.refresh_course()
+        # Since this doesn't happen automatically and we don't want to run all the publish signal handlers
+        # Just make sure we have the latest version of the course in cache before we test the problem.
+        bs_api.update_course_in_cache(self.course.id)


Again, I'd probably emit the signal and let everything run (I think this is the most expensive part of it anyhow), but I think either way is okay.

So this class already has the course_published signal enabled and so I thought he issue might be that the CourseFactor needed the emit_signals parameter set to true, but even with that this did not work. How important is it that this be figured out for this PR? I can dig into it further if needed.

ormsbee · 2024-09-09T15:02:13Z

lms/djangoapps/course_api/blocks/transformers/tests/test_milestones.py

@@ -166,7 +166,7 @@ def test_gated(self, gated_block_ref, gating_block_ref, expected_blocks_before_c
        self.course.enable_subsection_gating = True
        self.setup_gated_section(self.blocks[gated_block_ref], self.blocks[gating_block_ref])

-        with self.assertNumQueries(5):
+        with self.assertNumQueries(14):


Question: Looking at the queries files that you generated (which are awesome, BTW), my understanding is that this increase is a result of building the block structures entry for the course, and all the corresponding transaction save points. Does this go down to 6 queries if you run the block structure collection/caching before this assert queries block?

It might be worth noting in a query that we don't ever expect it to actually do these queries in a production setting, since these records would have been written on-publish.

Good call, I published before running this assert and the numbers went down for the actual lookup, I'll update the other place where this is happening as well.

ormsbee · 2024-09-09T15:03:44Z

openedx/core/djangoapps/content/block_structure/tests/helpers.py

+        # _add_relation and blocks
        for parent, children in enumerate(children_map):
+            if isinstance(block_structure, BlockStructureBlockData):
+                block_structure._get_or_create_block(self.block_key_factory(parent))  # pylint: disable=protected-access
            for child in children:
                block_structure._add_relation(self.block_key_factory(parent), self.block_key_factory(child))  # pylint: disable=protected-access
+                if isinstance(block_structure, BlockStructureBlockData):
+                    block_structure._get_or_create_block(self.block_key_factory(child))  # pylint: disable=protected-access


Why was this necessary (as opposed to invoking block_structure app code to generate this data)?

Can you say more, I don't know what you mean by that. Previously, the test was calling this create_block_structure function to create a block structure but not creating the underlying blocks. This caused an issue when the newly default caching system tried to cache those blocks because they didn't exist even though they were named in the blockstructure object. I was wondering if there was a different way to generate the whole tree in the first place that would be more "real", is that what you're suggesting?

Any places where the storage backing cache for block structures was on conditionally previously it will be ON by default.

Since the cache is now always on, remove test cases that test the case when it's disabled.

Remove the `enable_storage_backing_for_cache_in_request` function and its uses in the platform. The function is no longer needed because the storage backing for the block_structure cache will be ON by default moving forward. BREAKING CHANGE: This `enable_storage_backing_for_cache_in_request` function no longer exists and any calls to it should be removed. The cache it enables is now always ON.

This work is part of DEPR openedx/public-engineering#32 Now that we've removed all uses for this switch remove the decleration as well. BREAKING CHANGE: The `block_structure.storage_backing_for_cache` will no longer exist and its value will be ignored. If you have this switch set in your instance you can remove it. The backing cache is now always ON.

Previously, we were not caching BlockStructures to the database when we were adding them to the store by default. Now that we are doing that, the BlockStructureFactory test failed because it was taking a shortcut that would no longer work. It was just creating a blockstructure that had relations but not any block information. Now that we're always persisting updates to the database, this broke because to persist the structure to the database, we have to look up the block information from the block_structure which now fails. This change updates the test mixin to add more data so that the content can be persisted to the database successfully as a part of this test.

We don't call the modulestore or update the cache here now that we are backed by the database model. Previously the cache would change because the `_encode_root_cache_key` function in `BlockStructureStore` class used the `BlockstoreBlockData.VERSION` as a part of the cache key when the data was not being cached in a DB model.

Now that the model backed cache is on by default, we don't need to keep the StubModel object around.

With the new block_structure model cache enabled by default, we're donig queries to it as a part of other tests.

Because signals are disabled by default for performance reasons, this doesn't happen automatically. So we manually refresh the course in the cache after all the changes have been made so that the course in the cache matches the latest version in the modulestore.

This test had a redundant call to get the course data from the store because that already happens at the end of the setup function. And also because expected call structure was being built inside the assert, it made it harder to inspect when debugging. Make the code a little bit easire to debug in case we're back here in the future.

Provide a link to more docs from the code in case we need more context in the future.

…le switch.

feanil force-pushed the feanil/blockstructure_cache branch from 88f76e1 to 092f565 Compare July 29, 2024 17:44

feanil force-pushed the feanil/blockstructure_cache branch 2 times, most recently from 18d943f to 2548c2c Compare August 8, 2024 18:39

feanil requested review from ormsbee, bradenmacdonald and kdmccormick August 8, 2024 18:40

feanil force-pushed the feanil/blockstructure_cache branch from 2548c2c to fd1e239 Compare August 9, 2024 14:26

feanil marked this pull request as ready for review August 9, 2024 17:24

feanil force-pushed the feanil/blockstructure_cache branch from fd1e239 to 8f4ae4e Compare August 13, 2024 18:31

feanil force-pushed the feanil/blockstructure_cache branch from 2e03a8e to 854b490 Compare September 4, 2024 14:37

ormsbee requested changes Sep 9, 2024

View reviewed changes

kdmccormick removed their request for review September 16, 2024 17:13

feanil force-pushed the feanil/blockstructure_cache branch from 854b490 to a64dad5 Compare September 23, 2024 17:28

feanil requested a review from ormsbee September 23, 2024 18:21

feanil added 15 commits September 27, 2024 09:46

feat: Turn ON the storage backing for cache by default.

f03594d

Any places where the storage backing cache for block structures was on conditionally previously it will be ON by default.

test: Remove test variants that test without a cache.

c3ef120

Since the cache is now always on, remove test cases that test the case when it's disabled.

feat: Remove the unused StubModel class.

57b7977

Now that the model backed cache is on by default, we don't need to keep the StubModel object around.

test: Update query counts.

599bcaa

With the new block_structure model cache enabled by default, we're donig queries to it as a part of other tests.

docs: Cross link to wiki docs related to block structures.

30f8418

Provide a link to more docs from the code in case we need more context in the future.

fix: Send the course publish signal on course import.

1d2970d

fixup! test: Update the course in the cache after it's got new content.

9ad3163

fixup! test: Update the course in the cache after it's got new content.

c7e8816

fixup! test: Update the course in the cache after it's got new content.

e64bd7e

fixup! feat!: Drop the block_structure.storage_backing_for_cache waff…

5541aae

…le switch.

feanil force-pushed the feanil/blockstructure_cache branch from 3ebf286 to 5541aae Compare September 27, 2024 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: Drop the block_structure.storage_backing_for_cache WaffleSwitch #35185

feat!: Drop the block_structure.storage_backing_for_cache WaffleSwitch #35185

feanil commented Jul 25, 2024 •

edited

Loading

feanil commented Aug 20, 2024

ormsbee left a comment

ormsbee Sep 9, 2024

feanil Sep 23, 2024

ormsbee Sep 9, 2024

ormsbee Sep 9, 2024

ormsbee Sep 9, 2024

feanil Sep 23, 2024

ormsbee Sep 9, 2024

feanil Sep 23, 2024

ormsbee Sep 9, 2024

feanil Sep 23, 2024

ormsbee Sep 9, 2024

feanil Sep 23, 2024

feat!: Drop the block_structure.storage_backing_for_cache WaffleSwitch #35185

Are you sure you want to change the base?

feat!: Drop the block_structure.storage_backing_for_cache WaffleSwitch #35185

Conversation

feanil commented Jul 25, 2024 • edited Loading

Unresolved Questions

feanil commented Aug 20, 2024

ormsbee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feanil commented Jul 25, 2024 •

edited

Loading