Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine LMS Content Data Model Design #1

Closed
5 tasks
Tracked by #67
ormsbee opened this issue Feb 6, 2022 · 67 comments
Closed
5 tasks
Tracked by #67

Determine LMS Content Data Model Design #1

ormsbee opened this issue Feb 6, 2022 · 67 comments

Comments

@ormsbee
Copy link
Contributor

ormsbee commented Feb 6, 2022

This is a discovery ticket that would end in multiple ADRs in this repo. The goal is to determine the basic data structures and relationships needed to store content data for the LMS with the following goals:

  1. Handle all existing course use cases.
  2. Handle v2 content libraries, and library resources shared across multiple courses, with different policies/overrides.
  3. Handle potentially non-XBlock content.
  4. Allow for fast publishing.
  5. Allow for atomic publishing.
  6. Allow for easy export.
  7. Allow third party applications to build their own advanced data structures on top of it.
  8. Allow for efficient querying.
  9. Minimize the amount of wasted space caused by nearly identical versions caused by minor edits.

There will likely be some prototyping involved in this, as well as a lot of discussion.

These ADRs would include:

  • Determine high level approach to splitting reusable content and learning context specific policy.
  • Determine high level approach to versioning and incremental publishing.
  • Determine leaf XBlock-level modeling.
  • Determine unit-level modeling.
  • Determine sequence-level modeling.

Open edX's ModuleStore has explicit versioning capabilities (though they're not really used from LMS most of the time). Blockstore has versioning deeply baked into the design, but we've generally avoided encoding multi-version support into the post-publish data stores we build during course publish (e.g. CourseOverview, Block Transformers/Course Blocks API, etc.)

I think we've come to a point where we really do need to cross that divide and start introducing content versioning concepts to the LMS more generally. Some motivating reasons:

  1. It's difficult to preview changes before publishing if the LMS can only display the published version.
  2. The design of content libraries assumes that multiple versions are available simutaneously.
  3. Building all these separate stores of published data takes time, and parts may fail. When this happens, our course is in an inconsistent half-published state. For example, the rendered course content may be updated but the course outline generation may have failed so that nobody can see the new content. Ideally, we'd want to give these systems time to build up the data that they need, and then change everything atomically at once.
  4. Other systems like scoring and state storage would benefit from being able to store the actual version it was created against.

Storage Scaling Issues

SplitMongo ModuleStore wasted a lot of space with old version data, eventually leading us to create a separate cleanup script for it. There were a number of reasons why disk usage got so bad with this system:

  1. Structural data was stored inefficiently.
  2. New versions were being published for tiny changes.
  3. There was no cleanup.
  4. Historical course data had very little use.

Every time there was even a small change (e.g. the title of a Unit), we ended up writing a document with all settings-scoped data for the course. This happened all the time in Studio, so that for every one version that is of interest to us for viewing or preview purposes, there would be dozens of almost-identical intermediate versions. We ended up in a place where the majority of course storage was wasted in this way.

There are a few ways we can make this much better:

  • Isolate changes better
    Most content doesn't change very much from one version to the next, so we should break up the course into more granular pieces and track their changes individually.
  • Support simple cleanup policies
    There will be intermediate versions that can be almost immediately discarded (like the last state of the Studio draft). We should have obvious cleanup facilities for getting rid of those as soon as they are not needed.
  • Support a simplified data model for clients.
    Thinking about versions is hard, and unintuitive. We should have a set of primitives that help people model version-awareness into their content without having to overly complicate their data models.

Modeling Versioning in a way that scales (i.e. both features and users)

The following is a disorganized set of thoughts:

Entities (the names are bad, I'm just trying to get down the ideas)

  • LearningObject
  • LearningObjectVersion (with sub-types made with joined tables that represent things like Units, Blocks).
  • Bundles (?) of LearningObjectVersions. I wanted to resist adding a separate layer here, but I realized that without this, we'd have to echo out all the (LearningObjectVersion/LearningContextVersion) entries with each new version, even if the libraries that a course is using isn't changing at all. Which would be really bad for encouraging library use. I really don't like the reuse of a Blockstore term and concept though.
  • LearningContextVersion can contain multiple Bundles (e.g. a Course consists of versioned bundles for its "own" resources, as well as the Bundles of various other things like Libraries).
  • LearningContextBranch enforces that at any given point, there is one live version per branch (e.g. "draft", "live")
  • Some sort of registry where any app that has data related to a version has a chance to put the status of it (i.e. is it ready?) (How to handle the case where a process dies?)

I'm fiddling with some of these ideas in the learning_publishing app's models.py file at the moment.

Concern: How is it different than Blockstore? A: It's going to be much more relational data model that you hang other relational models (e.g. XBlock content, scheduling information) off of. Also, it's going to have zero intelligence about cycle detection or dependencies. It's also going to have explicit measures for cleanup of unused versions, which is not a thing in Blockstore.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 7, 2022

Okay, I think I'm overloading LearningContext with concepts of content ownership and student state association, and I'm not sure how to reconcile that cleanly. I feel like there's something missing that I don't have the right vocabulary for.

Some rambling thoughts:

  • The same library content consumed by the same student in different courses may or may not have shared state, depending on the desired use case.
  • My implicit assumption with courses is that there is one course-controlled learning context, and then a bunch of libraries. But is it necessary that there's a central "course" learning context? If I had four sections of a "course" (or "pathway") with four differently versioned pieces of content that were strung together, isn't that fine? We just need a way to navigate it.

So it's almost like there's a PublishingContext (each library, the course, some part of a course, some part of a pathway, etc.), which is a group of content that can be versioned and published together. And then there's a LearningContextVersion, which has one or more PublishingContextVersions batched together.

So maybe Learning Contexts have:

  • Publishing Contexts
  • Navigation
  • Student State

That could match up to a Course Run in the LMS. But is that over-complicating things to introduce a distinction between a publishing context and a learning one? If we want to shift how courses are authored to something more granular, it might be worthwhile.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 7, 2022

FYI @kdmccormick, @feanil, @Carlos-Muniz, @arbrandes, @bradenmacdonald, @doctoryes, @saksham115, @jristau1984: Some thoughts I've been having lately about data modeling content in the LMS, with possible implications for content libraries v2 and unit composition in the LMS. It's not very well organized, and no action is required–there's a lot to unpack here that's beyond the current scope of BD-14. I just wanted to be transparent about where my head was at.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 7, 2022

Thinking on this a bit more, this might be a viable way to model something like CCX, where the LearningContext becomes the CCX course key, and it has the PublishingContext for the base course. Policy things like dates would joins between LearningContexts and PublishingContexts.

This would also allow for how we model default behavior in libraries. The default values becomes the policy tables that join the library's LearningContext and the library's PublishingContext. The overrides that are specific to courses using the library become joins between the course's LearningContext and the library's PublishingContext.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 7, 2022

That's actually pretty exciting to me, because it gives us a place to more cleanly apply policy-level separation in a way that decouples content authoring and administration–and does so in a way that unifies CCX and libraries override handling.

@arbrandes
Copy link

I'm not going to pretend I have the time to dive into this in the near future, but I certainly like it that you're giving it serious thought, Dave. :)

Nevertheless, here's a use case I want to bring up for consideration.

  1. Historical course data had very little use.

This is because Studio never offered a wiki-like history (including grouping changes, who made them, etc) that a course team could navigate. I'm positive authors would love this: I know because I went as far as versioning OLX on git just so I could have all this stuff. I've seen this pop up in the forum on occasion, and if memory serves, MIT used to do this a lot.

I realize what you're discussing here is at a deeper level than exposing diffs on a frontend. But, again, it might be a use case worth considering as you model the data.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 8, 2022

@arbrandes: Right. I guess it's more accurate to say that historical course data had very little use in the LMS. The main use there would be for fast rollback. Studio could benefit from it, though we didn't manage to prioritize any of that work at edX.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 8, 2022

In this view of the world, we'd conceptually have:

LearningObject

  • Example: ProblemBlock, Unit, Sequence
  • Some addressable piece of content
  • This model holds almost no data, it's just an immutable ID to build joins off of for more specific models.
  • A snapshot of this is a LearningObjectVersion

PublishingContext

  • Example: Content Library
  • Content that is owned and versioned/released together.
  • A snapshot of this is a PublishingContextVersion.
  • PublishingContextVersions are M:M with LearningObjectVersions.

LearningContext

  • Example: Course Run, Content Library
  • A grouping to store content-related student state against: enrollments, grades, progress, etc.
  • A grouping to store content-related policy against: schedule, default values
  • A snapshot of this is a LearningContextVersion.

Relationship between LearningContexts and PublishingContexts

  • A LearningContextVersion has one or more PublishingContextVersions.
  • A LearningContextVersion may contain multiple versions of the same PublishingContext (e.g. two problems from two different versions of the same content library).

Next question: How is policy (e.g. defaults, scheduling, partitioning) stored against these data models?

@kdmccormick
Copy link
Member

@ormsbee I found your latest comment (the directly above the one I'm typing) to be illuminating. Your distinction between LearningContext and PublishingContext is something that I've wondered about but hadn't been able to crystalize.

I agree with the overall direction you are taking in this app. I haven't taken a deep look a the other two apps yet but will soon.

I think we've come to a point where we really do need to cross that divide and start introducing content versioning concepts to the LMS more generally.

Yeah :/ I was really hoping we could hold onto "LMS doesn't do versions" but I admit I don't have any robust solutions for the pitfalls you listed.


Now some more in-the-weeds reactions/ramblings:

The same library content consumed by the same student in different courses may or may not have shared state, depending on the desired use case.

Hm, this sounds complicated. I had always thought that student state would remain isolated between LearningContexts, and by consequence, Content Libraries would carry no student state other than maybe author-preview state in Studio.

I guess I view Content Libraries as a more generalized type of LearningContext, one that should hold policy but not state. Would that function as a helpful or even sensical simplifying assumption? Would it lock Content Libraries out of an interesting use case?

Going another direction, what if policy itself were a LearningObject within the PublishingContext? At that point, could we say that a Content Library is a PublishingContext but not a LearningContext in the eyes of the LMS?

LearningObject, PublishingContext, LearningContext

I know you said the names aren't final. I'm still going to poke at them because it helps me think about the architecture, and also I can't help myself :)

  • "Learning object" is an industry term that is unfortunately closer to our Sequences or Sections.
  • In terms of "context", I think we get to use that term once ("learning context") before folks start getting confused about what "context" means in different contexts.
  • How about "Versioned{Entity}" instead of "{Entity}Version"? "PublishingContextVersions" makes me think of multiple versions of the same PublishingContext, whereas "VersionedPublishingContexts" doesn't.
  • Does the LearningContext model belong in this package? Seems like a very core model that isn't publishing-specific.

So, maybe:

  • a ContentObject is an addressable piece of content, with each snapshot a VersionedContentObject.
  • a ContentPackage is content that is owned and versioned/released together as a VersionedContentPacakge.
  • a VersionedLearningContext joins ContentPackage(s) and student state.

keeping in mind that these are models are all namespaced under openedx_learning.publishing.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 9, 2022

@kdmccormick: I'm going to chew on the first half of your reactions for a while, but w.r.t. the naming suggestions:

I know you said the names aren't final. I'm still going to poke at them because it helps me think about the architecture, and also I can't help myself :)

FWIW, in case it wasn't clear, my naming attempt was a desperate cry for help. 😛 Thank you for suggesting new ones.

"Learning object" is an industry term that is unfortunately closer to our Sequences or Sections.
a ContentObject is an addressable piece of content, with each snapshot a VersionedContentObject.

I had originally dabbled with the idea of intentionally using LO since some of them will be LOs in the general industry sense, but I like your framing of ContentObject better. I would still prefer ContentObjectVersion over VersionedContentObject because to me "VersionedContentObject" implies that there is a separate model with versions–i.e. I should be able to call versioned_content_obj.versions and get back a list of other models with version information. I think that ContentObjectVersion more clearly conveys that this is the data for a single version.

I also like ContentPackage as a way to not use the word "context" so much, and I do think it's simpler and clearer.

I'm in the middle of writing a model for encoding Units, and I'll try to do so in a way that incorporates some of this new language and addresses your other questions/suggestions above. Thank you for looking at this!

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 10, 2022

Implementing Unit Composition

Let's imagine what Unit Composition would look like it if was built on top of a data model like this. In this framing, all the ContentObject and ContentPackage stuff mentioned above would be part of the learning_publishing app, and most of this would be a layer above that as the learning_composition app. Given the data model I'm suggesting, even the LearningContext model might move up to this level.


Idea: ContentPackages represent raw pools of ContentObjects, and composition of those COs only exists within a LearningContext.

This sounds a bit weird, since moving a Block from one Unit to another should surely count as a change in content. But a few thoughts I had around this:

  1. The LMS has a sort of "compiled" version of this content. Studio/Blockstore will be aware of the parent/child hierarchy and can provide when necessary.
  2. We still have LearningContextVersion to encode when these things change.
  3. We need to do composition at the LearningContext level anyway, to string together content from multiple ContentPackages (e.g. a Course Run referencing Library content). Doing it at the ContentPackage level would give us two semi-redundant ways to specify hierarchy.

These entities live at the LearningContext level.

Model that joins LearningContextVersion and ContentObjectVersion: "Block"?

  • This creates the instance of this piece of content as it exists in this version of the context, and gives it an identifier.
  • This is where UsageKeys would be mapped onto content.
  • This would allow for multiple instances of the same ContentObjectVersion to be brought into a course with different usage keys.
  • This needs a better name than LearningContextVersionContentObjectVersion 😛
  • Maybe we just call this "Block"? That would be more familiar terminology for people used to working with edx-platform today.
  • If usage keys are only established here, is it meaningful to have history in ContentPackages? Are ContentObjects just anonymous, hashed data buckets, like split modulestore definitions are today?

Unit

  • we can either try to be smart about collapsing duplicates by having a value hash for this, or we can be really good about cleanup by tying them to a LearningContextVersion and deleting them when that LearningContextVersion is itself deleted.

UnitBlocks

  • join between Unit and Block, with a column for ordering.

There's a weird sort of language flip with these names in that we're taking things that are versioned but talking about them


Edge Case: Nested blocks in Units

Units are flat lists. There are two scenarios I know of in which we have nesting within units right now:

  1. Blocks that render different children to different users.
    These include the LibraryContentBlock, SplitTestBlock, and ConditionalBlock. I think we can add these inline as part of the list, along with their content group settings, so that we can efficiently select the subset that our student cares about.
  2. The ProblemBuilder block that uses a container block and different problem-type sub-blocks.
    We might be able to just collapse this into a single Block as far as this layer of abstraction is concerned, and let the XBlock runtime deal with the nesting.

Right now, when you make a SplitTestBlock, with two things in it, you’re actually ending up with a hierarchy in ModuleStore that looks like this:

Unit (Vertical)
- Video
- HTML
- SplitTest
  - Vertical
    - Problem 1a
    - Problem 2a
  - Vertical
    - Problem 1b
    - Problem 2b

So the hierarchy looks like: Course -> Section -> Sequence -> Vertical -> SplitTest -> Vertical -> ProblemBlock. We've had a number of bugs along the lines of “nobody actually thinks this kind of structure is a thing on our platform", because they don't realize that extra level of hierarchy is allowed to exist.

But this kind of nesting for SplitTest only exists to generate what is a conceptually flat list–what we’re really trying to do is render Problem 1a+2a for one group and Problem 1b+2b for another group. The nesting is just because that’s how XBlocks work, and implementing them as children lets the SplitTestBlock control which one to show when rendering. But the mechanism SplitTestBlock uses underneath is user partitioning.

So Studio can keep this view of the world, but there’s no reason we need to encode it like that for the LMS.

The LMS can have a version of the data that just looks like:

Unit
- Video
- HTML
- Problem 1a (Partition Foo, Group A)
- Problem 2a (Partition Foo, Group A)
- Problem 1b (Partition Foo, Group B)
- Problem 2b (Partition Foo, Group B)

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 11, 2022

Okay, I've rambled a lot here, but I'm going to actually try to code this up in models and see how it holds together.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 11, 2022

Side note: Explicit branches (draft vs. published) vs. having entirely different LearningContexts to encode drafts for previewing (or even sending explicit versions during preview)?

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 14, 2022

Latest wrinkle issue is: How important is it to be able to change the identifier for a piece of content over time? So in other words, can I change the UsageKey for something and still keep student state/history associated with it?

The data model implication if we wanted that kind of capability would be:

  1. We'd have a sort of abstract representation of a piece of content across different versions of a LearningContext, using a UUID.
  2. We'd associate customized identifiers (e.g. UsageKeys) at the level of an individual version of content.

It's a bit wasteful of space, since the identifiers are getting repeated with every revision, though it's not that big a deal–we'd use 8 bytes for the foreign key anyway, and identifiers would compress well if we use the right row type.

The bigger thing is how intuitive it is, and how awkward it would be to work with. Certain queries get potentially much more expensive, like "did this identifier ever exist in this course?". There's also potential for more confusion between when you're working with this abstract, version-less Block, and a concrete, version-of-a-Block.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 14, 2022

Another weird wrinkle with having an abstract, versionless Block in this scenario is that it makes it harder to enforce uniqueness of UsageKey-style identifier within a LearningContextVersion (since the identifiers would be in the versioned-block table and the f-key to LearningContextVersion would be in the versionless-block table).

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 18, 2022

Additional wrinkle: Does it make sense to allow for multiple identifiers for a piece of content? This might be too obscure a use case, but we've had requests in the past for something like this. The scenario was that a university had their own content library, and that content was converted into capa problems (among other formats). Those problems had UsageKeys, but they also wanted to associate it with their original content IDs when they analyzed the usage data later. This could be done with some sort of namespaced identifier scheme, e.g. ('xblock', <usage_key>), ('uni-lib', <uni-identifier>)

@kdmccormick
Copy link
Member

Good stuff. Some clarifying questions and reactions, although I'm still chewing on this:

Idea: ContentPackages represent raw pools of ContentObjects, and composition of those COs only exists within a LearningContext.

You said earlier that ContextObjects include Sequences (and I assume Sections too?). Under your idea, the hierarchy of those Sections and Sequences becomes only exists within the LearningContext, right? This would make sense to me, given that in learning_sequences, the LearningSequence model has a foreign key to LearningContext.

(I am assuming that learning_sequences will be the basis for openedx-learning's modeling of the "above-the-unit" part of the hierarchy--but please correct me if I'm wrong on that.)

We still have LearningContextVersion to encode when these things change.

Okay, so conceptually COs are composed within a LearningContext, but in the data model COVersions are actually composed within LearningContextVersions, right? That is, there's nothing in the schema tying version-agnostic COs and LearningContexts directly?

Maybe we just call this "Block"?

Hah! This makes me smile. It almost feels like the completion of a long derivation of a cool math equation, where we've abruptly descended from dizzying abstractness by simplifying (LearningContextVersion X ContextObjectVersion) into the familiar Block.

Part of me thinks it'd be better to use a new term to avoid ambiguity with XBlock. The other part of me thinks that the ideas are close enough that introducing a new term would be even worse, and instead we should hope that this new meaning of "Block" naturally meshes with folks' current understanding of XBlock and the usage of block in our code.

UnitBlocks: join between Unit and Block, with a column for ordering.

This calls back to my previous question about whether Sections, Sequences and Units are themselves ContentObjects. If so, then it sounds like we can't assume that Block is a component. It feels weird that we could make a UnitBlock(unit=u, block=b) where b refers to an xblock of type vertical, sequence, or chapter.

There's a weird sort of language flip with these names in that we're taking things that are versioned but talking about them

Did you omitted some text here?

Edge Case: Nested blocks in Units

All of this sounds right to me 👍🏻 Good writeup of the issue.

@kdmccormick
Copy link
Member

Oh, one more thought:

UsageKey is conceptually defined as (DefinitionKey X LearningContextKey).

I see this as parallel with your idea that Block is defined as (ContextObjectVersion X LearningContextVersion).

More generally, Usage = Definition X Context. (ooh, maybe another name for ContentObject is "ContentDefinition"?)

Does that jive with your thinking?

If so, do you think each ContentObject would have a DefinitionKey? Then, a Block's UsageKey would be formed by combining the Block's ContentObject's key and the Block's LearningContext's key.

@ashultz0
Copy link

ashultz0 commented Mar 2, 2022

years ago I had a versioning system in django with two IDs so each object had its own UUID and the UUID of its lineage. There were some weird wrinkles because I wanted to have latest link be stable so the latest always had the same UUID for both and historical versions were spun off with new IDs as they were archived and you could make a stable view of the system by linking together historical versions.

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 8, 2022

@kdmccormick:

There's a weird sort of language flip with these names in that we're taking things that are versioned but talking about them

Did you omitted some text here?

Yeah, I just meant that the language flips around, where "Block" is a versioned thing, while in all the other definitions we call out versioned entities with "Version". I think it's okay because of how commonly it's going to be used–Blocks are likely the thing you interact with at higher layers when you don't care about all this versioning stuff at all but just want to have information about the course as it is–but I wanted to call it out.

More generally, Usage = Definition X Context. (ooh, maybe another name for ContentObject is "ContentDefinition"?)

It's funny, that would line us back to Modulestore terminology of definitions, which might not be so bad. I was worried that it might be confusing because we wouldn't necessarily be stuffing things that are definition-scoped in there–more like "all the things that aren't already covered by a specialized system like grading or scheduling", which is conceptually similar but doesn't match up in code.

@ashultz0: I think I get what you mean. We do something similar for structure documents in SplitMongo, to be able to more easily determine shared ancestry between two structures. The wrinkle with this system is that in addition to history, it's also trying to represent a conceptual split between something as it exists (and evolves) in a library vs. its usage in particular course.

@kdmccormick: I wonder if we could do some optimization based on the fact that a branch's history will be very linear. I know we've discussed this bit about having uniquely identified versions with hashing, and I still believe that has value. But one of the things I keep coming back to is that we currently pay an enormously high cost for very minor changes. Even this schema proposal would rewrite a join table of published blocks for a learning context that may have thousands of entries for each version. But we might be able to make a schema that encodes things much more efficiently by having publish version ranges rather than specific versions. So something like:

class Block:
    uuid  # auto-generated
    identifier  # UsageKey would go here.
    learning_context_branch  # some foreign key?
    start_version_num
    end_version_num  # null or maybe maxint to indicate what's published now
    content_object  # foreign key

I don't know if the extra complexity is worth it. It makes both querying and cleanup more complicated, but it may remove the need for a lot of that cleanup in the first place. I don't think it's as intuitively obvious as a schema that maps new blocks for each version, but it would make new additions really cheap, because we'd only be making new entries for the things that change. Getting the current state of a branch means looking for all the things where end_version_num is null. Or maybe the end version is a special maxint value for that column, since that may make constraints easier.

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 8, 2022

More thoughts after a night's rest:

  1. There are more ways for the incremental-versioning thing to get in a state where there are concurrent edits. This could be good or bad, depending on our use case.
  2. Having the incremental representation means that prepping a fully formed "version" representing the whole state of the course and then flipping a switch is a little less straightforward. There would be some work to match up the branch data to be equivalent.

@bradenmacdonald
Copy link
Contributor

Late to the party. I've been asked to think about what we need to do to back course by Blockstore, so trying to catch up on this and adding some thoughts.


There are two scenarios I know of in which we have nesting within units right now:

Blocks that render different children to different users.

I do feel that that should be controlled at some other level, not via XBlocks.

The ProblemBuilder block that uses a container block and different problem-type sub-blocks.

While it's true that this is the architecture of Problem Builder, I don't think it's a great architecture nor is does it justify adding a ton of implementation complexity in order to support one specific XBlock. We implemented problem builder originally without support for XBlock children, then we converted it to use xblock children assuming that other blocks would do similar things, but in the end it's only really problem builder that did this. Though it would be a lot of work, problem builder could be re-architected to use "fake" XBlock children again as it originally did, essentially managing and rendering children within the XBlock code rather than relying on the platform to do so. This would probably end up with a much nicer authoring experience than currently exists. However it would be a ton of work.


ContentPackages represent raw pools of ContentObjects, and composition of those COs only exists within a LearningContext.

I like this idea a lot, because...

We need to do composition at the LearningContext level anyway, to string together content from multiple ContentPackages (e.g. a Course Run referencing Library content). Doing it at the ContentPackage level would give us two semi-redundant ways to specify hierarchy.

Indeed. This was a giant pain when implementing content libraries on top of blockstore. In fact, dealing with different composition/IDs at different levels of the system was the only part of that implementation that I have bad memories of, and it also produced the part of the code that I'm least happy with and that I think is the most confusing.

The central problem is that at the blockstore level, content composition is done using "links" to other bundles and using the actual XML filenames of XBlock OLX that you want to include. But within the LMS, we need to use usage keys which contain only the learning context ID (which gives the bundle ID) and a unique "usage ID". In the case of a child block seen in the LMS (e.g. if you go directly to the "mobile view" URL for a specific child), knowing the usage ID doesn't tell you what the parent block is, so in the worst case you have to scan every OLX file in the bundle to see which one included a child with the given usage ID. It also requires authors that write an xblock-include statement must specify a usage key for that particular include, even though usage keys really shouldn't belong in bundles/OLX, just definition keys.

Example, for a unit XBlock that includes an HTML child and a Video child

  graph TD;
    subgraph b1[Bundle 1]
    o1[OLX File 1 unit/main-unit/definition.xml]
    o2[OLX File 2 html/introduction/definition.xml]
    o1-- xblock-include -->o2
    end
    subgraph b2[Video Bundle]
    o3[OLX File 3 video/intro-video/definition.xml]
    end
    o1-- xblock-include via bundle links -->o3
    BDL1[BundleDefinitionLocator 1]-->o1;
    BDL2[BundleDefinitionLocator 2]-->o2;
    BDL3[BundleDefinitionLocator 3]-->o3;
    subgraph LearningContext
        u1[UsageKey 1]
        u2[UsageKey 2]
        u3[UsageKey 3]
    end
    u1-->BDL1
    u2-->BDL2
    u3-->BDL3
    subgraph lc2[Some other LearningContext]
        u4[UsageKey 4]
    end
    u4-->BDL3
Loading

Mapping from usage keys to OLX files or vice versa can both be complex/hacky/expensive depending on the situation.

So, if it's possible that at the Blockstore / ContentPackages level we only deal with atomic units, the implementation would get much cleaner and simpler. In fact, I would say we definitely need to find a better approach to this before we can consider using Blockstore for courses.

The challenge I see is that authors really want to combine things often. e.g. a common use case is putting an HTML caption together with a video, or associating an HTML image and intro text with a CAPA problem. That image should always be followed by that CAPA problem, and course authors don't want to have to specify that every time they use that problem in a new course/context.


How important is it to be able to change the identifier for a piece of content over time? So in other words, can I change the UsageKey for something and still keep student state/history associated with it?

If you need this functionality, it's good to design for it early on. I like the approach of JIRA, where every issue has an internal permanent ID that you almost never see and a set of external IDs like DEPR-1, DEPR-2 etc. which can be changed, and where old IDs redirect to the current ID.

Does it make sense to allow for multiple identifiers for a piece of content? This might be too obscure a use case, but we've had requests in the past for something like this. The scenario was that a university had their own content library, and that content was converted into capa problems (among other formats). Those problems had UsageKeys, but they also wanted to associate it with their original content IDs when they analyzed the usage data later. This could be done with some sort of namespaced identifier scheme, e.g. ('xblock', <usage_key>), ('uni-lib', )

This sounds like a use case for custom tags, where you could tag individual content items with the original content ID ?

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 11, 2022

Another thought that might simplify things: Maybe we can get rid of the idea of branches altogether, and just have a purely linear history, with the caveat that the "published" pointer doesn't automatically advance. So there are new "versions" being made for preview purposes that aren't "live".

@ashultz0
Copy link

That makes a huge amount of sense to me. Developers that have git as a major part of their job have enough trouble with branches. Normal people just have a bunch of spreadsheets named "presentation-final-final-FINAL". A single linear stream with one pointer that says "published" feels plausible, branching really does not for people who do not have version control as a main part of their job.

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 12, 2022

+@doctoryes, @marcotuts, @jristau1984, @jmakowski1123: (looping you in because this relates to content libraries as well as some long standing issues around content modeling)

@bradenmacdonald:

The challenge I see is that authors really want to combine things often. e.g. a common use case is putting an HTML caption together with a video, or associating an HTML image and intro text with a CAPA problem. That image should always be followed by that CAPA problem, and course authors don't want to have to specify that every time they use that problem in a new course/context.

This has been bothering me a lot lately, to the point where I think we need some explicit terminology and modeling around it. Right now, our language for content is largely about where it is in the tiered navigation hierarchy, or how it gets implemented in XBlocks. We don't really have a concept around this thing where sometimes these two pieces should just always be treated as one thing for the purposes of composition. So for now I'm going to steal terminology from a UX article I read somewhere and call the smallest bits ContentAtoms and the slightly bigger logical grouping of 2-3 of these as a ContentMolecule. (ContentElement? I am not at all attached to these names.)

From the perspective of the LMS doing composition, it would be great if XBlock properly supported the ability to nest in ways that weren't strictly parent->child. So another words, if we could do this:

    <problem>
        <video>...</video>
        <prompt>What techniques are demonstrated by this video?</prompt>
        <p>Some description...</p>
        <!--  etc. -->
    </problem>

We've always needed this for things like content libraries. We kludge it in the v1 implementation by wrapping it in an extra vertical/unit. But ContentMolecules aren't really the same thing as Units–they come in different places in the hierarchy for one, and you wouldn't "compose" the internals of a ContentMolecule–it's just a fixed structure. But they have overlap, in the sense that both ContentMolecules and Units represent things that are externally addressable. It would make no sense to do an LTI launch of a ContentAtom that is an HTML prompt before a problem and not give you the problem itself.

Then there's the question of how we would introduce a concept to the system in a way that doesn't break everything. Wrapping it in a new XBlock/tag would introduce another level of hierarchy that would likely break a lot of things downstream. Not only would code break, but exporting OLX from a version of Open edX that supported such grouping to a version that didn't would make those parts of the course completely inaccessible (because they'd be hidden under a new XBlock type that doesn't exist the old system).

But how about this?

  1. For backwards compatibility, we make it so that by default, every individual block today is its own ContentMolecule.
  2. We add a new, optional field that allows you to group together consecutive blocks into a ContentMolecule (really needs a better name). Studio can have some niftier UI around this.
  3. Content Libraries v2 are the only thing out of the gate that make use of ContentMolecules.
  4. Somebody (please, please) suggests better names for this than ContentAtom/ContentMolecule.

I think that would give us a bare structure for grouping small things in a primitive, static, non-hierarchical way (ContentMolecules), in contrast to the more dynamic grouping mechanisms we'd be looking at for Unit Composition. It would be nice if Unit composition only had to deal with things at the ContentMolecule level, but there are certain types of composition that currently rely on ContentAtoms, like Feature Based Enrollment (FBE) that will disable certain XBlock types from displaying their contents for certain enrollment modes. That could be addressed by either applying FBE exclusions to the whole ContentMolecule, or pushing some of that decision down to the XBlock runtime layer.

So we'd have:

  1. Units are dynamically composable, flat lists of one or more ContentMolecules.
  2. An individual ContentMolecule is static.
  3. ContentMolecule definitions exist at the ContentPackage layer.
  4. All composition of different ContentMolecules into Units exists at the LearningContext level.

I'll poke at this some more shortly...

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 13, 2022

😩 Version-awareness makes everything harder. 😞

@ashultz0
Copy link

words for content atom/molecule

music words: phrase/verse

lego words: brick/model

if definitions live at the ContentPackage layer that makes the molecule a pack so we could do brick/pack instead which has the parallel sounds that are easy to remember together block, brick, pack

geometry words: point/line or point/shape

atom is actually pretty good for the lowest level, it's just molecule that is a problem. Maybe atom/aggregate or atom/subunit or atom/phrase to mix metaphors completely

reading what the new thing is it's kinda an Xblock but where an Xblock has to nest, this is on another axis... another axis from X makes it a YBlock ;)

@bradenmacdonald
Copy link
Contributor

@ormsbee That makes a lot of sense, and I think it's a nice simplification that covers most use cases without introducing too much complexity. I actually don't mind the name ContentAtom/ContentMolecule; it seems pretty clear to me what you mean by it.

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 15, 2022

Some follow on thoughts to the relationship between these:

Atoms are the analog to individual pieces of XBlock content. They need an identifier, title, and data. We will store things like user state and scores for things at this level of granularity, so we can't just hide it behind Molecules. They have a type, possibly major and minor types (e.g. (xblock, problem))

Molecules are logical groupings to aid composition. They have an identifier. They probably don't have titles, or if necessary derive a title from one of their children. We probably want to keep this layer as thin as possible. We shouldn't attach data to this where we also have to associate data with Atoms, as it will lead to further confusion.

So in this scenario, Molecules are a convenience grouping mechanism , but Unit composition still returns a list of Atoms. At an object level, I guess we'd want to allow you call methods to iterate over either type when reading the contents of a Unit, but we'd probably default to Atoms. For the far more common read-only use case for Unit traversal, Atom traversal is going to do the right thing most of the time.

At the same time, Molecules would have to exist at the ContentPackage level, i.e. in the authoring vs. policy divide, they fall on the authoring side. Conceptually, a LearningContext can have many ContentPackages (one representing each library being used + one or more representing course-specific content). ContentPackages are an interesting tool because they potentially give us a path to more neatly bridge course-centric authoring and more library-centric models by giving them a sort of common denominator format.

Storing this is kind of weird. Storing Atoms is mostly straightforward. Storing Molecules almost seems a little silly when almost all of them are going to be single-Atom Molecules to start. It'd be tempting to use something implicit, except that Molecules are going to be an actual entity in our system that will be stored and manipulated in Unit composition, so we need a real data model backing it. The IDs here are almost certainly going to be UUIDs of some sort though.

But then we get to the problem of versioned storage...

@ormsbee
Copy link
Contributor Author

ormsbee commented Mar 15, 2022

There are a number of issues that versioning raises with this scenario, but the biggest one that comes to mind is how entities in our system get multiplied out. We want stable references that survive a course publish–we don't want to invalidate all the score information or policy related metadata for a piece of content every time someone fixes a typo. At the same time, having references to the exact piece of content at the exact version would help for some scenarios (e.g. re-grading, rubric validation, course team debugging of student issues, etc.). But the more capable and flexible our content storage mechanism is, the more complex and confusing it becomes.

Identifiers like the UsageKey have to be assigned at the LearningContext layer. A course run could use the same problem from a content library in multiple places, and there's also no way to guarantee identifier uniqueness across multiple ContentPackages (courses may use dozens).

So taking all those together, we have:

Top level: LearningContext, LearningContextVersion ContentPackage, ContentPackageVersion

A LearningContextVersion has multiple ContentPackageVersions (possibly even two different versions of the same ContentPackage).

A LearningContext will have many version-independent Blocks. These have a permanent UUID as well as an identifier like a UsageKey. This is the thing that systems external to publishing would make a foreign key to when they need to store things like student state that persists across new versions of the content.

A LearningContextVersion has BlockVersions, which are joins between versioned Atom data and versionless Blocks. BlockVersions are the thing you'd put a foreign key against when you want to capture the state of the content at some point in time, e.g. submission/completion/grading. The mapping between LearningContextVersion and BlockVersion determines what's "live" in that version.

Blocks don't go away, even when the content is deleted. That would be helpful for people writing code that builds off of course data–even if the content no longer exists, your app would still be able to find the last version of the content that corresponded to that Block.

Blocks aren't just Course concepts, but exist in Content Libraries as well, where they serve the same function.

Next: How we'd model Units in this mix of versioned and un-versioned models...

@ormsbee
Copy link
Contributor Author

ormsbee commented Apr 28, 2022

@bradenmacdonald: My thinking has been more data-oriented than interface-oriented at this level, probably because it's that aspect that's pained me the most with XBlock/ModuleStore. I'm imagining that pluggability would happen one layer above this, in models that have foreign keys to ContentObjects, but that ContentObjects themselves have only the basic primitives necessary for import/export and to be placeholders for composition.

That being said, could you please describe a concrete use case, and I can see how these models might work with it?

@bradenmacdonald
Copy link
Contributor

@ormsbee OK, don't worry too much about what I said, it might be asking for more complexity than we need, and I don't think what you're describing precludes any use cases.

Perhaps there are some that it makes less efficient, like the idea of a remote learning content repository, or a remote blockstore - can remote objects be used without first importing them? Or is it good that they must be imported first?

I "felt" like I had other examples but on reflection I couldn't think of any good ones, so I'll let it go :p

@ormsbee
Copy link
Contributor Author

ormsbee commented Apr 28, 2022

Perhaps there are some that it makes less efficient, like the idea of a remote learning content repository, or a remote blockstore - can remote objects be used without first importing them? Or is it good that they must be imported first?

I think some minimal set of metadata about the resource would have to be imported in some way or another, even if it's mostly pointers to where the thing actually lives. But the content itself wouldn't have to be.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 1, 2022

I've simplified the ContentAtom a bit, so that it only stores:

  • id
  • content_package_id
  • hash_digest (40 char hash digest of a blake2b hash)
  • mime_type
  • size (in bytes)
  • data (as binary blob)
  • created_at (UTC timestamp)

There is a unique constraint on (content_package_id, mime_type, hash_digest).

Some thoughts:

  1. Everything imported into this table must have a mime_type.
  2. XBlock OLX is given the MIME type of application/vnd.openedx.xblock.{block_type}+xml.
  3. Other apps are able to make foreign keys against this model.
  4. The only way for insertions into this table to fail are (a) duplicate constraint or (b) size is too large. There is no deeper error checking of what it means for something to be "correct" at this level.
  5. Apps are built over this layer, with models that are foreign keyed to ContentAtom. If there are validation errors, those are created at that layer, but that happens after the raw data is imported.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 2, 2022

Random idea: define abstract model classes that plugins and other apps can extend for key extension types, for instance having one that's 1:1 with ContentAtoms.

Misc. thought: Content lives longer than code. One of our pain points has been breaking the export process entirely when certain apps/XBlocks go away. The design should be done in a way that prevents this.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 5, 2022

Okay, I've been reworking this some more, and trying to focus on separation of tasks by layer, and where we expect apps to hang their own models off of. There are a few high level considerations:

  1. There is a ContentPackage/LearningContext boundary, where policy/settings lives in the LearningContext layer.
  2. ContentSegment groupings must live in the ContentPackage layer, so that they can be shared from libraries to courses.
  3. This design is focused mostly on providing versioning information to hang more sophisticated, app-specific data off of.

Raw Content Layer (in ContentPackage)

ContentAtom

  • belongs to a ContentPackage
  • Has bytes, a mime_type, a hash digest, size, creation datetime, etc.
  • No identity, no versions.

Shared Content Layer (in ContentPackage)

ContentObject

  • belongs to a ContentPackage
  • has an immutable UUID
  • has a "type"
  • has creation datetime
  • no other data
  • also no versioning
  • exists as the generic thing to implement versioning on top of

SimpleContentObject

  • OneToOneField to ContentObject, i.e. it is a type of ContentObject
  • foreign key to ContentAtom, so it describes a type of content that maps to a single ContentAtom
  • this is mostly an empty shell to hang other models off of–models that would describe metadata that is not available in the ContentAtom. Example: an ImageContentObject might have ALT text stored here, which is something that's not derivable from the raw image data available in the ContentAtom.

ContentSegment

  • OneToOneField to ContentObject, i.e. it is a type of ContentObject
  • has supporting child model that associates an ordered list of SimpleContentObjects with it
  • this is how we would represent content that should always be shown together, needed for content libraries

Versioned Content Layer (LearningContext + LearningContextVersion)

LearningItem

  • has mutable identifier + immutable UUID
  • long-lived, has boolean to indicate whether it's currently published, pointer to last published version, etc.
  • represents any instance of an addressable piece of content that has ever existed.
  • this helps to enforce identifier uniqueness within a LearningContext–every LearningContextVersion's contents is going to have a unique constraint on (learning_context_version_id, learning_item_id).

LearningItemVersion

  • has a foreign key to LearningItem, which is how it establishes its identifier (e.g. usage key equivalent)
  • has a foreign key to ContentObject, which is how it establishes its payload
  • has its own UUID
  • has a title

A few notes about this arrangement:

  1. I don't like the number of layers it has, because I think it could confuse people who want to extend these models, but I can't think of a way to get rid of any layers cleanly.
  2. ContentSegment in this case is actually movable to a separate composition layer.
  3. We'd have to set some rules and tracking around when publishing/version creation is fully baked, since apps would be building their own data on top of these models, and we'd have to basically trust that they won't mutate that data later.
  4. Cleanup (and not taking related models by surprise deletion) would be an issue we'd have to be careful about.
  5. Using OneToOneFields as the primary key for models may make it simpler to piece to together things like different effective subclasses of ContentObject. For example, if there were an StaticAssetContentObject (there to associate which assets are associated with which SimpleContentObjects), and an ImageContentObject–then you could create an image that has the same primary key id in all three tables.

@feanil
Copy link
Contributor

feanil commented May 5, 2022

Some questions as I try to understand this:

  • What are possible values for the "type" of the ContentObject?
  • The goal of SimpleContentObject is to be the parent class for any object types that want to add more data on top of the base object in a way that's easy to query and not just in the blob of the ContentAtom, is that right?
    • And SimpleContentObject would also serve as an active table that keeps track of essentially the leaf nodes of content. ie. If it's in this table, it's a leaf node is a guarantee we can make.
  • For ContentSegment, we're limiting this to just SimpleContentObject to ensure that it's only a list of leaf nodes?

@kdmccormick
Copy link
Member

I'd thought I saw a potential issue with this schema, but @feanil and I talked through it and I believe resolved it. @ormsbee , let me know if this jives with your thinking.

  • Priors
    • A ContentObject can be a ContentSegment or SimpleContentObject.
    • ContentSegments themselves are composed of SimpleContentObjects.
    • If a ContentObject is part of a ContentSegment, then that individual ContentObject is not intended for re-use. Instead, the ContentSegment is the resusable piece.
      • Contrapositively, if a ContentObject is not part of a ContentSegment, then that individual ContentObject is intended for re-use.
  • Now, imagine: Within ContentPackage X, we have problem P and videos V1 and V2. We have one ContentSegment S which contains P and V1.
  • Issue:
    • The reusable objects in X include S and V2 (but do not include P and V1), which is easy for us to reason about. But, determining this for any arbitrary ContentPackage would involve a pretty complicated database query.
  • Resolution?:
    • Since repositories of reusable items (such as ContentLibraries) would actually be LearningContexts (that is, not just raw ContentPackages), some ContentLibrary L containing X would have a list of reusable LearningItems, which would include S and V2.
    • Thus, it does matter that a ContentPackage's reusable items are difficult to query for, because the containing ContentLibrary/LearningContext would expose that information.

@kdmccormick
Copy link
Member

And, on another note, I want to record a suggestion that @feanil mentioned about in our meeting (@Carlos-Muniz @ormsbee ):

  • Merge the SimpleContentObject and ContentAtom layers (we'll just call it "ContentAtom" for now)
  • So, "subtypes" of ContentObjects would include:
    • ContentSegment
    • ContentAtom
      • ImageAtom
      • VideoAtom
      • etc...

where <X>Atom includes a hash, content type, contents, and metadata.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 5, 2022

Will reply to all of these later, but a quick one:

  • Merge the SimpleContentObject and ContentAtom layers (we'll just call it "ContentAtom" for now)
  • So, "subtypes" of ContentObjects would include:
  • ContentSegment
  • ContentAtom
    • ImageAtom
    • VideoAtom
    • etc...

where Atom includes a hash, content type, contents, and metadata.

I agree with the idea that people shouldn't make foreign keys to ContentAtom, and that it should be to something like the SimpleContentObject instead. But there were a couple of reasons I wanted the separation between ContentAtom (raw data) and SimpleContentObject (metadata):

Storage Optimization

Metadata changes should be cheap. If we have a large image and change the ALT description text a few times, we shouldn't be making whole copies of the image data. Putting them together in the same table would require that a whole new copy of image data in that situation.

Content Outlives Applications

This design is leaning heavily on supplemental apps to provide rich data models. But some point, new apps are going replace them and want to migrate content data over. At some point, apps are going to be removed entirely, and their tables will be dropped. What can we meaningfully export from an old course after that has happened?

So the hope I have in these models is that we can have something that is grounded by a core model that will last over time, always be exportable, and will have the minimal data needed for a thing "of that type", while at the same time being supplemented by richer data.


(Edit: there's actually another aspect on error handling, dirty data, and async processes, which I'll try to write up after I pick my daughter up from school.)

@kdmccormick
Copy link
Member

All great points. You have me convinced on this point alone:

This design is leaning heavily on supplemental apps to provide rich data models. But some point, new apps are going replace them and want to migrate content data over. At some point, apps are going to be removed entirely, and their tables will be dropped. What can we meaningfully export from an old course after that has happened?

I would love if we could store our images, videos, etc in a dead-simple way so that we never have to migrate it again.

Just for the purpose of distinguishing it from a ContentObject, would it be fair to think of a ContentAtom along the lines of a "file"? For example: pixel data and dimensions are things that usually live in an image file, and thus go in the ImageAtom; alt text is externally-applied metadata, and thus would exist at the ContentObject level. (I don't know whether this metaphor holds for other content types, I'm struggling a bit to think of examples)

@feanil
Copy link
Contributor

feanil commented May 6, 2022

Assumption check: While trying to work through how this would work with videos, I realized that probably the blob in the content atom is actually going to need to use something like the FileField(backed by whatever storage mechanism you want) so that we don't have to store large amounts of data in the database.

In which case, making the ContentAtom be akin to a file makes sense to me. We want to have the files be immutable but then have the metadata related to files(in the case of video, title, language, etc) be easy to change even if you're not modifying the video.

Going down this line of reasoning a bit further, is the reason that you're thinking we might want multiple types of content atoms(e.g. VideoContentAtom) that we may want to extract some of the data in the atom for ease of querying? I'm imagining things like the length of a video which is going to be part of the video data but that we might want to be able to quickly access without re-analyzing the whole video.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 6, 2022

Assumption check: While trying to work through how this would work with videos, I realized that probably the blob in the content atom is actually going to need to use something like the FileField(backed by whatever storage mechanism you want) so that we don't have to store large amounts of data in the database.

Agreed. Unfortunately. Video's always the edge case that breaks everything by being 1000X the size of everything else. 😞

Going down this line of reasoning a bit further, is the reason that you're thinking we might want multiple types of content atoms(e.g. VideoContentAtom) that we may want to extract some of the data in the atom for ease of querying? I'm imagining things like the length of a video which is going to be part of the video data but that we might want to be able to quickly access without re-analyzing the whole video.

Originally, yes. I thought that it made sense to have a model for metadata that was intrinsic to the data of the raw bytes themselves, separate from other metadata about said bytes (like the author name, or a short description). But after our conversation yesterday, I came around to the idea that small levels of duplication of metadata is fine, and that it's preferable to do that than to confuse developers by giving them two places to hang this kind of data off of (which is I think was one of the points you and @kdmccormick were making?).

In the case of a video, that might mean that a VideoContentObject would hold both the length (intrinsic to the file itself) + a short description (layered metadata).

I think that extension at the ContentAtom layer then makes sense primarily when data requires significant transformation. This may be on a small scale, like stripping out all the policy/due date information, or removing deprecated fields. Or it may be on a large scale, if we're converting formats entirely–like QTI to Capa, or OLX into some more optimized representation. In that case, the original content-as-imported might stay in the ContentAtom, but a new derived ProblemBlockAtom might store the transformed version.

Granted, a lot of this conversion can (and probably should) be done before it gets published into the LMS at all. But having this side-loadable Atom gives other apps the opportunity to process and make derived content as necessary, which I think will be useful for querying, experimentation, and data migration. ContentAtoms aren't "owned" by any app, really. Any app can create a ContentObject that points to one, and any app can make a MySpecialAtom that derives its data from a ContentAtom.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 7, 2022

Random testing thought: In addition to having standard abstract models to inherit from for certain needs (e.g. something with the right way to subclass and query ContentObject), it would be good to have some standard test classes that apps could use, that would run common scenarios like, "Hey, what happens when the LearningContext gets deleted?"

@ormsbee
Copy link
Contributor Author

ormsbee commented May 9, 2022

Okay, I've done some prototype hacking, and I think this is feasible. I'll make a new ADR PR for data model and post the link here shortly. It'll cover the basic high level concepts/models for block and unit level content types, and the extensibility story.

Follow-on ADRs would cover error handling and potentially a more efficient versioning representation. That last one is tricky because I can think of a model that would do it, but it would make it more difficult to have the database enforce certain correctness constraints.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 23, 2022

As I was writing up the ADR for the data modeling basics, I came to the conclusion that the ContentPackage/LearningContext distinction was more trouble than it's worth, especially as more and more things were moved up to LearingContext anyway (e.g. Unit/Sequence composition). I think that what we really got from the split of those concepts were a couple of ideas:

  1. Each LearningContext should be able to borrow from other LearningContexts.
  2. Each LearningContext controls its own namespace of identifiers, so it needs to copy certain things (e.g. versioned metadata about a block), while it can still reference some things directly (e.g. the raw data).

But having the ContentPackage/LearningContext split was giving some painful duplication, and I ran into particular issues in figuring out how to model Segments in a way that tied it to the ContentPackage.

So with Content/Learning prefixes gone, I think the models simplify out to something like this:

simplfied_models

I've also renamed the primitive, verisonless content-package layer classes from ContentObject/ContentAtom to ItemInfo/ItemRaw, which brings it more into line with Item/ItemVersion terminology used elsewhere.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 23, 2022

Side Note on Re-Use: To facilitate easier querying in the future, some of the models should have nullable fields that can point to the original thing they're copying from. So if my ItemVersion was a copy of a Library's ItemVersion, it should have a fkey to that Library's ItemVersion.

At the same time, if we want to be resilient to content deletion, we can't rely on the fkey to always exist.

@ormsbee
Copy link
Contributor Author

ormsbee commented May 27, 2022

I'm going to close this for now, but may re-open if there is further discussion.

@ormsbee ormsbee closed this as completed May 27, 2022
Repository owner moved this from In Progress to Done in Axim Engineering Tasks May 27, 2022
@ormsbee ormsbee reopened this Jul 9, 2022
Repository owner moved this from Done to In Progress in Axim Engineering Tasks Jul 9, 2022
@ormsbee
Copy link
Contributor Author

ormsbee commented Jul 9, 2022

Okay, one adjustment that I'm making to this is to allow the association of multiple ItemRaw objects with a given ItemVersion. I think that will give some flexibility around associating multiple assets (e.g. for Video data + transcripts, problems with graders, etc.) without adding too much complexity.

@ormsbee
Copy link
Contributor Author

ormsbee commented Jul 18, 2022

Two more tweaks:

ItemRaw is now Content

This makes the hierarchy look like:

graph TD;
    CI[Item]-->CIV[ItemVersion]
    CIV[ItemVersion]-->CC1[Content]
Loading

It also gives the M:M through-model between ItemVersion and Content the clearer sounding name of ItemVersionContent.

Mapping Data to ItemVersion

I was creating models for serving static assets, and in particular I was thinking about images vs. other downloadable assets like zip files or PDFs. There are certain commonalities, like download permissions. But there is a lot of variation as well–images get alt text, resolutions, possibly multiple files for different sizing, etc.

One option is to make the relationship hierarchical by having a StaticAssetVersion model and then hanging yet more models off of that one. But one of the frustrating things about that arrangement is that we'd be repeating ourselves a lot. Say something that is image-specific gets changed–all the static asset-generic metadata gets re-generated each time (new ItemVersion so a new OneToOne StaticAssetVersion + the ImageAssetVersion that actually changed. Extrapolated across anything that might ever want to attach any kind of metadata, and that could be a mess.

Instead, I'm now thinking about treating these different aspects of the data (Downloadable, Image, etc.) as separate data models that are mapped to ItemVersions via a 1:M table that is locked 1:1 on the item_version_id. So:

graph LR
subgraph itemstore
  ItemVersion
end
subgraph staticassets
  direction TB
  Image
  DownloadableAsset
  subgraph "(autogenerated)"
    direction TB
    ItemVersionImage
    ItemVersionDownloadableAsset
  end
end
subgraph staticassets
  ItemVersionImage --"M:1"--> Image
  ItemVersionImage --"1:1"--> ItemVersion
  ItemVersionDownloadableAsset--"M:1"--> DownloadableAsset
  ItemVersionDownloadableAsset --"1:1"--> ItemVersion
end
Loading

Things I think are promising about this approach:

  1. We keep the ability to do the queries somewhat efficiently since the tables are 1:1 to the ItemVersion.
  2. Making changes to one aspect (e.g. grading policy) doesn't force all the other aspects to get re-written–only the mapping table row entries.
  3. If we can get that join table to be auto-created somehow (some mixin?), then the app developers might be insulated from more of the versioning machinery. In the above scenario, the app creator is dealing more with "this is image data", "this is generic downloadable asset data", and less with versioning directly. I feel like the mindset is more "this is data that I'm associating with this version of the content" and less "this is my versioned subclass of a generic versioned item".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

7 participants