-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine LMS Content Data Model Design #1
Comments
Okay, I think I'm overloading LearningContext with concepts of content ownership and student state association, and I'm not sure how to reconcile that cleanly. I feel like there's something missing that I don't have the right vocabulary for. Some rambling thoughts:
So it's almost like there's a PublishingContext (each library, the course, some part of a course, some part of a pathway, etc.), which is a group of content that can be versioned and published together. And then there's a LearningContextVersion, which has one or more PublishingContextVersions batched together. So maybe Learning Contexts have:
That could match up to a Course Run in the LMS. But is that over-complicating things to introduce a distinction between a publishing context and a learning one? If we want to shift how courses are authored to something more granular, it might be worthwhile. |
FYI @kdmccormick, @feanil, @Carlos-Muniz, @arbrandes, @bradenmacdonald, @doctoryes, @saksham115, @jristau1984: Some thoughts I've been having lately about data modeling content in the LMS, with possible implications for content libraries v2 and unit composition in the LMS. It's not very well organized, and no action is required–there's a lot to unpack here that's beyond the current scope of BD-14. I just wanted to be transparent about where my head was at. |
Thinking on this a bit more, this might be a viable way to model something like CCX, where the LearningContext becomes the CCX course key, and it has the PublishingContext for the base course. Policy things like dates would joins between LearningContexts and PublishingContexts. This would also allow for how we model default behavior in libraries. The default values becomes the policy tables that join the library's LearningContext and the library's PublishingContext. The overrides that are specific to courses using the library become joins between the course's LearningContext and the library's PublishingContext. |
That's actually pretty exciting to me, because it gives us a place to more cleanly apply policy-level separation in a way that decouples content authoring and administration–and does so in a way that unifies CCX and libraries override handling. |
I'm not going to pretend I have the time to dive into this in the near future, but I certainly like it that you're giving it serious thought, Dave. :) Nevertheless, here's a use case I want to bring up for consideration.
This is because Studio never offered a wiki-like history (including grouping changes, who made them, etc) that a course team could navigate. I'm positive authors would love this: I know because I went as far as versioning OLX on git just so I could have all this stuff. I've seen this pop up in the forum on occasion, and if memory serves, MIT used to do this a lot. I realize what you're discussing here is at a deeper level than exposing diffs on a frontend. But, again, it might be a use case worth considering as you model the data. |
@arbrandes: Right. I guess it's more accurate to say that historical course data had very little use in the LMS. The main use there would be for fast rollback. Studio could benefit from it, though we didn't manage to prioritize any of that work at edX. |
In this view of the world, we'd conceptually have: LearningObject
PublishingContext
LearningContext
Relationship between LearningContexts and PublishingContexts
Next question: How is policy (e.g. defaults, scheduling, partitioning) stored against these data models? |
@ormsbee I found your latest comment (the directly above the one I'm typing) to be illuminating. Your distinction between LearningContext and PublishingContext is something that I've wondered about but hadn't been able to crystalize. I agree with the overall direction you are taking in this app. I haven't taken a deep look a the other two apps yet but will soon.
Yeah :/ I was really hoping we could hold onto "LMS doesn't do versions" but I admit I don't have any robust solutions for the pitfalls you listed. Now some more in-the-weeds reactions/ramblings:
Hm, this sounds complicated. I had always thought that student state would remain isolated between LearningContexts, and by consequence, Content Libraries would carry no student state other than maybe author-preview state in Studio. I guess I view Content Libraries as a more generalized type of LearningContext, one that should hold policy but not state. Would that function as a helpful or even sensical simplifying assumption? Would it lock Content Libraries out of an interesting use case? Going another direction, what if policy itself were a LearningObject within the PublishingContext? At that point, could we say that a Content Library is a PublishingContext but not a LearningContext in the eyes of the LMS?
I know you said the names aren't final. I'm still going to poke at them because it helps me think about the architecture, and also I can't help myself :)
So, maybe:
keeping in mind that these are models are all namespaced under |
@kdmccormick: I'm going to chew on the first half of your reactions for a while, but w.r.t. the naming suggestions:
FWIW, in case it wasn't clear, my naming attempt was a desperate cry for help. 😛 Thank you for suggesting new ones.
I had originally dabbled with the idea of intentionally using LO since some of them will be LOs in the general industry sense, but I like your framing of ContentObject better. I would still prefer ContentObjectVersion over VersionedContentObject because to me "VersionedContentObject" implies that there is a separate model with versions–i.e. I should be able to call I also like ContentPackage as a way to not use the word "context" so much, and I do think it's simpler and clearer. I'm in the middle of writing a model for encoding Units, and I'll try to do so in a way that incorporates some of this new language and addresses your other questions/suggestions above. Thank you for looking at this! |
Implementing Unit CompositionLet's imagine what Unit Composition would look like it if was built on top of a data model like this. In this framing, all the ContentObject and ContentPackage stuff mentioned above would be part of the Idea: ContentPackages represent raw pools of ContentObjects, and composition of those COs only exists within a LearningContext.This sounds a bit weird, since moving a Block from one Unit to another should surely count as a change in content. But a few thoughts I had around this:
These entities live at the LearningContext level. Model that joins LearningContextVersion and ContentObjectVersion: "Block"?
Unit
UnitBlocks
There's a weird sort of language flip with these names in that we're taking things that are versioned but talking about them Edge Case: Nested blocks in UnitsUnits are flat lists. There are two scenarios I know of in which we have nesting within units right now:
Right now, when you make a SplitTestBlock, with two things in it, you’re actually ending up with a hierarchy in ModuleStore that looks like this:
So the hierarchy looks like: Course -> Section -> Sequence -> Vertical -> SplitTest -> Vertical -> ProblemBlock. We've had a number of bugs along the lines of “nobody actually thinks this kind of structure is a thing on our platform", because they don't realize that extra level of hierarchy is allowed to exist. But this kind of nesting for SplitTest only exists to generate what is a conceptually flat list–what we’re really trying to do is render Problem 1a+2a for one group and Problem 1b+2b for another group. The nesting is just because that’s how XBlocks work, and implementing them as children lets the SplitTestBlock control which one to show when rendering. But the mechanism SplitTestBlock uses underneath is user partitioning. So Studio can keep this view of the world, but there’s no reason we need to encode it like that for the LMS. The LMS can have a version of the data that just looks like:
|
Okay, I've rambled a lot here, but I'm going to actually try to code this up in models and see how it holds together. |
Side note: Explicit branches (draft vs. published) vs. having entirely different LearningContexts to encode drafts for previewing (or even sending explicit versions during preview)? |
Latest wrinkle issue is: How important is it to be able to change the identifier for a piece of content over time? So in other words, can I change the UsageKey for something and still keep student state/history associated with it? The data model implication if we wanted that kind of capability would be:
It's a bit wasteful of space, since the identifiers are getting repeated with every revision, though it's not that big a deal–we'd use 8 bytes for the foreign key anyway, and identifiers would compress well if we use the right row type. The bigger thing is how intuitive it is, and how awkward it would be to work with. Certain queries get potentially much more expensive, like "did this identifier ever exist in this course?". There's also potential for more confusion between when you're working with this abstract, version-less Block, and a concrete, version-of-a-Block. |
Another weird wrinkle with having an abstract, versionless Block in this scenario is that it makes it harder to enforce uniqueness of UsageKey-style identifier within a LearningContextVersion (since the identifiers would be in the versioned-block table and the f-key to LearningContextVersion would be in the versionless-block table). |
Additional wrinkle: Does it make sense to allow for multiple identifiers for a piece of content? This might be too obscure a use case, but we've had requests in the past for something like this. The scenario was that a university had their own content library, and that content was converted into capa problems (among other formats). Those problems had UsageKeys, but they also wanted to associate it with their original content IDs when they analyzed the usage data later. This could be done with some sort of namespaced identifier scheme, e.g. |
Good stuff. Some clarifying questions and reactions, although I'm still chewing on this:
You said earlier that ContextObjects include Sequences (and I assume Sections too?). Under your idea, the hierarchy of those Sections and Sequences becomes only exists within the LearningContext, right? This would make sense to me, given that in learning_sequences, the LearningSequence model has a foreign key to LearningContext. (I am assuming that learning_sequences will be the basis for openedx-learning's modeling of the "above-the-unit" part of the hierarchy--but please correct me if I'm wrong on that.)
Okay, so conceptually COs are composed within a LearningContext, but in the data model COVersions are actually composed within LearningContextVersions, right? That is, there's nothing in the schema tying version-agnostic COs and LearningContexts directly?
Hah! This makes me smile. It almost feels like the completion of a long derivation of a cool math equation, where we've abruptly descended from dizzying abstractness by simplifying (LearningContextVersion X ContextObjectVersion) into the familiar Block. Part of me thinks it'd be better to use a new term to avoid ambiguity with XBlock. The other part of me thinks that the ideas are close enough that introducing a new term would be even worse, and instead we should hope that this new meaning of "Block" naturally meshes with folks' current understanding of XBlock and the usage of
This calls back to my previous question about whether Sections, Sequences and Units are themselves ContentObjects. If so, then it sounds like we can't assume that Block is a component. It feels weird that we could make a
Did you omitted some text here?
All of this sounds right to me 👍🏻 Good writeup of the issue. |
Oh, one more thought: UsageKey is conceptually defined as (DefinitionKey X LearningContextKey). I see this as parallel with your idea that Block is defined as (ContextObjectVersion X LearningContextVersion). More generally, Usage = Definition X Context. (ooh, maybe another name for ContentObject is "ContentDefinition"?) Does that jive with your thinking? If so, do you think each ContentObject would have a DefinitionKey? Then, a Block's UsageKey would be formed by combining the Block's ContentObject's key and the Block's LearningContext's key. |
years ago I had a versioning system in django with two IDs so each object had its own UUID and the UUID of its lineage. There were some weird wrinkles because I wanted to have latest link be stable so the latest always had the same UUID for both and historical versions were spun off with new IDs as they were archived and you could make a stable view of the system by linking together historical versions. |
Yeah, I just meant that the language flips around, where "Block" is a versioned thing, while in all the other definitions we call out versioned entities with "Version". I think it's okay because of how commonly it's going to be used–Blocks are likely the thing you interact with at higher layers when you don't care about all this versioning stuff at all but just want to have information about the course as it is–but I wanted to call it out.
It's funny, that would line us back to Modulestore terminology of definitions, which might not be so bad. I was worried that it might be confusing because we wouldn't necessarily be stuffing things that are definition-scoped in there–more like "all the things that aren't already covered by a specialized system like grading or scheduling", which is conceptually similar but doesn't match up in code. @ashultz0: I think I get what you mean. We do something similar for structure documents in SplitMongo, to be able to more easily determine shared ancestry between two structures. The wrinkle with this system is that in addition to history, it's also trying to represent a conceptual split between something as it exists (and evolves) in a library vs. its usage in particular course. @kdmccormick: I wonder if we could do some optimization based on the fact that a branch's history will be very linear. I know we've discussed this bit about having uniquely identified versions with hashing, and I still believe that has value. But one of the things I keep coming back to is that we currently pay an enormously high cost for very minor changes. Even this schema proposal would rewrite a join table of published blocks for a learning context that may have thousands of entries for each version. But we might be able to make a schema that encodes things much more efficiently by having publish version ranges rather than specific versions. So something like: class Block:
uuid # auto-generated
identifier # UsageKey would go here.
learning_context_branch # some foreign key?
start_version_num
end_version_num # null or maybe maxint to indicate what's published now
content_object # foreign key I don't know if the extra complexity is worth it. It makes both querying and cleanup more complicated, but it may remove the need for a lot of that cleanup in the first place. I don't think it's as intuitively obvious as a schema that maps new blocks for each version, but it would make new additions really cheap, because we'd only be making new entries for the things that change. Getting the current state of a branch means looking for all the things where |
More thoughts after a night's rest:
|
Late to the party. I've been asked to think about what we need to do to back course by Blockstore, so trying to catch up on this and adding some thoughts.
I do feel that that should be controlled at some other level, not via XBlocks.
While it's true that this is the architecture of Problem Builder, I don't think it's a great architecture nor is does it justify adding a ton of implementation complexity in order to support one specific XBlock. We implemented problem builder originally without support for XBlock children, then we converted it to use xblock children assuming that other blocks would do similar things, but in the end it's only really problem builder that did this. Though it would be a lot of work, problem builder could be re-architected to use "fake" XBlock children again as it originally did, essentially managing and rendering children within the XBlock code rather than relying on the platform to do so. This would probably end up with a much nicer authoring experience than currently exists. However it would be a ton of work.
I like this idea a lot, because...
Indeed. This was a giant pain when implementing content libraries on top of blockstore. In fact, dealing with different composition/IDs at different levels of the system was the only part of that implementation that I have bad memories of, and it also produced the part of the code that I'm least happy with and that I think is the most confusing. The central problem is that at the blockstore level, content composition is done using "links" to other bundles and using the actual XML filenames of XBlock OLX that you want to include. But within the LMS, we need to use usage keys which contain only the learning context ID (which gives the bundle ID) and a unique "usage ID". In the case of a child block seen in the LMS (e.g. if you go directly to the "mobile view" URL for a specific child), knowing the usage ID doesn't tell you what the parent block is, so in the worst case you have to scan every OLX file in the bundle to see which one included a child with the given usage ID. It also requires authors that write an Example, for a unit XBlock that includes an HTML child and a Video child graph TD;
subgraph b1[Bundle 1]
o1[OLX File 1 unit/main-unit/definition.xml]
o2[OLX File 2 html/introduction/definition.xml]
o1-- xblock-include -->o2
end
subgraph b2[Video Bundle]
o3[OLX File 3 video/intro-video/definition.xml]
end
o1-- xblock-include via bundle links -->o3
BDL1[BundleDefinitionLocator 1]-->o1;
BDL2[BundleDefinitionLocator 2]-->o2;
BDL3[BundleDefinitionLocator 3]-->o3;
subgraph LearningContext
u1[UsageKey 1]
u2[UsageKey 2]
u3[UsageKey 3]
end
u1-->BDL1
u2-->BDL2
u3-->BDL3
subgraph lc2[Some other LearningContext]
u4[UsageKey 4]
end
u4-->BDL3
Mapping from usage keys to OLX files or vice versa can both be complex/hacky/expensive depending on the situation. So, if it's possible that at the Blockstore / ContentPackages level we only deal with atomic units, the implementation would get much cleaner and simpler. In fact, I would say we definitely need to find a better approach to this before we can consider using Blockstore for courses. The challenge I see is that authors really want to combine things often. e.g. a common use case is putting an HTML caption together with a video, or associating an HTML image and intro text with a CAPA problem. That image should always be followed by that CAPA problem, and course authors don't want to have to specify that every time they use that problem in a new course/context.
If you need this functionality, it's good to design for it early on. I like the approach of JIRA, where every issue has an internal permanent ID that you almost never see and a set of external IDs like DEPR-1, DEPR-2 etc. which can be changed, and where old IDs redirect to the current ID.
This sounds like a use case for custom tags, where you could tag individual content items with the original content ID ? |
Another thought that might simplify things: Maybe we can get rid of the idea of branches altogether, and just have a purely linear history, with the caveat that the "published" pointer doesn't automatically advance. So there are new "versions" being made for preview purposes that aren't "live". |
That makes a huge amount of sense to me. Developers that have git as a major part of their job have enough trouble with branches. Normal people just have a bunch of spreadsheets named "presentation-final-final-FINAL". A single linear stream with one pointer that says "published" feels plausible, branching really does not for people who do not have version control as a main part of their job. |
+@doctoryes, @marcotuts, @jristau1984, @jmakowski1123: (looping you in because this relates to content libraries as well as some long standing issues around content modeling)
This has been bothering me a lot lately, to the point where I think we need some explicit terminology and modeling around it. Right now, our language for content is largely about where it is in the tiered navigation hierarchy, or how it gets implemented in XBlocks. We don't really have a concept around this thing where sometimes these two pieces should just always be treated as one thing for the purposes of composition. So for now I'm going to steal terminology from a UX article I read somewhere and call the smallest bits ContentAtoms and the slightly bigger logical grouping of 2-3 of these as a ContentMolecule. (ContentElement? I am not at all attached to these names.) From the perspective of the LMS doing composition, it would be great if XBlock properly supported the ability to nest in ways that weren't strictly parent->child. So another words, if we could do this: <problem>
<video>...</video>
<prompt>What techniques are demonstrated by this video?</prompt>
<p>Some description...</p>
<!-- etc. -->
</problem> We've always needed this for things like content libraries. We kludge it in the v1 implementation by wrapping it in an extra vertical/unit. But ContentMolecules aren't really the same thing as Units–they come in different places in the hierarchy for one, and you wouldn't "compose" the internals of a ContentMolecule–it's just a fixed structure. But they have overlap, in the sense that both ContentMolecules and Units represent things that are externally addressable. It would make no sense to do an LTI launch of a ContentAtom that is an HTML prompt before a problem and not give you the problem itself. Then there's the question of how we would introduce a concept to the system in a way that doesn't break everything. Wrapping it in a new XBlock/tag would introduce another level of hierarchy that would likely break a lot of things downstream. Not only would code break, but exporting OLX from a version of Open edX that supported such grouping to a version that didn't would make those parts of the course completely inaccessible (because they'd be hidden under a new XBlock type that doesn't exist the old system). But how about this?
I think that would give us a bare structure for grouping small things in a primitive, static, non-hierarchical way (ContentMolecules), in contrast to the more dynamic grouping mechanisms we'd be looking at for Unit Composition. It would be nice if Unit composition only had to deal with things at the ContentMolecule level, but there are certain types of composition that currently rely on ContentAtoms, like Feature Based Enrollment (FBE) that will disable certain XBlock types from displaying their contents for certain enrollment modes. That could be addressed by either applying FBE exclusions to the whole ContentMolecule, or pushing some of that decision down to the XBlock runtime layer. So we'd have:
I'll poke at this some more shortly... |
😩 Version-awareness makes everything harder. 😞 |
words for content atom/molecule music words: phrase/verse lego words: brick/model if definitions live at the ContentPackage layer that makes the molecule a pack so we could do brick/pack instead which has the parallel sounds that are easy to remember together block, brick, pack geometry words: point/line or point/shape atom is actually pretty good for the lowest level, it's just molecule that is a problem. Maybe atom/aggregate or atom/subunit or atom/phrase to mix metaphors completely reading what the new thing is it's kinda an Xblock but where an Xblock has to nest, this is on another axis... another axis from X makes it a YBlock ;) |
@ormsbee That makes a lot of sense, and I think it's a nice simplification that covers most use cases without introducing too much complexity. I actually don't mind the name ContentAtom/ContentMolecule; it seems pretty clear to me what you mean by it. |
Some follow on thoughts to the relationship between these: Atoms are the analog to individual pieces of XBlock content. They need an identifier, title, and data. We will store things like user state and scores for things at this level of granularity, so we can't just hide it behind Molecules. They have a type, possibly major and minor types (e.g. Molecules are logical groupings to aid composition. They have an identifier. They probably don't have titles, or if necessary derive a title from one of their children. We probably want to keep this layer as thin as possible. We shouldn't attach data to this where we also have to associate data with Atoms, as it will lead to further confusion. So in this scenario, Molecules are a convenience grouping mechanism , but Unit composition still returns a list of Atoms. At an object level, I guess we'd want to allow you call methods to iterate over either type when reading the contents of a Unit, but we'd probably default to Atoms. For the far more common read-only use case for Unit traversal, Atom traversal is going to do the right thing most of the time. At the same time, Molecules would have to exist at the ContentPackage level, i.e. in the authoring vs. policy divide, they fall on the authoring side. Conceptually, a LearningContext can have many ContentPackages (one representing each library being used + one or more representing course-specific content). ContentPackages are an interesting tool because they potentially give us a path to more neatly bridge course-centric authoring and more library-centric models by giving them a sort of common denominator format. Storing this is kind of weird. Storing Atoms is mostly straightforward. Storing Molecules almost seems a little silly when almost all of them are going to be single-Atom Molecules to start. It'd be tempting to use something implicit, except that Molecules are going to be an actual entity in our system that will be stored and manipulated in Unit composition, so we need a real data model backing it. The IDs here are almost certainly going to be UUIDs of some sort though. But then we get to the problem of versioned storage... |
There are a number of issues that versioning raises with this scenario, but the biggest one that comes to mind is how entities in our system get multiplied out. We want stable references that survive a course publish–we don't want to invalidate all the score information or policy related metadata for a piece of content every time someone fixes a typo. At the same time, having references to the exact piece of content at the exact version would help for some scenarios (e.g. re-grading, rubric validation, course team debugging of student issues, etc.). But the more capable and flexible our content storage mechanism is, the more complex and confusing it becomes. Identifiers like the UsageKey have to be assigned at the LearningContext layer. A course run could use the same problem from a content library in multiple places, and there's also no way to guarantee identifier uniqueness across multiple ContentPackages (courses may use dozens). So taking all those together, we have: Top level: LearningContext, LearningContextVersion ContentPackage, ContentPackageVersion A LearningContextVersion has multiple ContentPackageVersions (possibly even two different versions of the same ContentPackage). A LearningContext will have many version-independent Blocks. These have a permanent UUID as well as an identifier like a UsageKey. This is the thing that systems external to publishing would make a foreign key to when they need to store things like student state that persists across new versions of the content. A LearningContextVersion has BlockVersions, which are joins between versioned Atom data and versionless Blocks. BlockVersions are the thing you'd put a foreign key against when you want to capture the state of the content at some point in time, e.g. submission/completion/grading. The mapping between LearningContextVersion and BlockVersion determines what's "live" in that version. Blocks don't go away, even when the content is deleted. That would be helpful for people writing code that builds off of course data–even if the content no longer exists, your app would still be able to find the last version of the content that corresponded to that Block. Blocks aren't just Course concepts, but exist in Content Libraries as well, where they serve the same function. Next: How we'd model Units in this mix of versioned and un-versioned models... |
@bradenmacdonald: My thinking has been more data-oriented than interface-oriented at this level, probably because it's that aspect that's pained me the most with XBlock/ModuleStore. I'm imagining that pluggability would happen one layer above this, in models that have foreign keys to ContentObjects, but that ContentObjects themselves have only the basic primitives necessary for import/export and to be placeholders for composition. That being said, could you please describe a concrete use case, and I can see how these models might work with it? |
@ormsbee OK, don't worry too much about what I said, it might be asking for more complexity than we need, and I don't think what you're describing precludes any use cases. Perhaps there are some that it makes less efficient, like the idea of a remote learning content repository, or a remote blockstore - can remote objects be used without first importing them? Or is it good that they must be imported first? I "felt" like I had other examples but on reflection I couldn't think of any good ones, so I'll let it go :p |
I think some minimal set of metadata about the resource would have to be imported in some way or another, even if it's mostly pointers to where the thing actually lives. But the content itself wouldn't have to be. |
I've simplified the ContentAtom a bit, so that it only stores:
There is a unique constraint on Some thoughts:
|
Random idea: define abstract model classes that plugins and other apps can extend for key extension types, for instance having one that's 1:1 with ContentAtoms. Misc. thought: Content lives longer than code. One of our pain points has been breaking the export process entirely when certain apps/XBlocks go away. The design should be done in a way that prevents this. |
Okay, I've been reworking this some more, and trying to focus on separation of tasks by layer, and where we expect apps to hang their own models off of. There are a few high level considerations:
Raw Content Layer (in ContentPackage)ContentAtom
Shared Content Layer (in ContentPackage)ContentObject
SimpleContentObject
ContentSegment
Versioned Content Layer (LearningContext + LearningContextVersion)LearningItem
LearningItemVersion
A few notes about this arrangement:
|
Some questions as I try to understand this:
|
I'd thought I saw a potential issue with this schema, but @feanil and I talked through it and I believe resolved it. @ormsbee , let me know if this jives with your thinking.
|
And, on another note, I want to record a suggestion that @feanil mentioned about in our meeting (@Carlos-Muniz @ormsbee ):
where |
Will reply to all of these later, but a quick one:
I agree with the idea that people shouldn't make foreign keys to ContentAtom, and that it should be to something like the SimpleContentObject instead. But there were a couple of reasons I wanted the separation between ContentAtom (raw data) and SimpleContentObject (metadata): Storage OptimizationMetadata changes should be cheap. If we have a large image and change the ALT description text a few times, we shouldn't be making whole copies of the image data. Putting them together in the same table would require that a whole new copy of image data in that situation. Content Outlives ApplicationsThis design is leaning heavily on supplemental apps to provide rich data models. But some point, new apps are going replace them and want to migrate content data over. At some point, apps are going to be removed entirely, and their tables will be dropped. What can we meaningfully export from an old course after that has happened? So the hope I have in these models is that we can have something that is grounded by a core model that will last over time, always be exportable, and will have the minimal data needed for a thing "of that type", while at the same time being supplemented by richer data. (Edit: there's actually another aspect on error handling, dirty data, and async processes, which I'll try to write up after I pick my daughter up from school.) |
All great points. You have me convinced on this point alone:
I would love if we could store our images, videos, etc in a dead-simple way so that we never have to migrate it again. Just for the purpose of distinguishing it from a ContentObject, would it be fair to think of a ContentAtom along the lines of a "file"? For example: pixel data and dimensions are things that usually live in an image file, and thus go in the ImageAtom; alt text is externally-applied metadata, and thus would exist at the ContentObject level. (I don't know whether this metaphor holds for other content types, I'm struggling a bit to think of examples) |
Assumption check: While trying to work through how this would work with videos, I realized that probably the blob in the content atom is actually going to need to use something like the FileField(backed by whatever storage mechanism you want) so that we don't have to store large amounts of data in the database. In which case, making the ContentAtom be akin to a file makes sense to me. We want to have the files be immutable but then have the metadata related to files(in the case of video, title, language, etc) be easy to change even if you're not modifying the video. Going down this line of reasoning a bit further, is the reason that you're thinking we might want multiple types of content atoms(e.g. VideoContentAtom) that we may want to extract some of the data in the atom for ease of querying? I'm imagining things like the length of a video which is going to be part of the video data but that we might want to be able to quickly access without re-analyzing the whole video. |
Agreed. Unfortunately. Video's always the edge case that breaks everything by being 1000X the size of everything else. 😞
Originally, yes. I thought that it made sense to have a model for metadata that was intrinsic to the data of the raw bytes themselves, separate from other metadata about said bytes (like the author name, or a short description). But after our conversation yesterday, I came around to the idea that small levels of duplication of metadata is fine, and that it's preferable to do that than to confuse developers by giving them two places to hang this kind of data off of (which is I think was one of the points you and @kdmccormick were making?). In the case of a video, that might mean that a VideoContentObject would hold both the length (intrinsic to the file itself) + a short description (layered metadata). I think that extension at the ContentAtom layer then makes sense primarily when data requires significant transformation. This may be on a small scale, like stripping out all the policy/due date information, or removing deprecated fields. Or it may be on a large scale, if we're converting formats entirely–like QTI to Capa, or OLX into some more optimized representation. In that case, the original content-as-imported might stay in the ContentAtom, but a new derived ProblemBlockAtom might store the transformed version. Granted, a lot of this conversion can (and probably should) be done before it gets published into the LMS at all. But having this side-loadable Atom gives other apps the opportunity to process and make derived content as necessary, which I think will be useful for querying, experimentation, and data migration. ContentAtoms aren't "owned" by any app, really. Any app can create a ContentObject that points to one, and any app can make a MySpecialAtom that derives its data from a ContentAtom. |
Random testing thought: In addition to having standard abstract models to inherit from for certain needs (e.g. something with the right way to subclass and query ContentObject), it would be good to have some standard test classes that apps could use, that would run common scenarios like, "Hey, what happens when the LearningContext gets deleted?" |
Okay, I've done some prototype hacking, and I think this is feasible. I'll make a new ADR PR for data model and post the link here shortly. It'll cover the basic high level concepts/models for block and unit level content types, and the extensibility story. Follow-on ADRs would cover error handling and potentially a more efficient versioning representation. That last one is tricky because I can think of a model that would do it, but it would make it more difficult to have the database enforce certain correctness constraints. |
As I was writing up the ADR for the data modeling basics, I came to the conclusion that the ContentPackage/LearningContext distinction was more trouble than it's worth, especially as more and more things were moved up to LearingContext anyway (e.g. Unit/Sequence composition). I think that what we really got from the split of those concepts were a couple of ideas:
But having the ContentPackage/LearningContext split was giving some painful duplication, and I ran into particular issues in figuring out how to model Segments in a way that tied it to the ContentPackage. So with Content/Learning prefixes gone, I think the models simplify out to something like this: I've also renamed the primitive, verisonless content-package layer classes from ContentObject/ContentAtom to ItemInfo/ItemRaw, which brings it more into line with Item/ItemVersion terminology used elsewhere. |
Side Note on Re-Use: To facilitate easier querying in the future, some of the models should have nullable fields that can point to the original thing they're copying from. So if my ItemVersion was a copy of a Library's ItemVersion, it should have a fkey to that Library's ItemVersion. At the same time, if we want to be resilient to content deletion, we can't rely on the fkey to always exist. |
I'm going to close this for now, but may re-open if there is further discussion. |
Okay, one adjustment that I'm making to this is to allow the association of multiple ItemRaw objects with a given ItemVersion. I think that will give some flexibility around associating multiple assets (e.g. for Video data + transcripts, problems with graders, etc.) without adding too much complexity. |
Two more tweaks: ItemRaw is now ContentThis makes the hierarchy look like: graph TD;
CI[Item]-->CIV[ItemVersion]
CIV[ItemVersion]-->CC1[Content]
It also gives the M:M through-model between Mapping Data to ItemVersionI was creating models for serving static assets, and in particular I was thinking about images vs. other downloadable assets like zip files or PDFs. There are certain commonalities, like download permissions. But there is a lot of variation as well–images get alt text, resolutions, possibly multiple files for different sizing, etc. One option is to make the relationship hierarchical by having a StaticAssetVersion model and then hanging yet more models off of that one. But one of the frustrating things about that arrangement is that we'd be repeating ourselves a lot. Say something that is image-specific gets changed–all the static asset-generic metadata gets re-generated each time (new ItemVersion so a new OneToOne StaticAssetVersion + the ImageAssetVersion that actually changed. Extrapolated across anything that might ever want to attach any kind of metadata, and that could be a mess. Instead, I'm now thinking about treating these different aspects of the data (Downloadable, Image, etc.) as separate data models that are mapped to ItemVersions via a 1:M table that is locked 1:1 on the graph LR
subgraph itemstore
ItemVersion
end
subgraph staticassets
direction TB
Image
DownloadableAsset
subgraph "(autogenerated)"
direction TB
ItemVersionImage
ItemVersionDownloadableAsset
end
end
subgraph staticassets
ItemVersionImage --"M:1"--> Image
ItemVersionImage --"1:1"--> ItemVersion
ItemVersionDownloadableAsset--"M:1"--> DownloadableAsset
ItemVersionDownloadableAsset --"1:1"--> ItemVersion
end
Things I think are promising about this approach:
|
This is a discovery ticket that would end in multiple ADRs in this repo. The goal is to determine the basic data structures and relationships needed to store content data for the LMS with the following goals:
There will likely be some prototyping involved in this, as well as a lot of discussion.
These ADRs would include:
Open edX's ModuleStore has explicit versioning capabilities (though they're not really used from LMS most of the time). Blockstore has versioning deeply baked into the design, but we've generally avoided encoding multi-version support into the post-publish data stores we build during course publish (e.g. CourseOverview, Block Transformers/Course Blocks API, etc.)
I think we've come to a point where we really do need to cross that divide and start introducing content versioning concepts to the LMS more generally. Some motivating reasons:
Storage Scaling Issues
SplitMongo ModuleStore wasted a lot of space with old version data, eventually leading us to create a separate cleanup script for it. There were a number of reasons why disk usage got so bad with this system:
Every time there was even a small change (e.g. the title of a Unit), we ended up writing a document with all settings-scoped data for the course. This happened all the time in Studio, so that for every one version that is of interest to us for viewing or preview purposes, there would be dozens of almost-identical intermediate versions. We ended up in a place where the majority of course storage was wasted in this way.
There are a few ways we can make this much better:
Most content doesn't change very much from one version to the next, so we should break up the course into more granular pieces and track their changes individually.
There will be intermediate versions that can be almost immediately discarded (like the last state of the Studio draft). We should have obvious cleanup facilities for getting rid of those as soon as they are not needed.
Thinking about versions is hard, and unintuitive. We should have a set of primitives that help people model version-awareness into their content without having to overly complicate their data models.
Modeling Versioning in a way that scales (i.e. both features and users)
The following is a disorganized set of thoughts:
Entities (the names are bad, I'm just trying to get down the ideas)
I'm fiddling with some of these ideas in the learning_publishing app's models.py file at the moment.
Concern: How is it different than Blockstore? A: It's going to be much more relational data model that you hang other relational models (e.g. XBlock content, scheduling information) off of. Also, it's going to have zero intelligence about cycle detection or dependencies. It's also going to have explicit measures for cleanup of unused versions, which is not a thing in Blockstore.
The text was updated successfully, but these errors were encountered: