Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separating abstract data model from syntaxes #103

Closed
talltree opened this issue Nov 12, 2019 · 18 comments
Closed

Separating abstract data model from syntaxes #103

talltree opened this issue Nov 12, 2019 · 18 comments
Assignees
Labels
pending close Issue will be closed shortly if no objections

Comments

@talltree
Copy link
Contributor

The first public working draft (FPWD) currently defines DID documents in two sections.

  1. Section 5 defines DID Documents.
  2. Section 6 defines DID Document Syntax.

However Section 5 does not actually define DID documents abstractly, but rather as a collection of JSON properties. This may be because the first sentence of Section 6 currently says:

A DID document MUST be a single JSON object conforming to [RFC8259].

However this directly contradicts the second paragraph of Section 6, which says:

Although syntactic mappings are provided for JSON and JSON-LD only, applications and services can use any other data representation syntax, such as JXD (JSON XDI Data, a serialization format for the XDI graph model), XML, YAML, or CBOR, that is capable of expressing the data model.

Besides resolving these obvious conflicts in the FPWD, a number of WG members have asserted that, because DIDs and DID documents operate at such a low level of Internet infrastructure—and are effectively protocol elements in DID resolution—the following design principles should apply:

  1. DID document structure should be defined abstractly, using a language designed for abstract data modeling such as UML.
  2. Syntaxes for expressing DID documents should be defined separately from the abstract data model.
  3. No syntax should have special status, i.e., each should define exactly how it implements the abstract data model in its own separate section of the spec.

If there is rough consensus on these design principles, then it would make sense to revise the current structure of the spec as follows:

  • One section for defining the abstract data model, with subsections for each property.
  • One section for each syntax that is defined in the main spec (other syntaxes can be defined in separate specs or subsequent versions).

Note that this issue is orthogonal to #95 (Document Structure), so a decision on this issue may affect that one.

@talltree talltree added the discuss Needs further discussion before a pull request can be created label Nov 12, 2019
@talltree talltree self-assigned this Nov 12, 2019
@SmithSamuelM
Copy link

Using an abstract data model for the DID document syntax specification enables tooling that is validating DID documents to be expressed in modeled form thereby engendering validation in a generic sense. Step one convert did doc to abstract data model. Validate in model space. The mapping between abstract data model and format specific syntax should be round trippable and the syntax mapping should be defined both directions. As is well known smart contract expressed inthe Ethereum smart contract language solidity are difficult to validate. The best validation approaches converted solidity contracts into abstract data models that could be validated. A UML State Chart format would allow modeling of the did method operations on a did doc in a formal way.

Besides the universal validation-ability advantages. An abstract data model will clear up and prevent many of the conflicts over items and syntax in the did doc. As the discussions often devolve to clarification between JSON and JSON-LD syntaxes as opposed to the functionality changes to the spec. Making the core spec abstract and then letting each syntax decide what is the best (two way round-trippable) way to implement the data model in a given syntax will allow the focus of discussions to be where they need to be. Generic functionality will be syntax independent thus removing any discussion of syntax from those discussions. Syntax specific discussion will be attended by those impacted by that syntax and who are knowledgeable in that syntax thereby focusing those discussion. This would also be an advantage to DID resolution.

@SmithSamuelM
Copy link

Using a abstract data model for the DID methods will enable more universal DID resolvers and the DID methods in a given language may be auto-generated from the abstract data model (such as UML State Charts) This makes code validation against a did method spec possible.

@dhuseby
Copy link

dhuseby commented Nov 12, 2019

@talltree thank you for taking the time to write this up. I couldn't agree with you more. I spent this summer implementing a Rust crate for parsing DID's and DID documents. It is now the primary DID handling code in Hyperledger Aries. After getting a mostly working crate together, I was so incensed at the state of the spec that I drank three beers and then wrote out a rant from an implementer's perspective.

The part of my rant that is applicable to this issue is Section 4.1: The DID document spec should be encoding agnostic. I completely agree with @talltree and @SmithSamuelM that we should concentrate more on what must be in a DID document that how it is encoding. Everything else we want to specify around DID operations (e.g. controller binding, key management, service point linking, etc) is independent of the encoding of the DID documents.

So why specify that DID docs must be JSON-LD? More importantly, what are the reasons for why we wouldn't want to bless on encoding? It turns out that many different industries have settled on specific encodings as standard and if we want SSI to penetrate into those markets DID documents will need to be encodable in those encodings.

I'm not talking about insignificant industries either. The entire legal profession has settled on PDF documents for better or worse. Adobe and several other vendors (e.g. Docusign) are building out identity related products meant to handle cryptographic credential distribution and management inside of PDF documents. Those products sound suspiciously like SSI and the PDF meta data sounds suspiciously like DID document data. If the DID spec outlined the data that should be in a DID doc and then formalized ways that data can be encoded in different formats, then support for PDFs would be possible for the legal industry.

The same goes for the financial industry. They recently settled on the ISO 20022 encoding for financial records and messaging. If we want to bring DID docs to the financial industry we'll need to support that. Same goes for national ID/drivers' license standards. Birth certificate encodings. Health record encodings. Hell, even climate scientists have settled on the HDF5 standard for all weather and geosat data sets. I think it would be a nice improvement if weather and climate instruments were digitally signing all of their data using SSI/DID credentials and enabling the provenance tracking of the data. But that would require the DID credentials to be encoded in HDF5.

To put a period at the end of my point, here are the recommendations that I made that are in full agreement with @talltree and @SmithSamuelM :

  1. Focus on what data must be in a DID document.
  2. Create a registry of encoding methods.
  3. Define the process by which a new encoding method can be included in the registry of standard encoding methods.
  4. Add JSON-LD as the first encoding method in the registry and move all of the encoding details out of the existing FPWD and into the JSON-LD encoding spec (i.e. the canonicalization rules, etc).

This also future proofs the DID spec because we can incorporate new encodings on the fly.

@dhuseby
Copy link

dhuseby commented Nov 12, 2019

In theory, if we follow the pattern of specifying the minimum data and then also specifying a registry of encoding methods, there's nothing keeping us from saying that X.509v3 certificates are a valid encoding method for a publicKey data unit as long as it contains the necessary fields. Based on my survey of the different methods of representing cryptographic key material, the only thing missing from the X.509v3 spec would be the "id" unless we abuse the "common name" (CN) member of the "Subject" field as specified in the latest PKIX RFCs to instead contain a DID string URI instead of a URL.

This approach would certainly address issue #69.

@dhuseby
Copy link

dhuseby commented Nov 12, 2019

But given my grok of DID, I don't think the existing certificate formats contain enough data to fully support DID. They certainly don't include any data to support the standard key management functions we seek to include in the standard.

@dhuseby
Copy link

dhuseby commented Nov 12, 2019

But how cool would it be to make the set of DID specs able to incorporate the existing CA system. Credentials from a CA in the form of an EV certificate could work just as well as KYC credentials from your bank or DMV. We would just need to wrap them in the DID context so that they are actually useful for SSI.

@dhuseby
Copy link

dhuseby commented Nov 12, 2019

I am now laughing to myself at the thought of the did:ca: method.

In all seriousness though, I think we're getting somewhere if we are able to boil down the suite of DID specs to the core tenants of cryptographic identity such that the CA system turns out to be one narrow implementation of the standard.

@selfissued
Copy link
Contributor

I have worked on specs* where there was an abstract data model spec and a companion concrete binding spec, and do you know what? Developers hated it!

The said that it was overly confusing to have to read both specs in parallel to try to piece together what they actually needed to implement. In the end, overwhelming feedback from developers caused us to abandon that approach. We folded all the normative requirements in the abstract spec into the concrete spec, and do you know what? The result was much clearer and easier for developers to use.

I'll also add that, as an editor, maintaining the two parallel specs, keeping them in sync, and figuring out which statements belonged in the abstract spec and which belonged in the concrete spec was a special kind of hell. I wouldn't wish it on anyone.

Please, let's not go down this rathole. Let's create a great JSON DID spec. If we later want to create a parallel concrete representation in another data format, we can do that. But let it be usable on its own just like JWT [RFC 7519] and CWT [RFC 8392] are parallel, but usable on their own without reference to one another.

* These specs were OpenID Connect Messages (the abstract form) and OpenID Connect Standard (the HTTP/OAuth binding of the abstract form). Developers revolted and insisted that we merge them to create OpenID Connect Core before we made OpenID Connect final.

@SmithSamuelM
Copy link

@selfissued. It’s interesting to hear about your experience with an abstract data model spec (what spec was it?). One of my concerns is that we currently are supporting two specs. A JSON version and a JSON-LD version. Only its not too encodings is some frankensteinian combination in one encoding. Having been an editor on other specs. (CEA 852, 852.1, 709.X) I find that focusing on the data model even when there is a preferred implementation language is a good thing. If there is a strong use case for more than one encoding (which is already the case for DIDs given both JSON and JSON-LD) then IMHO it makes tracking multiple encodings very difficult indeed if there is not an abstract data model. Alternatively if there is only one encoding then having both an abstract data model and an encoding is more complicated and I would be on your side in saying lets not have an abstract data model. So if you don’t mind responding for ithe spec you mentioned was there ever more than one encoding? If not then its not a convincing case for not having one. As a developer I like having examples for implementation purposes but if I have to support more than one implementation then I want a canonical specification that is the source of truth otherwise I spend way too much time trying to determine what is implementation detail and what is canonical detail. That seems to be what has been happening with the DID spec. Way too much time spent in bike-shedding JSON-LD vs JSON and not having a clean spec for implementing either of them. As cryptographic primitives DIDs should have more than one encoding. We want universality and portability. The price of that is creating an unambiguous spec. There are two paths to clarity. A single canonical implementation encoding or a single canonical data model. If we anticipate multiple encodings then the latter is better in the long run.

@dhuseby
Copy link

dhuseby commented Nov 12, 2019

@selfissued we're already doing the split abstract and concrete with the DID method spec's and it seems to work just fine. There's a template for what a DID method spec needs to define and implementors define those for their spec. I wrote/maintain the Git DID method spec and it works just fine.

BTW, I'm also proposing in other issues that the existing DID spec be broken up into: DID URI spec, DID key material spec, DID service endpoint spec, and the overall DID doc spec. I think it will be pretty easy to specify what the data model is given the very specific nature of what we are trying to accomplish.

@selfissued
Copy link
Contributor

Answering @SmithSamuelM's question, the last version of the abstract spec was https://openid.net/specs/openid-connect-messages-1_0-20.html, the last version of the concrete binding spec was https://openid.net/specs/openid-connect-standard-1_0-21.html, and they were replaced by the combined specification https://openid.net/specs/openid-connect-core-1_0.html, which became a standard.

@SmithSamuelM
Copy link

SmithSamuelM commented Nov 13, 2019

@selfissued. It looks like all three version s of the spec are essentially the same all use http for non normative examples. So maintaining two version would be confusing.what I believe is being proposed here is not to follow the example of opened. But instead to have only one full spec with appendices that are normative annotated code examples. So the normative binding spec would be simplified with normative examples but not completely duplicating the spec. So a developer has one spec that defines the data model and then a set or normative examples for the specific encoding. The normative examples would be annotated with clarifying comments. This is in contradistinction to essentially copying and pasting the full text of the spec for each encoding. I would not want the encodings to be stand alone specifications as appears to be the case for the open ID examples you gave. That would be a nightmare. =) Instead the encodings would be annotated normative or compliant code examples. Then the spec maintenance becomes generating the annotations such as field and block type representation and then making sure that the encodings pass compliance tests. As opposed to the editors being forced to synchronize multiple standalone copies of the same spec that only differ on the normative code examples. One could then use the simplest most common encoding (JSON in this case or pseudo code) to provide illustrative but non normative example throughout the body of the spec.

@talltree
Copy link
Contributor Author

@selfissued (which, I just have to say, in the context of the work we are doing, is about the coolest handle ever): thank you for the concrete examples of the problem of doing entirely separate specs for abstract data model and concrete encoding model. I agree that could become a nightmare.

What I had in mind when I raised this issue is exactly the model @SmithSamuelM describes: one spec that defines the abstract data model in one section and then uses subsections or appendices for defining each encoding. So the end result is one spec regardless. If someone wants a different encoding after we are done, they can either write a separate spec or convince us to version the main spec.

I also believe this could help us nicely partition the work. Everyone who cares about the abstract data model can collaborate on that, and then those who care about a particular encoding can collaborate on that encoding. Encoding-specific issues stay with the encoding teams, and only issues with the abstract data model are handled by the abstract data model team.

Note that this approach will also address issue #92. And I think it will help us decide about #95.

@dhuseby
Copy link

dhuseby commented Nov 13, 2019

I fully agree with the above two statements. I also want to clarify my 2p on this since I look at these problems from an implementor's perspective. When it comes to encodings there are exactly two things we care about:

  1. The representation of constants (e.g. algorithm identifiers, key usage restrictions, service endpoint types, dictionary key names in encodings like JSON that store key names).
  2. The canonicalization algorithm used to encode a data object into a form that can be digitally signed/verified.

So what I would like to see is of lists of constants used to identify all of the blessed algorithm types, key usage restrictions, service endpoint types, key management function types and the constants from the data model such as the name of each part of a key record.

Then for each encoding appendix, there would be a section that maps the constants to their values in the encoding (e.g. "RsaSignature2018" and "keyEncoding" in JSON).

After that I expect there to be a short section on the canonicalization algorithm used for that particular encoding. So the existing stuff about the JSON encoding would move to the JSON encoding appendix.

Then the only other thing we care about is extensibility and how the lists of constants can be expanded to include experimental features that may get included in future revisions of the spec. For instance, in HTTP used to recommend using "X-my-header" for new experimental or private API headers. We should address how a vendor would expand this spec to cover a novel key encoding or novel algorithm that isn't already accounted for in the spec. It may be as simple as including a URI/URL to the canonical reference on the non-standard data.

What you did in the open ID spec is not at all what I was thinking about. My approach is informed by my years in video game programming where we had an overall data model and then specific encodings for each target video game hardware (e.g. PC, XBox, Playstation, etc). All of our tools were built around assumptions about the core data model (i.e. renderable objects always have a mesh ID referencing the mesh data) and all of the load/save functions where smart enough to detect a specific encoding and do the translations on the fly when needed.

Having a core data model allows implementors to detect malformed DID docs regardless of encoding. That's where we ultimately want to be. I shouldn't care if a DID doc comes in the form of JSON over a HTTPS GET, or is loaded from the meta data in a PDF, or is scanned from the back of a physical drivers' license. I should be able to write code that validates the core data model and can work with DID documents from any source. What gets me excited is the thought that I could have just one implementation that loads DID documents from scanning drivers' licenses and then can immediately be forwarded to the universal resolver over an HTTP request as JSON encoded data. Or loading a DID from a HIPPA encoded medical record and storing it in PDF that is a medical bill. If we don't do this separation, we'll be endlessly hand coding hacks to stuff JSON-LD data into all of these non-JSON-LD aware data systems.

@jricher
Copy link
Contributor

jricher commented Nov 27, 2019

I think the real answer is what the current document is attempting: to have an abstractable data model, but have that model expressed as a known concrete data format. Any other serializations are into and out of that format. This puts hard limits on representation of values, composition of objects, and other items like that. If it can't be represented in the JSON serialization, then it can't be represented. That way you get out of problems like XML attributes and comments, which have no simple JSON equivalent, and instead get to have one data format that can be used across all things. We tried to do this with the VC Data Model spec, with language added towards the end of its lifetime declaring that all serialization and encoding needed to be lossless, bidirectional, and deterministic across any format. I think this could have been helped by having the VC Data model fundamentally expressed as a JSON document explicitly, instead of the implied JSON-LD that's there today.

Regardless, while there's definitely value in an abstract model, there's more value in what @selfissued says above about concrete bindings to real representations. And if you can make your concrete representation translatable to different formats, then you've essentially won for both sides.

@selfissued
Copy link
Contributor

To clarify my position on this following a phone conversation with @talltree, I'm fine with us working on multiple DID encodings as needed by community use cases, provided each is in a separate specification and that the first encoding we work on is JSON. If the JSON encoding also can be used as an abstract or prototype encoding that the others normatively reference, all the better.

@SmithSamuelM
Copy link

SmithSamuelM commented Dec 6, 2019 via email

@msporny
Copy link
Member

msporny commented Feb 13, 2020

The data model has been separated from the syntaxes in PR #186, which was merged yesterday.

You can view the new layout in the latest published spec:

https://www.w3.org/TR/did-core/

Closing this issue unless there are objections.

@msporny msporny added pending close Issue will be closed shortly if no objections and removed discuss Needs further discussion before a pull request can be created labels Feb 13, 2020
@msporny msporny closed this as completed Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending close Issue will be closed shortly if no objections
Projects
None yet
Development

No branches or pull requests

6 participants