Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referencing sections of a document #200

Open
simleo opened this issue Apr 27, 2022 · 3 comments
Open

Referencing sections of a document #200

simleo opened this issue Apr 27, 2022 · 3 comments

Comments

@simleo
Copy link
Contributor

simleo commented Apr 27, 2022

While converting a cwltool --provenance RO to a Workflow Run RO-Crate, I'm faced with the problem of referring to individual workflow steps. The workflow is stored in "packed" form, meaning that the tools that implement each step are stored in the same packed.cwl document as the workflow. For the packed form, CWL uses the URI fragment syntax to assign IDs to the steps and the workflow itself; in this case, they are:

  • Workflow: #main
  • First step ("rev"): #main/rev
  • Second step ("sorted"): #main/sorted

The workflow appears in the crate as a data entity with an @id of packed.cwl, so I decided to add the tools as SoftwareApplication entities with @id packed.cwl#rev and packed.cwl#sorted (whether this is correct is another matter: should they be packed.cwl#main/rev and packed.cwl#main/sorted instead?). Using fragments here seems quite reasonable, since the secondary resource is certainly "some portion or subset of the primary resource". However, should the tools be considered contextual entities or data entities? At first I tried to add them ad contextual entities:

crate.add(SoftwareApplication(crate, instrument_id, properties={
    "name": instrument_id,
}))

Leading to:

{
    "@id": "packed.cwl",
    "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
    "hasPart": [
        {"@id": "#packed.cwl#rev"}
        {"@id": "#packed.cwl#sorted"},
    ],
    ...
},
...

Which does not really seem to work, due to the leading # in the tool IDs (ro-crate-py automatically adds a leading hash mark to contextual entity IDs if they're not full URIs: I'm not sure this is a MUST in the RO-Crate spec, but it's at least implied), so I tried adding them as data entities:

crate.add(DataEntity(crate, instrument_id, properties={
    "@type": "SoftwareApplication",
    "name": instrument_id,
}))

Leading to:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "packed.cwl"},
        {"@id": "packed.cwl#rev"},
        {"@id": "packed.cwl#sorted"},
	...
    ],
    ...
{
    "@id": "packed.cwl",
    "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
    "hasPart": [
        {"@id": "packed.cwl#rev"}
        {"@id": "packed.cwl#sorted"},
    ],
    ...
},
...

I think this is more correct since section IDs have a document_id "#" fragment structure. However, having packed.cwl#rev and packed.cwl#sorted listed in the crate's hasPart seems a bit weird. The current spec says "where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property". However, these are not files, but file sections, and would still be linked indirectly (via packed.cwl) if removed from the crate's hasPart. Therefore, I think the spec should say that such "sections" MAY be listed.

I've made use of the workflow step example throughout the above discussion, but it actually generalizes to referencing sections of a document of any kind, when the document is part of the crate.

@mr-c
Copy link

mr-c commented Apr 28, 2022

so I decided to add the tools as SoftwareApplication entities with @id packed.cwl#rev and packed.cwl#sorted (whether this is correct is another matter: should they be packed.cwl#main/rev and packed.cwl#main/sorted instead?)

If used, it should be packed.cwl#main/rev and packed.cwl#main/sorted; there is neither a #rev nor #sorted in that document

@simleo
Copy link
Contributor Author

simleo commented Apr 28, 2022

Discussed at today's RO-Crate meeting:

  • Add them as Contextual entities
  • Python Library: don't add a leading hash if there's already one in the id

@stain
Copy link
Contributor

stain commented May 12, 2022

Right, packed.cwl#main/rev would be the way to refer to #main/rev within packed.cwl - CWL is unusual in that it has slash-based fragments, but this is also possible with XPath selectors for XML docs.

We could still add a section about referencing parts of other documents (which may even be contextual entities in another RO-Crate, some other Linked Data document, or just a section in a HTML/PDF), to clarify that you can use any URI/URI Reference with # in identifiers of contextual entities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants