Skip to content

Inputhons

Piotr Banski edited this page Jun 14, 2024 · 26 revisions

"Inputhon" is our super-fancy name for a type of a hackathon where the persons responsible for a centre's recommendations for data deposition formats meet for (say) an hour in order to prepare or update their centre's content for the SIS.

Please note: the content of this document is still being formed. As of May 2024, several inputhons have been held at several CLARIN centres, and we will be very grateful for feedback, including PRs made against this wiki, or simply a new github issue to let us know what in the below guidelines needs improving.

0. TL;DR

The goal is to (ideally) end the event with a submission of a pull request against one of the files in https://github.com/clarin-eric/standards/tree/formats/SIS/clarin/data/recommendations (note that it's not the master branch).

Post-event, the centre can either

  • point its users to the SIS (recommended, because of the data aggregation that happens there), or else
  • re-use the same data (you don't want to maintain two copies of recommendations, do you) by pulling them out of the SIS via its API (an example is supplied; essentially, you just need to style the data according to your site's make-up).

1. Motivation

For CLARIN B-centres which need to undergo (re-)certification,

  • storing format recommendations in the SIS satisfies the relevant CoreTrustSeal recommendation (see section 8 (R08, "Deposit & Appraisal") of the Extended Guidance), which checks, a.o., whether the repository offers a list of preferred formats.
  • Incidentally, two bullets down, R08 asks about info on "the approach towards digital objects that are deposited in non-preferred formats" -- that information can also be provided by the SIS, both in the general section describing the centre and/or in comments on formats, especially those labelled as "discouraged" (="non-preferred", in CTS lingo).

For other centres/repositories, storing the information is a way to:

  • get that done in a uniform format, and along a tested route;
  • be able to use a clean template and/or examples provided in the recommendations by other centres;
  • obtain statistics based on the aggregated information
  • not bother about displaying the information...
    • at all, if the centre/repository points to the SIS for that purpose, or
    • much, if the centre takes the data via the SIS API and applies its own (or provided) CSS styling to it.

2. Preparation

These steps are optional but advisable. If they seem like too much time investment, skip them. But we would appreciate if you could go via pull requests, also for the sake of keeping track of the project's history and in order to give you proper credit.

2.1. Give us a heads-up

Tell us about the intention to hold an inputhon, so that we can make sure that the centre is represented in the system, and that at least a skeletal recommendations file for it exists. We can then also at least try to make ourselves available for consultation over Zoom, etc.

2.2. Get the SIS

(Reminder: this is optional; perhaps a 'techy' member of your team can perform this step, and be around for the inputhon?)

  • fork the SIS, clone your own repo instance, install eXist and the SIS
  • optionally, you might want to integrate that new DB instance with your oXygen editor (yes, there's a lot of assumptions here), because then you will be able to visualise your changes just by dragging the recommendations file from oXygen's project panel to the DB connection panel (and refreshing the local SIS instance in the browser). Please do not feel discouraged if this paragraph is not clear to you.

2.3. Locate the XML document describing the recommendations for your centre

The native GitHub way, if you've forked/cloned

The recommended way is to look at the SIS /clarin/data/recommendations/ directory, and locate your centre's data. For example for the IDS, the document is IDS-recommendation.xml. Please bear in mind that the same centre may use different names across different RIs, so search also for the alternatives. We're not yet sure how to handle that kind of variation and your opinion on this matter may help.

If you can't locate your centre, please let us know, either by e-mail (see the "About" page of the SIS) or (better) by posting an issue.

The workaround by exporting centre data

If you don't want to bother with cloning the SIS repository (oh please, do bother...) then locate your centre in the list of centres supported by the SIS. If you can't locate the centre, use the link above to post a github issue.

Once you have located your centre, click on "download template" (if the page is empty) or "export table to XML" if the table has already been populated. In the latter case, please note that, as a centre representative, you should not feel obligated to keep the content of the existing recommendations if, on the centre page, you see a red notice saying "Warning: The recommendations have not been curated yet" -- that in most cases means that we have populated the recommendations ourselves, at the testing stage, with information obtained either from the centre directly by one of the Standards Committee members, or we have (superficially and quickly) interpreted the recommendations posted by your centre by squeezing them into, and smearing them across, the functional domain system that the SIS uses, and by more or less straightforwardly taking the recommendations levels (recommended, acceptable, discouraged) from your centre's documentation. You may want to thoroughly re-examine our choices -- we were only seeding the system.

Getting centre recommendations this way (by export from the SIS) equips you with the XML file, but that file lacks the associated document grammar. You can get it separately and associate it by hand, but that requires a bit of tinkering. Perhaps it may be easier, altogether, to just fork/clone the repository, and edit the source right in your GitHub account? Do not be afraid that you accidentally spoil something if you try -- that is, fortunately, extremely difficult.

2.4. Mind the functional domains

Reserve a few minutes to take a look at the data domains, see which of them correspond to the functions of the data that your centre is ready to receive. Please read through the descriptions of the particular domains. Treat the domains, together with the three levels of recommendation, as a scaffolding upon which your centre's recommendations will be placed.

3. Execution

3.1. General procedure

  • In case you haven't done that in the previous step, have a look at the data domains, see which of them correspond to the functions of the data that your centre is ready to receive.
  • For each of the selected domains, decide which formats are recommended and how (that is,
    • if the centre wishes to receive data in that format, it is going to be easy to curate, archive, etc. -- then choose "recommended", or
    • if it's an "if you really must" format -- then choose "acceptable";
    • you might also want to discourage submissions in some format -- choose "discouraged" in such cases, and do consider providing a short explanation about what is the preferred alternative, if there is any; or mentioning why submissions in the given format are discouraged by the centre. The place for that explanation is the <comment> element (see below for some examples).

In other words, you'd be going in a loop, where the number of iterations would be the number of data domains that you have pre-selected. You may be wondering if it's better to be very comprehensive or maybe start small. There is no best answer to that, but in order to make the task manageable (especially within that single magical hour), we suggest that

  • you first identify the kinds of formats that your centre has simple workflows for, and consider marking them as "recommended" for the given domains;
  • and then consider those for which your workflows are not very simple, perhaps not fully automated, and mark those as "acceptable"
  • if there are formats that you get questions about but for some reason you'd rather not touch them, mark those as "discouraged" and provide a comment as to why, and perhaps hints at possible alternatives.

If you take the path of editing the (forked) source with an XML editor, you will be able to use the benefit of XML Schema and Schematron -- both are used to constrain the XML you're going to produce, often providing suggestions on the valid values and structures. You will then also be able to use the template provided in each empty recommendations document.

Once you are done with creating/editing the recommendation file of your centre, please create a pull request to the format branch. See: Creating a pull request from a fork

3.2. Finding out about format IDs

In the examples provided in the section that follows, you can see that we reference formats in two places: one is the @id attribute of the <format> element, and another, optional, is the @ref attribute of <formatRef>.

A format ID is made up of an f followed by a mnemonic identifier, hence, e.g. fWave or fPlainText. You can copy the SIS IDs from the page listing formats, by clicking on the button next to the format name. Sometimes, the format that a centre recommends (or discourages, etc.) will not (yet) be separately described by the SIS. A list of such formats, not having their own information pages but nevertheless mentioned by recommendations, is to be found in our Sanity Checker, at the top (in the "List of missing formats by ID").

If you don't see your chosen format in that list, please make the ID up and let us know about that.

3.3. Comments on the individual recommendations

You may want to comment on the recommendations, e.g. to restrict the range of acceptable options (to e.g. mention the kind of a/v encoding that you would be most happy with, etc.) or to point users to alternative formats if you choose to label some format as "discouraged". Use the <comment> element for that.

You can also use language tags (in the optional xml:lang attribute) to provide information in the native language of your users. Comments without a language tag are going to be treated as comments in English, and the system will fall back to English whenever in doubt. Note that if your RI is Text+, German text is going to be prioritised over English.

If you want to reference another format, use the <formatRef> element with the ref attribute containing the ID of the format as defined in the SIS. You can copy the SIS IDs from the page listing formats, by clicking on the button next to the format name. If you don't see your chosen format in that list, please make the ID up and let us know about that.

A few examples follow:

      <format id="fCHAT-XML">
         <domain>Audiovisual Annotation</domain>
         <level>discouraged</level>
         <comment>Consider using <formatRef ref="fTEISpoken"/> instead.</comment>
      </format>

Note: below, we use the same ID twice, and the <comment> element for fine-graded distinctions.

      <format id="fWave">
         <domain>Audiovisual Source Language Data</domain>
         <level>recommended</level>
         <comment>PCM-WAV, 48 kHz, 16 bit</comment>
      </format>
      <format id="fWave">
         <domain>Audiovisual Source Language Data</domain>
         <level>acceptable</level>
         <comment>PCM-WAV with non-recommended parameters (not 48 kHz, 16 bit)</comment>
      </format>

Below, note the differing data domains as well as language-tagged comments:

      <format id="fTextPlain">
         <domain>Audiovisual Annotation</domain>
         <level>discouraged</level>
      </format>
      <format id="fTextPlain">
         <domain>Documentation</domain>
         <level>recommended</level>
      </format>
      <format id="fTextPlain">
         <domain>Text Annotation</domain>
         <level>discouraged</level>
      </format>
      <format id="fTextPlain">
         <domain>Textual Source Language Data</domain>
         <level>recommended</level>
         <comment>without markup</comment>
         <comment xml:lang="de">ohne Mark-up</comment>
      </format>

3.4. Overall information about the centre

At the top of the recommendations document is the <info> element, which you can use to provide information on the centre, but also information about "the approach towards digital objects that are deposited in non-preferred formats", to quote the CTS requirements. Put the text into the HTML <p> elements, and you can also use the HTML list elements (ol, ul, li) as well as <a> for links.

If you work in the source, the associated schemas should make the editing easier. One way or another, you are welcome to consult a separate wiki page on the Detailed syntax of information elements in the SIS.

4. Using the data

Do not forget to let the world know about the work you've put into preparing the recommendations. Point to the listing from the pages of your centre.

The SIS is set up in such a way that you shouldn't need to maintain two instances of data, one for the local pages, and one for the SIS. Note that such an approach would increase the maintenance costs (person-hours). The idea is: you do it once, use the data you've input, and revisit only if the centre/repository policy changes or when, as a B-centre, you need to get re-certified.

Pointing to the data in the SIS

You can simply point your users to your centre's data by using a direct link. For the IDS, you would use https://clarin.ids-mannheim.de/standards/views/view-centre.xq?id=IDS (or the aliased https://standards.clarin.eu/sis/views/view-centre.xq?id=IDS ). Consult the list of centres to find out what the ID of yours is.

Using the data input into SIS to populate the centre's local pages

You can also retrieve your data via the REST API offered by the SIS. Again, for IDS, you would use, e.g. curl 'https://clarin.ids-mannheim.de/standards/rest/views/recommended-formats-with-search.xq?centre=IDS&export=yes' -- have a look at the API documentation to see what parameters are possible, etc.

You can see an example way of querying the data with jQuery at https://github.com/IDS-Mannheim/IDS-Mannheim.github.io, and the corresponding simple webpage is available for viewing at https://ids-mannheim.github.io/standards/ (many browsers will allow you to view the source by doing Ctrl+U). If you would like to contribute a CSS (or XSL) stylesheet to render the info in a nicer way, please feel welcome to contact us and we will set up a directory for such contributions.

See also