-
Notifications
You must be signed in to change notification settings - Fork 0
/
tei_clarin_schema.xml
1289 lines (1135 loc) · 70.8 KB
/
tei_clarin_schema.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns:rng="http://relaxng.org/ns/structure/1.0"
xmlns="http://www.tei-c.org/ns/1.0"
xmlns:sch="http://purl.oclc.org/dsdl/schematron"
xmlns:eg="http://www.tei-c.org/ns/Examples"
xmlns:egXML="http://www.tei-c.org/ns/Examples"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:tei="http://www.tei-c.org/ns/1.0"
xml:lang="en" n="tei_clarin">
<teiHeader>
<fileDesc>
<titleStmt>
<title>CLARIN.SI TEI schema for language corpora</title>
<author>Tomaž Erjavec, tomaz.erjavec@ijs.si</author>
</titleStmt>
<publicationStmt>
<publisher>CLARIN.SI</publisher>
<date>2021-10-31</date>
<availability status="free">
<p>This file is freely available and you are hereby authorised to copy, modify, and redistribute it in any way without further reference or permissions.</p>
</availability>
<pubPlace>
<ref target="https://github.com/clarinsi/TEI-schema">https://github.com/clarinsi/TEI-schema</ref>
</pubPlace>
</publicationStmt>
<sourceDesc>
<p>Made from scratch.</p>
</sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
<p>Slovenian Research Infrastructure for Language Resources and Tools <ref target="http://www-clarin.si/">CLARIN.SI</ref>.</p>
</projectDesc>
</encodingDesc>
<revisionDesc>
<change when="2021-10-31">Tomaž Erjavec: change example document a lot.</change>
<change when="2021-09-29">Tomaž Erjavec: reduce to corpora, work on text.</change>
<change when="2021-08-24">Tomaž Erjavec: start adding text of recommendations.</change>
<change when="2020-12-16">Tomaž Erjavec: change representation of whitespace.</change>
<change when="2019-09-06">Tomaž Erjavec: added module for figures, so we can code tables; newly generated schemas.</change>
<change when="2018-12-31">Tomaž Erjavec: added module for names and dates so we can code e.g. parliamentary corpora; newly generated schemas.</change>
<change when="2018-12-28">Tomaž Erjavec: added module for spoken text so we can code e.g. parliamentary corpora; newly generated schemas.</change>
<change when="2018-10-10">Tomaž Erjavec: added transcription module so we can use facsimile; newly generated schemas.</change>
<change when="2018-04-09">Tomaž Erjavec: added dictionaries module and newly generated schemas as TEI has changed (added @msd et al).</change>
</revisionDesc>
</teiHeader>
<text>
<front>
<titlePage>
<docTitle>
<titlePart type="main"><ref target="https://github.com/clarinsi/TEI-schema/">CLARIN.SI TEI schema for language corpora</ref></titlePart>
</docTitle>
<docDate>2021-10-31</docDate>
<docEdition>0.3</docEdition>
</titlePage>
<p></p>
<divGen type="toc"/>
</front>
<body>
<div xml:id="sec-intro">
<head>Introduction</head>
<p>This document gives recommendations on the preferred
<ref target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html">TEI</ref>
XML encoding of langauge
corpora in the <ref target="https://www.clarin.si/">CLARIN.SI</ref> repository.</p>
<p>The TEI customisation for CLARIN.SI supports the encoding of language corpora and makes
explicit recommendations on the manner of encoding various phenomena.</p>
<p>These recommendations are written as a TEI ODD document (= proze recommendations
and formal schema), on the basis of which it is possible to derive an XML schema
expressed either as a RelaxNG schema, a DTD or a W3C schema, and such schemas are
also part of this Git repository.</p>
<p>When using these recommendations, the following points should be taken in consideration:
<list>
<item>The TEI Guidelines are a large and complex set of specifications, and the
CLARIN.SI recommendations do not attempt to give an introduction to them. Rather,
it concentrates only on aspects of the Guidelines that are most likely to be of use
in encoding linguistically annotated language corpora which are likely to be deposited in
the CLARIN.SI repository.</item>
<item>It is difficult if not impossible to determine in advance what kinds of corpora
and their mark-up will be deposited in the repository. While we tried to cater for the
more obvious encoding needs, it is possible that new users will have to use TEI elements
or attributes that are not mentioned here.</item>
<item>Since these recommendations are (always) work in progress, we have left the formal
specification (i.e. the XML schema) very unconstrained, so it can accommodate encoding
practices that we have not (yet) foreseen. The downside of this is that the XML schemas
will allow constructs that are at odds to those that we propose in the prose of the
recommendations. Therefore, the prose should be taken as the definitive way of encoding
the phenomena under discussion, i.e. even if a corpus validates against the
schema, it might still not be encoded according to these recommendations.</item>
<item>In the text of these recommendations, every mention of an element is linked to its
definition, where examples of use are also given. The text also makes frequent reference
to the text of the TEI Guidelines. However, the TEI Guidelines give generic examples and
explanations, which can be at odds with particular recommendations that are made here, so
the ones from the TEI Guidelines should be taken with a grain of salt.</item>
</list>
</p>
<p>The rest of these recommendations are structured as follows:
<list>
<item>the rest of this section details the <ref target="#sec-scope">scope and
purpose</ref> of the recommendations;</item>
<item><ref target="#sec-general">Section 2</ref> gives the general requirements that
a CLARIN.SI corpus has to meet;</item>
<item><ref target="#sec-overall">Section 3</ref> explains the overall document
structure of a CLARIN.SI corpus;</item>
<item><ref target="#sec-metadata">Section 4</ref> concentrates on encoding the corpus
metadata;</item>
<!--item><ref target="#sec-transcript">Section 5</ref> gives the encoding of the corpus body;</item-->
<item><ref target="#sec-linguistic">Section 6</ref> details linguistic annotations;</item>
<item><ref target="#sec-multimedia">Section 7</ref> gives information on multimedia information;</item>
<item><ref target="#sec-conversion">Section 9</ref> discusses conversions to and from
the CLARIN.SI format;</item>
<item><ref target="#examplar">Appendix A</ref> gives a complete example document that illustrates
the encoding according to the CLARIN.SI schema.</item>
<item><ref target="#schema">Appendix B</ref> gives the formal specification of
the CLARIN.SI schema.</item>
</list>
</p>
<div xml:id="sec-scope">
<head>Scope and purpose</head>
<p>These recommendations consists of readable guidelines and a formal TEI ODD schema with
derived XML schemas in various schema languages. They are intended for the encoding of
linguistically annotated corpora, regardless of the language or country of origin, for
the purposes of scholarly investigations, be they from the field of linguistics,
political science, history or other humanities and social sciences disciplines.</p>
<p>In developing a schema for structuring data, two approaches can be adopted: a
descriptive one, where as much as possible of the original data distinctions are kept in
the target encoding; or a prescriptive one, where the target encoding is severely
constrained, to enable seamless data interchange and esp. interoperability with software
tools. These recommendations adopt the <emph>descriptive</emph> approach, as the source
data, time and effort devoted to converting it, the intended applications will differ
considerably, and it is likely that any prescriptive schema would soon turn out to be too
restrictive. Nevertheless, the recommendations do try to limit the plethora of encoding
options otherwise available in TEI to those that could be sensibly applied to language
corpora, and where more than one option is available in TEI to encode a given phenomenon,
the schema and especially the text guidelines attempt to recommend only one option.</p>
</div>
</div>
<div xml:id="sec-general">
<head>General requirements</head>
<p>A CLARIN.SI corpus should, in general, capture as much of the text from the source as
possible, while the presence of graphical items or other elements that could not or were not
transcribed should be indicated by markup, in particular with the use of <gi>gap</gi>.</p>
<div xml:id="sec-chars">
<head>Characters</head>
<p>The corpus should be encoded in Unicode, using the UTF-8 character encoding, at least
for European languages. In cases where the original contains characters from the Unicode
Private Use Area, these should be given their closest Unicode equivalents.<note>TEI
supports preserving the original Private Use Area codepoints by using the <gi>g</gi>
element, the use of which is further explained in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/WD.html">Characters, Glyphs,
and Writing Modes</ref> of the TEI Guidelines. However, such characters will rarely if
ever be used in corpora, so the current proposal does not include the gaiji module which
allows the use of this element - if they are needed, then the CLARIN.SI ODD needs to
be changed.</note></p>
<p>End-of-line hyphens can be removed, and the split words joined in order to simplify
linguistic processing. It is recommended that this practice is documented in the TEI
header of the corpus, in the <gi>hyphenation</gi> element.</p>
<p>The following characters, esp. prevalent when the source documents are in Word or
HTML, deserve special mention:
<list>
<item>NO-BREAK SPACE (U+00A0) prevents, with some applications, an automatic line break
at its position and collapsing consecutive white space characters into a single
space. As this recommendation is not interested in preserving the details of the
layout, and the use of this character complicates (or breaks) further processing
esp. lingistic annotation, it is recommended that this character is substituted by the
normal space character (U+0020). The same holds for other variants of spaces (U+2000 -
U+200A), which are, however, used much less frequently.</item>
<item>NON-BREAKING HYPHEN (U+2011), similarly to NO-BREAK SPACE, prevents a
line break, in this case, following its position. With a smiliar reasoning
as above, it is recommended that this character is substituted by the normal
hyphen character ('-', U+002D).</item>
<item>SOFT HYPHEN (U+00AD) indicates that a word can be hyphenated at that
point. Occurences of this character should be removed from the corpus,
because, again, they only complicate or break further processing.</item>
</list>
</p>
<p>While not required, it is sensible to also normalise sequences of whitespace
characters (such as tabulators, end-of-line characters and spaces) into a single
space or end-of-line character. Again, this simplifies further (esp. linguistic)
processing.</p>
</div>
<div xml:id="sec-document">
<head>Documenting the encoding process</head>
<p>Difficult encoding situations that are not covered by the TEI Guidelines can be
documented in the <gi>editorialDecl</gi> of the corpus TEI header. In particular, if the
source texts has been changed (so, omitting or normalising figures, text, EOL hyphens,
quotes, special characters, etc. as discussed above) this practice can be documented in
the <gi>correction</gi>, <gi>normalization</gi>, <gi>quotation</gi>, and, as mentioned,
in the <gi>hyphenation</gi> element of the editorial declaration. Two further elements,
namely <gi>segmentation</gi> and <gi>interpretation</gi> can also be used to document
these aspects of the encoding process. The example below illustrates the use of these
elements:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-editorial">
<editorialDecl>
<correction>
<p>Found typos in the source have been silently corrected.</p>
</correction>
<normalization>
<p>Tables have been omitted from the corpus. Spacing has been normalised
to single space. Soft hyphens have been removed.</p>
</normalization>
<hyphenation>
<p>End-of-line hyphens have been silently removed.</p>
</hyphenation>
<quotation>
<p>Quotation marks have been left in the text and are not explicitly
marked up.</p>
</quotation>
<segmentation>
<p>The texts are segmented into paragraphs, sentences, words and
punctuation.</p>
</segmentation>
<interpretation>
<p>Word-level linguistic annotation comprises the lemma of a word and its
morphosyntactic description, which follow the
<ref target="http://nl.ijs.si/ME/V6/msd/">MULTEXT-East morphosyntactic
specification Version 6</ref> for Slovene.</p>
</interpretation>
</editorialDecl>
</egXML>
</p>
<p>When automatic procedures have been used to encode the texts (most prominently, to
add linguistic markup, as discussed in the Section on <ref
target="#sec-linguistic">Linguistic annotation</ref>) this should be documented in
the <gi>appInfo</gi> element of the <gi>encodingDesc</gi>, as shown in the example
below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-appinfo">
<appInfo>
<application version="1.0" ident="reldi-tagger">
<label>ReLDI morphosyntactic tagger and lemmatiser</label>
<desc>MSD tagging and lemmatisation performed with ReLDI Tagger trained for
Slovene and available from
<ref target="https://github.com/clarinsi/reldi-tagger">GitHub</ref>.</desc>
</application>
</appInfo>
</egXML>
</p>
</div>
<div xml:id="sec-langs">
<head>Languages</head>
<p>The language of an element's text content is in TEI, as in XML, signaled by the
value of its <att>xml:lang</att> attribute. The CLARIN.SI recommendations require
that each element that contains text is either marked by this attribute, or one of
its ancestors is; in particular, the root element of the corpus should have an
<att>xml:lang</att> attribute. For multilingual documents (excluding cases where only
a minor part of the text is in another language), the language code of the root
element should be <q>mul</q> for <q>multiple languages</q>. Note that if, going by
the ancestor axis, the values of two <att>xml:lang</att> are in conflict, the one
closer to the context node is relevant one.
</p>
<p>The values of <att>xml:lang</att> should follow <ref
target="https://tools.ietf.org/html/bcp47">BCP 47</ref>, cf. also <q><ref
target="https://www.w3.org/International/questions/qa-when-xmllang">xml:lang in XML
document schemas</ref></q>.</p>
<p>It is good practice to document the languages used in the <gi>langUsage</gi>
element of the TEI header.</p>
<p>Apart from the above considerations, a related question is where to draw the line
between the object and meta languages, i.e. the language of the corpus and the
language of the mark-up. The TEI defines the names of the elements and attributes in
English, and the language of the corpus will, of course, depend on the country of the
parliament. It is less straightforward to decide in which language the attribute
values (such as the values of the <att>type</att> attribute) should be. CLARIN.SI
recommends that these should also be in English.</p>
</div>
<div xml:id="sec-idents">
<head>Identifiers and referencing</head>
<p>In order to simply refer to elements of a TEI document (i.e. a CLARIN.SI
corpus), elements can be marked with an ID, i.e. given the <att>xml:id</att> attribute
with a unique value, obeying certain format requirements as defined by
<ref target="https://www.w3.org/TR/xml-id/">W3C</ref>.</p>
<p>CLARIN.SI requires an <att>xml:id</att> attribute on the root element of each corpus
file, which should, furthermore, be identical to the filename (modulo the file
extension). CLARIN.SI also recommends that the divisions of the document (element
<gi>div</gi>), if any, should also be given identifiers. While any element can be given
an <att>xml:id</att>, this is in general, not a good idea; rather, only those elements
that will or could be referenced should be marked with this attribute.</p>
<p>TEI offers a number of attributes that contain (URI) pointers. Where the reference
is to an element inside the document, the value of the <att>xml:id</att> being
referred to should be preceded by a hash (#), as mandated by the XML standard. If the
ID pointed to is from another document, then the full URI needs to be used.</p>
<p>However, as such URIs can be very long, TEI also offers another way of pointing, which
can be used to shorten such long URIs, and this is defined by the <gi>prefixDef</gi>
element in the TEI header, as illustrated below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-prefixDef">
<prefixDef ident="mte" matchPattern="(.+)"
replacementPattern="http://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl.xml#$1">
<p xml:lang="en">Private URIs with this prefix point to feature-structure elements defining the Slovenian MULTEXT-East Version 6 MSDs.</p>
</prefixDef>
</egXML>
With such a definition, we can use the much shorter pointers in the mark-up of words,
such as <val>mte:Pd-nsg</val>, which are then, via a regular expression mapping in the
prefix definition converted to the full URI <val>http://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl.xml#Pd-nsg</val>.
</p>
</div>
<div xml:id="sec-temporal">
<head>Temporal information</head>
<p>Corpora cancontain time-related information, e.g. the date and time of a tweet, the
birth (and death) dates of an author, etc. In general, such information in TEI is stored in the
attributes of the pertinent element, which take as their values a date and possibly time,
according to the ISO 8601 Date and Time Formats, and specified in the <ref
target="https://www.w3.org/TR/xmlschema-2/">XML Schema Part 2: Datatypes Second
Edition</ref>. TEI offers a very rich set of attributes and ancillary elements to specify
time-related information, which are discussed in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CONADA">Dates and
Times</ref> of the TEI Guidelines.</p>
<p>CLARIN.SI corpora can use any of the TEI temporal attributes and elements, however,
for most purposes, the following five attributes will suffice:
<list>
<item><att>when</att>: when a certain event happened;</item>
<item><att>from</att>, <att>to</att>: the start and end of an event or state;</item>
<item><att>notBefore</att>, <att>notAfter</att>: the earliest and latest known
time that an event or state took place, used in cases where the exact time
is not known.</item>
</list>
</p>
</div>
<div xml:id="sec-files">
<head>Files</head>
<p>While this recommendations make the assumption that a complete corpus is one TEI XML
document, this does not mean that it also has to be stored in one file, as the file
structure is distinct from the concept of XML documents. To enable one XML document to be
composed of many files, the <ref target="https://www.w3.org/TR/xinclude/">XInclude</ref>
mechanism should be used. Typically, a corpus will then be composed of a file containing
the root XML element <gi>teiCorpus</gi>, which contains the corpus header, while
individual <gi>TEI</gi>-rooted text files will be included in the corpus using the
<gi>include</gi> element from the XInclude namespace, as illustrated by the following
example:
<eg xml:id="exa-xinclude">
<![CDATA[<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="Sk-11/SI-1990-05-07-01.xml"/>]]>
</eg>
</p>
<p>As mentioned, we recommend that the file has the same name as the value of the
<att>xml:id</att> attribute of the root element of the file. This e.g. guarantees
that each file of the corpus has a unique name.</p>
</div>
</div>
<div xml:id="sec-overall">
<head>Overall document structure</head>
<div xml:id="sec-corpstruct">
<head>Corpus structure</head>
<p>As illustrated below, a CLARIN.SI corpus is rooted in a <gi>teiCorpus</gi>
element. The <gi>teiHeader</gi> of the corpus contains the metadata for the complete
corpus, including the metadata that is marked with the <att>xml:id</att> attribute and
referred to by the subordinate <gi>TEI</gi> elements, such as the defined taxonomies for
text types.</p>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure">
<teiCorpus xml:lang="xx">
<teiHeader>
<!-- Common corpus metadata -->
</teiHeader>
<TEI xml:id="id.1">
<teiHeader>
<!-- Document metadata -->
</teiHeader>
<text>
<body>
<!-- Document text -->
</body>
</text>
</TEI>
<!-- More TEI elements here -->
</teiCorpus>
</egXML>
<p>An individual <gi>TEI</gi> element is referred to as a <hi>corpus element</hi>. In
cases where such a category is simply defined (e.g. a corpus of books) it contains one
text from the corpus, however, if corpus "texts" are very short (tweets, samples) or
very long or complex (encyclopedias) one corpus element can contain a collection or
part of a "text" according to some well-specified criterion.</p>
<p>In cases of smaller corpora, the top level <gi>teiCorpus</gi> can also be omitted,
so the complete corpus is rooted simply in a <gi>TEI</gi> element, and the individual (short)
texts are encoded as <gi>div</gi> elements.</p>
<p>The <gi>text</gi> element can, in general, apart from the obligatory <gi>body</gi>,
also contain front matter in <gi>front</gi> and back matter in <gi>back</gi>.
However, front and back are seldomly used in computer corpora.</p>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-textstructure">
<TEI xml:id="id_1">
<teiHeader>
<!-- Document metadata -->
</teiHeader>
<text>
<front>
<!-- Front matter -->
</front>
<body>
<!-- Transcription text -->
</body>
<back>
<!-- Back matter -->
</back>
</text>
</TEI>
</egXML>
</div>
<div xml:id="sec-textstruct">
<head>Text divisions</head>
<p>The <gi>div</gi> elemet can be used to further divide the texts.
The divisions can be nested, as shown in the example below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-divsimple">
<body>
<div>
<head>Part I.</head>
...
</div>
<div>
<head>Chapter 1</head>
...
</div>
<div>
<head>Chapter 2</head>
...
</div>
...
</body>
</egXML>
As with corpus elements, there is no hard and fast rule what should constitute a
division, except that they typically have a heading.
The divisions can be further characterised by their <att>type</att> and, possibly,
<att>subtype</att> attributes. They can be used when the digital source of the texts
either explicitly (e.g. via its structure, as in up-conversion from Word documents) or
implicitly (e.g. via pattern matching the content of the headings) indicates what kind of
a division it is. For example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-divtype">
<body>
<div type="part">
<head>Part I.</head>
...
</div>
<div type="chapter">
<head>Chapter 1</head>
...
</div>
<div type="chapter">
<head>Chapter 2</head>
...
</div>
...
</body>
</egXML>
</p>
<p>If used, the values of the <att>type</att> and <att>subtype</att> attributes will
depend on the structure of the source texts, on the need to distinguish the types of
divisions, as well as on the ability to automatically recognise them or the available
effort to manually add them. The CLARIN.SI specification does therefore not enforce the
use of these attributes nor does it restrict their values. Below we give an example of a
relatively complex structure made on the basis of a corpus of parliamentary proceedings:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-divakn">
<body>
<div type="prayers">
<head>Prayers</head>
...
</div>
<div type="oralStatements">
<head>Speaker’s Statement</head>
...
</div>
<div type="questions">
<head>Oral Answers to Questions</head>
<div type="debateSection" subtype="topic">
<head>Health</head>
<div type="debateSection" subtype="askedPerson">
<head>The Secretary of State was asked—</head>
<div type="debateSection" subtype="questionAnswer">
<head>Ambulance Waiting Times</head>
...
</div>
</div>
</div>
</div>
<div type="pointOfOrder">
<head>Points of Order</head>
...
</div>
</body>
</egXML>
</p>
</div>
<div xml:id="sec-docvariants">
<head>Document variants</head>
<p>Copora can exist in two or more versions, e.g. the original and its translation(s) into
another language(s).</p>
<p>TEI offers a number of options on how to encode <q>variant</q> texts, most of them
discussed in the <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.htm">Chapter on Linking,
Segmentation, and Alignment</ref>. We here present the simplest option, where it is
assumed that the text of each language exists in a separate TEI document and that the
senteces should be aligned between the original and translation. As shown in the
example below, which gives one sentence from the file <code>text-orig.xml</code> and one
from the file <code>text-trans.xml</code> it is in this case enough to specify the
<att>xml:id</att> on both elements and use the <att>corresp</att> attribute to point to
the aligned sentence(s):
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-variants">
<!-- From text-orig.xml: -->
<s xml:id="orig.1" corresp="text-trans#trans.1">Ali je to slaba stvar?</s>
<!-- From text-trans.xml: -->
<s xml:id="trans.1" corresp="text-orig.xml#orig.1">キレるってそんなに悪いことでしょうか?</s>
</egXML>
It should be noted that the relation between the aligned elements does not need
to be 1-1: if the relation is 0-1 or 1-0, then the non-aligned element is simply
not given in a <att>corresp</att>; if the relation is n-1 or 1-n, then several
IDs are given as values of the <att>corresp</att> attribute,
e.g. <code>corresp="text-orig.xml#orig.3 text-orig.xml#orig.4"</code>.</p>
</div>
</div>
<div xml:id="sec-metadata">
<head>Corpus metadata</head>
<p>TEI allows significant metadata to be added to a document. The metadata is
contained in the <gi>teiHeader</gi> element, which in corpora can appear at two
levels:
<list>
<item>the overall corpus teiHeader, i.e. as part of the <gi>teiCorpus</gi> element;</item>
<item>the teiHeader of individual corpus texts, i.e. as part of a <gi>TEI</gi> element.</item>
</list>
It is recommended that the metadata that is common to the whole corpus is stored in
the corpus TEI header, whereas the text-specific metadata is in the corpus text TEI
header.</p>
<p>It is outside the scope of this specification to give all the details of a
<gi>teiHeader</gi> element, for this, the user is referred to the Section on the <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html">TEI header</ref>
of the TEI Guidelines, and, of course, to the example corpora that are part of the
CLARIN.SI Git repository. Here we do, however, give some examples that are useful for
a variety of corpora.</p>
<div xml:id="sec-speakers">
<head>Speaker metadata</head>
<p>Speech corpora typically contain information about speakers, which is given in the
corpus TEI header, in particular in the <gi>listPerson</gi> element, itself a part of the
participant description, i.e. the <gi>particDesc</gi> element.</p>
<p>A <gi>listPerson</gi> typically contains <gi>person</gi> elements, which give
information on an individual person, as the example below illustrates.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-speakers">
<person xml:id="KucanMilan1941">
<persName>
<surname>Kučan</surname>
<forename>Milan</forename>
</persName>
<sex value="M">male</sex>
<birth when="1941-01-14">
<placeName ref="http://www.geonames.org/3197229">Križevci</placeName>
</birth>
</person>
</egXML>
Each <gi>person</gi> must have an <att>xml:id</att> attribute, so that it can be referred
to from the transcription. Apart from that, the only required element is
<gi>persName</gi>, giving the name of the person. This can be contained directly in the
element, or, as the preferred option, further decomposed into the person surname(s) and
forename(s) or even other elements, as explained in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ND.html#NDPER">Personal
Names</ref> of the TEI Guidelines.</p>
<p>As illustrated above, further person metadata can contain the sex of the person and
their birth date and place. Other potentially useful elements are the persons
<gi>death</gi> date and place, as well as (possibly time stamped) <gi>education</gi>,
<gi>occupation</gi>, and <gi>affiliation</gi>.</p>
<p>Persons can have further attributes, and TEI offers various elements (typically typed)
to express them; they are introduced in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ND.html#NDPERSEpc">Personal
Characteristics</ref> of the TEI Guidelines. The two more general ones are
<gi>state</gi>, which contains the description of some status or quality attributed to a
person (or organization), often at some specific time or for a specific date range and
<gi>trait</gi>, which differs from <gi>state</gi> that it is independent of the volition
or action of the holder and usually not at some specific time or for a specific date
range. The former could, for example, be used to encode the fact that a person was jailed
for a given period of time, while the latter would e.g. be used for the information that
a person is handicapped.</p>
<p>It can be advantageous to refer to external knowledge sources about a person, such as
Wikipedia or VIAF. This is encoded using the <gi>idno</gi> element, whose content is
typically an URI, while the <att>type</att> attribute denotes the kind of knowledge
source referred to.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-external">
<person xml:id="Kucan_Milan1941">
<persName>
<surname>Kučan</surname>
<forename>Milan</forename>
</persName>
<idno type="wikimedia" xml:lang="sl">https://sl.wikipedia.org/wiki/Milan_Ku%C4%8Dan</idno>
<idno type="wikimedia" xml:lang="en">https://en.wikipedia.org/wiki/Milan_Ku%C4%8Dan</idno>
<idno type="viaf">https://viaf.org/viaf/68121580/</idno>
</person>
</egXML>
</p>
</div>
</div>
<!--div xml:id="sec-transcript">
<head>Encoding of texts</head>
<p>In this section we illustrate the encoding of various types of texts from corpora in the
CLARIN.SI repository.
<p>
<bibl>Erjavec, Tomaž; et al., 2021, Multilingual comparable corpora of parliamentary
debates ParlaMint 2.1, Slovenian language resource repository CLARIN.SI, <ref
target="http://hdl.handle.net/11356/1432">http://hdl.handle.net/11356/1432</ref>.
</bibl>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-relations">
<text ana="#reference">
<body>
<div type="debateSection">
<head>REPUBLIKA SLOVENIJA DRŽAVNI ZBOR</head>
<head type="session">1. izredna seja</head>
<head type="chairman">Sejo sta vodila predsednik Državnega zbora dr. Milan Brglez in podpredsednik Janko Veber.</head>
<note type="time">Seja se je začela ob 10. uri.</note>
<note type="speaker">PREDSEDNIK DR. MILAN BRGLEZ:</note>
<u who="#BrglezMilan"
xml:id="ParlaMint-SI_2014-08-25-SDZ7-Izredna-01.u1"
ana="#chair">
<seg xml:id="ParlaMint-SI_2014-08-25-SDZ7-Izredna-01.seg1">Spoštovane kolegice poslanke in kolegi poslanci, gospe in gospodje!</seg>
<seg xml:id="ParlaMint-SI_2014-08-25-SDZ7-Izredna-01.seg2">Začenjam 1. izredno sejo Državnega zbora, ki sem jo sklical na podlagi drugega odstavka 58. člena in drugega odstavka 60. člena Poslovnika Državnega zbora. Obveščen sem, da se današnje seje ne more udeležiti poslanec Marijan Pojbič. Vse prisotne lepo pozdravljam.</seg>
<seg xml:id="ParlaMint-SI_2014-08-25-SDZ7-Izredna-01.seg3">Preden preidemo na določitev dnevnega reda seje, dovolite, da nagovorim Državni zbor v zvezi s spominom na žrtve vseh totalitarnih in avtoritarnih režimov.</seg>
...
</u>
</div>
</body>
</text>
</egXML>
</p>
<p>
<bibl>Verdonik, Darinka; et al., 2021, Spoken corpus Gos VideoLectures 4.2
(transcription), Slovenian language resource repository CLARIN.SI,
<ref target="http://hdl.handle.net/11356/1444">http://hdl.handle.net/11356/1444</ref>.</bibl>
<egXML>
</egXML>
</p>
-->
<!-- From ParlaMint:
<p>The transcriptions of the parliamentary debates the central part of these
recommendations and this section explains how to encode the transcriptions of speeches
proper, the commentary inserted by the transcribers, and the encoding of interruption
of speeches and various verbal and non-verbal incidents in the parliament. Most of
these elements are explained in the Chapter on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/TS.html">Transcription of
Speech</ref> of the TEI Guidelines.</p>
<div xml:id="sec-uterrance">
<head>Utterances and commentary</head>
<p>In the transcriptions the main distinction to be made is between the
transcriptions of the utterances of the speakers against the commentary inserted by
the transcriber, such as the titles of the divisions, results of voting, comments on
what is happening in the chamber etc. The former should be encoded using the
utterance element, <gi>u</gi>, while latter are encoded using a variety of elements,
such as <gi>head</gi> or <gi>note</gi>, and possibly others, as further discussed
below. Below we give an example of a rather straightforward start of a division:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-relations">
<div>
<head>REPUBLIKA SLOVENIJA DRŽAVNI ZBOR</head>
<head type="session">nadaljevanje 39. seje</head>
<note type="chairman">Sejo so vodili predsednik Državnega zbora dr. Milan Brglez
in podpredsednika Primož Hainz ter Matjaž Nemec.</note>
<note type="time">Seja se je začela ob 10.03.</note>
<note type="speaker">PREDSEDNIK DR. MILAN BRGLEZ:</note>
<u who="#SDZ7.BrglezMilan" ana="#chair">
<seg xml:id="SDZ7-Redna-39-2018-03-27.seg1">Spoštovani kolegice poslanke in
kolegi poslanci, gospe in gospodje!</seg>
<seg xml:id="SDZ7-Redna-39-2018-03-27.seg2">Začenjam z nadaljevanjem 39. seje
Državnega zbora.</seg>
...
</u>
...
</div>
</egXML>
The example starts with the division heading, saying that this is the National Assembly
of the Republic of Slovenia, with the second heading specifying that this is the
continuation of the 39th session. Next come three notes, first one specifying who
chaired the session, the second when the session started and the third the name of the
first speaker. It should be noted that these are not formal specifications, rather,
they are simply parts of the transcript that have been wrapped in certain elements.</p>
<p>After these preliminary notes comes the transcript of the speech proper, which, as
mentioned, is encoded using the <gi>u</gi> element. Its main attribute is
<att>who</att>, giving the pointer to the <gi>person</gi> element containing the
metadata of the speaker. The <gi>u</gi> element can also have the <att>ana</att>
attribute giving one or more pointers to a typology of types of speakers. In our case,
it would point to a category that specifies that the speaker is the chair of the
session.</p>
<p>The utterances can (but are not required to) be segmented using the generic
TEI element for segments, <gi>seg</gi>, encoding the paragraphs of the source
transcription.<note>The reason why the TEI element for paragraphs (<gi>p</gi>) is not
used is that utterances, being essentially (transcriptions of) spoken text, do not
allow for internal paragraphs, a concept pertinent to written text.</note></p>
</div>
</div-->
<div xml:id="sec-linguistic">
<head>Linguistic annotation</head>
<p>This section introduces common types of linguistic annotation that can be added language
to the corporus texts; the examples should be sufficient for users to be able to add
further types of linguistic annotation to their own corpora.</p>
<p>The TEI Guidelines discuss basic linguistic annotation in their Chapter on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html">Simple Analytic
Mechanisms</ref> and we follow one particular option given there. In particular, it is
recommended that (where possible) the annotation is in-line (as opposed to stand-off),
i.e. that the linguistic annotation is given in the main document, and therefore mixed
with the other annotations, rather than in a separate document with pointers into the
base text.</p>
<div xml:id="sec-anawords">
<head>Basic word-level annotation</head>
<p>Basic linguistic annotation comprises sentence segmentation, tokenisation,
part-of-speech tagging and lemmatisation. The CLARIN.SI recommendations
specialise the recommendations given in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AILALW">Lightweight
Linguistic Annotation</ref> of the TEI Guidelines. The following example shows the
basic principles of the annotation:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-ud">
<s>
<w msd="UPosTag=DET|Case=Gen|Gender=Neut|Number=Sing|PronType=Dem" lemma="ta">Tega</w>
<w msd="UPosTag=PRON|PronType=Prs|Reflex=Yes|Variant=Short" lemma="se">se</w>
<w msd="UPosTag=PART" lemma="sploh">sploh</w>
<w msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin" lemma="biti">nisem</w>
<w msd="UPosTag=VERB|Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part" lemma="zavesti" join="right">zavedel</w>
<pc msd="UPosTag=PUNCT">.</pc>
</s>
</egXML>
Sentences are marked up using the <gi>s</gi> element, words with the <gi>w</gi>
element and punctuation symbols with the <gi>pc</gi> element. To retain the
linguistically significant whitespace, the <att>join</att> element is used, with the
possible values being <val>no</val> (assumed to be the default), <val>right</val>
(no whitespace to the right of the token) and <val>left</val> (no whitespace to the
left of the token) and <val>both</val> (no whitespace to either side of the
token). While, in the preceding example, it would be more intuitive to have the
value <val>left</val> marked on the full-stop, we recommend that only the value
<val>right</val> is used on the preceding token, as this simplifies processing.</p>
<p>The base form of a word is given in the <att>lemma</att> attribute,<note>Note
that punctuation characters, <gi>pc</gi>, do not have a <att>lemma</att> attribute,
as they cannot sensibly be said to have lemmas.</note> while the situation with the
part-of-speech tags is somewhat more complicated. For analytic tagsets, where a
"part-of-speech tag" is actually a set of attribute-values, as in the example above,
the <att>msd</att> attribute should be used. For synthetic tagsets, such as the <ref
target="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn
Treebank tagset</ref>, which have atomic tags that cannot always be decomposed into
attribute-value pairs (e.g. "TO" for the word "to"), a better alternative is to use
of the <att>pos</att> attribute.</p>
<p>There is also a third option, for tags that are look like strings, however, they
are meant as a shorthand for a feature-structure representation, as is the case with
the <ref target="http://nl.ijs.si/ME/V6/msd/html/">MULTEXT-East tagset</ref>. For
these, it is best to use the generic <att>ana</att> attribute, whose value is a
pointer, as shown in the following example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-msd">
<s>
<w ana="#Pd-nsg" lemma="ta">Tega</w>
<w ana="#Px------y" lemma="se">se</w>
<w ana="#Q" lemma="sploh">sploh</w>
<w ana="#Va-r1s-y" lemma="biti">nisem</w>
<w ana="#Vmep-sm" lemma="zavesti" join="right">zavedel</w>
<pc ana="#Z">.</pc>
</s>
</egXML>
Here, the tags are pointers to identifiers, where the elements bearing these
identifiers define the appropriate feature-structures, i.e. pairs of
attribute-values, as in the example below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-fs">
<fs xml:id="Pd-nsg" xml:lang="en">
<f name="CATEGORY"><symbol value="Pronoun"/></f>
<f name="Type"><symbol value="demonstrative"/></f>
<f name="Gender"><symbol value="neuter"/></f>
<f name="Number"><symbol value="singular"/></f>
<f name="Case"><symbol value="genitive"/></f>
</fs>
</egXML>
</p>
<p>Such feature structures are grouped together in the feature-value library
(<gi>fvLib</gi>) element, which can be contained in its own <gi>TEI</gi> element of
the corpus. As <att>ana</att> is a pointer, it can also contain complete URLs
(e.g. <code>http://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Pd-nsg</code>) which
enables the feature-structure definitions to be stored externaly to the
corpus. However, prefixing such PoS tags for each token by the complete URL would
lead to very large files. This is why the TEI offers a mechanism to shorten
references to URLs. This mechanism is explained in the Section <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAPU">Using
Abbreviated Pointers</ref> of the TEI Guidelines, and we give below and example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-prefix">
<s>
<w ana="mte:Pd-nsg" lemma="ta">Tega</w>
<w ana="mte:Px------y" lemma="se">se</w>
<w ana="mte:Q" lemma="sploh">sploh</w>
<w ana="mte:Va-r1s-y" lemma="biti">nisem</w>
<w ana="mte:Vmep-sm" lemma="zavesti" join="right">zavedel</w>
<pc ana="mte:Z">.</pc>
</s>
</egXML>
As can be seen, the only difference to the preceding example is that the values (IDs) of
the tags are preceded by <code>mte:</code> rather than <code>#</code>. This prefix should
be then expanded by the processing software to whatever the <gi>prefixDef</gi> element,
defined in the TEI header, specifies, as shown in the example below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-prefixDef">
<prefixDef ident="mte"
matchPattern="(.+)"
replacementPattern="http://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl.xml#$1">
<p xml:lang="en">Private MSD URIs with the prefix "mte" point to fs elements
defining the Slovene MULTEXT-East Version 6 MSDs, cf. <ref
target="http://nl.ijs.si/ME/V6/">http://nl.ijs.si/ME/V6/</ref> and <ref
target="https://github.com/clarinsi/mte-msd">https://github.com/clarinsi/mte-msd</ref>.</p>
</prefixDef>
</egXML>
</p>
</div>
<div xml:id="sec-ananorm">
<head>Normalised and syntactic words</head>
<p>In certain contexts a word (or, in general, a token) in the text needs to be
normalised in a certain way. This can happen with historical
texts, which contain archaic wordforms and where we wish to annotate them with
their modernised forms, or when the text is linguisticaly annotated, and the
annotation framework distinguishes original words form syntactic words (i.e. has the
concept of <q>multiwords</q>), as is the case in the <ref
target="https://universaldependencies.org/format.html#words-tokens-and-empty-nodes">Universal
Dependencies framework</ref>.</p>
<p>For simple normalisation, where one word token is normalised into another word token,
the <att>norm</att> attribute on word or punctuation tokens should be used, as explained
at the end of the Section <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AILALW">Lightweight
Linguistic Annotation</ref> of the TEI Guidelines.</p>
<p>More challenging is the case where one original word token must be represented as
several normalised words, either in the context of historical corpora or, as mentioned
above, in the context of multiword units. For this we use embedded empty words with
associated <att>norm</att> attributes, and possibly other attributes with linguistic
annotation. For example, Czech has the word <q>abyste</q> which is decomposed into two
syntactic words, <q>aby</q> and <q>byste</q>. This should be encoded as in the following
example:<note>Note that the example is rendered in three lines, however, the correct
encoding in the corpus is actually in a single line, without any spaces between the
elements, as otherwise the new line and indenting spaces are actually a part of the word
<q>abyste</q>.</note>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-1-2">
<w>abyste<w norm="aby" lemma="aby"/><w norm="byste" lemma="být"/></w>
</egXML>
</p>
<p>There are also cases where two (or more) original words correspond to one normalised
word. Here, it is the outer word that carries the <att>norm</att> and possibly other
linguistic attributes, while the inner words are the original ones. For example, Slovene
used to form the superlative form of adjectives with the word <q>naj</q> written
separately (and often as <q>nar</q>), while in contemporary Slovene the <q>naj</q> is a
prefix of the adjective. This case should be encoded as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-2-1">
<w norm="najlepši" lemma="lep"><w>nar</w> <w>lepši</w></w>
</egXML>
</p>
</div>
<div xml:id="sec-anasegment">
<head>Segmental annotation</head>
<p>A common annotation type, used e.g. for marking named entities or terms, is
segmental annotation, where a stretch of text or tokens is simply enclosed in XML
tags, as the following example illustrates:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-names">
<s>
<name type="person">
<w>John</w>
<w>Malkovič</w>
</name>
<w>went</w>
<w>to</w>
<name type="location">
<w>New</w>
<w>York</w>
</name>
<pc>.</pc>
</s>
</egXML>
TEI offers a number of elements that can be used for such annotations,
e.g.:
<list>
<item><gi>term</gi> for marking up terms, discussed in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COHQU">Terms,
Glosses, Equivalents, and Descriptions</ref> of the TEI Guidelines;</item>
<item><gi>name</gi> for various types of names, or, the more general <gi>rs</gi>
for <q>referring string</q>, e.g. <tag>rs type="person"</tag> her
husband<tag>/rs</tag>, discussed in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CONARS">Referring
Strings</ref> of the TEI Guidelines;</item>
<item><gi>num</gi> for numbers and <gi>measure</gi>, usually comprising a number, a
unit, and a commodity name (e.g. <tag>measure type="weight" quantity="5000"
unit="ton" commodity="coal"</tag>five thousand tons of coal<tag>/measure</tag>,
discussed in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CONANU">Numbers
and Measures</ref> of the TEI Guidelines;</item>
<item><gi>date</gi> and <gi>time</gi>, discussed in the Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CONADA">Dates
and Times</ref> of the TEI Guidelines;</item>
<item><gi>seg</gi> for cases where TEI does not have a specific element for some
type of segmental markup, e.g. <tag>seg type="swearword"
subtype="religious"</tag>Damn<tag>/seg</tag>; this element is discussed in the
Section on <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASE">Blocks,
Segments, and Anchors</ref> of the TEI Guidelines.</item>
</list>
It should be noted that for cases of discontinuity of the segment, the <att>prev</att>
and <att>next</att> attributes can be used to link its parts together. Furthermore,