Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding of "Numeral symbols other than decimal digits" (EGD 4.2.2) #39

Open
arlogriffiths opened this issue Jun 26, 2024 · 10 comments
Open
Assignees
Labels
help wanted Extra attention is needed

Comments

@arlogriffiths
Copy link
Collaborator

@michaelnmmeyer — in tfc-khmer-epigraphy, there is a massive number of <num> elements whose contents are made up of symbols other than decimal digits that have not been wrapped in <g type="numeral"> by the responsible encoder(s) as EGD 4.2.2 prescribes. Examples:

  • <num value="1">I</num> should be <num value="1"><g type="numeral">I</g></num>
  • <num value="4">IIII</num>should be <num value="4"><g type="numeral">IIII</g></num>
  • <num value="123">100 20III</num> should be <num value="123"><g type="numeral">100</g><g type="numeral">20</g><g type="numeral">III</g></num>

There are also cases like <num value="80">80</num> which look like they contain decimal digits but where the transliteration is probably a representation of a non-decimal notation system, and so ought to be <num value="80"><g type-numeral>80</g></num> (as in the 123 example above). But there is no way for a machine to tell that these are not decimal units.

@chhomkunthea : do we ever have numbers noted with the decimal system outside of dates in the Khmer corpus? If we do not, then all such cases can automatically be converted to the encoding with <g>. You seem to have ignored EGD 4.2.2 so far. Please re-read it carefully.

Can you process the xml files and apply <g> wherever an algorithm can determine that the contents of <num> is not (explusively) a series of decimal digits?

@danbalogh : please correct me if I have made any mistake in my representation of our encoding rules.

@chloechollet and @salomepichon: please take note of the above if you weren't aware of the rules yet.

@chhomkunthea
Copy link
Contributor

Dear Arlo,

As far as I know, the numerals in Khmer corpus are not written with decimal system, except dates. Salomé and Chloé may confirm this.
Thank you for finalising the encoding of numerals, especially the number I. I will check the EG again before encoding next inscriptions with numerals.

Best,
Kunthea

@danbalogh
Copy link
Collaborator

Yes, the above notes conform to our encoding guidelines.

@arlogriffiths
Copy link
Collaborator Author

in that case, @michaelnmmeyer, please wrap in <g type="numeral"> all contents of <num> other than strings of 3 or 4 arabic numeral (as such string are liable to be dates in the first or second millennium of the Śaka era and, as Kunthea comments, Śaka dates are normally expressed with decimal digits).

@michaelnmmeyer
Copy link
Member

michaelnmmeyer commented Jun 30, 2024

This is addressed in e71eaed. There remains a number of occurrences to check and correct manually, to wit:

  • <num atLeast="11" atMost="19">10<gap reason="lost" quantity="1" unit="character"/></num>
  • <num atLeast="2" atMost="3"><choice><unclear>2</unclear><unclear>3</unclear></choice></num>
  • <num atLeast="880" atMost="888">88<gap reason="lost" quantity="1" unit="character"/></num>
  • <num atLeast="900" atMost="909">90<gap reason="lost" quantity="1" unit="character"/></num>
  • <num><choice><unclear>2</unclear><unclear>3</unclear></choice></num>
  • <num value="10"><g type="numeral">10</g></num>
  • <num value="14">10 1 <unclear>III</unclear></num>
  • <num value="17"><g type="numeral">10</g> <unclear><g type="numeral">7</g></unclear></num>
  • <num value="1"><unclear>I</unclear></num>
  • <num value="2"><unclear>II</unclear></num>
  • <num value="4"><unclear>4</unclear></num>
  • <num value="546"><supplied reason="lost">54</supplied>6</num>
  • <num value="60"><g type="numeral">60</g></num>
  • <num value="665" cert="low">66<supplied reason="lost" cert="low">5</supplied></num>
  • <num value="801">8<unclear>0</unclear>1</num>
  • <num value="860"><choice><sic>9</sic><corr>8</corr></choice>60</num>
  • <num value="902">90<choice><sic><num value="2">2</num></sic><corr><num value="3">3</num></corr></choice></num>
  • <num value="9"><supplied reason="lost">9</supplied></num>

@arlogriffiths
Copy link
Collaborator Author

Thanks. I have converted the above into a task list and will take car of it.

@arlogriffiths
Copy link
Collaborator Author

arlogriffiths commented Jul 1, 2024

@chhomkunthea : I don't understand the cases

  • <num value="14">10 1 <unclear>III</unclear></num>: should the 1 be I and should we have <num value="14"><g type="numeral">10</g> I<unclear>III</unclear></num>?
  • <num value="17"><g type="numeral">10</g> <unclear><g type="numeral">7</g></unclear></num>: should it be <num value="17"><g type="numeral">10</g> 7</unclear></num> or <num value="17"><g type="numeral">10</g> <unclear><g type="numeral">IIIIIII</g></unclear></num>?

@arlogriffiths
Copy link
Collaborator Author

@michaelnmmeyer:

  • all 28 cases of <num value="1"><unclear>I</unclear></num> should be changed to <num value="1"><g type="numeral"><unclear>I</unclear></g></num>.

@chhomkunthea
Copy link
Contributor

chhomkunthea commented Jul 1, 2024

Dear Arlo,

In the case of K. 915, I would like to propose below:

<num value="14"><g type="numeral">10</g> <g type="numeral">I</g><unclear><g type="numeral">III</g></unclear></num>

And for K. 1017, it should be:

<num value="17"><g type="numeral">10</g> <unclear>7</unclear></num>

@arlogriffiths
Copy link
Collaborator Author

@chhomkunthea : thanks. I have implemented your suggestion in K. 915 (or rather cleaned up the file which had some conflicts after you had implemented your suggestions).
@danbalogh : do you approve of Kunthea' solution to avoid the problem that <unclear> cannot be used inside <g>?

@danbalogh
Copy link
Collaborator

I think I would prefer <num value="14"><g type="numeral">10</g> <unclear><g type="numeral">IIII</g></unclear></num> because if it were clear, the encoding of the latter part would be IIII. So that tells me that "IIII" is interpreted as a single numeral glyph, and if part of that is unclear, then the whole glyph is unclear. I hope you understand what I'm trying to say here; it's a bit difficult to express. It's analogous to how an Indian numeral 3 might be written as three lines one below the other, ≡ - and if one or two of those lines were unclear, you would put the 3 in unclear tags without trying to indicate that in fact the top bar is clear and the other two are not.
Viewed the other way round, if we removed only the <unclear> from <num value="14"><g type="numeral">10</g> <g type="numeral">I</g><unclear><g type="numeral">III</g></unclear></num>, we'd be left with <num value="14"><g type="numeral">10</g> <g type="numeral">I</g><g type="numeral">III</g></num>, which I believe does not really make sense.
That said, I do understand that Kunthea's rationale in choosing the above encoding was to show that the first bar is clear and the other three are not, and I don't have a strong objection to that. So if you are happy with that solution, I think it can stay. I don't suppose we want, at this stage, to revise the encoding of these numeral bars to say that only one I can ever be wrapped in g, and I must be iterated for every single bar. That would be the only way I see that would allow us to encode that only the latter 3 bars are unclear, but simultaneously also to keep the encoding rigorous (so that the unclear tag can be removed without requiring that the remaining text be rewritten).
Thanks for bearing with me. I've been thinking aloud. Bottom line: out of 3 alternatives [1, use unclear around a g with four bars; 2, keep Kunthea's way; 3, revise the encoding method] my order of preference is 1-2-3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants