Name		Name	Last commit message	Last commit date
parent directory ..
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
macros.tera.xml		macros.tera.xml
readme.md		readme.md
template.tera.xml		template.tera.xml

readme.md

Document Encoder

This program encodes spreadsheets hosted on Google Drive annotating source documents with linguistic information into equivalent database entries.

Table of Contents

Document Encoder

Development

To run this program, you need to have rust and it's associated tooling installed. Then, simply navigate to this directory in a terminal and run the following:

cargo run

Spreadsheet Standards

All of the current markings are placed on the first row of each annotation line, which corresponds with the original source text in Cherokee syllabary. This leaves other rows untouched for capturing source document formatting.

Blocks

While phrases are too granular to correspond directly to a section of our loose translations, blocks are larger units of text that do this correspondence. We contain a block of text with curly brackets {} in both the source text and the translation. These occur in the syllabary row rather than between lines because a block may not fall exactly on line boundaries.

These will be matched up in the encoding process by pairing blocks in order, such that the fourth block in the annotation spreadsheet corresponds to the fourth block in the translation.

The following example represents a block that starts with the first word of line 2, and ends with the first word of line 3.


Line 2	Syllabary	{ᎩᎶᎢ	...
	Simple Phonetics	⁠kiloʔi
	...
Line 3	Syllabary	ᎨᏎᏍᏗ}	...
	Simple Phonetics	⁠gesesdi
	...

Here is a corresponding loose translation:

He was called Switch Striker. {One who heals others must take care of themselves.} You must not go around eating different women's cooking.

Generally, blocks will be closer to paragraph length as opposed to this sentence length example. If there is no clear way to break up the document into blocks, then simply enclose the whole thing in one block; the first word starts with { and the last word ends with }.

NOTE: All line breaks (explicit and implicit) must be contained within a block to be included in the TEI output.

Phrases

Square brackets [] delimit phrases and can be nested to any degree. For example, the following table produces a three word phrase with a two word phrase nested inside of it:


Line 1	Syllabary	[ᏚᏙᎥᎢ	[ᎦᎾᏍᏓ	ᏗᏥᏯ]]
	Simple Phonetics	dudoʔvʔi	gansda	di⁠ji⁠ya
	...

Line Breaks

Generally, line breaks are implicit in the division of our spreadsheets into sets of rows marked as "Line 1", "Line 2", etc. However, when there are line breaks in the source document that occur in the middle of a word, we annotate the whole word in the row that it starts on. To mark the position of the line break in these cases, place a backslash \ within the syllabary text where the break occurs.

For example:


Line 2	Syllabary	ᎤᏓᎦᏎ\ᏍᏗᏳ
	Simple Phonetics	⁠u⁠da⁠k⁠se⁠sdiyu
	...

Page Breaks

If a document consists of multiple pages, the annotation of each page exists on a separate tab of the same spreadsheet. To all be detected correctly, these tabs must be named exactly like this: Page 1, Page 2, Page 3, etc. Page breaks may also occur in the middle of a word. Taking inspiration from the last section about line breaks, we can use a backslash \ to represent such a page break, like so:


Line 2	Syllabary	ᎤᏓᎦᏎ\ᏍᏗᏳ
	Simple Phonetics	udaksesdiyu
	...

Metadata

The document metadata must be listed in a sub-sheet titled exactly Metadata. The order of fields should always stay the same based on our spreadsheet template. So far, that ordering is as follows:

Document ID: DAILP-defined unique identifier for this document.
Genre
Source Text
Title
Page # in Source Text: the number of the page this document starts with, within a larger source (like the Willie Jumper stories).
Page Count: total number of pages contained within
Translation Document ID: Identifier of a Google Doc containing the translation, pulled from the share url of that document.
Source Image URLs: a link directly to an image for each page, where each page is a column here.
Date: date-time in ISO format like YYYY-MM-DD hh:mm:ss that indicates when this document was created.
Contributor Names: Names of each contributor this document, whether as creator/author, translator, annotator, collector, etc. Each name goes in a column.
Contributor Role: Short description of the person above's role in the document, i.e. Annotator. The role should match with the person name in the same column one row above.
Source Attribution: Where this manuscript was sourced from, i.e. Beineke Rare Book & Manuscript Library
Source Attribution Link: A link to the organization or specific record this document was source from. Each entry should correspond with the source just above it.

Translations

A translation for a document should consist of paragraphs that correspond directly to blocks in the annotated spreadsheet. There are a few important rules to keep in mind:

Paragraphs should be separated by one empty line to correspond to different annotation blocks.
All text placed below a horizontal line will be ignored by the document processor.
If there are multiple translations, each one should be placed into its own Google Document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migration

migration

readme.md

Document Encoder

Development

Spreadsheet Standards

Blocks

Phrases

Line Breaks

Page Breaks

Metadata

Translations

Files

migration

Directory actions

More options

Directory actions

More options

Latest commit

History

migration

Folders and files

parent directory

readme.md

Document Encoder

Development

Spreadsheet Standards

Blocks

Phrases

Line Breaks

Page Breaks

Metadata

Translations