Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework text processors #793

Merged
merged 12 commits into from
Apr 21, 2024

Conversation

StefanVukovic99
Copy link
Member

Part of #787, necessary changes to support korean.

  1. Text preprocessing happens before deinflection, text postprocessing after deinflection is added here. This is primarily to support jamo de/recomposition for hangul, but could have other uses.

  2. Changes the order of operations in translator so that subtrings of the looked up text are preprocessed and then deinflected (instead of deinflecting substrings of the preprocessed text). As a consequence, the (complicated) TextSourceMap is no longer used. This too is primarily motivated by hangul, but also fixes an issue where textProcessors which change the number of characters have been causing the wrong text to be selected on scan (this affected non-japanese languages). There is a slight change to behavior on some textReplacements.

@StefanVukovic99 StefanVukovic99 requested a review from a team as a code owner March 29, 2024 20:22
Copy link

github-actions bot commented Mar 29, 2024

✔️ No visual differences introduced by this PR.

View Playwright Report (note: open the "playwright-report" artifact)

@StefanVukovic99 StefanVukovic99 added kind/meta The issue or PR is meta area/linguistics The issue or PR is related to linguistics labels Mar 29, 2024
@jamesmaa
Copy link
Collaborator

Dumb question, but what is a source map?

@StefanVukovic99
Copy link
Member Author

StefanVukovic99 commented Apr 18, 2024

what is a source map

An object for keeping track of which input character was replaced by more or less than 1 character during preprocessing (invented by nutbread it seems, not a common term). This helps keep track of the length of the original scanned text so the right number of characters can be highlighted on scan.

The current code preprocesses the text, then deinflects its substrings starting from the longest. This PR includes a loop interchange, so now we substring first, then preprocess (necessary for Hangul). This makes keeping track of the original text's length simpler and the TextSourceMap redundant.

Copy link
Collaborator

@jamesmaa jamesmaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ngl this PR feels a bit out of my league. Asking some questions for now for my own edification and I'll take another look tomorrow

@@ -2548,8 +2548,8 @@
"audio": "",
"clipboard-image": "",
"clipboard-text": "",
"cloze-body": "打",
"cloze-body-kana": "だ",
"cloze-body": "打(う)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test inputs include two cases with some fancy text replacements for parentheses.

{
"name": "Ignore text inside parentheses",
"func": "findTerms",
"mode": "split",
"text": "打(う)ち込(こ)む",
"options": [
"default",
{
"type": "terms",
"removeNonJapaneseCharacters": false,
"textReplacements": [
null,
[
{
"pattern": "\\(([^)]*)(?:\\)|$)",
"flags": "g",
"replacement": ""
}
]
]
}
]
},
{
"name": "Remove parentheses around text",
"func": "findTerms",
"mode": "split",
"text": "(打)(ち)(込)(む)",
"options": [
"default",
{
"type": "terms",
"removeNonJapaneseCharacters": false,
"textReplacements": [
null,
[
{
"pattern": "\\(([^)]*)(?:\\)|$)",
"flags": "g",
"replacement": "$1"
}
]
]
}
]
},

This is a small change to the behavior of the {cloze-body} handlebars in this case.
I think no one will be affected by this as the set of users using these handlebars markers, and using the obscure text replacements at all, and using them in this special way is probably empty.

ext/js/language/translator.js Show resolved Hide resolved
for (const [key, value] of textPreprocessorOptionsSpace) {
variantSpace.set(key, value);
}
const preprocessorVariantSpace = new Map(preprocessorOptionsSpace);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For. my own edification: what is variant space here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Each processor has an array of its possible options, (e.g. [false, true], or ['off', 'direct', 'inverse'] or [[false, false], [true, false], [true, true]]. Since we need to run all combinations of the processors, this is like each processor's options being a dimension of a matrix, or an axis of a coordinate system/vector space, and we are getting all of the combinations by traversing all of the cells in the matrix / points in the space.

StefanVukovic99 and others added 5 commits April 19, 2024 09:27
Fixes FooSoft#775. Note that this behavior gets overridden if backspace is set
as a shortcut action.
)

_isKeyCharacterInput only worked when not using an IME, as inside of an
IME when a keydown event is fired, the key is reported as "Process",
which does not have a key.length equal to 1. This resulted in hotkeys
being triggered while typing, which this commit fixes.
Copy link
Collaborator

@jamesmaa jamesmaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to understand the high level logic and it looks fine to me. Just a few questions for my understanding

const preprocessorVariantSpace = new Map(preprocessorOptionsSpace);
preprocessorVariantSpace.set('textReplacements', this._getTextReplacementsVariants(options));
const preprocessorVariants = this._getArrayVariants(preprocessorVariantSpace);
const postprocessorVariants = this._getArrayVariants(postprocessorOptionsSpace);

/** @type {import('translation-internal').DatabaseDeinflection[]} */
const deinflections = [];
const used = new Set();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does used here mean semantically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a set of strings the deinflector has been run on - no need to run it again on those words since we would already have them in the deinflections array. It's to improve performance.

_applyTextProcessors(textProcessors, processorVariant, text, textCache) {
for (const {id, textProcessor: {process}} of textProcessors) {
const setting = processorVariant.get(id);
let level1 = textCache.get(text);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This textCache logic seems to be what sourceMap was doing originally. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sourceMap had a functional purpose, it was necessary for the algorithm. The textCache is there just to improve performance by minimizing calls to textProcessor.process.

@jamesmaa
Copy link
Collaborator

Oh and please resolve conflicts

jamesmaa
jamesmaa previously approved these changes Apr 21, 2024
@jamesmaa jamesmaa added this pull request to the merge queue Apr 21, 2024
Merged via the queue into themoeway:master with commit 07258ec Apr 21, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/linguistics The issue or PR is related to linguistics kind/meta The issue or PR is meta
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants