rework text processors #793

StefanVukovic99 · 2024-03-29T20:22:28Z

Part of #787, necessary changes to support korean.

Text preprocessing happens before deinflection, text postprocessing after deinflection is added here. This is primarily to support jamo de/recomposition for hangul, but could have other uses.
Changes the order of operations in translator so that subtrings of the looked up text are preprocessed and then deinflected (instead of deinflecting substrings of the preprocessed text). As a consequence, the (complicated) TextSourceMap is no longer used. This too is primarily motivated by hangul, but also fixes an issue where textProcessors which change the number of characters have been causing the wrong text to be selected on scan (this affected non-japanese languages). There is a slight change to behavior on some textReplacements.

github-actions · 2024-03-29T20:28:00Z

✔️ No visual differences introduced by this PR.

View Playwright Report (note: open the "playwright-report" artifact)

jamesmaa · 2024-04-18T19:21:57Z

Dumb question, but what is a source map?

StefanVukovic99 · 2024-04-18T21:29:24Z

what is a source map

An object for keeping track of which input character was replaced by more or less than 1 character during preprocessing (invented by nutbread it seems, not a common term). This helps keep track of the length of the original scanned text so the right number of characters can be highlighted on scan.

The current code preprocesses the text, then deinflects its substrings starting from the longest. This PR includes a loop interchange, so now we substring first, then preprocess (necessary for Hangul). This makes keeping track of the original text's length simpler and the TextSourceMap redundant.

jamesmaa

ngl this PR feels a bit out of my league. Asking some questions for now for my own edification and I'll take another look tomorrow

jamesmaa · 2024-04-19T02:31:49Z

test/data/anki-note-builder-test-results.json

@@ -2548,8 +2548,8 @@
        "audio": "",
        "clipboard-image": "",
        "clipboard-text": "",
-        "cloze-body": "打",
-        "cloze-body-kana": "だ",
+        "cloze-body": "打(う)",


What is this change?

The test inputs include two cases with some fancy text replacements for parentheses.

yomitan/test/data/translator-test-inputs.json

Lines 180 to 225 in 69b57e5

{

"name": "Ignore text inside parentheses",

"func": "findTerms",

"mode": "split",

"text": "打(う)ち込(こ)む",

"options": [

"default",

{

"type": "terms",

"removeNonJapaneseCharacters": false,

"textReplacements": [

null,

[

{

"pattern": "\$([^)]*)(?:\$|$)",

"flags": "g",

"replacement": ""

}

]

]

}

]

},

{

"name": "Remove parentheses around text",

"func": "findTerms",

"mode": "split",

"text": "(打)(ち)(込)(む)",

"options": [

"default",

{

"type": "terms",

"removeNonJapaneseCharacters": false,

"textReplacements": [

null,

[

{

"pattern": "\$([^)]*)(?:\$|$)",

"flags": "g",

"replacement": "$1"

}

]

]

}

]

},

This is a small change to the behavior of the {cloze-body} handlebars in this case.
I think no one will be affected by this as the set of users using these handlebars markers, and using the obscure text replacements at all, and using them in this special way is probably empty.

ext/js/language/translator.js

jamesmaa · 2024-04-19T02:59:57Z

ext/js/language/translator.js

-        for (const [key, value] of textPreprocessorOptionsSpace) {
-            variantSpace.set(key, value);
-        }
+        const preprocessorVariantSpace = new Map(preprocessorOptionsSpace);


For. my own edification: what is variant space here?

Each processor has an array of its possible options, (e.g. [false, true], or ['off', 'direct', 'inverse'] or [[false, false], [true, false], [true, true]]. Since we need to run all combinations of the processors, this is like each processor's options being a dimension of a matrix, or an axis of a coordinate system/vector space, and we are getting all of the combinations by traversing all of the cells in the matrix / points in the space.

Fixes FooSoft#775. Note that this behavior gets overridden if backspace is set as a shortcut action.

) _isKeyCharacterInput only worked when not using an IME, as inside of an IME when a keydown event is fired, the key is reported as "Process", which does not have a key.length equal to 1. This resulted in hotkeys being triggered while typing, which this commit fixes.

jamesmaa

Starting to understand the high level logic and it looks fine to me. Just a few questions for my understanding

jamesmaa · 2024-04-21T13:47:24Z

ext/js/language/translator.js

+        const preprocessorVariantSpace = new Map(preprocessorOptionsSpace);
+        preprocessorVariantSpace.set('textReplacements', this._getTextReplacementsVariants(options));
+        const preprocessorVariants = this._getArrayVariants(preprocessorVariantSpace);
+        const postprocessorVariants = this._getArrayVariants(postprocessorOptionsSpace);

        /** @type {import('translation-internal').DatabaseDeinflection[]} */
        const deinflections = [];
        const used = new Set();


What does used here mean semantically?

It's a set of strings the deinflector has been run on - no need to run it again on those words since we would already have them in the deinflections array. It's to improve performance.

jamesmaa · 2024-04-21T14:11:36Z

ext/js/language/translator.js

+    _applyTextProcessors(textProcessors, processorVariant, text, textCache) {
+        for (const {id, textProcessor: {process}} of textProcessors) {
+            const setting = processorVariant.get(id);
+            let level1 = textCache.get(text);


This textCache logic seems to be what sourceMap was doing originally. Is that correct?

No, sourceMap had a functional purpose, it was necessary for the algorithm. The textCache is there just to improve performance by minimizing calls to textProcessor.process.

jamesmaa · 2024-04-21T14:14:56Z

Oh and please resolve conflicts

rework text processors

205a02a

StefanVukovic99 requested a review from a team as a code owner March 29, 2024 20:22

StefanVukovic99 added kind/meta The issue or PR is meta area/linguistics The issue or PR is related to linguistics labels Mar 29, 2024

StefanVukovic99 and others added 4 commits April 16, 2024 23:24

Merge branch 'master' into text-processing-rework

12044ff

Merge branch 'master' into text-processing-rework

ba70450

Merge branch 'master' into text-processing-rework

b46f02a

Merge branch 'master' into text-processing-rework

0ef56ff

Merge branch 'master' into text-processing-rework

c1d2c81

jamesmaa reviewed Apr 19, 2024

View reviewed changes

StefanVukovic99 and others added 5 commits April 19, 2024 09:27

rename text-preprocessors file

b2a0bd4

Fix search header left margins on small screens (FooSoft#839)

2885037

Refocuses search input on backspace (FooSoft#840)

2139aea

Fixes FooSoft#775. Note that this behavior gets overridden if backspace is set as a shortcut action.

Merge branch 'master' into text-processing-rework

89e0ab1

jamesmaa reviewed Apr 21, 2024

View reviewed changes

jamesmaa previously approved these changes Apr 21, 2024

View reviewed changes

Merge branch 'master' into text-processing-rework

e4709d9

StefanVukovic99 dismissed jamesmaa’s stale review via e4709d9 April 21, 2024 15:11

jamesmaa approved these changes Apr 21, 2024

View reviewed changes

jamesmaa added this pull request to the merge queue Apr 21, 2024

Merged via the queue into themoeway:master with commit 07258ec Apr 21, 2024
10 checks passed

jamesmaa mentioned this pull request May 7, 2024

Update eslint unsafe rule #887

Merged

Lyroxide mentioned this pull request May 23, 2024

"Select matched text" should not select っ #980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework text processors #793

rework text processors #793

StefanVukovic99 commented Mar 29, 2024

github-actions bot commented Mar 29, 2024 •

edited

Loading

jamesmaa commented Apr 18, 2024

StefanVukovic99 commented Apr 18, 2024 •

edited

Loading

jamesmaa left a comment

jamesmaa Apr 19, 2024

StefanVukovic99 Apr 19, 2024

jamesmaa Apr 19, 2024

StefanVukovic99 Apr 19, 2024

jamesmaa left a comment

jamesmaa Apr 21, 2024

StefanVukovic99 Apr 21, 2024

jamesmaa Apr 21, 2024

StefanVukovic99 Apr 21, 2024

jamesmaa commented Apr 21, 2024

	{
	"name": "Ignore text inside parentheses",
	"func": "findTerms",
	"mode": "split",
	"text": "打(う)ち込(こ)む",
	"options": [
	"default",
	{
	"type": "terms",
	"removeNonJapaneseCharacters": false,
	"textReplacements": [
	null,
	[
	{
	"pattern": "\\(([^)]*)(?:\\)\|$)",
	"flags": "g",
	"replacement": ""
	}
	]
	]
	}
	]
	},
	{
	"name": "Remove parentheses around text",
	"func": "findTerms",
	"mode": "split",
	"text": "(打)(ち)(込)(む)",
	"options": [
	"default",
	{
	"type": "terms",
	"removeNonJapaneseCharacters": false,
	"textReplacements": [
	null,
	[
	{
	"pattern": "\\(([^)]*)(?:\\)\|$)",
	"flags": "g",
	"replacement": "$1"
	}
	]
	]
	}
	]
	},

rework text processors #793

rework text processors #793

Conversation

StefanVukovic99 commented Mar 29, 2024

github-actions bot commented Mar 29, 2024 • edited Loading

jamesmaa commented Apr 18, 2024

StefanVukovic99 commented Apr 18, 2024 • edited Loading

jamesmaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmaa commented Apr 21, 2024

github-actions bot commented Mar 29, 2024 •

edited

Loading

StefanVukovic99 commented Apr 18, 2024 •

edited

Loading