Unicode normalization follow up, adding character navigation and several fixes #16622

LeonarddeR · 2024-05-28T16:26:04Z

Link to issue number:

Summary of the issue:

It has been discussed that normalization would also be helpful for character navigation. There's also an issue where character descriptions and symbol pronunciation didn't work correctly because normalization took place before symbol processing. Furthermore, for the UnicodeNormalizationOffsetConverter used for braille, it was discovered that diffing didn't turn out to be accurate enough.

Description of user facing changes

When normalization is enabled, there's an extra option Report normalized when navigating by character in the speech settings.
Added global commands foor speech and braille normalization (without assigned gesture)
When normalization is enabled, characters will now always be normalized as well.
Braille Unicode normalization is more reliable now.

Description of development approach

Normalization is now always applied to speech, rather than only for object and text info speech. I also changed some helper functions to be able to report normalized when navigating by character.
Added a SuppressUnicodeNormalizationCommand that allows you to suppress global normalization within a speech sequence. This command is used when creating a spelling sequence, because spelling has its own normalization logic now. It also ensures that when spelling a character or providing a character description (i.e. NVDA+. double press), normalization does not occur. It can even be used to disable character normalization altogether if necessary.
Rewrote UnicodeNormalizationOffsetConverter to use a new function in NVDAHelper local that uses uniscribe to give a list of offsets for character boundaries. This allows us to split a string into glyphs and then apply normalization to every glyph, which results in more reliable offset calculation and less complex code.

Testing strategy:

Test that á and Ĳ are announced in their normalized form when Unicode normalization is on, and that the normalized word is added to the character announcement when enabled. Note that for á, this doesn't always work as several textInfo implementations (such as Mozilla) are slightly broken when navigating across compositions. I must probably report that.
Test that, even when unicode normalization is off, á is now announced as a accute when navigating by character, thereby improving that behavior as well

Known issues with pull request:

None known

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

Summary by CodeRabbit

New Features
- Added an option to report normalized characters during character navigation.
- Introduced commands to cycle through Unicode normalization states for braille and speech.
- Added a checkbox in the speech settings panel for reporting normalized characters.
Enhancements
- Improved speech synthesis with a new command to suppress Unicode normalization.
- Enhanced text processing functions to support Unicode normalization.
Tests
- Added unit tests for notifications and speech normalization, including tests for ligatures and decomposed characters.
Documentation
- Updated user guide with information on toggling Unicode normalization and reporting normalized characters.

seanbudd · 2024-05-29T00:07:25Z

Hi @LeonarddeR - just noting that #16616 hasn't been triaged yet.
We're waiting on further discussion and testing in 2024.3 before considering changing the default for this in 2024.4

LeonarddeR · 2024-05-29T05:54:49Z

Thanks for pointing that out. Note that I was aware of the standpoint to delay this pr to 2024.4, but others might not have been.

Adriani90 · 2024-05-30T07:45:35Z

@seanbudd this seems quite stable, which points do you think need still clarification? If you need more community feedback, this needs to be merged into alpha. I don‘t think there is anythink open regarding speech in #16616. you can also keep this behavior in alpha until you think it is stable enough to bring it into Beta. At least this is how we dealt with new features in the past as well, see e.g. cancellable speech which was enabled only in alpha for a longer period of time until it finally reached the stable version.

CyrilleB79 · 2024-05-30T13:57:27Z

@seanbudd would you accept milestone 2024.3 for this PR if we change default value to disabled (see #16624 (comment))? This would at least avoid to ship 2024.3 with a buggy feature, closing #16624.

A subsequent PR for 2024.4 could then switch the default value to enabled.

seanbudd · 2024-05-31T00:05:25Z

@CyrilleB79 - yes as requested in #16624 (comment), however @LeonarddeR seems to suggest in #16624 (comment) that this is not possible for whatever reason

LeonarddeR · 2024-05-31T05:05:53Z

@seanbudd I must have misunderstood you then. I thought you were asking in #16624 (comment) whether it would be possible to fix #16624 without changing character navigation behavior. That would be very difficult. If you'd agree changing this pr to disable normalization again by default but yet change character navigation to also normalize, that is certainly possibly and I can change the pr as requested by @CyrilleB79.

seanbudd · 2024-05-31T05:48:56Z

That would be great yes. The normalization fix can go into 2024.3, but we don't want to turn it on by default until it has had some wider testing in a release

AppVeyorBot · 2024-05-31T09:03:28Z

FAIL: Lint check. See test results for more information.
PASS: Translation comments check.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/xpx9om5592xwryp7/artifacts/output/nvda_snapshot_pr16622-32209,8e0d3c7f.exe
PASS: System tests (tags: installer NVDA).
PASS: Unit tests.
CI timing (mins):
INIT 0.0,
INSTALL_START 1.1,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 12.0,
TESTSETUP_START 0.0,
TESTSETUP_END 0.4,
TEST_START 0.0,
TEST_END 2.4,
FINISH_END 0.2

See test results for failed build of commit 8e0d3c7fdb

AppVeyorBot · 2024-05-31T09:22:40Z

PASS: Translation comments check.
PASS: Unit tests.
PASS: Lint check.
FAIL: System tests (tags: installer NVDA). See test results for more information.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/b78s5cpjpirs2eyj/artifacts/output/nvda_snapshot_pr16622-32210,de6705d0.exe
CI timing (mins):
INIT 0.0,
INSTALL_START 1.1,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 11.4,
TESTSETUP_START 0.0,
TESTSETUP_END 0.4,
TEST_START 0.0,
TEST_END 2.3,
FINISH_END 0.2

See test results for failed build of commit de6705d08d

coderabbitai · 2024-06-06T19:58:41Z

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, after reviewing the textUtils.cpp file, I noticed that the functions calculateWordOffsets and calculateCharacterOffsets are essentially wrappers around the _calculateUniscribeOffsets function. They differ only in the UNIT parameter (UNIT_WORD and UNIT_CHARACTER, respectively).

To reduce code duplication, you could consider merging these two functions into a single function that accepts a UNIT parameter. This would centralize the logic and make the codebase cleaner. Here's a proposed refactoring:

bool calculateOffsets(UNIT unit, wchar_t* text, int textLength, int offset, int* startOffset, int* end, int* endOffset) {
    return _calculateUniscribeOffsets(unit, text, textLength, offset, startOffset, endOffset);
}

You can then replace calls to calculateWordOffsets and calculateCharacterOffsets with calls to calculateOffsets, passing the appropriate UNIT as an argument. This change should maintain all current functionalities while reducing the redundancy in your codebase.

Let me know if you need further assistance with this refactoring or any other part of your code!

Analysis chain

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Locate the textUtils.cpp file in the repository
fd textUtils.cpp

Length of output: 49

Script:

#!/bin/bash
# Display the contents of the textUtils.cpp file
cat nvdaHelper/local/textUtils.cpp

Length of output: 4852

LeonarddeR · 2024-06-06T20:24:19Z

@coderabbitai I'd rather fix duplication between calculateCharacterBoundaries and _calculateUniscribeOffsets. Note that the former swaps some c style code for c++, but a major part of both is very similar.

coderabbitai · 2024-06-06T20:28:41Z

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, after reviewing the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets in textUtils.cpp, here are some observations and a proposed refactoring strategy to reduce code duplication:

Observations:

Common Functionality: Both functions utilize the ScriptItemize and ScriptBreak functions from the Uniscribe API to analyze text. The main difference lies in how they handle the results from these functions.
Error Handling and Resource Management: Both functions handle errors similarly and have similar resource management patterns, especially with dynamic memory allocation for SCRIPT_ITEM and SCRIPT_LOGATTR.

Proposed Refactoring:

To reduce duplication, you can abstract the common parts of these functions into helper functions. Here’s a proposed refactoring:

Common Script Itemization:
Create a function to handle the common itemization logic.

bool itemizeScript(const wchar_t* text, int textLength, SCRIPT_ITEM** items, int* numItems) {
    *items = new SCRIPT_ITEM[textLength + 1];
    return ScriptItemize(text, textLength, textLength, nullptr, nullptr, *items, numItems) == S_OK && *numItems > 0;
}

Common Script Breaking:
Create a function to handle the common breaking logic.

bool breakScript(const wchar_t* text, int textLength, SCRIPT_ITEM* items, int numItems, SCRIPT_LOGATTR** logAttrArray) {
    *logAttrArray = new SCRIPT_LOGATTR[textLength];
    int nextICharPos = textLength;
    for (int itemIndex = numItems - 1; itemIndex >= 0; --itemIndex) {
        int iCharPos = items[itemIndex].iCharPos;
        int iCharLength = nextICharPos - iCharPos;
        if (ScriptBreak(text + iCharPos, iCharLength, &(items[itemIndex].a), *logAttrArray + iCharPos) != S_OK) {
            return false;
        }
    }
    return true;
}

Refactor Existing Functions:
Utilize these helper functions in calculateCharacterBoundaries and _calculateUniscribeOffsets to reduce duplication and centralize the common logic.

This approach not only reduces duplication but also enhances maintainability by centralizing the error handling and memory management, making the code cleaner and easier to manage.

Let me know if this refactoring aligns with your expectations or if you need further modifications!

Analysis chain

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Extract the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets from textUtils.cpp
ast-grep --lang cpp --pattern $'$_ $_(calculateCharacterBoundaries, _calculateUniscribeOffsets)($$$) {
  $$$
}' --file textUtils.cpp

Length of output: 327

Script:

#!/bin/bash
# Correctly extract the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets from textUtils.cpp
ast-grep --lang cpp --pattern $'$_ $_(calculateCharacterBoundaries, _calculateUniscribeOffsets)($$$) {
  $$$
}'

Length of output: 112

Script:

#!/bin/bash
# Fetch the entire content of textUtils.cpp
cat textUtils.cpp

Length of output: 65

Script:

#!/bin/bash
# Search for the textUtils.cpp file in the repository to find the correct path
fd textUtils.cpp

Length of output: 49

Script:

#!/bin/bash
# Fetch the entire content of textUtils.cpp from the correct path
cat nvdaHelper/local/textUtils.cpp

Length of output: 4852

LeonarddeR · 2024-06-06T20:50:05Z

@coderabbitai I didn't like your suggestions much. I refactored it in the last commit.

coderabbitai · 2024-06-06T20:52:48Z

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, I've reviewed the latest changes in textUtils.cpp from your last commit. It looks like you've made significant improvements by abstracting the common logic into the _getLogAttrArray function, which simplifies the handling of script itemization and breaking. This refactoring enhances the modularity and readability of the code.

Using std::vector for managing SCRIPT_ITEM and SCRIPT_LOGATTR arrays is a robust choice as it automatically manages memory and makes the code cleaner and safer by avoiding manual memory management.

If there's anything specific you'd like feedback on or further improvements you're considering, feel free to let me know! Great work on these refinements.

Analysis chain

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Fetch the latest changes made to the textUtils.cpp file in the last commit
git diff HEAD~1 -- nvdaHelper/local/textUtils.cpp

Length of output: 3463

seanbudd · 2024-06-06T23:57:41Z

@LeonarddeR -is this ready for re-review?

I'd encourage any change of policy/usage of feature flags to be moved to a separate discussion.
Right now they are encouraged for general settings where appropriate, and are used throughout the braille settings, as well as in paragraph nav and other general settings.

LeonarddeR · 2024-06-07T05:24:12Z

@seanbudd Yes, I think the unit tests, including some edge cases, seem to prove that this works as advertised.

Qchristensen

UserGuide looks good.

source/gui/settingsDialogs.py

source/speech/speech.py

source/textUtils/__init__.py

source/textUtils/uniscribe.py

Adriani90 · 2024-06-08T00:22:49Z

@seanbudd

I'd encourage any change of policy/usage of feature flags to be moved to a separate discussion.
Right now they are encouraged for general settings where appropriate, and are used throughout the braille settings, as well as in paragraph nav and other general settings.

This is because people didn't pay attention to the actual purpose of feature flags when reviewing the coresponding PRs. I think it makes sense to clearly not encourage using these feature flags outside of the advanced settings panel anymore, and state the actuual purpose in the policy docs.
Many users find this feature flag approach outside of the advanced settings as quite bad UX.

seanbudd · 2024-06-11T02:58:57Z

is this ready for re-review?

LeonarddeR · 2024-06-11T05:23:30Z

Yes, sorry. Forgot to mark it as ready

…symbol definition, the symbol replacement is spoken (#16950) Fixup for #16622 Summary of the issue: When unicode normalization of a character (e.g. ·) resulted into a character that had a symbol definition (e.g. ·, middle dot), the symbol definition wasn't applied to the normalization. This resulted in NVDA speaking nothing or only the word normalized. Description of user facing changes NVDA will now properly speak the · character (Greek Ano Teleia) as middle dot when normalizing. This also applies to other characters where normalization results in a character that's part of the symbol dictionary. Description of development approach When normalizing a character, ensure it is thrown through characterProcessing.processSpeechSymbol.

LeonarddeR added 5 commits May 28, 2024 18:18

SPeech: Enable unicode normalization by default

bcdf4c0

Add input gestures

82e2bd7

Speak normalized on character nav

b7844cb

Fix unit test

9d6d799

Lint

7d09e8d

LeonarddeR mentioned this pull request May 28, 2024

Normalization of unicode cahracter: allow excluding the symbols in the symbols.dic file from the normalization #16624

Closed

LeonarddeR changed the title ~~SPeech Unicode normalization: Enable by default and normalize character navigation~~ Speech Unicode normalization: Enable by default and normalize character navigation May 29, 2024

Merge remote-tracking branch 'origin/master' into normalizationFollowUp

b4d74d3

LeonarddeR added this to the 2024.4 milestone May 29, 2024

LeonarddeR added 4 commits May 31, 2024 09:23

Merge remote-tracking branch 'origin/master' into normalizationFollowUp

438d217

Update user guide and change default

50f6087

Fix announcing ligatures

f94c55f

Updates to speech

b35ebde

Fix suppression

19f698a

LeonarddeR changed the title ~~Speech Unicode normalization: Enable by default and normalize character navigation~~ Speech Unicode normalization: normalize character navigation May 31, 2024

LeonarddeR added 3 commits June 1, 2024 09:43

Last fixups, opefully

13c2168

Use walrus

5f54f77

Slightly expand mixed test

b56c369

LeonarddeR changed the title ~~Speech Unicode normalization: normalize character navigation~~ Unicode normalization follow up, adding character navigation and several fixes Jun 1, 2024

LeonarddeR and others added 2 commits June 1, 2024 11:22

Add a SequenceMatcher monkey patch

e9faa2b

Better assertions

b854b76

Even more compact code

1ae779a

Get rid of duplicated code

d25042b

LeonarddeR marked this pull request as ready for review June 7, 2024 05:22

Qchristensen approved these changes Jun 7, 2024

View reviewed changes

seanbudd reviewed Jun 7, 2024

View reviewed changes

seanbudd marked this pull request as draft June 7, 2024 06:58

Apply suggestions from code review

6d801d1

LeonarddeR added 4 commits June 8, 2024 13:38

Add processText doc string

afd2aec

Add _getSpellingSpeechWithoutCharMode docstring

50e35bf

Add return type

7c3fd6f

Import Generator

5bf722f

LeonarddeR marked this pull request as ready for review June 11, 2024 05:22

seanbudd approved these changes Jun 11, 2024

View reviewed changes

seanbudd merged commit 6b366fe into nvaccess:master Jun 11, 2024
1 check passed

LeonarddeR mentioned this pull request Aug 3, 2024

Ensure that when character normalization results in a character with symbol definition, the symbol replacement is spoken #16950

Merged

5 tasks

This was referenced Sep 28, 2024

Set default input and translation tables according to NVDA's language #17222

Merged

Update translations from Crowdin #17233

Merged

coderabbitai bot mentioned this pull request Oct 8, 2024

Update translations from Crowdin #17261

Merged

coderabbitai bot mentioned this pull request Oct 15, 2024

When spelling by line, ensure that only normalized characters report as such #17295

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode normalization follow up, adding character navigation and several fixes #16622

Unicode normalization follow up, adding character navigation and several fixes #16622

LeonarddeR commented May 28, 2024 •

edited

Loading

seanbudd commented May 29, 2024

LeonarddeR commented May 29, 2024

Adriani90 commented May 30, 2024

CyrilleB79 commented May 30, 2024

seanbudd commented May 31, 2024

LeonarddeR commented May 31, 2024

seanbudd commented May 31, 2024

AppVeyorBot commented May 31, 2024

AppVeyorBot commented May 31, 2024

coderabbitai bot commented Jun 6, 2024

LeonarddeR commented Jun 6, 2024

coderabbitai bot commented Jun 6, 2024

LeonarddeR commented Jun 6, 2024

coderabbitai bot commented Jun 6, 2024

seanbudd commented Jun 6, 2024

LeonarddeR commented Jun 7, 2024

Qchristensen left a comment

Adriani90 commented Jun 8, 2024

seanbudd commented Jun 11, 2024

LeonarddeR commented Jun 11, 2024

Unicode normalization follow up, adding character navigation and several fixes #16622

Unicode normalization follow up, adding character navigation and several fixes #16622

Conversation

LeonarddeR commented May 28, 2024 • edited Loading

Link to issue number:

Summary of the issue:

Description of user facing changes

Description of development approach

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Summary by CodeRabbit

seanbudd commented May 29, 2024

LeonarddeR commented May 29, 2024

Adriani90 commented May 30, 2024

CyrilleB79 commented May 30, 2024

seanbudd commented May 31, 2024

LeonarddeR commented May 31, 2024

seanbudd commented May 31, 2024

AppVeyorBot commented May 31, 2024

AppVeyorBot commented May 31, 2024

coderabbitai bot commented Jun 6, 2024

LeonarddeR commented Jun 6, 2024

coderabbitai bot commented Jun 6, 2024

Observations:

Proposed Refactoring:

LeonarddeR commented Jun 6, 2024

coderabbitai bot commented Jun 6, 2024

seanbudd commented Jun 6, 2024

LeonarddeR commented Jun 7, 2024

Qchristensen left a comment

Choose a reason for hiding this comment

Adriani90 commented Jun 8, 2024

seanbudd commented Jun 11, 2024

LeonarddeR commented Jun 11, 2024

LeonarddeR commented May 28, 2024 •

edited

Loading