Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode normalization follow up, adding character navigation and several fixes #16622

Merged
merged 31 commits into from
Jun 11, 2024

Conversation

LeonarddeR
Copy link
Collaborator

@LeonarddeR LeonarddeR commented May 28, 2024

Link to issue number:

Fixes #16622
fixes #16640

Summary of the issue:

It has been discussed that normalization would also be helpful for character navigation. There's also an issue where character descriptions and symbol pronunciation didn't work correctly because normalization took place before symbol processing. Furthermore, for the UnicodeNormalizationOffsetConverter used for braille, it was discovered that diffing didn't turn out to be accurate enough.

Description of user facing changes

  1. When normalization is enabled, there's an extra option Report normalized when navigating by character in the speech settings.
  2. Added global commands foor speech and braille normalization (without assigned gesture)
  3. When normalization is enabled, characters will now always be normalized as well.
  4. Braille Unicode normalization is more reliable now.

Description of development approach

  1. Normalization is now always applied to speech, rather than only for object and text info speech. I also changed some helper functions to be able to report normalized when navigating by character.
  2. Added a SuppressUnicodeNormalizationCommand that allows you to suppress global normalization within a speech sequence. This command is used when creating a spelling sequence, because spelling has its own normalization logic now. It also ensures that when spelling a character or providing a character description (i.e. NVDA+. double press), normalization does not occur. It can even be used to disable character normalization altogether if necessary.
  3. Rewrote UnicodeNormalizationOffsetConverter to use a new function in NVDAHelper local that uses uniscribe to give a list of offsets for character boundaries. This allows us to split a string into glyphs and then apply normalization to every glyph, which results in more reliable offset calculation and less complex code.

Testing strategy:

  • Test that á and IJ are announced in their normalized form when Unicode normalization is on, and that the normalized word is added to the character announcement when enabled. Note that for á, this doesn't always work as several textInfo implementations (such as Mozilla) are slightly broken when navigating across compositions. I must probably report that.
  • Test that, even when unicode normalization is off, á is now announced as a accute when navigating by character, thereby improving that behavior as well

Known issues with pull request:

None known

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

Summary by CodeRabbit

  • New Features

    • Added an option to report normalized characters during character navigation.
    • Introduced commands to cycle through Unicode normalization states for braille and speech.
    • Added a checkbox in the speech settings panel for reporting normalized characters.
  • Enhancements

    • Improved speech synthesis with a new command to suppress Unicode normalization.
    • Enhanced text processing functions to support Unicode normalization.
  • Tests

    • Added unit tests for notifications and speech normalization, including tests for ligatures and decomposed characters.
  • Documentation

    • Updated user guide with information on toggling Unicode normalization and reporting normalized characters.

@seanbudd
Copy link
Member

Hi @LeonarddeR - just noting that #16616 hasn't been triaged yet.
We're waiting on further discussion and testing in 2024.3 before considering changing the default for this in 2024.4

@LeonarddeR LeonarddeR changed the title SPeech Unicode normalization: Enable by default and normalize character navigation Speech Unicode normalization: Enable by default and normalize character navigation May 29, 2024
@LeonarddeR
Copy link
Collaborator Author

Thanks for pointing that out. Note that I was aware of the standpoint to delay this pr to 2024.4, but others might not have been.

@LeonarddeR LeonarddeR added this to the 2024.4 milestone May 29, 2024
@Adriani90
Copy link
Collaborator

@seanbudd this seems quite stable, which points do you think need still clarification? If you need more community feedback, this needs to be merged into alpha. I don‘t think there is anythink open regarding speech in #16616. you can also keep this behavior in alpha until you think it is stable enough to bring it into Beta. At least this is how we dealt with new features in the past as well, see e.g. cancellable speech which was enabled only in alpha for a longer period of time until it finally reached the stable version.

@CyrilleB79
Copy link
Collaborator

@seanbudd would you accept milestone 2024.3 for this PR if we change default value to disabled (see #16624 (comment))? This would at least avoid to ship 2024.3 with a buggy feature, closing #16624.

A subsequent PR for 2024.4 could then switch the default value to enabled.

@seanbudd
Copy link
Member

@CyrilleB79 - yes as requested in #16624 (comment), however @LeonarddeR seems to suggest in #16624 (comment) that this is not possible for whatever reason

@LeonarddeR
Copy link
Collaborator Author

@seanbudd I must have misunderstood you then. I thought you were asking in #16624 (comment) whether it would be possible to fix #16624 without changing character navigation behavior. That would be very difficult. If you'd agree changing this pr to disable normalization again by default but yet change character navigation to also normalize, that is certainly possibly and I can change the pr as requested by @CyrilleB79.

@seanbudd
Copy link
Member

That would be great yes. The normalization fix can go into 2024.3, but we don't want to turn it on by default until it has had some wider testing in a release

@AppVeyorBot
Copy link

See test results for failed build of commit 8e0d3c7fdb

@LeonarddeR LeonarddeR changed the title Speech Unicode normalization: Enable by default and normalize character navigation Speech Unicode normalization: normalize character navigation May 31, 2024
@AppVeyorBot
Copy link

See test results for failed build of commit de6705d08d

@LeonarddeR LeonarddeR changed the title Speech Unicode normalization: normalize character navigation Unicode normalization follow up, adding character navigation and several fixes Jun 1, 2024
Copy link
Contributor

coderabbitai bot commented Jun 6, 2024

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, after reviewing the textUtils.cpp file, I noticed that the functions calculateWordOffsets and calculateCharacterOffsets are essentially wrappers around the _calculateUniscribeOffsets function. They differ only in the UNIT parameter (UNIT_WORD and UNIT_CHARACTER, respectively).

To reduce code duplication, you could consider merging these two functions into a single function that accepts a UNIT parameter. This would centralize the logic and make the codebase cleaner. Here's a proposed refactoring:

bool calculateOffsets(UNIT unit, wchar_t* text, int textLength, int offset, int* startOffset, int* end, int* endOffset) {
    return _calculateUniscribeOffsets(unit, text, textLength, offset, startOffset, endOffset);
}

You can then replace calls to calculateWordOffsets and calculateCharacterOffsets with calls to calculateOffsets, passing the appropriate UNIT as an argument. This change should maintain all current functionalities while reducing the redundancy in your codebase.

Let me know if you need further assistance with this refactoring or any other part of your code!

Analysis chain
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Locate the textUtils.cpp file in the repository
fd textUtils.cpp

Length of output: 49



Script:

#!/bin/bash
# Display the contents of the textUtils.cpp file
cat nvdaHelper/local/textUtils.cpp

Length of output: 4852

@LeonarddeR
Copy link
Collaborator Author

@coderabbitai I'd rather fix duplication between calculateCharacterBoundaries and _calculateUniscribeOffsets. Note that the former swaps some c style code for c++, but a major part of both is very similar.

Copy link
Contributor

coderabbitai bot commented Jun 6, 2024

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, after reviewing the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets in textUtils.cpp, here are some observations and a proposed refactoring strategy to reduce code duplication:

Observations:

  1. Common Functionality: Both functions utilize the ScriptItemize and ScriptBreak functions from the Uniscribe API to analyze text. The main difference lies in how they handle the results from these functions.
  2. Error Handling and Resource Management: Both functions handle errors similarly and have similar resource management patterns, especially with dynamic memory allocation for SCRIPT_ITEM and SCRIPT_LOGATTR.

Proposed Refactoring:

To reduce duplication, you can abstract the common parts of these functions into helper functions. Here’s a proposed refactoring:

  1. Common Script Itemization:
    Create a function to handle the common itemization logic.

    bool itemizeScript(const wchar_t* text, int textLength, SCRIPT_ITEM** items, int* numItems) {
        *items = new SCRIPT_ITEM[textLength + 1];
        return ScriptItemize(text, textLength, textLength, nullptr, nullptr, *items, numItems) == S_OK && *numItems > 0;
    }
  2. Common Script Breaking:
    Create a function to handle the common breaking logic.

    bool breakScript(const wchar_t* text, int textLength, SCRIPT_ITEM* items, int numItems, SCRIPT_LOGATTR** logAttrArray) {
        *logAttrArray = new SCRIPT_LOGATTR[textLength];
        int nextICharPos = textLength;
        for (int itemIndex = numItems - 1; itemIndex >= 0; --itemIndex) {
            int iCharPos = items[itemIndex].iCharPos;
            int iCharLength = nextICharPos - iCharPos;
            if (ScriptBreak(text + iCharPos, iCharLength, &(items[itemIndex].a), *logAttrArray + iCharPos) != S_OK) {
                return false;
            }
        }
        return true;
    }
  3. Refactor Existing Functions:
    Utilize these helper functions in calculateCharacterBoundaries and _calculateUniscribeOffsets to reduce duplication and centralize the common logic.

This approach not only reduces duplication but also enhances maintainability by centralizing the error handling and memory management, making the code cleaner and easier to manage.

Let me know if this refactoring aligns with your expectations or if you need further modifications!

Analysis chain



Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Extract the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets from textUtils.cpp
ast-grep --lang cpp --pattern $'$_ $_(calculateCharacterBoundaries, _calculateUniscribeOffsets)($$$) {
  $$$
}' --file textUtils.cpp

Length of output: 327



Script:

#!/bin/bash
# Correctly extract the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets from textUtils.cpp
ast-grep --lang cpp --pattern $'$_ $_(calculateCharacterBoundaries, _calculateUniscribeOffsets)($$$) {
  $$$
}'

Length of output: 112



Script:

#!/bin/bash
# Fetch the entire content of textUtils.cpp
cat textUtils.cpp

Length of output: 65



Script:

#!/bin/bash
# Search for the textUtils.cpp file in the repository to find the correct path
fd textUtils.cpp

Length of output: 49



Script:

#!/bin/bash
# Fetch the entire content of textUtils.cpp from the correct path
cat nvdaHelper/local/textUtils.cpp

Length of output: 4852

@LeonarddeR
Copy link
Collaborator Author

@coderabbitai I didn't like your suggestions much. I refactored it in the last commit.

Copy link
Contributor

coderabbitai bot commented Jun 6, 2024

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, I've reviewed the latest changes in textUtils.cpp from your last commit. It looks like you've made significant improvements by abstracting the common logic into the _getLogAttrArray function, which simplifies the handling of script itemization and breaking. This refactoring enhances the modularity and readability of the code.

Using std::vector for managing SCRIPT_ITEM and SCRIPT_LOGATTR arrays is a robust choice as it automatically manages memory and makes the code cleaner and safer by avoiding manual memory management.

If there's anything specific you'd like feedback on or further improvements you're considering, feel free to let me know! Great work on these refinements.

Analysis chain
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Fetch the latest changes made to the textUtils.cpp file in the last commit
git diff HEAD~1 -- nvdaHelper/local/textUtils.cpp

Length of output: 3463

@seanbudd
Copy link
Member

seanbudd commented Jun 6, 2024

@LeonarddeR -is this ready for re-review?

I'd encourage any change of policy/usage of feature flags to be moved to a separate discussion.
Right now they are encouraged for general settings where appropriate, and are used throughout the braille settings, as well as in paragraph nav and other general settings.

@LeonarddeR LeonarddeR marked this pull request as ready for review June 7, 2024 05:22
@LeonarddeR
Copy link
Collaborator Author

@seanbudd Yes, I think the unit tests, including some edge cases, seem to prove that this works as advertised.

Copy link
Member

@Qchristensen Qchristensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserGuide looks good.

source/gui/settingsDialogs.py Outdated Show resolved Hide resolved
source/speech/speech.py Show resolved Hide resolved
source/speech/speech.py Outdated Show resolved Hide resolved
source/speech/speech.py Outdated Show resolved Hide resolved
source/speech/speech.py Outdated Show resolved Hide resolved
source/speech/speech.py Show resolved Hide resolved
source/textUtils/__init__.py Outdated Show resolved Hide resolved
source/textUtils/uniscribe.py Outdated Show resolved Hide resolved
source/textUtils/uniscribe.py Show resolved Hide resolved
source/textUtils/uniscribe.py Show resolved Hide resolved
@seanbudd seanbudd marked this pull request as draft June 7, 2024 06:58
@Adriani90
Copy link
Collaborator

@seanbudd

I'd encourage any change of policy/usage of feature flags to be moved to a separate discussion.
Right now they are encouraged for general settings where appropriate, and are used throughout the braille settings, as well as in paragraph nav and other general settings.

This is because people didn't pay attention to the actual purpose of feature flags when reviewing the coresponding PRs. I think it makes sense to clearly not encourage using these feature flags outside of the advanced settings panel anymore, and state the actuual purpose in the policy docs.
Many users find this feature flag approach outside of the advanced settings as quite bad UX.

@seanbudd
Copy link
Member

is this ready for re-review?

@LeonarddeR LeonarddeR marked this pull request as ready for review June 11, 2024 05:22
@LeonarddeR
Copy link
Collaborator Author

Yes, sorry. Forgot to mark it as ready

@seanbudd seanbudd merged commit 6b366fe into nvaccess:master Jun 11, 2024
1 check passed
seanbudd pushed a commit that referenced this pull request Aug 5, 2024
…symbol definition, the symbol replacement is spoken (#16950)

Fixup for #16622

Summary of the issue:
When unicode normalization of a character (e.g. ·) resulted into a character that had a symbol definition (e.g. ·, middle dot), the symbol definition wasn't applied to the normalization. This resulted in NVDA speaking nothing or only the word normalized.

Description of user facing changes
NVDA will now properly speak the · character (Greek Ano Teleia) as middle dot when normalizing. This also applies to other characters where normalization results in a character that's part of the symbol dictionary.

Description of development approach
When normalizing a character, ensure it is thrown through characterProcessing.processSpeechSymbol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review.
Projects
None yet
7 participants