Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement full Unicode 16.0.0 extended grapheme breaking. #719

Merged
merged 17 commits into from
Nov 20, 2024

Conversation

lrhn
Copy link
Member

@lrhn lrhn commented Nov 7, 2024

Includes rule GB9c (Indic Conjunt Break rule).

This change has a significant cost in size since the information needed per character no longer fits in 4 bits. The base table is therefore twice as big (one byte per entry rather than half of that).

The number of states in the state automatons have also increased slightly, but in comparison that's a negligible change.

Tests have been made more thorough, testing not only the Unicode Consortium provided tests, but also variants of those with representative characters for each category of character that either in or not-in the BMP, to test that surrogate pair decoding works correctly.

Test also check that the created automatons are minimal, in that no state is unreachable and no two states are indistinguishable.

Includes rule GB9c (Indict Conjunt Break based).

This change has a significant cost in size since the
information needed per character no longer fits in 4 bits.
The base table is therefore twice as big (one byte per entry
rather than half of that).

The number of states in the state automatons have also
increased slightly, but in comparison that's a negligible change.

Tests have been made more thorough, testing not only the
Unicode Consortium provided tests, but also variants of those
with representative characters for each category of character
that either in or not-in the BMP, to test that surrogate pair
decoding works correctly.

Test also check that the created automatons are minimal,
in that no state is unreachable and no two states are
indistinguishable.
Copy link

github-actions bot commented Nov 7, 2024

Package publishing

Package Version Status Publish tag (post-merge)
package:characters 1.4.0 ready to publish characters-v1.4.0
package:args 2.6.1-wip WIP (no publish necessary)
package:async 2.12.0 already published at pub.dev
package:collection 1.19.1-wip WIP (no publish necessary)
package:convert 3.1.2 already published at pub.dev
package:crypto 3.0.6 already published at pub.dev
package:fixnum 1.1.1 already published at pub.dev
package:logging 1.3.0 already published at pub.dev
package:os_detect 2.0.3-wip WIP (no publish necessary)
package:path 1.9.1 already published at pub.dev
package:platform 3.1.6 already published at pub.dev
package:typed_data 1.4.0 already published at pub.dev

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

Copy link

github-actions bot commented Nov 7, 2024

PR Health

Breaking changes ✔️
Package Change Current Version New Version Needed Version Looking good?
characters None 1.4.0 1.4.0 1.4.0 ✔️
Coverage ⚠️
File Coverage
pkgs/characters/benchmark/benchmark.dart 💔 Not covered
pkgs/characters/lib/characters.dart 💔 Not covered
pkgs/characters/lib/src/characters.dart 💚 100 %
pkgs/characters/lib/src/characters_impl.dart 💚 89 %
pkgs/characters/lib/src/grapheme_clusters/breaks.dart 💚 97 %
pkgs/characters/lib/src/grapheme_clusters/constants.dart 💔 Not covered
pkgs/characters/lib/src/grapheme_clusters/table.dart 💚 100 %
pkgs/characters/tool/benchmark.dart 💔 Not covered
pkgs/characters/tool/bin/generate_tables.dart 💔 Not covered
pkgs/characters/tool/bin/generate_tests.dart 💔 Not covered
pkgs/characters/tool/generate.dart 💔 Not covered
pkgs/characters/tool/src/args.dart 💔 Not covered
pkgs/characters/tool/src/atsp.dart 💔 Not covered
pkgs/characters/tool/src/automaton_builder.dart 💔 Not covered
pkgs/characters/tool/src/data_files.dart 💔 Not covered
pkgs/characters/tool/src/debug_names.dart 💚 12 %
pkgs/characters/tool/src/graph.dart 💔 Not covered
pkgs/characters/tool/src/grapheme_category_loader.dart 💔 Not covered
pkgs/characters/tool/src/indirect_table.dart 💔 Not covered
pkgs/characters/tool/src/list_overlap.dart 💔 Not covered
pkgs/characters/tool/src/shared.dart 💔 Not covered
pkgs/characters/tool/src/string_literal_writer.dart 💔 Not covered
pkgs/characters/tool/src/table_builder.dart 💔 Not covered

This check for test coverage is informational (issues shown here will not fail the PR).

This check can be disabled by tagging the PR with skip-coverage-check.

API leaks ✔️

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbols
License Headers ⚠️
// Copyright (c) 2024, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
pkgs/characters/lib/src/grapheme_clusters/breaks.dart

All source files should start with a license header.

This check can be disabled by tagging the PR with skip-license-check.

Until `// dart format off` starts working.
@lrhn
Copy link
Member Author

lrhn commented Nov 7, 2024

Health check is wrong. The changelog is correct since the version wasn't changed, and the existing changelog didn't list missing part that is now implemented.

Copy link
Member

@natebosch natebosch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of these comments my primary concern is the return statements in loops in tests.

Should some of those be continue instead of return?

pkgs/characters/test/breaks_test.dart Outdated Show resolved Hide resolved
pkgs/characters/test/breaks_test.dart Outdated Show resolved Hide resolved
pkgs/characters/test/breaks_test.dart Outdated Show resolved Hide resolved
pkgs/characters/tool/src/graph.dart Outdated Show resolved Hide resolved
pkgs/characters/tool/src/graph.dart Outdated Show resolved Hide resolved
pkgs/characters/tool/src/graph.dart Outdated Show resolved Hide resolved
@lrhn
Copy link
Member Author

lrhn commented Nov 8, 2024

I think I broke isGraphemeClusterBoundary. Have to fix that too.
... and fixed. That was a silly bug.

lrhn added 2 commits November 8, 2024 16:12
Add direct test for `isGrahphemeClusterBoundary`.
@lrhn lrhn force-pushed the indic-conjunct-break branch from 6158e75 to e382ab4 Compare November 11, 2024 17:30
@lrhn
Copy link
Member Author

lrhn commented Nov 13, 2024

Out of these comments my primary concern is the return statements in loops in tests.

Should some of those be continue instead of return?

They could be break, and now they are.
(They should break to an expect that will not be able to fail, but that still means every test run goes through that expect, one way or another.)

@lrhn
Copy link
Member Author

lrhn commented Nov 13, 2024

Think I'm done now. Much cleaned up (and nicer, IMO).

@lrhn lrhn merged commit 1de8372 into main Nov 20, 2024
17 checks passed
@lrhn lrhn deleted the indic-conjunct-break branch November 20, 2024 18:41
copybara-service bot pushed a commit to dart-lang/sdk that referenced this pull request Nov 21, 2024
Revisions updated by `dart tools/rev_sdk_deps.dart`.

core (https://github.com/dart-lang/core/compare/6af0821..1de8372):
  1de83727  2024-11-20  Lasse R.H. Nielsen  Implement full Unicode 16.0.0 extended grapheme breaking. (dart-lang/core#719)

dartdoc (https://github.com/dart-lang/dartdoc/compare/f8a55e4..c7f1160):
  c7f11603  2024-11-20  Sam Rawlins  Fix sidebars via correct web API for anchor href values (dart-lang/dartdoc#3934)

http (https://github.com/dart-lang/http/compare/e37093f..79470d0):
  79470d0  2024-11-19  Brian Quinlan  Include names in argument lists (dart-lang/http#1408)

shelf (https://github.com/dart-lang/shelf/compare/0bb44cb..a2708cd):
  a2708cd  2024-11-21  Devon Carew  shorten the issue badges (dart-lang/shelf#456)

Change-Id: Iee20d2300d1bf0e43a57b352b73235ae24fa5e51
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/396960
Auto-Submit: Devon Carew <devoncarew@google.com>
Commit-Queue: Konstantin Shcheglov <scheglov@google.com>
Reviewed-by: Konstantin Shcheglov <scheglov@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants