Skip to content

Commit

Permalink
Remove ruby compat hacks (#259)
Browse files Browse the repository at this point in the history
* Fix two minor bugs from the Ruby code

First, `categroy` rather than `code` was used in constructing the
`control_boundary` property as related to the characters U+200C and
U+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the `control_boundary`
property.

Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example

    $ ruby -e 'puts "# blah".hex'
    0

This led to the character `'\0'` being included in the `exclusions`
and `excl_versions` sets which is incorrect. However this seems
asymptomatic because `'\0'` is never part of a composition. (In terms of
the C code, the use of `comp_exclusion` is guarded by the `comb_index`
property which is `UINT16_MAX` for `'\0'`.)

* Cleanup: Remove sequence ordering hack

This hack changed the ordering of sequences encoded in the sequences
table and was added so we could easily prove equivalence to the Ruby
data generator code.

However, it's no longer needed and removing it shouldn't result in any
functional change.
  • Loading branch information
c42f authored Jan 4, 2024
1 parent a78bee9 commit 1fe43f5
Show file tree
Hide file tree
Showing 2 changed files with 13,800 additions and 13,812 deletions.
14 changes: 1 addition & 13 deletions data/data_generator.jl
Original file line number Diff line number Diff line change
Expand Up @@ -95,10 +95,6 @@ end
exclusions = Set(read_composition_exclusions(r"# \(1\) Script Specifics.*?# Total code points:"s))
excl_version = Set(read_composition_exclusions(r"# \(2\) Post Composition Version precomposed characters.*?# Total code points:"s))

# FIXME: Replicate a bug in the ruby code
push!(exclusions, 0)
push!(excl_version, 0)

#-------------------------------------------------------------------------------
function read_case_folding(filename)
case_folding = Dict{UInt32,Vector{UInt32}}()
Expand Down Expand Up @@ -396,8 +392,7 @@ function char_table_properties!(sequences, char)
comp_exclusion = code in exclusions || code in excl_version,
ignorable = code in ignorable,
control_boundary = char.category in ("Zl", "Zp", "Cc", "Cf") &&
# FIXME: Ruby bug compat - should be `code in (0x200C, 0x200D)`
!(char.category in (0x200C, 0x200D)),
!(char.code in (0x200C, 0x200D)),
charwidth = derive_char_width(code, char.category),
boundclass = get_grapheme_boundclass(code),
indic_conjunct_break = get_indic_conjunct_break(code),
Expand All @@ -407,13 +402,6 @@ end
# Many character properties are duplicates. Deduplicate them, constructing a
# per-character array of indicies into the properties array
sequences = UTF16Sequences()

# FIXME: Hack to force ordering compat with Ruby code
for c in char_props
encode_sequence!(sequences, c.decomp_mapping)
encode_sequence!(sequences, get_case_folding(c.code))
end

char_table_props = [char_table_properties!(sequences, cp) for cp in char_props]

deduplicated_props = Origin(0)(Vector{eltype(char_table_props)}())
Expand Down
Loading

0 comments on commit 1fe43f5

Please sign in to comment.