Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a symbol group lookup table issue #14561

Merged

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Dec 4, 2023

Description

This PR fixes an issue in the finite-state transducer's (FST) lookup table that is used to map an input character to a symbol group. A symbol group is a an integer that's subsequently used to select a row from the transition table.

The FST uses a OTHER symbol group, to which all symbols are mapped that are not explicitly mapped to a symbol group.

E.g., say, we have two symbol groups, one that contains braces ({,}) and one that contains brackets ([,]).

const std::vector<std::string> symbol_groups = {"{}", "[]"};

// symbol (ASCII value) -> symbol group
// { (123)              -> 0
// } (125)              -> 0
// [ (91)               -> 1
// ] (93)               -> 1
// <anything else>      -> 2 ('OTHER')

So the lookup table will look something like this:

// lut[0] -> 2
// lut[1] -> 2
// lut[2] -> 2
// ...
// lut[91] -> 1
// lut[92] -> 2
// lut[93] -> 1
// ...
// lut[123] -> 0
// lut[124] -> 2
// lut[125] -> 0
// lut[126] -> 2

Now, when running the FST, we want to limit the range of lookups that we have to perform, so we bound the character to lookup to one-past-the-last index that was explicitly provided, because anything that comes after that index maps to the OTHER symbol group anyways. In the above example, the highest provided index is 125 (}) and one past it is index 126. We clamp any character value above 126 to 126. The number of valid items is 126+1.

So the lookup at runtime becomes:

return sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)];

Previously, we were computing number of valid items wrongly. And the issue didn't surface because most of our FST usage included }, which is only succeeded by ~ and DEL, which are actually anyways only valid as part of string values, and hence wouldn't have changed semantics there.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested a review from a team as a code owner December 4, 2023 17:24
@elstehle elstehle added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Dec 4, 2023
@elstehle elstehle changed the title Fixes a symbol group lookup table corner case Fixes a symbol group lookup table issue Dec 4, 2023
init_data.num_valid_entries = max_base_match_val + 1;
// The number of valid entries in the table (including the entry for the out-of-bounds symbol
// group id)
init_data.num_valid_entries = oob_match_index + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In essence, this is the change that was done here. Previously, we were setting

init_data.num_valid_entries = oob_match_index;

Which, means that lookups larger than the highest valid index were getting mapped to to the symbol group of that highest entry, instead of the OTHER symbol group which would come just one after.

@elstehle
Copy link
Contributor Author

elstehle commented Dec 7, 2023

/merge

@rapids-bot rapids-bot bot merged commit f5dca59 into rapidsai:branch-24.02 Dec 7, 2023
67 checks passed
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
This PR fixes an issue in the finite-state transducer's (FST) lookup table that is used to map an input character to a symbol group. A symbol group is a an integer that's subsequently used to select a row from the transition table.


The FST uses a `OTHER` symbol group, to which all symbols are mapped that are not explicitly mapped to a symbol group.

E.g., say, we have two symbol groups, one that contains braces (`{`,`}`) and one that contains brackets (`[`,`]`).
```
const std::vector<std::string> symbol_groups = {"{}", "[]"};

// symbol (ASCII value) -> symbol group
// { (123)              -> 0
// } (125)              -> 0
// [ (91)               -> 1
// ] (93)               -> 1
// <anything else>      -> 2 ('OTHER')

So the lookup table will look something like this:

// lut[0] -> 2
// lut[1] -> 2
// lut[2] -> 2
// ...
// lut[91] -> 1
// lut[92] -> 2
// lut[93] -> 1
// ...
// lut[123] -> 0
// lut[124] -> 2
// lut[125] -> 0
// lut[126] -> 2
```

Now, when running the FST, we want to limit the range of lookups that we have to perform, so we bound the character to lookup to one-past-the-last index that was explicitly provided, because anything that comes after that index maps to the `OTHER` symbol group anyways. In the above example, the highest provided index is `125` (`}`) and one past it is index `126`. We clamp any character value above `126` to `126`. The _number_ of valid items is `126+1`. 

So the lookup at runtime becomes:
```
return sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)];
```

Previously, we were computing number of valid items wrongly. And the issue didn't surface because most of our FST usage included `}`, which is only succeeded by `~` and `DEL`, which are actually anyways only valid as part of string values, and hence wouldn't have changed semantics there.

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Ray Douglass (https://github.com/raydouglass)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#14561
@elstehle elstehle mentioned this pull request Dec 12, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants