Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add is_potential_mixed_script_confusable_char function #13

Merged

Conversation

crlf0710
Copy link
Collaborator

@crlf0710 crlf0710 commented May 4, 2020

This is a prototype of the data required from mixed_script_confusable lint of rustc. I'm not really sure whether we need to give more detailed data to rustc for diagnostics. (e.g. This code point is potentially confusable with which code point or which script?)

Putting those aside, maybe we can have a early review first.

Implements is_potential_mixed_script_confusable_char function.

A few other issues raised up during adding the actual lint - #15 #16

@crlf0710 crlf0710 force-pushed the rustc_mixed_script_confusable branch 2 times, most recently from 35d49f3 to 1fb9c04 Compare May 4, 2020 18:43
@crlf0710
Copy link
Collaborator Author

crlf0710 commented May 6, 2020

There's a debug boolean value in the python script (search "debug = False", https://github.com/unicode-rs/unicode-security/pull/13/files#diff-c87c196441d88317a5ea3bf97e9fde0aR536), if anyone's curious why these codepoints are confusable, you can toggle it and regenerate the table file, which will include comments on why these code points are considered mixed script confusable.

@crlf0710
Copy link
Collaborator Author

crlf0710 commented May 7, 2020

The python calculation part is a little complex there. So i'll leave a brief description. The main idea is creating equivalence classes from confusables.txt, with the prototype as each equivalence class's representative element. Then compare each element pair within each equivalence class. If they're from different scripts, then mark each of them potentially mixed script confusable. And within that there're some special handling when the prototype has multiple code points.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really hard to review the python stuff, it would be nice if each function had a long comment explaining what its purpose is. currently i have to kinda figure that out from the code, and it's hard to review things when you don't know what they're doing

confusables.append((d_input, d_outputs))

return confusables

def aliases():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps link to the original source in unicode-script so things can be manually updated if necessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments below this line.

@crlf0710
Copy link
Collaborator Author

@Manishearth Will add some comments, thanks!

@crlf0710
Copy link
Collaborator Author

Added some comments. Also cc @pyfisch here.

@crlf0710
Copy link
Collaborator Author

Adjusted the APIs and added a very simple test.

@crlf0710 crlf0710 requested a review from Manishearth May 30, 2020 05:22
@crlf0710 crlf0710 changed the title Implement data tables for rustc_mixed_script_confusable_detection. Add is_potential_mixed_script_confusable_char function May 30, 2020
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, would appreciate docs on each function

def is_script_ignored_in_mixedscript(source):
return source == 'Zinh' or source == 'Zyyy' or source == 'Zzzz'

def process_mixedscript_single_to_multi(item_i, script_i, proto_lst, scripts):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this function trying to do? what does it operate on? where did those rules come from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to explain in comments.

return True
return False

def load_potential_mixedscript_confusables(f, identifier_allowed, scripts):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this function returns, and what each argument is, should be documented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will add more comments.

@crlf0710
Copy link
Collaborator Author

crlf0710 commented Jun 9, 2020

Added more comments to document the details.

@crlf0710 crlf0710 requested a review from Manishearth June 9, 2020 14:32
@Manishearth Manishearth merged commit fd03564 into unicode-rs:master Jun 9, 2020
@crlf0710 crlf0710 deleted the rustc_mixed_script_confusable branch June 10, 2020 01:20
Manishearth added a commit to Manishearth/rust that referenced this pull request Jun 26, 2020
…=Manishearth

Implement mixed script confusable lint.

This implements the mixed script confusable lint defined in RFC 2457.
This is blocked on rust-lang#72069 and unicode-rs/unicode-security#13, and will need a Cargo.toml version bump after those are resolved.

The lint message warning is sub-optimal for now. We'll need a mechanism to properly output  `AugmentScriptSet` to screen, this is to be added in `unicode-security` crate.

r? @Manishearth
Manishearth added a commit to Manishearth/rust that referenced this pull request Jun 26, 2020
…=Manishearth

Implement mixed script confusable lint.

This implements the mixed script confusable lint defined in RFC 2457.
This is blocked on rust-lang#72069 and unicode-rs/unicode-security#13, and will need a Cargo.toml version bump after those are resolved.

The lint message warning is sub-optimal for now. We'll need a mechanism to properly output  `AugmentScriptSet` to screen, this is to be added in `unicode-security` crate.

r? @Manishearth
Manishearth added a commit to Manishearth/rust that referenced this pull request Jun 26, 2020
…=Manishearth

Implement mixed script confusable lint.

This implements the mixed script confusable lint defined in RFC 2457.
This is blocked on rust-lang#72069 and unicode-rs/unicode-security#13, and will need a Cargo.toml version bump after those are resolved.

The lint message warning is sub-optimal for now. We'll need a mechanism to properly output  `AugmentScriptSet` to screen, this is to be added in `unicode-security` crate.

r? @Manishearth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants