Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize Unicode identifiers #2284

Merged
merged 2 commits into from
Jun 20, 2023
Merged

Tokenize Unicode identifiers #2284

merged 2 commits into from
Jun 20, 2023

Conversation

tamaroning
Copy link
Contributor

@tamaroning tamaroning commented Jun 9, 2023

Addresses #2287 #418

  • add function which checks XID_Start and XID_Continue to libcpp
  • identifiers
  • lifetime token

gcc/rust/ChangeLog:

	* lex/rust-lex.cc (is_whitespace): add all lacked codepoints valid as whitespaces
	(is_identifier_start): new function to check XID_Start and underscore
	(is_identifier_continue): new function to check XID_Continue
	(Lexer::build_token):tokenize Unicode identifiers
	(Lexer::parse_partial_string_continue):add comments
	(Lexer::parse_partial_unicode_escape):add comments
	(Lexer::parse_raw_identifier):change to use `is_identifier_scontinue`
	(Lexer::parse_identifier_or_keyword):change to use `is_identifier_continue`
	(Lexer::parse_char_or_lifetime):change to use `is_identifier_start` and `is_identifier_continue`
	(Lexer::skip_codepoint_input):do not attempt to skip input when bumping EOF
	* lex/rust-lex.h:add `current_char32` field

libcpp/ChangeLog:

	* charset.cc (check_xid_property):new function to check XID_Start and XID_Continue
	* include/cpplib.h (check_xid_property):add forward declaration of `check_xid_property`

@tamaroning
Copy link
Contributor Author

I found this code and am not sure how we handle width of characters
https://github.com/tamaroning/gccrs/blob/3c0f3ca4e4d8ddeecead05e109ace1917e021321/gcc/rust/lex/rust-lex.cc#L2462-L2463

@CohenArthur CohenArthur marked this pull request as draft June 12, 2023 08:32
@tamaroning tamaroning mentioned this pull request Jun 12, 2023
15 tasks
Comment on lines 1259 to 1270
/* Returns 1 if C has the XID_Start property, 2 if C has the XID_Continue properties, 3 if C has the both properties, 0 otherwise. */
int check_xid_property (cppchar_t c) {
// fast path for ASCII
if (c < 0x80) {
if (('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z'))
return 3;
if (('0' <= c && c <= '9') || c == '_')
return 2;
}
Copy link
Contributor Author

@tamaroning tamaroning Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should return bit flags but how can I do it?
(i.e. None: 0, XID_Start: 1, XID_Continue: 1 << 1)
Is just using int (or unsigned int) and exporting constants such as XID_START and XID_CONTINUE good?

Copy link
Contributor Author

@tamaroning tamaroning Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use unsigned int and export consants via enum

@tamaroning tamaroning force-pushed the uc-lexer branch 3 times, most recently from 1a03be4 to 41e4ea7 Compare June 16, 2023 01:09
Comment on lines 107 to +116
bool
is_whitespace (char character)
is_whitespace (int character)
{
return ISSPACE (character);
// https://doc.rust-lang.org/reference/whitespace.html
return character == '\t' || character == '\n' || character == '\v'
|| character == '\f' || character == '\r' || character == ' '
|| character == 0x0085 // next line
|| character == 0x200e // left-to-right mark
|| character == 0x200f // right-to-left mark
|| character == 0x2028 // line separator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some codepoints of whitespaces.
But non-ascii whitespaces are not actually checked during tokenization because this func is called with argument whose type is char (1 byte)

@tamaroning tamaroning changed the title [WIP] Tokenize Unicode identifiers Tokenize Unicode identifiers Jun 16, 2023
@tamaroning tamaroning marked this pull request as ready for review June 16, 2023 03:19
Copy link
Member

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great work, thank you!

Comment on lines +113 to +117
|| character == 0x0085 // next line
|| character == 0x200e // left-to-right mark
|| character == 0x200f // right-to-left mark
|| character == 0x2028 // line separator
|| character == 0x2029; // pragraph separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of those characters accepted by rustc as whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All of these values are defined in the Rust ref.
You can find URL to this just before the selected lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed this! Sorry! Thanks for pointing it out haha

}

current_char = peek_input ();
current_char32 = peek_codepoint_input ();
skip_codepoint_input ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we skipping the codepoint input here but not the char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip one byte here, only the first byte of current utf-8 character can be skipped by the lexer, which we do not expect.
For example, if the lexer tokenizes identifier あああ , it should skip the first utf8 character あ, not its first byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you!

Comment on lines 2490 to 2491
// TODO some keywords cannot be used for a lifetime label
// https://doc.rust-lang.org/reference/tokens.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good - please open an issue for this and put the issue number in the comment :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opend #2306

@tamaroning
Copy link
Contributor Author

Oops, I mistakenly made the last commit. I will undo it later.

@tamaroning
Copy link
Contributor Author

Done

Copy link
Member

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nit :) This looks great to me, thank you @tamaroning!

length++;
}

current_column += length;

loc += length - 1;

// TODO some keywords cannot be used for a lifetime label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO some keywords cannot be used for a lifetime label
// TODO some keywords cannot be used for a lifetime label #2306

@CohenArthur CohenArthur requested a review from P-E-P June 19, 2023 07:40
@CohenArthur
Copy link
Member

@tamaroning I noticed that your commit does not include a commit title - please add a short description of the commit as title. It should look something like this:

    gccrs: we can't check the bounds involving empty placeholder types

    We use placeholders for assoicated types on traits but if we are unifying
    types against a placeholder its not possible to check the bounds as the
    placeholder does not have enough information yet at this point to determine
    if bounds will or won't be satisfied. That check will occur when associated
    types and generics are setup.

    Fixes #2036

    gcc/rust/ChangeLog:

            * typecheck/rust-unify.cc (UnifyRules::go): dont check bounds on placeholders

    gcc/testsuite/ChangeLog:

            * rust/compile/issue-2036.rs: New test.

    Signed-off-by: Philip Herron <herron.philip@googlemail.com>

You can see the commit has a title, which can be anything as long as it is a short description of the commit. Then, you have the commit description - but that's not always necessary. Then the Changelog, and finally the Signed-off-by line

@tamaroning
Copy link
Contributor Author

@CohenArthur Thanks for your review! I fixed comments and commit messages.

Copy link
Member

@philberty philberty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch looks great though i think what will really help is to split this into two commits. So you put all the libcpp changes into one commit and the gccrs changes in another. The only reason for this is that the libcpp changes will need to be reviewed by other people in the GCC community and it will be easier to do this by separating them out.

Great work!

libcpp/ChangeLog:

	* charset.cc (check_xid_property):new function to check XID_Start and XID_Continue
	* include/cpplib.h (check_xid_property):add enum representing XID properties

Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>
gcc/rust/ChangeLog:

	* lex/rust-lex.cc (is_whitespace):add all lacked codepoints valid as whitespaces
	(is_identifier_start):new function to check XID_Start and underscore
	(is_identifier_continue):new function to check XID_Continue
	(Lexer::build_token):tokenize Unicode identifiers
	(Lexer::parse_partial_string_continue):add comments
	(Lexer::parse_partial_unicode_escape):add comments
	(Lexer::parse_raw_identifier):change to use `is_identifier_scontinue`
	(Lexer::parse_identifier_or_keyword):change to use `is_identifier_continue`
	(Lexer::parse_char_or_lifetime):change to use `is_identifier_start/continue`
	(Lexer::skip_codepoint_input):do not attempt to skip input when bumping EOF
	* lex/rust-lex.h:add `current_char32` field

Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>
@tamaroning
Copy link
Contributor Author

This patch looks great though i think what will really help is to split this into two commits. So you put all the libcpp changes into one commit and the gccrs changes in another. The only reason for this is that the libcpp changes will need to be reviewed by other people in the GCC community and it will be easier to do this by separating them out.

Sounds good :) Done.

Copy link
Member

@P-E-P P-E-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not much to say, that's good work.

@@ -116,6 +123,18 @@ is_non_decimal_int_literal_separator (char character)
return character == 'x' || character == 'o' || character == 'b';
}

bool
is_identifier_start (int codepoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to unify with the Codepoint alias in rust-codepoint.h ? Also what about specifying an explicit size (eg. std::uint32_t, maybe even wchar_t ?) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think uint32_t is better. If we unify types for paramters of such functions, other several functions should also use the same type.
e.g. is_x_digit, is_octal_digit, etc.

bool
is_x_digit (char number)
{
return ISXDIGIT (number);
}

Copy link
Member

@philberty philberty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@philberty philberty added this pull request to the merge queue Jun 20, 2023
Merged via the queue into Rust-GCC:master with commit 1931d49 Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants