Tokenize Unicode identifiers #2284

tamaroning · 2023-06-09T02:34:08Z

add function which checks XID_Start and XID_Continue to libcpp
identifiers
lifetime token

gcc/rust/ChangeLog:

	* lex/rust-lex.cc (is_whitespace): add all lacked codepoints valid as whitespaces
	(is_identifier_start): new function to check XID_Start and underscore
	(is_identifier_continue): new function to check XID_Continue
	(Lexer::build_token):tokenize Unicode identifiers
	(Lexer::parse_partial_string_continue):add comments
	(Lexer::parse_partial_unicode_escape):add comments
	(Lexer::parse_raw_identifier):change to use `is_identifier_scontinue`
	(Lexer::parse_identifier_or_keyword):change to use `is_identifier_continue`
	(Lexer::parse_char_or_lifetime):change to use `is_identifier_start` and `is_identifier_continue`
	(Lexer::skip_codepoint_input):do not attempt to skip input when bumping EOF
	* lex/rust-lex.h:add `current_char32` field

libcpp/ChangeLog:

	* charset.cc (check_xid_property):new function to check XID_Start and XID_Continue
	* include/cpplib.h (check_xid_property):add forward declaration of `check_xid_property`

tamaroning · 2023-06-11T18:56:45Z

I found this code and am not sure how we handle width of characters
https://github.com/tamaroning/gccrs/blob/3c0f3ca4e4d8ddeecead05e109ace1917e021321/gcc/rust/lex/rust-lex.cc#L2462-L2463

tamaroning · 2023-06-12T10:53:22Z

libcpp/charset.cc

+/* Returns 1 if C has the XID_Start property, 2 if C has the XID_Continue properties, 3 if C has the both properties, 0 otherwise. */
+int check_xid_property (cppchar_t c) {
+  // fast path for ASCII
+  if (c < 0x80) {
+    if (('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z'))
+      return 3;
+    if (('0' <= c && c <= '9') || c == '_')
+      return 2;
+  }


I think this function should return bit flags but how can I do it?
(i.e. None: 0, XID_Start: 1, XID_Continue: 1 << 1)
Is just using int (or unsigned int) and exporting constants such as XID_START and XID_CONTINUE good?

Fixed to use unsigned int and export consants via enum

tamaroning · 2023-06-16T01:15:03Z

gcc/rust/lex/rust-lex.cc

 bool
-is_whitespace (char character)
+is_whitespace (int character)
 {
-  return ISSPACE (character);
+  // https://doc.rust-lang.org/reference/whitespace.html
+  return character == '\t' || character == '\n' || character == '\v'
+	 || character == '\f' || character == '\r' || character == ' '
+	 || character == 0x0085	 // next line
+	 || character == 0x200e	 // left-to-right mark
+	 || character == 0x200f	 // right-to-left mark
+	 || character == 0x2028	 // line separator


Added some codepoints of whitespaces.
But non-ascii whitespaces are not actually checked during tokenization because this func is called with argument whose type is char (1 byte)

CohenArthur

That's great work, thank you!

CohenArthur · 2023-06-16T09:41:33Z

gcc/rust/lex/rust-lex.cc

+	 || character == 0x0085	 // next line
+	 || character == 0x200e	 // left-to-right mark
+	 || character == 0x200f	 // right-to-left mark
+	 || character == 0x2028	 // line separator
+	 || character == 0x2029; // pragraph separator


Are all of those characters accepted by rustc as whitespace?

Yes. All of these values are defined in the Rust ref.
You can find URL to this just before the selected lines.

Oh, I missed this! Sorry! Thanks for pointing it out haha

CohenArthur · 2023-06-16T09:44:46Z

gcc/rust/lex/rust-lex.cc

 	}

+      current_char = peek_input ();
+      current_char32 = peek_codepoint_input ();
+      skip_codepoint_input ();


why are we skipping the codepoint input here but not the char?

If we skip one byte here, only the first byte of current utf-8 character can be skipped by the lexer, which we do not expect.
For example, if the lexer tokenizes identifier あああ , it should skip the first utf8 character あ, not its first byte.

I see, thank you!

CohenArthur · 2023-06-16T09:47:03Z

gcc/rust/lex/rust-lex.cc

+	  // TODO some keywords cannot be used for a lifetime label
+	  // https://doc.rust-lang.org/reference/tokens.html


That's good - please open an issue for this and put the issue number in the comment :)

Opend #2306

tamaroning · 2023-06-19T04:24:39Z

Oops, I mistakenly made the last commit. I will undo it later.

tamaroning · 2023-06-19T04:38:02Z

Done

CohenArthur

Very minor nit :) This looks great to me, thank you @tamaroning!

CohenArthur · 2023-06-19T07:39:59Z

gcc/rust/lex/rust-lex.cc

 	      length++;
 	    }

 	  current_column += length;

 	  loc += length - 1;

+	  // TODO some keywords cannot be used for a lifetime label


Suggested change

// TODO some keywords cannot be used for a lifetime label

// TODO some keywords cannot be used for a lifetime label #2306

CohenArthur · 2023-06-19T08:00:00Z

@tamaroning I noticed that your commit does not include a commit title - please add a short description of the commit as title. It should look something like this:

    gccrs: we can't check the bounds involving empty placeholder types

    We use placeholders for assoicated types on traits but if we are unifying
    types against a placeholder its not possible to check the bounds as the
    placeholder does not have enough information yet at this point to determine
    if bounds will or won't be satisfied. That check will occur when associated
    types and generics are setup.

    Fixes #2036

    gcc/rust/ChangeLog:

            * typecheck/rust-unify.cc (UnifyRules::go): dont check bounds on placeholders

    gcc/testsuite/ChangeLog:

            * rust/compile/issue-2036.rs: New test.

    Signed-off-by: Philip Herron <herron.philip@googlemail.com>

You can see the commit has a title, which can be anything as long as it is a short description of the commit. Then, you have the commit description - but that's not always necessary. Then the Changelog, and finally the Signed-off-by line

tamaroning · 2023-06-19T08:16:42Z

@CohenArthur Thanks for your review! I fixed comments and commit messages.

philberty

This patch looks great though i think what will really help is to split this into two commits. So you put all the libcpp changes into one commit and the gccrs changes in another. The only reason for this is that the libcpp changes will need to be reviewed by other people in the GCC community and it will be easier to do this by separating them out.

Great work!

libcpp/ChangeLog: * charset.cc (check_xid_property):new function to check XID_Start and XID_Continue * include/cpplib.h (check_xid_property):add enum representing XID properties Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>

gcc/rust/ChangeLog: * lex/rust-lex.cc (is_whitespace):add all lacked codepoints valid as whitespaces (is_identifier_start):new function to check XID_Start and underscore (is_identifier_continue):new function to check XID_Continue (Lexer::build_token):tokenize Unicode identifiers (Lexer::parse_partial_string_continue):add comments (Lexer::parse_partial_unicode_escape):add comments (Lexer::parse_raw_identifier):change to use `is_identifier_scontinue` (Lexer::parse_identifier_or_keyword):change to use `is_identifier_continue` (Lexer::parse_char_or_lifetime):change to use `is_identifier_start/continue` (Lexer::skip_codepoint_input):do not attempt to skip input when bumping EOF * lex/rust-lex.h:add `current_char32` field Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>

tamaroning · 2023-06-19T09:36:58Z

This patch looks great though i think what will really help is to split this into two commits. So you put all the libcpp changes into one commit and the gccrs changes in another. The only reason for this is that the libcpp changes will need to be reviewed by other people in the GCC community and it will be easier to do this by separating them out.

Sounds good :) Done.

P-E-P

I've not much to say, that's good work.

P-E-P · 2023-06-19T11:03:47Z

gcc/rust/lex/rust-lex.cc

@@ -116,6 +123,18 @@ is_non_decimal_int_literal_separator (char character)
  return character == 'x' || character == 'o' || character == 'b';
 }

+bool
+is_identifier_start (int codepoint)


Do we want to unify with the Codepoint alias in rust-codepoint.h ? Also what about specifying an explicit size (eg. std::uint32_t, maybe even wchar_t ?) ?

Yes, I think uint32_t is better. If we unify types for paramters of such functions, other several functions should also use the same type.
e.g. is_x_digit, is_octal_digit, etc.

gccrs/gcc/rust/lex/rust-lex.cc

Lines 83 to 87 in d535c82

bool

is_x_digit (char number)

{

return ISXDIGIT (number);

}

philberty

LGTM

]

CohenArthur marked this pull request as draft June 12, 2023 08:32

tamaroning mentioned this pull request Jun 12, 2023

Unicode support #2287

Open

15 tasks

tamaroning commented Jun 12, 2023

View reviewed changes

tamaroning force-pushed the uc-lexer branch 3 times, most recently from 1a03be4 to 41e4ea7 Compare June 16, 2023 01:09

tamaroning commented Jun 16, 2023

View reviewed changes

tamaroning changed the title ~~[WIP] Tokenize Unicode identifiers~~ Tokenize Unicode identifiers Jun 16, 2023

tamaroning force-pushed the uc-lexer branch from 41e4ea7 to d7b3752 Compare June 16, 2023 01:24

tamaroning marked this pull request as ready for review June 16, 2023 03:19

CohenArthur requested a review from philberty June 16, 2023 09:40

CohenArthur added the enhancement label Jun 16, 2023

CohenArthur reviewed Jun 16, 2023

View reviewed changes

tamaroning force-pushed the uc-lexer branch 2 times, most recently from 44ce808 to b77a471 Compare June 17, 2023 16:27

tamaroning mentioned this pull request Jun 18, 2023

Refactor lexer to treat all input characters as UTF-8 #2307

Merged

tamaroning force-pushed the uc-lexer branch from e466722 to b77a471 Compare June 19, 2023 04:36

CohenArthur approved these changes Jun 19, 2023

View reviewed changes

CohenArthur requested a review from P-E-P June 19, 2023 07:40

tamaroning force-pushed the uc-lexer branch from b77a471 to cb12be8 Compare June 19, 2023 08:15

philberty requested changes Jun 19, 2023

View reviewed changes

tamaroning force-pushed the uc-lexer branch from cb12be8 to 355e737 Compare June 19, 2023 09:09

tamaroning added 2 commits June 19, 2023 18:26

libcpp: add function to check XID properties

eb60755

libcpp/ChangeLog: * charset.cc (check_xid_property):new function to check XID_Start and XID_Continue * include/cpplib.h (check_xid_property):add enum representing XID properties Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>

tamaroning force-pushed the uc-lexer branch from 355e737 to d535c82 Compare June 19, 2023 09:29

P-E-P approved these changes Jun 19, 2023

View reviewed changes

philberty approved these changes Jun 20, 2023

View reviewed changes

philberty added this to the AST Pipeline for libcore 1.49 Complete milestone Jun 20, 2023

philberty added this pull request to the merge queue Jun 20, 2023

Merged via the queue into Rust-GCC:master with commit 1931d49 Jun 20, 2023

tamaroning mentioned this pull request Jun 29, 2023

Remove unnecessary methods/fields of Rust::Lexer #2347

Merged

tschwinge added a commit that referenced this pull request Mar 24, 2024

Merge commit '00dea7e8c41b672730d6e2c891b6012a83d8842c' into HEAD [#2284

59bc3e7

]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize Unicode identifiers #2284

Tokenize Unicode identifiers #2284

tamaroning commented Jun 9, 2023 •

edited

Loading

tamaroning commented Jun 11, 2023

tamaroning Jun 12, 2023 •

edited

Loading

tamaroning Jun 16, 2023 •

edited

Loading

tamaroning Jun 16, 2023

CohenArthur left a comment

CohenArthur Jun 16, 2023

tamaroning Jun 17, 2023

CohenArthur Jun 19, 2023

CohenArthur Jun 16, 2023

tamaroning Jun 17, 2023

CohenArthur Jun 19, 2023

CohenArthur Jun 16, 2023

tamaroning Jun 17, 2023

tamaroning commented Jun 19, 2023

tamaroning commented Jun 19, 2023

CohenArthur left a comment

CohenArthur Jun 19, 2023

CohenArthur commented Jun 19, 2023

tamaroning commented Jun 19, 2023

philberty left a comment

tamaroning commented Jun 19, 2023

P-E-P left a comment

P-E-P Jun 19, 2023

tamaroning Jun 19, 2023

philberty left a comment

		// TODO some keywords cannot be used for a lifetime label
		// https://doc.rust-lang.org/reference/tokens.html

Tokenize Unicode identifiers #2284

Tokenize Unicode identifiers #2284

Conversation

tamaroning commented Jun 9, 2023 • edited Loading

tamaroning commented Jun 11, 2023

tamaroning Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

tamaroning Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CohenArthur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamaroning commented Jun 19, 2023

tamaroning commented Jun 19, 2023

CohenArthur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CohenArthur commented Jun 19, 2023

tamaroning commented Jun 19, 2023

philberty left a comment

Choose a reason for hiding this comment

tamaroning commented Jun 19, 2023

P-E-P left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philberty left a comment

Choose a reason for hiding this comment

tamaroning commented Jun 9, 2023 •

edited

Loading

tamaroning Jun 12, 2023 •

edited

Loading

tamaroning Jun 16, 2023 •

edited

Loading