sys/shell: refactor tokenizer code. #12085

jcarrano · 2019-08-26T15:03:21Z

Contribution description

The tokenizer (the code that breaks up the line given to the shell into strings to create argv) was quite a messy piece of code. This commit refactors it into a more traditional state-machine based parser.

This fixes the issues with quote handling exposed by the recently introduced test.

Not only is the code clearer now, it is also smaller:

In the SAMR-21, before:

   text	   data	    bss	    dec	    hex	filename
  10292	    140	   2612	  13044	   32f4	tests/shell/bin/samr21-xpro/tests_shell.elf

After 1st refactor commit:

   text    data     bss     dec     hex filename
  10268	    140	   2612	  13020	   32dc	tests/shell/bin/samr21-xpro/tests_shell.elf

After 2nd change:

   text	   data	    bss	    dec	    hex	filename
  10256	    140	   2612	  13008	   32d0	tests/shell/bin/samr21-xpro/tests_shell.elf

Testing procedure

I added test vectors to the automated test. The old code could not handle this string:

> echo "t\e st" "\"" '\'' a\ b
shell: incorrect quoting
> echo "\""
shell: incorrect quoting

The new code handles it correctly.

Issues/PRs references

To be continued....

Check that single and doule quotes work, along with backslash escaping and that malformed strings are rejected. Right now the test is failing. Then next commit will replace the tokenizer with one that works.

The tokenizer (the code that breaks up the line given to the shell into strings to create argv) was quite a messy piece of code. This commit refactors it into a more traditional state-machine based parser. This fixes the issues with quote handling exposed by the recently introduced test. Not only is the code clearer now, it is also smaller: In the SAMR-21, before: ``` text data bss dec hex filename 10292 140 2612 13044 32f4 tests/shell/bin/samr21-xpro/tests_shell.elf ``` After: ``` text data bss dec hex filename 10268 140 2612 13020 32dc tests/shell/bin/samr21-xpro/tests_shell.elf ``` Further commits can bring this number down even further, but for now the focus in on clarity.

Factor out common code for quoted and unquoted tokens. This makes the code slighly less clear, but it also eliminates repetition (which may improve clarity). The binary size is decreased again (for samr21): Before: ``` text data bss dec hex filename 10268 140 2612 13020 32dc tests/shell/bin/samr21-xpro/tests_shell.elf ``` After: ``` text data bss dec hex filename 10256 140 2612 13008 32d0 tests/shell/bin/samr21-xpro/tests_shell.elf ```

jia200x · 2019-08-27T08:30:11Z

I will check it today during Hack'n Ack :)

jcarrano · 2019-08-27T09:49:31Z

@cladmi pointed out that I was not building with RIOT_CI_BUILD=1, so my results above are inacurate. Here are the correct results with RIOT_CI_BUILD=1 BOARD=samr21-xpro make all at the first, second, and third commmit:

   text	   data	    bss	    dec	    hex	filename
  10212	    140	   2612	  12964	   32a4	tests/shell/bin/samr21-xpro/tests_shell.elf

   text	   data	    bss	    dec	    hex	filename
  10188	    140	   2612	  12940	   328c	tests/shell/bin/samr21-xpro/tests_shell.elf

   text	   data	    bss	    dec	    hex	filename
  10176	    140	   2612	  12928	   3280	tests/shell/bin/samr21-xpro/tests_shell.elf

The code for traversing arrays of shell commands (used to print help messages and to search for commmand handlers) was needlessly complex. The new code is both cleaner and results in smaller binaries (at least for samr21). A substantial part of the improvement in size is due to the following change ``` - printf("%-20s %s\n", "Command", "Description"); - puts("---------------------------------------"); + puts("Command Description\n---------------------------------------"); ``` The way the code is written now means that ifdefs in the middle of the code can be eliminated, further enhancing clarity. Size before: ``` text data bss dec hex filename 10176 140 2612 12928 3280 tests/shell/bin/samr21-xpro/tests_shell.elf ``` After: ``` text data bss dec hex filename 10164 140 2612 12916 3274 tests/shell/bin/samr21-xpro/tests_shell.elf ```

jcarrano · 2019-08-27T11:45:57Z

Fourth commit:

   text	   data	    bss	    dec	    hex	filename
  10164	    140	   2612	  12916	   3274	tests/shell/bin/samr21-xpro/tests_shell.elf

MrKevinWeiss · 2019-08-27T16:05:38Z

Tested with a nucleo board. Also just dicked around with it a bit and could not find too much problems.

I would have expected that the access to _shell_command_list would be elided when the branch was inaccessible, but it did not happen.

jia200x · 2019-08-28T09:13:45Z

sys/shell/shell.c

+    char *writepos = readpos;
+    enum PARSE_STATE pstate = PARSE_SPACE;
+
+    while (*readpos != '\0') {


I find this part too hard to read. Maybe something like?

Suggested change

while (*readpos != '\0') {

while (*readpos != '\0') {

char wordbreak;

switch (pstate) {

case PARSE_SPACE:

if (*readpos != BLANK) {

argc++;

}

if (*readpos == SQUOTE) {

pstate = PARSE_SINGLEQUOTE;

} else if (*readpos == DQUOTE) {

pstate = PARSE_DOUBLEQUOTE;

} else if (*readpos == ESCAPECHAR) {

pstate = PARSE_UNQUOTED_ESC;

} else if (*readpos != BLANK) {

pstate = PARSE_UNQUOTED;

goto write;

}

readpos++;

continue;

case PARSE_UNQUOTED:

wordbreak = BLANK;

goto break_word;

case PARSE_SINGLEQUOTE:

wordbreak = SQUOTE;

goto break_word;

case PARSE_DOUBLEQUOTE:

wordbreak = DQUOTE;

goto break_word;

default: /* QUOTED state */

pstate = escape_toggle(pstate);

goto write;

}

break_word:

if (*readpos == wordbreak) {

pstate = PARSE_SPACE;

*writepos++ = '\0';

goto parse_end;

} else if (*readpos == ESCAPECHAR) {

pstate = escape_toggle(pstate);

goto parse_end;

}

write:

*writepos++ = *readpos;

parse_end:

readpos++;

}

It's a similar goto logic but the labels are not inside the switch.
Also, it seems to reduce the size (native) from:

Building application "default" for "native" with MCU "native". text data bss dec hex filename 90573 1104 72088 163765 27fb5 /home/jialamos/Development/RIOT/examples/default/bin/native/default.elf

to

Building application "default" for "native" with MCU "native". text data bss dec hex filename 90557 1104 72088 163749 27fa5 /home/jialamos/Development/RIOT/examples/default/bin/native/default.elf

On samr21-xpro, the build size keeps the same. No idea about others too.

Here's a non-goto version:

Suggested change

while (*readpos != '\0') {

for(readpos = line; *readpos != '\0'; readpos++) {

char wordbreak;

switch (pstate) {

case PARSE_SPACE:

if (*readpos != BLANK) {

argc++;

}

if (*readpos == SQUOTE) {

pstate = PARSE_SINGLEQUOTE;

} else if (*readpos == DQUOTE) {

pstate = PARSE_DOUBLEQUOTE;

} else if (*readpos == ESCAPECHAR) {

pstate = PARSE_UNQUOTED_ESC;

} else if (*readpos != BLANK) {

pstate = PARSE_UNQUOTED;

*writepos++ = *readpos;

}

continue;

case PARSE_UNQUOTED:

wordbreak = BLANK;

break;

case PARSE_SINGLEQUOTE:

wordbreak = SQUOTE;

break;

case PARSE_DOUBLEQUOTE:

wordbreak = DQUOTE;

break;

default: /* QUOTED state */

pstate = escape_toggle(pstate);

*writepos++ = *readpos;

continue;

}

if (*readpos == wordbreak) {

pstate = PARSE_SPACE;

*writepos++ = '\0';

} else if (*readpos == ESCAPECHAR) {

pstate = escape_toggle(pstate);

}

}

It's 4 bytes bigger on samr21-xpro

I made a similar modification (I'm about to push the fixup). The size does not change in samr21 because the compiler is smart enough to move the code around, so the generated machine code is the same in both cases.

mmm, hold on, I don't think it improves so much in terms of clarity:

while (*readpos != '\0') { char wordbreak; switch (pstate) { case PARSE_SPACE: if (*readpos != BLANK) { argc++; } if (*readpos == SQUOTE) { pstate = PARSE_SINGLEQUOTE; } else if (*readpos == DQUOTE) { pstate = PARSE_DOUBLEQUOTE; } else if (*readpos == ESCAPECHAR) { pstate = PARSE_UNQUOTED_ESC; } else if (*readpos != BLANK) { pstate = PARSE_UNQUOTED; goto write_word; } goto parse_end; case PARSE_UNQUOTED: wordbreak = BLANK; break; case PARSE_SINGLEQUOTE: wordbreak = SQUOTE; break; case PARSE_DOUBLEQUOTE: wordbreak = DQUOTE; break; default: /* QUOTED state */ pstate = escape_toggle(pstate); goto write_word; } if (*readpos == wordbreak) { pstate = PARSE_SPACE; *writepos++ = '\0'; } else if (*readpos == ESCAPECHAR) { pstate = escape_toggle(pstate); } else { write_word: *writepos++ = *readpos; } parse_end: readpos++; }

I could draw the state diagram as a comment before the function.

the state diagram would also help

MrKevinWeiss · 2019-08-28T09:22:40Z

Generally I think that it would be better no to use the goto like that, even if it saves a few bytes. I also don't know if we need to be that concerned about saving a few bytes on shell... Of course if it is free lets do it but if it sacrifices some readability then maybe just go with the old version.

jcarrano · 2019-08-28T12:22:36Z

sys/shell/shell.c

+
+           ┌───[\]────┐   ┌─────["]────┐   ┌───[']─────┐  ┌───[\]────┐
+           ↓          │   ↓            │   │           ↓  │          ↓
+  ┏━━━━━━━━━━┓      ┏━┷━━━━━┓        ┏━┷━━━┷━┓       ┏━━━━┷━━┓     ┏━━━━━━━━━━┓


This displays badly because github's font is defective. In your editor or terminal it should display fine.

HendrikVE · 2020-01-24T13:12:07Z

Can be closed due to #13197

fjmolinas · 2020-06-24T12:31:21Z

Closing since #13197 is in.

jcarrano added 3 commits August 26, 2019 17:00

tests/shell: Test quoting and escaping.

b01426f

Check that single and doule quotes work, along with backslash escaping and that malformed strings are rejected. Right now the test is failing. Then next commit will replace the tokenizer with one that works.

jcarrano requested review from miri64 and jia200x August 26, 2019 15:03

jcarrano mentioned this pull request Aug 26, 2019

sys/shell: Make echoing and prompt runtime-configurable. #12087

Closed

jcarrano requested a review from MrKevinWeiss August 26, 2019 15:19

jcarrano mentioned this pull request Aug 27, 2019

sys/shell: refactor readline function. #12099

Closed

2 tasks

jcarrano added the CI: needs squashing Commits in this PR need to be squashed; If set, CI systems will mark this PR as unmergable label Aug 27, 2019

fixup! llvm is not that smart in native.

eb38d01

I would have expected that the access to _shell_command_list would be elided when the branch was inaccessible, but it did not happen.

jia200x reviewed Aug 28, 2019

View reviewed changes

jia200x added the Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines label Aug 28, 2019

jcarrano commented Aug 28, 2019

View reviewed changes

fixup! include state diagram

1be6a13

jcarrano mentioned this pull request Aug 28, 2019

[TRACKING] sys/shell refactoring. #12105

Open

8 tasks

HendrikVE mentioned this pull request Jan 24, 2020

sys/shell: refactor tokenizer code #13197

Merged

1 task

fjmolinas closed this Jun 24, 2020

Enoch247 mentioned this pull request Dec 21, 2023

sys/shell: fix getopt() support #20209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sys/shell: refactor tokenizer code. #12085

sys/shell: refactor tokenizer code. #12085

jcarrano commented Aug 26, 2019

jia200x commented Aug 27, 2019

jcarrano commented Aug 27, 2019

jcarrano commented Aug 27, 2019

MrKevinWeiss commented Aug 27, 2019

jia200x Aug 28, 2019 •

edited

Loading

jia200x Aug 28, 2019 •

edited

Loading

jcarrano Aug 28, 2019

jcarrano Aug 28, 2019

jcarrano Aug 28, 2019

jia200x Aug 28, 2019

MrKevinWeiss commented Aug 28, 2019

jcarrano Aug 28, 2019

HendrikVE commented Jan 24, 2020

fjmolinas commented Jun 24, 2020

-    while (*readpos != '\0') {
+    while (*readpos != '\0') {
+        char wordbreak;
+        switch (pstate) {
+            case PARSE_SPACE:
+                if (*readpos != BLANK) {
+                    argc++;
+                }
+                if (*readpos == SQUOTE) {
+                    pstate = PARSE_SINGLEQUOTE;
+                } else if (*readpos == DQUOTE) {
+                    pstate = PARSE_DOUBLEQUOTE;
+                } else if (*readpos == ESCAPECHAR) {
+                    pstate = PARSE_UNQUOTED_ESC;
+                } else if (*readpos != BLANK) {
+                    pstate = PARSE_UNQUOTED;
+                    goto write;
+                }
+                readpos++;
+                continue;
+            case PARSE_UNQUOTED:
+                wordbreak = BLANK;
+                goto break_word;
+            case PARSE_SINGLEQUOTE:
+                wordbreak = SQUOTE;
+                goto break_word;
+            case PARSE_DOUBLEQUOTE:
+                wordbreak = DQUOTE;
+                goto break_word;
+            default: /* QUOTED state */
+                pstate = escape_toggle(pstate);
+                goto write;
+        }
+break_word:
+    if (*readpos == wordbreak) {
+        pstate = PARSE_SPACE;
+        *writepos++ = '\0';
+        goto parse_end;
+    } else if (*readpos == ESCAPECHAR) {
+        pstate = escape_toggle(pstate);
+        goto parse_end;
+    }
+write:
+        *writepos++ = *readpos;
+parse_end:
+        readpos++;
+    }

sys/shell: refactor tokenizer code. #12085

sys/shell: refactor tokenizer code. #12085

Conversation

jcarrano commented Aug 26, 2019

Contribution description

Testing procedure

Issues/PRs references

jia200x commented Aug 27, 2019

jcarrano commented Aug 27, 2019

jcarrano commented Aug 27, 2019

MrKevinWeiss commented Aug 27, 2019

jia200x Aug 28, 2019 • edited Loading

Choose a reason for hiding this comment

jia200x Aug 28, 2019 • edited Loading

Choose a reason for hiding this comment

jcarrano Aug 28, 2019

Choose a reason for hiding this comment

jcarrano Aug 28, 2019

Choose a reason for hiding this comment

jcarrano Aug 28, 2019

Choose a reason for hiding this comment

jia200x Aug 28, 2019

Choose a reason for hiding this comment

MrKevinWeiss commented Aug 28, 2019

jcarrano Aug 28, 2019

Choose a reason for hiding this comment

HendrikVE commented Jan 24, 2020

fjmolinas commented Jun 24, 2020

jia200x Aug 28, 2019 •

edited

Loading

jia200x Aug 28, 2019 •

edited

Loading