Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys/shell: refactor tokenizer code. #12085

Closed
wants to merge 6 commits into from

Conversation

jcarrano
Copy link
Contributor

Contribution description

The tokenizer (the code that breaks up the line given to the shell into strings to create argv) was quite a messy piece of code. This commit refactors it into a more traditional state-machine based parser.

This fixes the issues with quote handling exposed by the recently introduced test.

Not only is the code clearer now, it is also smaller:

In the SAMR-21, before:

   text	   data	    bss	    dec	    hex	filename
  10292	    140	   2612	  13044	   32f4	tests/shell/bin/samr21-xpro/tests_shell.elf

After 1st refactor commit:

   text    data     bss     dec     hex filename
  10268	    140	   2612	  13020	   32dc	tests/shell/bin/samr21-xpro/tests_shell.elf

After 2nd change:

   text	   data	    bss	    dec	    hex	filename
  10256	    140	   2612	  13008	   32d0	tests/shell/bin/samr21-xpro/tests_shell.elf

Testing procedure

I added test vectors to the automated test. The old code could not handle this string:

> echo "t\e st" "\"" '\'' a\ b
shell: incorrect quoting
> echo "\""
shell: incorrect quoting

The new code handles it correctly.

Issues/PRs references

To be continued....

Check that single and doule quotes work, along with backslash escaping
and that malformed strings are rejected.

Right now the test is failing. Then next commit will replace the tokenizer
with one that works.
The tokenizer (the code that breaks up the line given to the shell into
strings to create argv) was quite a messy piece of code. This commit
refactors it into a more traditional state-machine based parser.

This fixes the issues with quote handling exposed by the recently
introduced test.

Not only is the code clearer now, it is also smaller:

In the SAMR-21, before:

```
   text	   data	    bss	    dec	    hex	filename
  10292	    140	   2612	  13044	   32f4	tests/shell/bin/samr21-xpro/tests_shell.elf
```

After:

```
   text    data     bss     dec     hex filename
  10268	    140	   2612	  13020	   32dc	tests/shell/bin/samr21-xpro/tests_shell.elf
```

Further commits can bring this number down even further, but for now the
focus in on clarity.
Factor out common code for quoted and unquoted tokens. This makes the code
slighly less clear, but it also eliminates repetition (which may improve
clarity).

The binary size is decreased again (for samr21):

Before:

```
   text    data     bss     dec     hex filename
  10268     140    2612   13020    32dc tests/shell/bin/samr21-xpro/tests_shell.elf
```

After:

```
   text	   data	    bss	    dec	    hex	filename
  10256	    140	   2612	  13008	   32d0	tests/shell/bin/samr21-xpro/tests_shell.elf
```
@jcarrano jcarrano added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: tests Area: tests and testing framework Type: cleanup The issue proposes a clean-up / The PR cleans-up parts of the codebase / documentation CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Area: sys Area: System labels Aug 26, 2019
@jcarrano jcarrano requested review from miri64 and jia200x August 26, 2019 15:03
@jcarrano jcarrano requested a review from MrKevinWeiss August 26, 2019 15:19
@jia200x
Copy link
Member

jia200x commented Aug 27, 2019

I will check it today during Hack'n Ack :)

@jcarrano
Copy link
Contributor Author

@cladmi pointed out that I was not building with RIOT_CI_BUILD=1, so my results above are inacurate. Here are the correct results with RIOT_CI_BUILD=1 BOARD=samr21-xpro make all at the first, second, and third commmit:

   text	   data	    bss	    dec	    hex	filename
  10212	    140	   2612	  12964	   32a4	tests/shell/bin/samr21-xpro/tests_shell.elf
   text	   data	    bss	    dec	    hex	filename
  10188	    140	   2612	  12940	   328c	tests/shell/bin/samr21-xpro/tests_shell.elf
   text	   data	    bss	    dec	    hex	filename
  10176	    140	   2612	  12928	   3280	tests/shell/bin/samr21-xpro/tests_shell.elf

The code for traversing arrays of shell commands (used to print help messages
and to search for commmand handlers) was needlessly complex.

The new code is both cleaner and results in smaller binaries (at least for
samr21). A substantial part of the improvement in size is due to the following
change

```
-    printf("%-20s %s\n", "Command", "Description");
-    puts("---------------------------------------");
+    puts("Command              Description\n---------------------------------------");
```

The way the code is written now means that ifdefs in the middle of the code
can be eliminated, further enhancing clarity.

Size before:

```
   text	   data	    bss	    dec	    hex	filename
  10176	    140	   2612	  12928	   3280	tests/shell/bin/samr21-xpro/tests_shell.elf
```

After:

```
   text	   data	    bss	    dec	    hex	filename
  10164	    140	   2612	  12916	   3274	tests/shell/bin/samr21-xpro/tests_shell.elf
```
@jcarrano
Copy link
Contributor Author

Fourth commit:

   text	   data	    bss	    dec	    hex	filename
  10164	    140	   2612	  12916	   3274	tests/shell/bin/samr21-xpro/tests_shell.elf

@MrKevinWeiss
Copy link
Contributor

Tested with a nucleo board. Also just dicked around with it a bit and could not find too much problems.

@jcarrano jcarrano added the CI: needs squashing Commits in this PR need to be squashed; If set, CI systems will mark this PR as unmergable label Aug 27, 2019
I would have expected that the access to _shell_command_list would be elided
when the branch was inaccessible, but it did not happen.
char *writepos = readpos;
enum PARSE_STATE pstate = PARSE_SPACE;

while (*readpos != '\0') {
Copy link
Member

@jia200x jia200x Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this part too hard to read. Maybe something like?

Suggested change
while (*readpos != '\0') {
while (*readpos != '\0') {
char wordbreak;
switch (pstate) {
case PARSE_SPACE:
if (*readpos != BLANK) {
argc++;
}
if (*readpos == SQUOTE) {
pstate = PARSE_SINGLEQUOTE;
} else if (*readpos == DQUOTE) {
pstate = PARSE_DOUBLEQUOTE;
} else if (*readpos == ESCAPECHAR) {
pstate = PARSE_UNQUOTED_ESC;
} else if (*readpos != BLANK) {
pstate = PARSE_UNQUOTED;
goto write;
}
readpos++;
continue;
case PARSE_UNQUOTED:
wordbreak = BLANK;
goto break_word;
case PARSE_SINGLEQUOTE:
wordbreak = SQUOTE;
goto break_word;
case PARSE_DOUBLEQUOTE:
wordbreak = DQUOTE;
goto break_word;
default: /* QUOTED state */
pstate = escape_toggle(pstate);
goto write;
}
break_word:
if (*readpos == wordbreak) {
pstate = PARSE_SPACE;
*writepos++ = '\0';
goto parse_end;
} else if (*readpos == ESCAPECHAR) {
pstate = escape_toggle(pstate);
goto parse_end;
}
write:
*writepos++ = *readpos;
parse_end:
readpos++;
}

It's a similar goto logic but the labels are not inside the switch.
Also, it seems to reduce the size (native) from:

Building application "default" for "native" with MCU "native".
   text    data     bss     dec     hex filename
  90573    1104   72088  163765   27fb5 /home/jialamos/Development/RIOT/examples/default/bin/native/default.elf

to

Building application "default" for "native" with MCU "native".

   text    data     bss     dec     hex filename
  90557    1104   72088  163749   27fa5 /home/jialamos/Development/RIOT/examples/default/bin/native/default.elf

On samr21-xpro, the build size keeps the same. No idea about others too.

Copy link
Member

@jia200x jia200x Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a non-goto version:

Suggested change
while (*readpos != '\0') {
for(readpos = line; *readpos != '\0'; readpos++) {
char wordbreak;
switch (pstate) {
case PARSE_SPACE:
if (*readpos != BLANK) {
argc++;
}
if (*readpos == SQUOTE) {
pstate = PARSE_SINGLEQUOTE;
} else if (*readpos == DQUOTE) {
pstate = PARSE_DOUBLEQUOTE;
} else if (*readpos == ESCAPECHAR) {
pstate = PARSE_UNQUOTED_ESC;
} else if (*readpos != BLANK) {
pstate = PARSE_UNQUOTED;
*writepos++ = *readpos;
}
continue;
case PARSE_UNQUOTED:
wordbreak = BLANK;
break;
case PARSE_SINGLEQUOTE:
wordbreak = SQUOTE;
break;
case PARSE_DOUBLEQUOTE:
wordbreak = DQUOTE;
break;
default: /* QUOTED state */
pstate = escape_toggle(pstate);
*writepos++ = *readpos;
continue;
}
if (*readpos == wordbreak) {
pstate = PARSE_SPACE;
*writepos++ = '\0';
} else if (*readpos == ESCAPECHAR) {
pstate = escape_toggle(pstate);
}
}

It's 4 bytes bigger on samr21-xpro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a similar modification (I'm about to push the fixup). The size does not change in samr21 because the compiler is smart enough to move the code around, so the generated machine code is the same in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, hold on, I don't think it improves so much in terms of clarity:

    while (*readpos != '\0') {
        char wordbreak;

        switch (pstate) {
            case PARSE_SPACE:
                if (*readpos != BLANK) {
                    argc++;
                }
                if (*readpos == SQUOTE) {
                    pstate = PARSE_SINGLEQUOTE;
                } else if (*readpos == DQUOTE) {
                    pstate = PARSE_DOUBLEQUOTE;
                } else if (*readpos == ESCAPECHAR) {
                    pstate = PARSE_UNQUOTED_ESC;
                } else if (*readpos != BLANK) {
                    pstate = PARSE_UNQUOTED;
                    goto write_word;
                }
                goto parse_end;
            case PARSE_UNQUOTED:
                wordbreak = BLANK;
                break;
            case PARSE_SINGLEQUOTE:
                wordbreak = SQUOTE;
                break;
            case PARSE_DOUBLEQUOTE:
                wordbreak = DQUOTE;
                break;
            default: /* QUOTED state */
                pstate = escape_toggle(pstate);
                goto write_word;
        }

        if (*readpos == wordbreak) {
            pstate = PARSE_SPACE;
            *writepos++ = '\0';
        } else if (*readpos == ESCAPECHAR) {
            pstate = escape_toggle(pstate);
        } else {
write_word:
            *writepos++ = *readpos;
        }
parse_end:
        readpos++;
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could draw the state diagram as a comment before the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state diagram would also help

@MrKevinWeiss
Copy link
Contributor

Generally I think that it would be better no to use the goto like that, even if it saves a few bytes. I also don't know if we need to be that concerned about saving a few bytes on shell... Of course if it is free lets do it but if it sacrifices some readability then maybe just go with the old version.

@jia200x jia200x added the Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines label Aug 28, 2019

┌───[\]────┐ ┌─────["]────┐ ┌───[']─────┐ ┌───[\]────┐
↓ │ ↓ │ │ ↓ │ ↓
┏━━━━━━━━━━┓ ┏━┷━━━━━┓ ┏━┷━━━┷━┓ ┏━━━━┷━━┓ ┏━━━━━━━━━━┓
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This displays badly because github's font is defective. In your editor or terminal it should display fine.

@HendrikVE
Copy link
Contributor

Can be closed due to #13197

@fjmolinas
Copy link
Contributor

Closing since #13197 is in.

@fjmolinas fjmolinas closed this Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: sys Area: System Area: tests Area: tests and testing framework CI: needs squashing Commits in this PR need to be squashed; If set, CI systems will mark this PR as unmergable CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Type: cleanup The issue proposes a clean-up / The PR cleans-up parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants