-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sys/shell: refactor tokenizer code. #12085
Conversation
Check that single and doule quotes work, along with backslash escaping and that malformed strings are rejected. Right now the test is failing. Then next commit will replace the tokenizer with one that works.
The tokenizer (the code that breaks up the line given to the shell into strings to create argv) was quite a messy piece of code. This commit refactors it into a more traditional state-machine based parser. This fixes the issues with quote handling exposed by the recently introduced test. Not only is the code clearer now, it is also smaller: In the SAMR-21, before: ``` text data bss dec hex filename 10292 140 2612 13044 32f4 tests/shell/bin/samr21-xpro/tests_shell.elf ``` After: ``` text data bss dec hex filename 10268 140 2612 13020 32dc tests/shell/bin/samr21-xpro/tests_shell.elf ``` Further commits can bring this number down even further, but for now the focus in on clarity.
Factor out common code for quoted and unquoted tokens. This makes the code slighly less clear, but it also eliminates repetition (which may improve clarity). The binary size is decreased again (for samr21): Before: ``` text data bss dec hex filename 10268 140 2612 13020 32dc tests/shell/bin/samr21-xpro/tests_shell.elf ``` After: ``` text data bss dec hex filename 10256 140 2612 13008 32d0 tests/shell/bin/samr21-xpro/tests_shell.elf ```
I will check it today during Hack'n Ack :) |
@cladmi pointed out that I was not building with
|
The code for traversing arrays of shell commands (used to print help messages and to search for commmand handlers) was needlessly complex. The new code is both cleaner and results in smaller binaries (at least for samr21). A substantial part of the improvement in size is due to the following change ``` - printf("%-20s %s\n", "Command", "Description"); - puts("---------------------------------------"); + puts("Command Description\n---------------------------------------"); ``` The way the code is written now means that ifdefs in the middle of the code can be eliminated, further enhancing clarity. Size before: ``` text data bss dec hex filename 10176 140 2612 12928 3280 tests/shell/bin/samr21-xpro/tests_shell.elf ``` After: ``` text data bss dec hex filename 10164 140 2612 12916 3274 tests/shell/bin/samr21-xpro/tests_shell.elf ```
Fourth commit:
|
Tested with a nucleo board. Also just dicked around with it a bit and could not find too much problems. |
I would have expected that the access to _shell_command_list would be elided when the branch was inaccessible, but it did not happen.
char *writepos = readpos; | ||
enum PARSE_STATE pstate = PARSE_SPACE; | ||
|
||
while (*readpos != '\0') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this part too hard to read. Maybe something like?
while (*readpos != '\0') { | |
while (*readpos != '\0') { | |
char wordbreak; | |
switch (pstate) { | |
case PARSE_SPACE: | |
if (*readpos != BLANK) { | |
argc++; | |
} | |
if (*readpos == SQUOTE) { | |
pstate = PARSE_SINGLEQUOTE; | |
} else if (*readpos == DQUOTE) { | |
pstate = PARSE_DOUBLEQUOTE; | |
} else if (*readpos == ESCAPECHAR) { | |
pstate = PARSE_UNQUOTED_ESC; | |
} else if (*readpos != BLANK) { | |
pstate = PARSE_UNQUOTED; | |
goto write; | |
} | |
readpos++; | |
continue; | |
case PARSE_UNQUOTED: | |
wordbreak = BLANK; | |
goto break_word; | |
case PARSE_SINGLEQUOTE: | |
wordbreak = SQUOTE; | |
goto break_word; | |
case PARSE_DOUBLEQUOTE: | |
wordbreak = DQUOTE; | |
goto break_word; | |
default: /* QUOTED state */ | |
pstate = escape_toggle(pstate); | |
goto write; | |
} | |
break_word: | |
if (*readpos == wordbreak) { | |
pstate = PARSE_SPACE; | |
*writepos++ = '\0'; | |
goto parse_end; | |
} else if (*readpos == ESCAPECHAR) { | |
pstate = escape_toggle(pstate); | |
goto parse_end; | |
} | |
write: | |
*writepos++ = *readpos; | |
parse_end: | |
readpos++; | |
} |
It's a similar goto
logic but the labels are not inside the switch.
Also, it seems to reduce the size (native) from:
Building application "default" for "native" with MCU "native".
text data bss dec hex filename
90573 1104 72088 163765 27fb5 /home/jialamos/Development/RIOT/examples/default/bin/native/default.elf
to
Building application "default" for "native" with MCU "native".
text data bss dec hex filename
90557 1104 72088 163749 27fa5 /home/jialamos/Development/RIOT/examples/default/bin/native/default.elf
On samr21-xpro, the build size keeps the same. No idea about others too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a non-goto version:
while (*readpos != '\0') { | |
for(readpos = line; *readpos != '\0'; readpos++) { | |
char wordbreak; | |
switch (pstate) { | |
case PARSE_SPACE: | |
if (*readpos != BLANK) { | |
argc++; | |
} | |
if (*readpos == SQUOTE) { | |
pstate = PARSE_SINGLEQUOTE; | |
} else if (*readpos == DQUOTE) { | |
pstate = PARSE_DOUBLEQUOTE; | |
} else if (*readpos == ESCAPECHAR) { | |
pstate = PARSE_UNQUOTED_ESC; | |
} else if (*readpos != BLANK) { | |
pstate = PARSE_UNQUOTED; | |
*writepos++ = *readpos; | |
} | |
continue; | |
case PARSE_UNQUOTED: | |
wordbreak = BLANK; | |
break; | |
case PARSE_SINGLEQUOTE: | |
wordbreak = SQUOTE; | |
break; | |
case PARSE_DOUBLEQUOTE: | |
wordbreak = DQUOTE; | |
break; | |
default: /* QUOTED state */ | |
pstate = escape_toggle(pstate); | |
*writepos++ = *readpos; | |
continue; | |
} | |
if (*readpos == wordbreak) { | |
pstate = PARSE_SPACE; | |
*writepos++ = '\0'; | |
} else if (*readpos == ESCAPECHAR) { | |
pstate = escape_toggle(pstate); | |
} | |
} |
It's 4 bytes bigger on samr21-xpro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a similar modification (I'm about to push the fixup). The size does not change in samr21 because the compiler is smart enough to move the code around, so the generated machine code is the same in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, hold on, I don't think it improves so much in terms of clarity:
while (*readpos != '\0') {
char wordbreak;
switch (pstate) {
case PARSE_SPACE:
if (*readpos != BLANK) {
argc++;
}
if (*readpos == SQUOTE) {
pstate = PARSE_SINGLEQUOTE;
} else if (*readpos == DQUOTE) {
pstate = PARSE_DOUBLEQUOTE;
} else if (*readpos == ESCAPECHAR) {
pstate = PARSE_UNQUOTED_ESC;
} else if (*readpos != BLANK) {
pstate = PARSE_UNQUOTED;
goto write_word;
}
goto parse_end;
case PARSE_UNQUOTED:
wordbreak = BLANK;
break;
case PARSE_SINGLEQUOTE:
wordbreak = SQUOTE;
break;
case PARSE_DOUBLEQUOTE:
wordbreak = DQUOTE;
break;
default: /* QUOTED state */
pstate = escape_toggle(pstate);
goto write_word;
}
if (*readpos == wordbreak) {
pstate = PARSE_SPACE;
*writepos++ = '\0';
} else if (*readpos == ESCAPECHAR) {
pstate = escape_toggle(pstate);
} else {
write_word:
*writepos++ = *readpos;
}
parse_end:
readpos++;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could draw the state diagram as a comment before the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the state diagram would also help
Generally I think that it would be better no to use the goto like that, even if it saves a few bytes. I also don't know if we need to be that concerned about saving a few bytes on shell... Of course if it is free lets do it but if it sacrifices some readability then maybe just go with the old version. |
|
||
┌───[\]────┐ ┌─────["]────┐ ┌───[']─────┐ ┌───[\]────┐ | ||
↓ │ ↓ │ │ ↓ │ ↓ | ||
┏━━━━━━━━━━┓ ┏━┷━━━━━┓ ┏━┷━━━┷━┓ ┏━━━━┷━━┓ ┏━━━━━━━━━━┓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This displays badly because github's font is defective. In your editor or terminal it should display fine.
Can be closed due to #13197 |
Closing since #13197 is in. |
Contribution description
The tokenizer (the code that breaks up the line given to the shell into strings to create argv) was quite a messy piece of code. This commit refactors it into a more traditional state-machine based parser.
This fixes the issues with quote handling exposed by the recently introduced test.
Not only is the code clearer now, it is also smaller:
In the SAMR-21, before:
After 1st refactor commit:
After 2nd change:
Testing procedure
I added test vectors to the automated test. The old code could not handle this string:
The new code handles it correctly.
Issues/PRs references
To be continued....