Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix delay_ns cycle calculation #642

Merged
merged 7 commits into from
Jun 16, 2021

Conversation

salkinium
Copy link
Member

@salkinium salkinium commented Jun 12, 2021

There's a massive error in the algorithm since we muliply the nanoseconds with modm::platform::delay_ns_per_loop first for some reason. I'm honestly shocked at just how wrong this is?

  • Fix loop count for all STM32
  • Fix overhead computation for <1000ns
  • Fix inlining of delay_ns function
  • Some automatic or semiautomatic testing of cycles via DWT->CYCCNT or SysTick->VAL
  • Tested in hardware
  • Enable DWT for Cortex-M7 (missing unlock key)
  • Extend CYCCNT use to 32-bit instead of just 31-bit.
  • Better accuracy for modm::delay_us by using binary scaling for modm::platform::delay_fcpu_MHz
  • Smaller linkerscripts with fewer copy table entries by reusing sections
  • Simplify .fastcode section placement to always be in RAM or instruction cache
  • Add Cortex-M0 unit test into CI to check delay on systems without DWT->CYCCNT.

Fixes #641.
cc @XDjackieXD

@salkinium
Copy link
Member Author

salkinium commented Jun 12, 2021

Ah, here's what introduced the bug: #540.

@salkinium salkinium force-pushed the fix/delay_ns branch 2 times, most recently from 3025794 to 72087a5 Compare June 13, 2021 00:16
@salkinium
Copy link
Member Author

salkinium commented Jun 13, 2021

I wrote a measurement example to see the difference between

  • non-inlined 3 cycle loop in CCM RAM (jumps through a veneer)
  • non-inlined 6 cycle loop in Flash (jumps directly)
  • inlined 6 cycle loop in Flash (no jump, depends on surrounding code)
delay_ns in RAM, 3 cycles per loop, overhead 10 loops

    expected     |    measured
     ns | cycles | cycles |      ns
      1 |      0 |     30 |     468      x
      5 |      0 |     30 |     468      x
     10 |      0 |     30 |     468      x
     50 |      3 |     30 |     468      x
    100 |      6 |     30 |     468      x
    200 |     12 |     30 |     468      x
    400 |     25 |     30 |     468      x
    500 |     32 |     30 |     468
    550 |     35 |     33 |     515
    600 |     38 |     36 |     562
    650 |     41 |     39 |     609
    700 |     44 |     42 |     656
    750 |     48 |     45 |     703
    800 |     51 |     51 |     796
    850 |     54 |     54 |     843
    900 |     57 |     57 |     890
    950 |     60 |     60 |     937
   1000 |     64 |     63 |     984
   1500 |     96 |     93 |    1453
   2000 |    128 |    126 |    1968
  10000 |    640 |    636 |    9937
 100000 |   6400 |   6381 |   99703
1000000 |  64000 |  63828 |  997312


delay_ns non-inlined in Flash, 6 cycles per loop, overhead 3 loops

    expected     |    measured
     ns | cycles | cycles |      ns
      1 |      0 |     22 |     343      x
      5 |      0 |     22 |     343      x
     10 |      0 |     22 |     343      x
     50 |      3 |     22 |     343      x
    100 |      6 |     22 |     343      x
    200 |     12 |     22 |     343      x
    250 |     16 |     22 |     343      x
    300 |     19 |     22 |     343      x
    350 |     22 |     22 |     343
    400 |     25 |     25 |     390
    450 |     28 |     25 |     390
    500 |     32 |     31 |     484
    550 |     35 |     31 |     484
    600 |     38 |     37 |     578
    650 |     41 |     37 |     578
    700 |     44 |     43 |     671
    750 |     48 |     49 |     765
    800 |     51 |     49 |     765
    850 |     54 |     55 |     859
    900 |     57 |     55 |     859
    950 |     60 |     61 |     953
   1000 |     64 |     61 |     953
   1500 |     96 |     97 |    1515
   2000 |    128 |    127 |    1984
  10000 |    640 |    643 |   10046
 100000 |   6400 |   6451 |  100796
1000000 |  64000 |  64513 | 1008015


delay_ns inlined in Flash, 6 cycles per loop, overhead 1 loop

    expected     |    measured
     ns | cycles | cycles |      ns
      1 |      0 |     11 |     171      x
      5 |      0 |     11 |     171      x
     10 |      0 |     11 |     171      x
     50 |      3 |     11 |     171      x
    100 |      6 |     11 |     171      x
    200 |     12 |     14 |     218      x
    250 |     16 |     14 |     218
    300 |     19 |     19 |     296
    350 |     22 |     19 |     296
    400 |     25 |     25 |     390
    450 |     28 |     25 |     390
    500 |     32 |     31 |     484
    550 |     35 |     31 |     484
    600 |     38 |     37 |     578
    650 |     41 |     37 |     578
    700 |     44 |     43 |     671
    750 |     48 |     49 |     765
    800 |     51 |     49 |     765
    850 |     54 |     55 |     859
    900 |     57 |     55 |     859
    950 |     60 |     61 |     953
   1000 |     64 |     61 |     953
   1500 |     96 |     97 |    1515
   2000 |    128 |    127 |    1984
  10000 |    640 |    643 |   10046
 100000 |   6400 |   6451 |  100796
1000000 |  64000 |  64513 | 1008015

@salkinium
Copy link
Member Author

Hm, another problem is that jumping into CCM RAM is done through a veener, which incurs 5-7 cycles for loading and pipeline stall. This is a little disappointing of GNU ld tbh, but I can see the issue with the limited jumping range of the blx instruction.

080029b0 <___ZN4modm8delay_usEm_veneer>:
___ZN4modm8delay_usEm_veneer():
 80029b0:	f85f f000 	ldr.w	pc, [pc]	; 80029b4 <___ZN4modm8delay_usEm_veneer+0x4>
 80029b4:	10000189 	.word	0x10000189

080029b8 <___ZN4modm8delay_nsEm_veneer>:
___ZN4modm8delay_nsEm_veneer():
 80029b8:	f85f f000 	ldr.w	pc, [pc]	; 80029bc <___ZN4modm8delay_nsEm_veneer+0x4>
 80029bc:	100001a9 	.word	0x100001a9

@salkinium
Copy link
Member Author

salkinium commented Jun 13, 2021

Well, the inlining depends heavily on the surrounding code and optimization opportunities and may yield something worse even:

delay_ns non-inlined in Flash, 6 cycles per loop, overhead 3 loops

    expected     |    measured
     ns | cycles | cycles |      ns
      1 |      0 |     22 |     343      x
      5 |      0 |     21 |     328      x
     10 |      0 |     21 |     328      x
     50 |      3 |     21 |     328      x
    100 |      6 |     21 |     328      x
    200 |     12 |     21 |     328      x
    250 |     16 |     21 |     328      x
    300 |     19 |     21 |     328      x
    350 |     22 |     21 |     328
    400 |     25 |     24 |     375
    450 |     28 |     24 |     375
    500 |     32 |     30 |     468
    550 |     35 |     30 |     468
    600 |     38 |     36 |     562
    650 |     41 |     36 |     562
    700 |     44 |     42 |     656
    750 |     48 |     48 |     750
    800 |     51 |     48 |     750
    850 |     54 |     54 |     843
    900 |     57 |     54 |     843
    950 |     60 |     60 |     937
   1000 |     64 |     60 |     937
   1500 |     96 |     97 |    1515
   2000 |    128 |    126 |    1968
  10000 |    640 |    643 |   10046
 100000 |   6400 |   6450 |  100781
1000000 |  64000 |  64512 | 1008000


delay_ns inlined in Flash, 6 cycles per loop, overhead 1 loop

    expected     |    measured
     ns | cycles | cycles |      ns
      1 |      0 |     25 |     390      x
      5 |      0 |     25 |     390      x
     10 |      0 |     25 |     390      x
     50 |      3 |     26 |     406      x
    100 |      6 |     31 |     484      x
    200 |     12 |     26 |     406      x
    250 |     16 |     26 |     406      x
    300 |     19 |     26 |     406      x
    350 |     22 |     26 |     406      x
    400 |     25 |     29 |     453      x
    450 |     28 |     29 |     453
    500 |     32 |     34 |     531
    550 |     35 |     34 |     531
    600 |     38 |     40 |     625
    650 |     41 |     40 |     625
    700 |     44 |     46 |     718
    750 |     48 |     52 |     812
    800 |     51 |     52 |     812
    850 |     54 |     58 |     906
    900 |     57 |     58 |     906
    950 |     60 |     64 |    1000
   1000 |     64 |     64 |    1000
   1500 |     96 |     99 |    1546
   2000 |    128 |    130 |    2031
  10000 |    640 |    645 |   10078
 100000 |   6400 |   6457 |  100890
1000000 |  64000 |  64515 | 1008046

@salkinium salkinium force-pushed the fix/delay_ns branch 3 times, most recently from 2f7fe02 to e2234e1 Compare June 15, 2021 02:31
@salkinium
Copy link
Member Author

salkinium commented Jun 15, 2021

Ok, so here are my findings:

  1. The time the loop takes depends on the wait states and cache configuration of Flash, which would require updating the delay_ns_per_loop variable depending on flash latency. In my experimentation this varies quite a lot per STM32 family. It would be unmaintainable to measure and store this (rather useless) information.
  2. Calling a function in SRAM, jumps through a veneer, which is located in Flash, requiring one additional load and branch. Flash is very slow when a jump is too large to fit into cache!
  3. Execution from SRAM without cache is very fast, ≤2 cycles per 32-bit instruction fetch. However there may be contention with fetching data on the same bus interface. (not an issue during the loop)

I've therefore decided on the following:

  1. I've simplified the placement of the .fastcode section to always be in SRAM, CCM or instruction cache. This mirrors the .fastdata section, which is always in SRAM, CCM or data cache. (The .ramcode section is aliased to .fastcode now).
  2. The modm::platform::delay_ns(uint32_t) function is placed in the .fastcode, therefore always residing in RAM.
  3. The modm::platform::delay_ns(uint32_t) is naked and implemented as carefully written inline assembly. The loop count is either 4 (placed in SRAM with S-Bus access), 3 (placed in SRAM or CCM with I-Bus access) or 1 (placed in ITCM on Cortex-M7). This is as fast as Flash at boot config (0 wait states), but significantly faster than flash at high wait states.
  4. The modm::delay_ns(uint32_t) function is a modm_always_inline inline assembly that blx's to the modm::platform::delay_ns function. This eliminated the use of the veneer, thus jumping as quick as possible into RAM.

The result is a minimum cycle count of <20 for any Flash config, at any speed regardless of calling environment (ie. does not depend in inlining quality). I've tested this in hardware for most STM32 (I have no L0, F2, G4 hardware, but made educated guesses) and SAMD as well.

Here is the delay_ns measurements for STM32F334K8, with a minimum cycle count of 15 or ~250ns.
modm::delay_ns for system clock = 64000000
    expected       |      measured
      ns |  cycles |  cycles |       ns
       1 |       0 |      15 |      234      >
       5 |       0 |      15 |      234      >
      10 |       0 |      15 |      234      >
      50 |       3 |      15 |      234      >
     100 |       6 |      15 |      234      >
     150 |       9 |      15 |      234      >
     200 |      12 |      15 |      234      >
     250 |      16 |      18 |      281
     300 |      19 |      21 |      328
     350 |      22 |      24 |      375
     400 |      25 |      27 |      421
     450 |      28 |      30 |      468
     500 |      32 |      33 |      515
     550 |      35 |      36 |      562
     600 |      38 |      39 |      609
     650 |      41 |      42 |      656
     700 |      44 |      45 |      703
     750 |      48 |      48 |      750
     800 |      51 |      54 |      843
     850 |      54 |      57 |      890
     900 |      57 |      60 |      937
     950 |      60 |      63 |      984
    1000 |      64 |      66 |     1031
    1100 |      70 |      72 |     1125
    1200 |      76 |      78 |     1218
    1300 |      83 |      84 |     1312
    1400 |      89 |      90 |     1406
    1500 |      96 |      96 |     1500
    1600 |     102 |     105 |     1640
    1700 |     108 |     111 |     1734
    1800 |     115 |     117 |     1828
    1900 |     121 |     123 |     1921
    2000 |     128 |     129 |     2015
    3000 |     192 |     192 |     3000
    4000 |     256 |     258 |     4031
    5000 |     320 |     321 |     5015
    6000 |     384 |     384 |     6000
    7000 |     448 |     447 |     6984
    8000 |     512 |     513 |     8015
    9000 |     576 |     576 |     9000
   10000 |     640 |     639 |     9984
  100000 |    6400 |    6384 |    99750
 1000000 |   64000 |   63831 |   997359
10000000 |  640000 |  638298 |  9973406

I am very, very happy with this solution. I added an example for testing a bunch of settings.
What to you think? @XDjackieXD

(cc @rleh Do you have L0, F2 or G4 hardware and could you test the example on that?)

@salkinium salkinium force-pushed the fix/delay_ns branch 3 times, most recently from e06083f to 0ca2d1b Compare June 15, 2021 04:21
@salkinium salkinium added advanced 🤯 ci:hal Triggers the exhaustive HAL compile CI jobs labels Jun 15, 2021
@chris-durand
Copy link
Member

Very nice, I have tested L0 and G4. I don't have F2 hardware.

The boot clock for L0 is wrong, it boots on 2.097 MHz MSI, not HSI. Could you cherry-pick the fix?

Results for G4:
modm::delay_ns for system clock = 16000000
    expected       |      measured
      ns |  cycles |  cycles |       ns
     100 |       1 |      17 |     1062      >
    1000 |      16 |      20 |     1250      >
   10000 |     160 |     164 |    10250       
  100000 |    1600 |    1598 |    99875       
 1000000 |   16000 |   15962 |   997625       
10000000 |  160000 |  159578 |  9973625       

modm::delay_us for boot clock = 16000000
     expected       |       measured
      us |  cycles  |  cycles  |       us
       1 |       16 |       28 |        1      >
       5 |       80 |       94 |        5       
      10 |      160 |      172 |       10       
      20 |      320 |      334 |       20       
      30 |      480 |      496 |       31       
      40 |      640 |      652 |       40       
      50 |      800 |      814 |       50       
      60 |      960 |      976 |       61       
      70 |     1120 |     1132 |       70       
      80 |     1280 |     1294 |       80       
      90 |     1440 |     1456 |       91       
    1000 |    16000 |    16012 |     1000       ò
modm::delay_ns for system clock = 170000000
    expected       |      measured
      ns |  cycles |  cycles |       ns
       1 |       0 |      21 |      123      >
       5 |       0 |      21 |      123      >
      10 |       1 |      21 |      123      >
      50 |       8 |      21 |      123      >
     100 |      17 |      24 |      141      >
     150 |      25 |      33 |      194      >
     200 |      34 |      42 |      247      >
     250 |      42 |      48 |      282       
     300 |      51 |      57 |      335       
     350 |      59 |      66 |      388       
     400 |      68 |      75 |      441       
     450 |      76 |      84 |      494       
     500 |      85 |      90 |      529       
     550 |      93 |      99 |      582       
     600 |     102 |     108 |      635       
     650 |     110 |     117 |      688       
     700 |     119 |     123 |      723       
     750 |     127 |     132 |      776       
     800 |     136 |     141 |      829       
     850 |     144 |     150 |      882       
     900 |     153 |     159 |      935       
     950 |     161 |     165 |      970       
    1000 |     170 |     174 |     1023       
    1100 |     187 |     192 |     1129       
    1200 |     204 |     207 |     1217       
    1300 |     221 |     225 |     1323       
    1400 |     238 |     240 |     1411       
    1500 |     255 |     258 |     1517       
    1600 |     272 |     273 |     1605       
    1700 |     289 |     291 |     1711       
    1800 |     306 |     309 |     1817       
    1900 |     323 |     324 |     1905       
    2000 |     340 |     342 |     2011       
    3000 |     510 |     507 |     2982       
    4000 |     680 |     675 |     3970       
    5000 |     850 |     840 |     4941       
    6000 |    1020 |    1008 |     5929       
    7000 |    1190 |    1173 |     6900       
    8000 |    1360 |    1341 |     7888       
    9000 |    1530 |    1509 |     8876       
   10000 |    1700 |    1674 |     9847       
  100000 |   17000 |   16674 |    98082       
 1000000 |  170000 |  166674 |   980435       
10000000 | 1700000 | 1666674 |  9803964       

modm::delay_us for boot clock = 170000000
     expected       |       measured
      us |  cycles  |  cycles  |       us
       1 |      170 |      210 |        1      >
       5 |      850 |      894 |        5       
      10 |     1700 |     1740 |       10       
     100 |    17000 |    17040 |      100       
    1000 |   170000 |   170040 |     1000       
   10000 |  1700000 |  1700040 |    10000       
  100000 | 17000000 | 17000040 |   100000       
Results for L0:
modm::delay_ns for system clock = 2097000
    expected       |      measured
      ns |  cycles |  cycles |       ns
     100 |       0 |      16 |     7629      >
    1000 |       2 |      16 |     7629      >
   10000 |      20 |      22 |    10491       
  100000 |     209 |     211 |   100619       
 1000000 |    2097 |    2098 |  1000476       
10000000 |   20970 |   20968 |  9999046       

modm::delay_us for boot clock = 2097000
     expected       |       measured
      us |  cycles  |  cycles  |       us
       1 |        2 |       27 |       12      >
       5 |       10 |       36 |       17      >
      10 |       20 |       48 |       22      >
      20 |       41 |       69 |       32      >
      30 |       62 |       90 |       42      >
      40 |       83 |      111 |       52      >
      50 |      104 |      132 |       62      >
      60 |      125 |      153 |       72      >
      70 |      146 |      174 |       82       
      80 |      167 |      195 |       92       
      90 |      188 |      216 |      103       
    1000 |     2097 |     2151 |     1025       

modm::delay_ns for system clock = 32000000
    expected       |      measured
      ns |  cycles |  cycles |       ns
       1 |       0 |      17 |      531      >
       5 |       0 |      17 |      531      >
      10 |       0 |      17 |      531      >
      50 |       1 |      17 |      531      >
     100 |       3 |      17 |      531      >
     150 |       4 |      17 |      531      >
     200 |       6 |      17 |      531      >
     250 |       8 |      17 |      531      >
     300 |       9 |      17 |      531      >
     350 |      11 |      17 |      531      >
     400 |      12 |      17 |      531      >
     450 |      14 |      17 |      531      >
     500 |      16 |      20 |      625      >
     550 |      17 |      20 |      625       
     600 |      19 |      23 |      718      >
     650 |      20 |      23 |      718       
     700 |      22 |      26 |      812       
     750 |      24 |      26 |      812       
     800 |      25 |      29 |      906       
     850 |      27 |      32 |     1000       
     900 |      28 |      32 |     1000       
     950 |      30 |      35 |     1093       
    1000 |      32 |      35 |     1093       
    1100 |      35 |      38 |     1187       
    1200 |      38 |      41 |     1281       
    1300 |      41 |      44 |     1375       
    1400 |      44 |      47 |     1468       
    1500 |      48 |      50 |     1562       
    1600 |      51 |      56 |     1750       
    1700 |      54 |      59 |     1843       
    1800 |      57 |      62 |     1937       
    1900 |      60 |      65 |     2031       
    2000 |      64 |      68 |     2125       
    3000 |      96 |      98 |     3062       
    4000 |     128 |     131 |     4093       
    5000 |     160 |     164 |     5125       
    6000 |     192 |     194 |     6062       
    7000 |     224 |     227 |     7093       
    8000 |     256 |     260 |     8125       
    9000 |     288 |     290 |     9062       
   10000 |     320 |     323 |    10093       
  100000 |    3200 |    3194 |    99812       
 1000000 |   32000 |   31919 |   997468       
10000000 |  320000 |  319151 |  9973468       

modm::delay_us for boot clock = 32000000
     expected       |       measured
      us |  cycles  |  cycles  |       us
       1 |       32 |       60 |        1      >
       5 |      160 |      189 |        5       
      10 |      320 |      348 |       10       
     100 |     3200 |     3228 |      100       
    1000 |    32000 |    32028 |     1000       
   10000 |   320000 |   320028 |    10000       
  100000 |  3200000 |  3200028 |   100000       

@salkinium
Copy link
Member Author

Thank you very much, @chris-durand! I've fixed the boot_frequency on the Python side now.
I don't think anyone has or uses F2 devices, it's only a few dozen anyways… it'll be fine.

@salkinium
Copy link
Member Author

I added a NUCLEO-F091RC BSP, since I had that lying around. I've then used it to add Cortex-M0 compilation of the unit tests to the CI which is important for the delay, since I use SysTick->VAL instead of DWT->CYCCNT.

@salkinium salkinium added ci:hal Triggers the exhaustive HAL compile CI jobs and removed ci:hal Triggers the exhaustive HAL compile CI jobs labels Jun 15, 2021
@XDjackieXD
Copy link
Contributor

Amazing :D
As far as my understanding of ARM ASM goes this looks nice and it is more than accurate enough for my use-case.
Thank you very much for the quick fix!

@salkinium
Copy link
Member Author

FYI: I'm going to merge this tonight if there are no further reviews.

@rleh rleh self-requested a review June 16, 2021 11:05
examples/nucleo_f091rc/blink/main.cpp Outdated Show resolved Hide resolved
Copy link
Member

@chris-durand chris-durand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

test/modm/platform/delay/delay_test.hpp Show resolved Hide resolved
@salkinium salkinium added ci:hal Triggers the exhaustive HAL compile CI jobs and removed ci:hal Triggers the exhaustive HAL compile CI jobs labels Jun 16, 2021
@salkinium
Copy link
Member Author

salkinium commented Jun 16, 2021

Btw @rleh all jobs from travis-ci.org have been moved to travis-ci.com and it broke something in our ARM64 job. I assume it's just some missing setting, it's not super important for now.

@salkinium salkinium merged commit cc15b1a into modm-io:develop Jun 16, 2021
This was linked to issues Jun 16, 2021
@salkinium salkinium deleted the fix/delay_ns branch June 18, 2021 20:17
@salkinium salkinium added this to the 2021q2 milestone Jun 20, 2021
@XDjackieXD
Copy link
Contributor

XDjackieXD commented Jun 25, 2021

hm. I just tested this with cmake in our real project (I was testing with scons earlier to get as close to the examples as possible...) and I get linker errors like these:

undefined reference to `modm::platform::delay_ns(unsigned long)'

whereever I call modm::delay_ns (stm32f303 platform).

Is this something I'm doing wrong or is this a bug introduced with these changes? (shall I open a new issue for it?)

My fail. something with cmake handling has changed and the cmake file wasn't updated by running lbuild build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced 🤯 ci:hal Triggers the exhaustive HAL compile CI jobs fix 💎
Development

Successfully merging this pull request may close these issues.

delay_ns off by about factor 10^2 delay_ns off by a factor of two modm::delay and FreeRTOS
4 participants