delay_ns off by a factor of two #495

mcbridejc · 2020-10-13T19:17:16Z

I think this must be a bug, though I haven't yet put my finger on what's going on.

Board: Nucleu-F303K8 (I'm using the modm board config, and haven't messed with clock setup, so it's running from HSI @ 64MHz)

My test:

{
   modm::atomic::Lock lck;
   GpioA0->setOutput(true);
   modm::delay_ns(pulse_width);
   GpioA0->setOutput(false);
   modm::delay_ns(delay);
}

I'm observing with scope that the delay is double the requested value. This is consistent for values ranging from 1500 to 15000 ns; it always runs twice as long as requested.

The delay_ns_per_loop appears to have been calculated correctly for 64MHz:

(gdb) p modm::platform::delay_ns_per_loop
$1 = 47

I have a timer PWM output that has the expected timing, so I believe that I am actually running with the correct clock frequency.

I've tried stepping through delay loop in the debugger, and monitoring DWT.CYCCNT. This only increments 2 per loop iteration, not the expected 3, which is an error in the wrong direction to explain this problem. I'm not sure though if the debugger is affecting branch timing.

Disassembled function (from gdb) looks good to me:

0x8000498 <modm::delay_ns(unsigned long)>       ldr     r3, [pc, #12]   ; (0x80004a8 <modm::delay_ns(unsigned long)+16>)         
0x800049a <modm::delay_ns(unsigned long)+2>     movs    r2, #10                                                                  
0x800049c <modm::delay_ns(unsigned long)+4>     ldrh    r3, [r3, #0]                                                             
0x800049e <modm::delay_ns(unsigned long)+6>     muls    r2, r3                                                                   
0x80004a0 <modm::delay_ns(unsigned long)+8>     subs    r0, r0, r2                                                               
0x80004a2 <modm::delay_ns(unsigned long)+10>    subs    r0, r0, r3                                                               
0x80004a4 <modm::delay_ns(unsigned long)+12>    bpl.n   0x80004a2 <modm::delay_ns(unsigned long)+10>                             
0x80004a6 <modm::delay_ns(unsigned long)+14>    bx      lr                                                                       
0x80004a8 <modm::delay_ns(unsigned long)+16>    lsls    r0, r5, #6                                                               
0x80004aa <modm::delay_ns(unsigned long)+18>    asrs    r0, r0, #32

According to https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions, the branch should take 1+P cycles, where P is "The number of cycles required for a pipeline refill. This ranges from 1 to 3 depending on the alignment and width of the target instruction, and whether the processor manages to speculate the address early.". Even at the max value of 3 here, it should still be only 5 cycles per loop and I'm seeing 6 pretty clearly.

Any ideas?

The text was updated successfully, but these errors were encountered:

mcbridejc · 2020-10-13T20:34:06Z

On the hunch that it must have something to do with instruction fetching, and may be dependent on instruction alignment, I did a little experiment in which I inserted nop instructions into the assembly in delay_ns as a way to shift the alignment of the bpl.n and subs r0, r0, r3 instructions.

Here, with six nops:

0x800049a <modm::delay_ns(unsigned long)+2>     movs    r2, #10                                                             
0x800049c <modm::delay_ns(unsigned long)+4>     ldrh    r3, [r3, #0]                                                        
0x800049e <modm::delay_ns(unsigned long)+6>     nop                                                                         
0x80004a0 <modm::delay_ns(unsigned long)+8>     nop                                                                         
0x80004a2 <modm::delay_ns(unsigned long)+10>    nop                                                                         
0x80004a4 <modm::delay_ns(unsigned long)+12>    nop                                                                         
0x80004a6 <modm::delay_ns(unsigned long)+14>    nop                                                                         
0x80004a8 <modm::delay_ns(unsigned long)+16>    nop                                                                         
0x80004aa <modm::delay_ns(unsigned long)+18>    muls    r2, r3                                                              
0x80004ac <modm::delay_ns(unsigned long)+20>    subs    r0, r0, r2                                                          
0x80004ae <modm::delay_ns(unsigned long)+22>    subs    r0, r0, r3                                                          
0x80004b0 <modm::delay_ns(unsigned long)+24>    bpl.n   0x80004ae <modm::delay_ns(unsigned long)+22>                        
0x80004b2 <modm::delay_ns(unsigned long)+26>    bx      lr                                                                  
0x80004b4 <modm::delay_ns(unsigned long)+28>    lsls    r0, r5, #6                                                          
0x80004b6 <modm::delay_ns(unsigned long)+30>    asrs    r0, r0, #32

With the alignment above, the delay loop appears to take 7 cycles. If I add or subtract a nop, it goes back to taking 6 cycles. It seems likely that having the branch instruction and its target instruction on separate 64-bit words causes the extra delay. Not that this does anything to explain why the delay is double the expected value, but it seems to suggest that delay will vary based on alignment of the delay_us function.

mcbridejc · 2020-10-13T21:23:16Z

Adding modm_fastcode attribute to delay_ns brings back to 3 cycles per loop, and I no longer see any dependence on alignment with code in RAM. So...should delay_ns be moved to fastcode section?

salkinium · 2020-10-14T05:51:16Z

Nanosecond delay is a difficult topic… I left the current implementation a little frustrated due to the limitations you discovered.

Adding modm_fastcode attribute to delay_ns brings back to 3 cycles per loop

It won't be inline-able anymore, which means it will be significantly worse for very small delays for very fast bitbangs (400kHz for I2C for example). We could add an alignment restriction (.align 8) to the ASM and fix the loop count. But then the overhead may still be comparable to jumping to a RAM routine.

So...should delay_ns be moved to .fastcode section?

I think we should, the loop/latency correctness is worth more than small delay optimizations. Want to open a PR?

However, .fastcode is only placed into RAM if a fast instruction RAM exists. On the F303 there is a Core-Coupled Memory specifically for instruction bus access, thus there's no additional delay due to access as you discovered with word alignment due to the Flash Cache aka. ST ART Accelerator.

We should probably move the .fastcode section into RAM at all times, it may not always be faster, but the latency may be more consistent. I can do this on your PR, since it's a bit involved.

mcbridejc · 2020-10-14T06:03:45Z

Why not use the DWT cycle counter?

…

On Tue, Oct 13, 2020, 10:51 PM Niklas Hauser ***@***.***> wrote: Nanosecond delay is a difficult topic… I left the current implementation a little frustrated due to the limitations you discovered. Adding modm_fastcode attribute to delay_ns brings back to 3 cycles per loop It won't be inline-able anymore, which means it will be significantly worse for very small delays for very fast bitbangs (400kHz for I2C for example). We could add an alignment restriction (.align 8) to the ASM and fix the loop count. But then the overhead may still be comparable to jumping to a RAM routine. So...should delay_ns be moved to .fastcode section? I think we should, the loop/latency correctness is worth more than small delay optimizations. Want to open a PR? However, .fastcode is only placed into RAM if a fast instruction RAM exists. On the F303 there is a Core-Coupled Memory specifically for instruction bus access, thus there's no additional delay due to access as you discovered with word alignment due to the Flash Cache aka. ST ART Accelerator. We should probably move the .fastcode section into RAM at all times, it may not always be faster, but the latency may be more consistent. I can do this on your PR, since it's a bit involved. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#495 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABFQFILPE2TTVLSC4UCFVJ3SKU36FANCNFSM4SPP37OQ> .

salkinium · 2020-10-14T06:14:56Z

Why not use the DWT cycle counter?

It's not available on the Cortex-M0 (and I wasn't able to get it to work on the Cortex-M7 yet), and you have to load the address into a register and periodically poll it, which gives you even more cycles per loop. So you're not getting better resolution, only better overall latency. It would perhaps be interesting to have an inline check that jumps to a fast inline implementation for very small (but inaccurate) delays and otherwise to a RAM routines that has a bit more overhead, but is more accurate?

See modm-io#495

mcbridejc · 2020-11-14T19:49:47Z

Sorry, I didn't come back to this project for a while. The PR above is the simple fix that has been working for me on the F303.

See modm-io#495

salkinium changed the title ~~delay_ns off by two~~ delay_ns off by a factor of two Oct 14, 2020

salkinium added advanced 🤯 bug 🐞 labels Oct 14, 2020

mcbridejc added a commit to mcbridejc/modm that referenced this issue Nov 14, 2020

Add fastcode attribute to delay_ns

e10b9fd

See modm-io#495

mcbridejc mentioned this issue Nov 14, 2020

Add fastcode attribute to delay_ns #498

Merged

mcbridejc added a commit to mcbridejc/modm that referenced this issue Nov 17, 2020

Add fastcode attribute to delay_ns

b7b533c

See modm-io#495

salkinium linked a pull request Jun 16, 2021 that will close this issue

[core] Fix delay_ns cycle calculation #642

Merged

11 tasks

salkinium closed this as completed Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delay_ns off by a factor of two #495

delay_ns off by a factor of two #495

mcbridejc commented Oct 13, 2020

mcbridejc commented Oct 13, 2020

mcbridejc commented Oct 13, 2020

salkinium commented Oct 14, 2020

mcbridejc commented Oct 14, 2020 via email

salkinium commented Oct 14, 2020

mcbridejc commented Nov 14, 2020

delay_ns off by a factor of two #495

delay_ns off by a factor of two #495

Comments

mcbridejc commented Oct 13, 2020

mcbridejc commented Oct 13, 2020

mcbridejc commented Oct 13, 2020

salkinium commented Oct 14, 2020

mcbridejc commented Oct 14, 2020 via email

salkinium commented Oct 14, 2020

mcbridejc commented Nov 14, 2020