Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delay_ns off by a factor of two #495

Closed
mcbridejc opened this issue Oct 13, 2020 · 6 comments · Fixed by #642
Closed

delay_ns off by a factor of two #495

mcbridejc opened this issue Oct 13, 2020 · 6 comments · Fixed by #642

Comments

@mcbridejc
Copy link
Contributor

I think this must be a bug, though I haven't yet put my finger on what's going on.

Board: Nucleu-F303K8 (I'm using the modm board config, and haven't messed with clock setup, so it's running from HSI @ 64MHz)

My test:

{
   modm::atomic::Lock lck;
   GpioA0->setOutput(true);
   modm::delay_ns(pulse_width);
   GpioA0->setOutput(false);
   modm::delay_ns(delay);
}

I'm observing with scope that the delay is double the requested value. This is consistent for values ranging from 1500 to 15000 ns; it always runs twice as long as requested.

The delay_ns_per_loop appears to have been calculated correctly for 64MHz:

(gdb) p modm::platform::delay_ns_per_loop
$1 = 47

I have a timer PWM output that has the expected timing, so I believe that I am actually running with the correct clock frequency.

I've tried stepping through delay loop in the debugger, and monitoring DWT.CYCCNT. This only increments 2 per loop iteration, not the expected 3, which is an error in the wrong direction to explain this problem. I'm not sure though if the debugger is affecting branch timing.

Disassembled function (from gdb) looks good to me:

0x8000498 <modm::delay_ns(unsigned long)>       ldr     r3, [pc, #12]   ; (0x80004a8 <modm::delay_ns(unsigned long)+16>)         
0x800049a <modm::delay_ns(unsigned long)+2>     movs    r2, #10                                                                  
0x800049c <modm::delay_ns(unsigned long)+4>     ldrh    r3, [r3, #0]                                                             
0x800049e <modm::delay_ns(unsigned long)+6>     muls    r2, r3                                                                   
0x80004a0 <modm::delay_ns(unsigned long)+8>     subs    r0, r0, r2                                                               
0x80004a2 <modm::delay_ns(unsigned long)+10>    subs    r0, r0, r3                                                               
0x80004a4 <modm::delay_ns(unsigned long)+12>    bpl.n   0x80004a2 <modm::delay_ns(unsigned long)+10>                             
0x80004a6 <modm::delay_ns(unsigned long)+14>    bx      lr                                                                       
0x80004a8 <modm::delay_ns(unsigned long)+16>    lsls    r0, r5, #6                                                               
0x80004aa <modm::delay_ns(unsigned long)+18>    asrs    r0, r0, #32                                                              

According to https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions, the branch should take 1+P cycles, where P is "The number of cycles required for a pipeline refill. This ranges from 1 to 3 depending on the alignment and width of the target instruction, and whether the processor manages to speculate the address early.". Even at the max value of 3 here, it should still be only 5 cycles per loop and I'm seeing 6 pretty clearly.

Any ideas?

@mcbridejc
Copy link
Contributor Author

On the hunch that it must have something to do with instruction fetching, and may be dependent on instruction alignment, I did a little experiment in which I inserted nop instructions into the assembly in delay_ns as a way to shift the alignment of the bpl.n and subs r0, r0, r3 instructions.

Here, with six nops:

0x800049a <modm::delay_ns(unsigned long)+2>     movs    r2, #10                                                             
0x800049c <modm::delay_ns(unsigned long)+4>     ldrh    r3, [r3, #0]                                                        
0x800049e <modm::delay_ns(unsigned long)+6>     nop                                                                         
0x80004a0 <modm::delay_ns(unsigned long)+8>     nop                                                                         
0x80004a2 <modm::delay_ns(unsigned long)+10>    nop                                                                         
0x80004a4 <modm::delay_ns(unsigned long)+12>    nop                                                                         
0x80004a6 <modm::delay_ns(unsigned long)+14>    nop                                                                         
0x80004a8 <modm::delay_ns(unsigned long)+16>    nop                                                                         
0x80004aa <modm::delay_ns(unsigned long)+18>    muls    r2, r3                                                              
0x80004ac <modm::delay_ns(unsigned long)+20>    subs    r0, r0, r2                                                          
0x80004ae <modm::delay_ns(unsigned long)+22>    subs    r0, r0, r3                                                          
0x80004b0 <modm::delay_ns(unsigned long)+24>    bpl.n   0x80004ae <modm::delay_ns(unsigned long)+22>                        
0x80004b2 <modm::delay_ns(unsigned long)+26>    bx      lr                                                                  
0x80004b4 <modm::delay_ns(unsigned long)+28>    lsls    r0, r5, #6                                                          
0x80004b6 <modm::delay_ns(unsigned long)+30>    asrs    r0, r0, #32                                                         

With the alignment above, the delay loop appears to take 7 cycles. If I add or subtract a nop, it goes back to taking 6 cycles. It seems likely that having the branch instruction and its target instruction on separate 64-bit words causes the extra delay. Not that this does anything to explain why the delay is double the expected value, but it seems to suggest that delay will vary based on alignment of the delay_us function.

@mcbridejc
Copy link
Contributor Author

Adding modm_fastcode attribute to delay_ns brings back to 3 cycles per loop, and I no longer see any dependence on alignment with code in RAM. So...should delay_ns be moved to fastcode section?

@salkinium salkinium changed the title delay_ns off by two delay_ns off by a factor of two Oct 14, 2020
@salkinium
Copy link
Member

Nanosecond delay is a difficult topic… I left the current implementation a little frustrated due to the limitations you discovered.

Adding modm_fastcode attribute to delay_ns brings back to 3 cycles per loop

It won't be inline-able anymore, which means it will be significantly worse for very small delays for very fast bitbangs (400kHz for I2C for example). We could add an alignment restriction (.align 8) to the ASM and fix the loop count. But then the overhead may still be comparable to jumping to a RAM routine.

So...should delay_ns be moved to .fastcode section?

I think we should, the loop/latency correctness is worth more than small delay optimizations. Want to open a PR?

However, .fastcode is only placed into RAM if a fast instruction RAM exists. On the F303 there is a Core-Coupled Memory specifically for instruction bus access, thus there's no additional delay due to access as you discovered with word alignment due to the Flash Cache aka. ST ART Accelerator.

We should probably move the .fastcode section into RAM at all times, it may not always be faster, but the latency may be more consistent. I can do this on your PR, since it's a bit involved.

@mcbridejc
Copy link
Contributor Author

mcbridejc commented Oct 14, 2020 via email

@salkinium
Copy link
Member

Why not use the DWT cycle counter?

It's not available on the Cortex-M0 (and I wasn't able to get it to work on the Cortex-M7 yet), and you have to load the address into a register and periodically poll it, which gives you even more cycles per loop. So you're not getting better resolution, only better overall latency. It would perhaps be interesting to have an inline check that jumps to a fast inline implementation for very small (but inaccurate) delays and otherwise to a RAM routines that has a bit more overhead, but is more accurate?

mcbridejc added a commit to mcbridejc/modm that referenced this issue Nov 14, 2020
@mcbridejc
Copy link
Contributor Author

Sorry, I didn't come back to this project for a while. The PR above is the simple fix that has been working for me on the F303.

mcbridejc added a commit to mcbridejc/modm that referenced this issue Nov 17, 2020
@salkinium salkinium linked a pull request Jun 16, 2021 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants