-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrote library to fix several issues #17
Conversation
Marked as draft because I haven't had a chance to test this yet. |
I decided it wasn't a good idea to make a breaking API change on such an inactive repository, so I put back the |
Inspecting the assembly output, both long and short delays work well. Constant delays (known at compile time) compile with negligible overhead. Tested with both speed ( |
Hi, thanks for this merge request - I've managed to run the code, but seems like on my atmega328p the delays are 10x as long as they are supposed to be: e.g. I'm using |
I double-checked the math and it looks right to me, and it seemed to work correctly on my 328p. I'm not sure what could be causing this issue. Can you post your code? |
Sure - my code is nothing more than: #![feature(llvm_asm)]
#![no_std]
#![no_main]
use ruduino::cores::current::port;
use ruduino::Pin;
#[no_mangle]
pub extern "C" fn main() {
port::B0::set_output();
loop {
port::B0::set_high();
avr_delay::delay_ms(1000); // in this case it's set to 1s, but in reality takes 10s
port::B0::set_low();
avr_delay::delay_ms(1000);
}
} ... compiled using:
... with [build]
target = "avr-atmega328p.json"
[unstable]
build-std = ["core"] Maybe 20 MHz is too much for |
20MHz shouldn't be causing any overflow. Maybe this is related to the issue with |
Yeah, I've read about that issue and I've done like a dozen of |
For what it worth:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing as requested.
src/lib.rs
Outdated
// microseconds | ||
let us = ms * 1000; | ||
delay_us(us); | ||
let ticks: u64 = (u64::from(avr_config::CPU_FREQUENCY_HZ) * u64::from(ms)) / 4_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u64 is very costly on AVR. The hardware can only do 8 bits addition, subtraction and multiplication. It can also do 8 bits divide par two. Everything else will be implemented in software with a tiny bit of help from the hardware for carry. So here, we have a u64 multiplication and division. This means that it will require a lot instructions, taking precious flash space. And it will be very slow at runtime.
A nice summary of the instruction set: https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set
(The official doc is the absolute reference, but from a cursory look, it seems to be almost all on the wikipedia page).
avr_config::CPU_FREQUENCY_HZ is a const u32
.
So the challenge is to find a way to compute as much as possible in constant expression at compile time. And end up with the least amount of arithmetic at runtime. And most of it around 8 bits.
In my code, I have basically adapted the reference implementation in C. Which cost 16 bits comparison, subtractions and multiplications. The code follows.
/// Works for 4 KHz >= CPU freqency <= 262.14 MHz.
/// Maximum 65_535 ms.
pub fn busy_loop_ms(ms: u16) {
const ITER_PER_MS: u16 = (crate::CPU_FREQUENCY_HZ / 4000) as u16;
const MAX_MS_PER_LOOP: u16 = u16::MAX / ITER_PER_MS;
const ITERS_PER_MAX_MS: u16 = MAX_MS_PER_LOOP * ITER_PER_MS;
let mut ms: u16 = ms;
while ms > MAX_MS_PER_LOOP {
busy_loop_4_cycles(ITERS_PER_MAX_MS);
ms -= MAX_MS_PER_LOOP;
}
if ms > 0 {
busy_loop_4_cycles(ms * ITER_PER_MS);
}
}
It "feels" like it might be possible to do better still. Note that if the delay value is known at compile time (like delay(150)
), and the code is successfully inlined, then most of the operations will be done at compile time. But you cannot count on this without looking at the assembly output and careful tuning of the application code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was counting on everything being inlined at compile time. The AVR C implementation that I was looking at uses floats, which are ridiculously slow when emulated at runtime. Since that implementation was depending on compile time evaluation, I felt that it was fine to also do that.
But I agree your implementation is better. The only potential problem I see is that it can only delay for a max of about one minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it depends on the compile time evaluation, more like it hopes it does it? My naive understanding is compile time evaluation wouldn't be guaranteed in the C version.
The instruction set does have some float multiplication, but not float divisions. I don't know how costly floats are maybe its not that bad? This was written by Atmel, I would hope they know a thing or two about their CPUs.
Float would have no minute limit while retaining high precision for small values. I will investigate and report here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bombela
This is from <util/delay.h>
(the file defining _delay_ms
and _delay_us
) included with Microchip Studio:
The functions in this header file are wrappers around the basic busy-wait functions from <util/delay_basic.h>. They are meant as convenience functions where actual time values can be specified rather than a number of cycles to wait for. The idea behind is that compile-time constant expressions will be eliminated by compiler optimization so floating-point expressions can be used to calculate the number of delay cycles needed based on the CPU frequency passed by the macro F_CPU.
In order for these functions to work as intended, compiler optimizations must be enabled, and the delay time must be an expression that is a known constant at compile-time. If these requirements are not met, the resulting delay will be much longer (and basically unpredictable), and applications that otherwise do not use floating-point calculations will experience severe code bloat by the floating-point library routines linked into the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since AVR doesn't have hardware division at all, I think it's basically impossible to make delay_ms and delay_us work with values that aren't known at compile time without a ton of overhead. At the very least, I'd say it isn't a priority for this crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The division is not needed at runtime in my example. But it still requires 16bit substraction and multiplication.
But note that even if you call delay with a constant value. The compiler might still move the computation at runtime to reduce code bloat. I encountered this on my last project (attiny24a, 2048 bytes of flash). I used delays to output patterns on a pin, which helped me synchronize debugging on an oscilloscope. The trick was to use only two different delay constants. The compiler was happy to output only two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example, with opt-level=z:
#[no_mangle]
pub unsafe extern "C" fn main() {
busy_loop_ms(66);
busy_loop_ms(67);
busy_loop_ms(68);
loop {}
}
Produces:
main:
ldi r24, 66
ldi r25, 0
rcall _ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E
ldi r24, 67
ldi r25, 0
rcall _ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E
ldi r24, 68
ldi r25, 0
rcall _ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E
.LBB3_1:
rjmp .LBB3_1
_ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E:
ldi r18, 0
ldi r19, 250
.LBB0_1:
cpi r24, 33
cpc r25, r1
brlo .LBB0_3
movw r30, r18
;APP
.Ltmp0:
sbiw r30, 1
brne .Ltmp0
;NO_APP
sbiw r24, 32
rjmp .LBB0_1
.LBB0_3:
cpi r24, 0
cpc r25, r1
breq .LBB0_5
ldi r22, 208
ldi r23, 7
rcall __mulhi3
;APP
.Ltmp1:
sbiw r24, 1
brne .Ltmp1
;NO_APP
.LBB0_5:
ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you getting that assembly output? I've been using avr-objdump on the .elf output file, and the formatting is much less nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo rustc $(FLAGS) --target=avr-attiny24a.json -- --emit asm
echo target/avr-attiny24a/release/deps/*.s
src/lib.rs
Outdated
/// * 'count' - a u16, the number of times to cycle the loop. | ||
#[inline(always)] | ||
#[allow(unused_variables, unused_mut, unused_assignments)] | ||
fn delay_loop_4_cycles(mut count: u16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to adapt my own code to the new asm
macro recently. I found all the magical register naming in the Rust compiler source code at https://github.com/rust-lang/rust/blob/9ad5d82f822b3cb67637f11be2e65c5662b66ec0/compiler/rustc_target/src/asm/avr.rs.
The documentation of the asm!
macro itself can be found there: https://doc.rust-lang.org/nightly/reference/inline-assembly.html.
Volatile is now the default. So I believe the following code should be the correct translation.
Even though we do not care of the final value of _count
, we still must declare the register pair as both input and output. The hardware is mutating the register pair, and the compiler should know about this.
/// 4 cycles per iteration.
#[inline(always)]
pub fn busy_loop_4_cycles(count: u16) {
let mut _count = count; // Quiet the linter.
unsafe {
asm!(
"1: sbiw {regpair}, 1",
"brne 1b",
regpair = inout(reg_iw) _count,
)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bombela please make a merge request for it.
That is no longer true due other "progress". The so called progress is a mix of
I hope that there will be smaller Merge Requests that replace this large MR. |
Okay, this is ready for review again. I basically just switched the It only compiles in release mode for some reason (so does the current main branch of the main repo), but that's an issue with compiling @bombela's point about delays that aren't known at compile time still stands, but I think it isn't a priority right now (AVR libc doesn't do it either), we first need to get the crate in a working state. |
I might try something with magic-number division (https://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html) to work with values not known at compile time. But that will definitely be a separate pull request EDIT: It looks like this wouldn't be that helpful, since we would end up with 64x64->128 but multiplications, which would be very slow anyway. |
This is fascinating. The math goes over my head sadly. On my side, I am trying to produce something similar to the intrinsic delay_cycles from gcc-avr. With |
I actually got it working. Here is a gist (https://gist.github.com/bombela/cc90e5c29f3e7667326de4c087c1e148) with the code for reference. I tested all the corner cases I could think of, and it appears to work beautifully. It requires that the delay functions take their parameters via const generic ( |
That looks really good. A couple minor points:
|
On Thu, Jun 09, 2022 at 01:34:07AM -0700, François-Xavier Bourlet wrote:
I actually got it working. Here is a gist
(https://gist.github.com/bombela/cc90e5c29f3e7667326de4c087c1e148) with
the code for reference. I tested all the corner cases I could think of,
and it appears to work beautifully. It requires that the delay functions
take their parameters via const generic (`delay_ms::<42>()`). I will
try to assemble a PR together someday.
Request / idea
Idea / request
In directory examples in the rewrite branch
Groeten
Geert Stappers
--
Silence is hard to parse
|
That is not for this rewrite. I closing this merge request. That might be stupid. The idea is getting a fresh start. Use branch |
Rewrite: Prevent overflow from public functions Hopefully are the underlaying commits preserved. Yes, I do fear that "github.com" does `git merge --squash`. What "lord-ne" wrote: This is the rewrite discussed in #17. Its main purpose is to ensure that public functions cannot overflow, by having them delegate to private functions that accept larger parameters. Some work was also put in to mitigating the effects of the delays not being inlined at compile time. Additional a link to #22 are is being asked for good commit messages.
Fixes the issues #13 and #15. Supersedes the pull requests #12, #14, and #16.