Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote library to fix several issues #17

Closed
wants to merge 5 commits into from

Conversation

lord-ne
Copy link

@lord-ne lord-ne commented Aug 25, 2021

Fixes the issues #13 and #15. Supersedes the pull requests #12, #14, and #16.

@lord-ne
Copy link
Author

lord-ne commented Aug 25, 2021

Marked as draft because I haven't had a chance to test this yet.

@lord-ne
Copy link
Author

lord-ne commented Aug 25, 2021

I decided it wasn't a good idea to make a breaking API change on such an inactive repository, so I put back the u32 delay function, and marked it as deprecated.

@lord-ne lord-ne marked this pull request as ready for review August 25, 2021 15:11
@lord-ne
Copy link
Author

lord-ne commented Aug 25, 2021

Inspecting the assembly output, both long and short delays work well. Constant delays (known at compile time) compile with negligible overhead. Tested with both speed (opt-level = 3) and size (opt-level = "z") optimizations.

@Patryk27
Copy link

Hi, thanks for this merge request - I've managed to run the code, but seems like on my atmega328p the delays are 10x as long as they are supposed to be: e.g. delay_ms(100); waits for one second.

I'm using AVR_CPU_FREQUENCY_HZ=20000000 and dropping one zero seems to help, but it kinda feels the value is alright and the calculation is off somewhere.

@lord-ne
Copy link
Author

lord-ne commented Sep 17, 2021

I double-checked the math and it looks right to me, and it seemed to work correctly on my 328p. I'm not sure what could be causing this issue. Can you post your code?

@Patryk27
Copy link

Patryk27 commented Sep 17, 2021

Sure - my code is nothing more than:

#![feature(llvm_asm)]
#![no_std]
#![no_main]

use ruduino::cores::current::port;
use ruduino::Pin;

#[no_mangle]
pub extern "C" fn main() {
    port::B0::set_output();

    loop {
        port::B0::set_high();
        avr_delay::delay_ms(1000); // in this case it's set to 1s, but in reality takes 10s 
        port::B0::set_low();
        avr_delay::delay_ms(1000);
    }
}

... compiled using:

AVR_CPU_FREQUENCY_HZ=20000000 cargo build --release

... with .cargo/config:

[build]
target = "avr-atmega328p.json"

[unstable]
build-std = ["core"]

Maybe 20 MHz is too much for AVR_CPU_FREQUENCY_HZ? For reference, atmega-hal's Delay seems to be working correctly (also on 20 MHz).

@lord-ne
Copy link
Author

lord-ne commented Sep 17, 2021

20MHz shouldn't be causing any overflow. Maybe this is related to the issue with avr_config where it doesn't rebuild when the AVR_CPU_FREQUENCY_HZ environment variable changes?

@Patryk27
Copy link

Yeah, I've read about that issue and I've done like a dozen of cargo clean-s to be sure I'm not running stale code 😄 Anyways, I don't have my 328p at hand now - I'll try to re-check when I have some free time; if everything works correctly on your side, then I might've just fused my processor incorrectly (I guess accidentally setting it to use its internal 8 MHz clock could yield similar results).

@stappersg
Copy link
Member

Copy link

@bombela bombela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing as requested.

src/lib.rs Outdated
// microseconds
let us = ms * 1000;
delay_us(us);
let ticks: u64 = (u64::from(avr_config::CPU_FREQUENCY_HZ) * u64::from(ms)) / 4_000;
Copy link

@bombela bombela Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u64 is very costly on AVR. The hardware can only do 8 bits addition, subtraction and multiplication. It can also do 8 bits divide par two. Everything else will be implemented in software with a tiny bit of help from the hardware for carry. So here, we have a u64 multiplication and division. This means that it will require a lot instructions, taking precious flash space. And it will be very slow at runtime.

A nice summary of the instruction set: https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set
(The official doc is the absolute reference, but from a cursory look, it seems to be almost all on the wikipedia page).

avr_config::CPU_FREQUENCY_HZ is a const u32.

So the challenge is to find a way to compute as much as possible in constant expression at compile time. And end up with the least amount of arithmetic at runtime. And most of it around 8 bits.

In my code, I have basically adapted the reference implementation in C. Which cost 16 bits comparison, subtractions and multiplications. The code follows.

  /// Works for 4 KHz >= CPU freqency <= 262.14 MHz.
  /// Maximum 65_535 ms.
  pub fn busy_loop_ms(ms: u16) {
      const ITER_PER_MS: u16 = (crate::CPU_FREQUENCY_HZ / 4000) as u16;
      const MAX_MS_PER_LOOP: u16 = u16::MAX / ITER_PER_MS;
      const ITERS_PER_MAX_MS: u16 = MAX_MS_PER_LOOP * ITER_PER_MS;

      let mut ms: u16 = ms;

      while ms > MAX_MS_PER_LOOP {
          busy_loop_4_cycles(ITERS_PER_MAX_MS);
          ms -= MAX_MS_PER_LOOP;
      }

      if ms > 0 {
          busy_loop_4_cycles(ms * ITER_PER_MS);
      }
  }

It "feels" like it might be possible to do better still. Note that if the delay value is known at compile time (like delay(150)), and the code is successfully inlined, then most of the operations will be done at compile time. But you cannot count on this without looking at the assembly output and careful tuning of the application code.

Copy link
Author

@lord-ne lord-ne Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was counting on everything being inlined at compile time. The AVR C implementation that I was looking at uses floats, which are ridiculously slow when emulated at runtime. Since that implementation was depending on compile time evaluation, I felt that it was fine to also do that.

But I agree your implementation is better. The only potential problem I see is that it can only delay for a max of about one minute.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it depends on the compile time evaluation, more like it hopes it does it? My naive understanding is compile time evaluation wouldn't be guaranteed in the C version.

The instruction set does have some float multiplication, but not float divisions. I don't know how costly floats are maybe its not that bad? This was written by Atmel, I would hope they know a thing or two about their CPUs.

Float would have no minute limit while retaining high precision for small values. I will investigate and report here.

Copy link
Author

@lord-ne lord-ne Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bombela
This is from <util/delay.h> (the file defining _delay_ms and _delay_us) included with Microchip Studio:

The functions in this header file are wrappers around the basic busy-wait functions from <util/delay_basic.h>. They are meant as convenience functions where actual time values can be specified rather than a number of cycles to wait for. The idea behind is that compile-time constant expressions will be eliminated by compiler optimization so floating-point expressions can be used to calculate the number of delay cycles needed based on the CPU frequency passed by the macro F_CPU.

In order for these functions to work as intended, compiler optimizations must be enabled, and the delay time must be an expression that is a known constant at compile-time. If these requirements are not met, the resulting delay will be much longer (and basically unpredictable), and applications that otherwise do not use floating-point calculations will experience severe code bloat by the floating-point library routines linked into the application.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AVR doesn't have hardware division at all, I think it's basically impossible to make delay_ms and delay_us work with values that aren't known at compile time without a ton of overhead. At the very least, I'd say it isn't a priority for this crate.

Copy link

@bombela bombela Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The division is not needed at runtime in my example. But it still requires 16bit substraction and multiplication.

But note that even if you call delay with a constant value. The compiler might still move the computation at runtime to reduce code bloat. I encountered this on my last project (attiny24a, 2048 bytes of flash). I used delays to output patterns on a pin, which helped me synchronize debugging on an oscilloscope. The trick was to use only two different delay constants. The compiler was happy to output only two functions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example, with opt-level=z:

#[no_mangle]
pub unsafe extern "C" fn main() {
    busy_loop_ms(66);
    busy_loop_ms(67);
    busy_loop_ms(68);
    loop {}
}

Produces:

main:
	ldi	r24, 66
	ldi	r25, 0
	rcall	_ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E
	ldi	r24, 67
	ldi	r25, 0
	rcall	_ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E
	ldi	r24, 68
	ldi	r25, 0
	rcall	_ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E
.LBB3_1:
	rjmp	.LBB3_1

_ZN12attiny24a_uc3mcu5delay12busy_loop_ms17hdb07dc8da5567d03E:
	ldi	r18, 0
	ldi	r19, 250
.LBB0_1:
	cpi	r24, 33
	cpc	r25, r1
	brlo	.LBB0_3
	movw	r30, r18
	;APP
.Ltmp0:
	sbiw	r30, 1
	brne	.Ltmp0
	;NO_APP
	sbiw	r24, 32
	rjmp	.LBB0_1
.LBB0_3:
	cpi	r24, 0
	cpc	r25, r1
	breq	.LBB0_5
	ldi	r22, 208
	ldi	r23, 7
	rcall	__mulhi3
	;APP
.Ltmp1:
	sbiw	r24, 1
	brne	.Ltmp1
	;NO_APP
.LBB0_5:
	ret

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you getting that assembly output? I've been using avr-objdump on the .elf output file, and the formatting is much less nice

Copy link

@bombela bombela Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo rustc $(FLAGS) --target=avr-attiny24a.json -- --emit asm
echo target/avr-attiny24a/release/deps/*.s

src/lib.rs Outdated
/// * 'count' - a u16, the number of times to cycle the loop.
#[inline(always)]
#[allow(unused_variables, unused_mut, unused_assignments)]
fn delay_loop_4_cycles(mut count: u16) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to adapt my own code to the new asm macro recently. I found all the magical register naming in the Rust compiler source code at https://github.com/rust-lang/rust/blob/9ad5d82f822b3cb67637f11be2e65c5662b66ec0/compiler/rustc_target/src/asm/avr.rs.

The documentation of the asm! macro itself can be found there: https://doc.rust-lang.org/nightly/reference/inline-assembly.html.

Volatile is now the default. So I believe the following code should be the correct translation.
Even though we do not care of the final value of _count, we still must declare the register pair as both input and output. The hardware is mutating the register pair, and the compiler should know about this.

  /// 4 cycles per iteration.
  #[inline(always)]
  pub fn busy_loop_4_cycles(count: u16) {
      let mut _count = count; // Quiet the linter.
      unsafe {
          asm!(
              "1: sbiw {regpair}, 1",
              "brne 1b",
              regpair = inout(reg_iw) _count,
          )
      }
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bombela please make a merge request for it.

@stappersg
Copy link
Member

Fixes the issues #13 and #15. Supersedes the pull requests #12, #14, and #16.

That is no longer true due other "progress". The so called progress is a mix of

  • being blocked by a LLVM ERROR
  • switch from llvm_asm!() to asm!()
  • merging merge requests

I hope that there will be smaller Merge Requests that replace this large MR.

@lord-ne lord-ne marked this pull request as draft June 8, 2022 17:50
@lord-ne lord-ne marked this pull request as ready for review June 8, 2022 23:07
@lord-ne
Copy link
Author

lord-ne commented Jun 8, 2022

Okay, this is ready for review again. I basically just switched the llvm_asm! to the asm! that's currently in the main repo (here) and tidied up the code a bit.

It only compiles in release mode for some reason (so does the current main branch of the main repo), but that's an issue with compiling core for AVR, not an issue with this code.

@bombela's point about delays that aren't known at compile time still stands, but I think it isn't a priority right now (AVR libc doesn't do it either), we first need to get the crate in a working state.

@lord-ne
Copy link
Author

lord-ne commented Jun 9, 2022

I might try something with magic-number division (https://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html) to work with values not known at compile time. But that will definitely be a separate pull request

EDIT: It looks like this wouldn't be that helpful, since we would end up with 64x64->128 but multiplications, which would be very slow anyway.

@bombela
Copy link

bombela commented Jun 9, 2022

This is fascinating. The math goes over my head sadly. On my side, I am trying to produce something similar to the intrinsic delay_cycles from gcc-avr. With asm! and const generics. I will make a pull request if I ever get something worthwhile.

@lord-ne lord-ne requested a review from stappersg June 9, 2022 08:06
@bombela
Copy link

bombela commented Jun 9, 2022

I actually got it working. Here is a gist (https://gist.github.com/bombela/cc90e5c29f3e7667326de4c087c1e148) with the code for reference. I tested all the corner cases I could think of, and it appears to work beautifully. It requires that the delay functions take their parameters via const generic (delay_ms::<42>()). I will try to assemble a PR together someday.

@lord-ne
Copy link
Author

lord-ne commented Jun 9, 2022

@bombela

That looks really good. A couple minor points:

  • I think that these compile time constant delay functions should probably be separate functions. There are still some use cases where the user knows that a value is known at compile time, but rust doesn't consider it valid in a const context for whatever reason (since the compiler's constant propagation is more powerful than rust's tools for determining const-ness). Although admittedly, I'm struggling to think of such cases off the top of my head. Maybe call the functions like delay_ms_const or something like that?
  • I think the arguments to delay_us, delay_ms, delay_s should be u32, so they can't overflow the Delayer. In general, I think that the Delayer itself should be made private, since it can overflow. And to be honest, does the MUL and DIV part really need to be part of the Delayer, wouldn't it be simpler to just multiply and divide at the call site (that is, in the body of delay_us, delay_ms, and delay_s) and pass in a single CYCLES value? And if it really is a useful operation, make a seperate function that takes in a u32 cycles and mul, so it can't overflow. EDIT: Overflow actually causes a compile-time error, so this isn't really an issue.
  • Some convenience macro rules so one can just write delay_s!(3 * 60) instead of delay_s::<{ 3 * 60 }>()
  • You should probably wait until Fix integer overflow in microseconds delay, typos #16 and/or Rewrote library to fix several issues #17 is merged before making another PR.

@stappersg
Copy link
Member

stappersg commented Jun 9, 2022 via email

@stappersg
Copy link
Member

afbeelding

That is not for this rewrite.

I closing this merge request. That might be stupid. The idea is getting a fresh start. Use branch rewrite for it.

@stappersg stappersg closed this Jun 9, 2022
stappersg added a commit that referenced this pull request Jun 13, 2022
Rewrite: Prevent overflow from public functions

Hopefully are the underlaying commits preserved.
Yes, I do fear that "github.com" does `git merge --squash`.

What "lord-ne" wrote:

This is the rewrite discussed in #17. Its main purpose is to ensure that public functions cannot overflow, by having them delegate to private functions that accept larger parameters. Some work was also put in to mitigating the effects of the delays not being inlined at compile time.

Additional a link to #22 are is being asked for good commit messages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants