Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop unrolling optimization #15

Closed
nt314p opened this issue Sep 1, 2024 · 49 comments · Fixed by #16
Closed

Loop unrolling optimization #15

nt314p opened this issue Sep 1, 2024 · 49 comments · Fixed by #16
Assignees
Labels
enhancement New feature or request

Comments

@nt314p
Copy link

nt314p commented Sep 1, 2024

Hi,

I've been messing around with methods to increase shift out speed (possibly to under 10 us per byte).

One suggestion I have to increase performance is to manually unroll loops within write functions, since the number of iterations (8, 16, 24, etc.) is known at compile time.

This will increase sketch size, so perhaps options to use unrolled function versions could be provided through alternative functions (like writeMSBFIRSTU where U indicates unrolled) or a #define #ifdef for USE_UNROLLED. 🤷

-NT

@RobTillaart RobTillaart self-assigned this Sep 2, 2024
@RobTillaart RobTillaart added the enhancement New feature or request label Sep 2, 2024
@RobTillaart
Copy link
Owner

Thanks for sharing this idea.
As I am rather busy, it will take time before I will dive into it..
Did you make measurements of the gain / size ?

A define sounds good, would allow to set it from commandline too.

@nt314p
Copy link
Author

nt314p commented Sep 2, 2024

No worries, I am busy as well and probably will only get around to testing it on hardware in a month.

Certainly though, unrolling should remove the jump instructions needed for the loop and give some performance benefit.

--

Unrelated, but I was wondering what the purpose is of the status register manipulation and noInterrupts() call?

// inner for loop of FastShiftOut::writeLSBFIRST
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    uint8_t oldSREG = SREG; // <-
    noInterrupts(); // <-
    if ((value & m) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;
    SREG = oldSREG; // <-
  }

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

The noInterrupts() prevents that interrupts are run. The interrupt flag is part of the SREG (Status Register).
SREG = oldSREG resets all status flags to point before noInterrupts() is called.

Imagine that an IRQ uses the data and/or clock register in the loop.
e.g *_dataOutRegister &= outmask2; is in pseudo machine code.

1 LOAD  _dataOutRegister INTO CPUREGISTER
2 AND CPUREGISTER outmask2
3 STORE CPUREGISTER INTO _dataOutRegister 

That could corrupt the content of the dataRegister if the IRQ would happen between step 2 and 3 in the above code.


Note:
The nointerrupts should be out of the loop as it would make the loop more predictable with respect to the arrival time of the individual bits. Could be important, even critical, depending on the receiving end.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

Reference timing of the current 0.3.3 version:

IDE: 1.8.19
Board: UNO

FastShiftOut_test.ino
FASTSHIFTOUT_LIB_VERSION: 0.3.3

Performance - time in us
        write: 23.51
        write: 45.78
        Delta: 22.27

writeLSBFIRST: 22.51
writeLSBFIRST: 45.77
        Delta: 23.26

writeMSBFIRST: 23.02
writeMSBFIRST: 44.77
        Delta: 21.75

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.60


Test print interface
println("Hello world"): 	328.92
println(1357): 			311.68
println(3.14159265, 4): 	716.04

done ...

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

Not correct test, see below

Quick test:

Moving the noInterrupts() out of the loop improves from 21.75 us => 19.11 us (~10%)

writeMSBFIRST: 23.02
writeMSBFIRST: 42.13
        Delta: 19.11

IDE: 1.8.19
Board UNO

Sketch uses 6544 bytes (20%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

Not correct test, see below

Quick test

Unrolling the loop and replacing m by constants improves from 19.11 us ==> 11.94 (~38%)

writeMSBFIRST: 23.02
writeMSBFIRST: 34.96
        Delta: 11.94
  uint8_t oldSREG = SREG;
  noInterrupts();

    if ((value & 0x80) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x40) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x20) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x10) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x08) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x04) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x02) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;

    if ((value & 0x01) == 0) *_dataOutRegister &= outmask2;
    else                  *_dataOutRegister |= outmask1;
    *_clockRegister |= cbmask1;
    *_clockRegister &= cbmask2;


  SREG = oldSREG;

IDE: 1.8.19
Board UNO

Sketch uses 6792 bytes (21%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

uses 250 bytes more PROGMEM,
NOTE only patched the MSB code to do the test, so it would add twice this amount of PROGMEM.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

Not correct test, see below

Quick test

replacing &= by = and |= by = for all masking, improves from 11.94 us ==> 4.65 (~60%)
Improvement from reference 21.75 ==> 4.65 is almost 80% (factor 4.6)

writeMSBFIRST: 23.02
writeMSBFIRST: 27.67
        Delta: 4.65

  uint8_t oldSREG = SREG;
  noInterrupts();
  
  uint8_t cbmask1  = *_clockRegister | _clockBit;
  uint8_t cbmask2  = *_clockRegister & ~_clockBit;
  uint8_t outmask1 = *_dataOutRegister | _dataOutBit;
  uint8_t outmask2 = *_dataOutRegister & ~_dataOutBit;

    if ((value & 0x80) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x40) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x20) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x10) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x08) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x04) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x02) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;

    if ((value & 0x01) == 0) *_dataOutRegister = outmask2;
    else                  *_dataOutRegister = outmask1;
    *_clockRegister = cbmask1;
    *_clockRegister = cbmask2;


  SREG = oldSREG;

IDE: 1.8.19
Board UNO

Sketch uses 6708 bytes (20%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

roughly 150 bytes more PROGMEM
NOTE only patched the MSB code to do the test, so it would add twice this amount of PROGMEM.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

@nt314p

Warning: tests above need to be verified, they are just quick tests and may have bugs.

Test program was incorrect, code part looks OK


OK, trading RAM for speed makes a call roughly 4.5 times faster which is more than expected.
4.65 us means the average bit is send in 4.65 / 8 in 0.6 us (is about 1.6 megabit/second)

Note that the duration of a pulse on the clock or data line is even less.
Needs to be checked with hardware (scope) to verify if the pulses are still "square" enough.

This code needs some serious warning in the documentation.

to be continued

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

Quick test of the print interface improvement (which has far more overhead)

Test print interface
println("Hello world"): 	218.64
println(1357): 			260.68
println(3.14159265, 4): 	648.12

where the 0.3.3 scored

Test print interface
println("Hello world"): 	328.92
println(1357): 			311.68
println(3.14159265, 4): 	716.04

roughly about 9 us gained per character send?

As said before, code need to be verified.
Furthermore gain for LSB and MSB might differ a bit.

Todo test gain for write16, write24 and write32 ==> add to example

@RobTillaart
Copy link
Owner

Quick test for write16/24/32

Reference 0.3.3

      write16: 45.84
      write16: 91.42
        Delta: 45.58

      write24: 68.09
      write24: 135.93
        Delta: 67.84

      write32: 90.35
      write32: 180.45
        Delta: 90.10

optimized

      write16: 28.74
      write16: 57.22
        Delta: 28.48

      write24: 42.44
      write24: 84.63
        Delta: 42.19

      write32: 56.15
      write32: 112.04
        Delta: 55.90

Still a ~37% improvement.
Writing e.g. a dedicated write32() without function call overhead would be even be faster (would use quite some RAM)

To be contiued.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

@nt314p
Ooops!, Found bugs in the test program, so the above results are not correct and will be less significant.
Need to rerun tests.

Size indication and code snippets seems OK.

@RobTillaart
Copy link
Owner

New reference 0.3.3

FastShiftOut_test.ino
FASTSHIFTOUT_LIB_VERSION: 0.3.3

Performance - time in us
        write: 23.51
        write: 45.78
        Delta: 22.27

writeLSBFIRST: 22.51
writeLSBFIRST: 44.76
        Delta: 22.25

writeMSBFIRST: 22.51
writeMSBFIRST: 44.77
        Delta: 22.26

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.59

      write16: 45.65
      write16: 91.04
        Delta: 45.39

      write24: 67.90
      write24: 135.56
        Delta: 67.66

      write32: 90.16
      write32: 180.07
        Delta: 89.91

Test print interface
println("Hello world"): 	328.92
println(1357): 			311.60
println(3.14159265, 4): 	716.04

@RobTillaart
Copy link
Owner

New optimized writeMSBFIRST:

FSO object configured as MSBFIRST

Performance - time in us
        write: 15.02
        write: 28.80
        Delta: 13.78

writeLSBFIRST: 22.51
writeLSBFIRST: 44.77
        Delta: 22.26

writeMSBFIRST: 13.96
writeMSBFIRST: 27.67
        Delta: 13.70

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.59

      write16: 28.74
      write16: 57.21
        Delta: 28.48

      write24: 42.44
      write24: 84.62
        Delta: 42.18

      write32: 56.15
      write32: 112.04
        Delta: 55.89


Test print interface
println("Hello world"): 	218.56
println(1357): 			260.72
println(3.14159265, 4): 	648.16

Gain writeMSBFIRST is from 22.26 us ==> 13.70 us (~38%) Still interesting.

@RobTillaart
Copy link
Owner

Will create a develop branch

RobTillaart added a commit that referenced this issue Sep 3, 2024
@RobTillaart
Copy link
Owner

Created develop branch, please verify.

RobTillaart added a commit that referenced this issue Sep 3, 2024
@nt314p
Copy link
Author

nt314p commented Sep 3, 2024

Woah, you've done a lot of work. One thing to quickly point out is that you can't optimize away the &= and |=. If the clock and data pins reside on the same PORT register, you need to read the updated value first before clearing or setting the bit.

Like for a hypothetical 2 bit register that's initialized as 00 (DATA:1, CLOCK:0) before running the shift out function:

uint8_t cbmask1  = *_clockRegister | _clockBit; // 01
uint8_t cbmask2  = *_clockRegister & ~_clockBit; // 00
uint8_t outmask1 = *_dataOutRegister | _dataOutBit; // 10
uint8_t outmask2 = *_dataOutRegister & ~_dataOutBit; // 00

*_clockRegister = cbmask1; // this always clocks out DATA = 0
*_clockRegister = cbmask2;

@RobTillaart
Copy link
Owner

One thing to quickly point out is that you can't optimize away the &= and |=. If the clock and data pins reside on the same PORT register.

100% right, good catch, will fix it asap

RobTillaart added a commit that referenced this issue Sep 3, 2024
@RobTillaart
Copy link
Owner

RobTillaart commented Sep 3, 2024

@nt314p
Push an updated version of the develop branch.

new figures,

        write: 18.60
        write: 35.97
        Delta: 17.36

writeLSBFIRST: 17.61
writeLSBFIRST: 34.96
        Delta: 17.35

writeMSBFIRST: 17.60
writeMSBFIRST: 34.96
        Delta: 17.36

Standard shiftOut1: 89.85
Standard shiftOut2: 179.44
             Delta: 89.59

      write16: 35.84
      write16: 71.43
        Delta: 35.59

      write24: 53.19
      write24: 106.14
        Delta: 52.94

      write32: 70.54
      write32: 140.84
        Delta: 70.29

Test print interface
println("Hello world"): 	265.16
println(1357): 			282.24
println(3.14159265, 4): 	676.80

Timing writeMSBFIRST is from 22.26 us ==> 17.36 us (~22%), still worth an update.
Recall about 10% is gained by disabling the interrupts only once, so loop unroll gives about 12%.

Need to think hard if more optimizations are possible.

RobTillaart added a commit that referenced this issue Sep 3, 2024
@RobTillaart
Copy link
Owner

removed SREG construction and use interrupts() saved one instruction (0.06 us).

tried "do not change dataregister if bit is the same as before" and it took more time to track the change :)

@nt314p
Copy link
Author

nt314p commented Sep 3, 2024

Nice. For sure more optimizations are possible. If the register(s) are known at compile time it is possible to get the timing under 10 us per byte. But compile time pin to port is a whole other issue.

Keep in mind that by using interrupts(), you enable interrupts even if the sketch did not have them enabled before. An alternative could be using ATOMIC_BLOCK and ATOMIC_RESTORESTATE from the avr library (see https://www.nongnu.org/avr-libc/user-manual/group__util__atomic.html). This is a way to create a block of code that is executed without interrupts. I do not think it has any performance benefit though.

Another option could be letting the user manage interrupt enable/disable. I can think of instances where interrupts might be wanted or acceptable during the shift out, and other instances where interrupts would not be desired.

@RobTillaart
Copy link
Owner

Good points,
The SREG code restores the interrupt flag as it was before the call, so that would work.

The atomic block is an option, dont know the details so I have to do some reading

Pins could be known only at runtime, that is similar to the standard shiftout() function.Do not want to break that user expectation.
nued
That said, you propose to remove the interrupts and leave it to the user. Makes some sense, need to sleep over that thought.

To be continued later this week.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 4, 2024

After a night sleep, interrupts suppression is left to the user.
Will update the code and readme.md a.s.a.p

RobTillaart added a commit that referenced this issue Sep 4, 2024
@nt314p
Copy link
Author

nt314p commented Sep 4, 2024

I'm still confused as to where the rest of the clock cycles are going. Because 17.36 us is ~278 clock cycles, or ~34.75 clock cycles per bit. The register load-and/or-store should take 5 clock cycles based on LD,AND/OR,ST, so you have 5 + 5 for the clock pulse, another 5 for the data line, and even throw in 10 more (generously) for checking if the data bit is set and branching. That gives 25 cycles per bit, which leaves 278 - 25 x 8 = 78 clock cycles for function calling and return overhead? It seems excessive. Is there something I am missing?

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 4, 2024

Found a possible optimization in the clock pulse that shaves of a micro.

  *_clockRegister |= cbmask1;    //  set one bit
  *_clockRegister &= cbmask2;    // reset it

==>

  uint8_t r = *_clockRegister;
  *_clockRegister = r | cbmask1;  //  set one bit
  *_clockRegister = r;            //  reset it

Results

writeLSBFIRST: 16.22
writeLSBFIRST: 32.20
        Delta: 15.98    <<<<  was 17.17

If there is no interrupt that modifies the clock register it could work.

However:
The original *_clockRegister |= cbmask1; statement could also be interrupted with a change in the clock register.
Imagine the clock is on the same port as Serial1 and data is coming in.

oops....

==> need to restore the interrupts disabling with the SREG in the code again.
It should be as robust as possible.

Will try to fix that today.

@RobTillaart
Copy link
Owner

With noInterrupts() added, there is 1 micros gain by the above trick.

  uint8_t r = *_clockRegister;
  *_clockRegister = r | cbmask1;  //  set one bit
  *_clockRegister = r;            //  reset it
writeLSBFIRST: 16.42
writeLSBFIRST: 32.57
        Delta: 16.16   <<<  was 17.17 

Opinion?

@RobTillaart
Copy link
Owner

Applying the same trick to the "for loop not unrolled" version

writeLSBFIRST: 20.06
writeLSBFIRST: 39.87
        Delta: 19.80   <<<  was 20.94   => 0.3.3 took 22.20 us

I need time for other repos, so to be continued.

@nt314p
Copy link
Author

nt314p commented Sep 4, 2024

I agree. If you want the shift out to be robust, you'll need to apply the noInterrupts()/SREG pattern to make an atomic section of code. You could also use the ATOMIC_BLOCK I suggested earlier. No difference though I believe.

It's also a matter of how aggressively we use atomic blocks. We could do

// ...
ATOMIC_BLOCK
{
  *_clockRegister |= cbmask1;
}
ATOMIC_BLOCK
{
  *_clockRegister &= cbmask2;
}
// ...

or

// ...
ATOMIC_BLOCK
{
  *_clockRegister |= cbmask1;
  *_clockRegister &= cbmask2;
}
// ...

or just disable interrupts for each bit or even during the entire shift out. Applying atomic blocks more finely prevents more optimization. For example, your clock pulse optimization cannot work if interrupts are allowed between the clock HIGH and clock LOW.

Looking at the assembly, it seems that the program is repeatedly loading the data and clock register pointers into CPU registers. Perhaps try caching these two values at the beginning of the function. Avoiding those loads (the ldds) should save a considerable amount. Note that the indirect load and store functions (think reading and writing to *registerPointer) take 2 clock cycles I believe.

I've been taking a look at the assembly on my end, but I probably should just clone the repository to get more accurate results instead of just copying snippets.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 5, 2024

ATOMIC_BLOCK per bit is preferred as it is also a logic block.
Would be 8 blocks of about 2 us with current timing.

ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
{
  if ((value & 0x80) == 0) *_dataOutRegister &= outmask2;
  else                     *_dataOutRegister |= outmask1;
  r = *_clockRegister;
  *_clockRegister = r | cbmask1;  //  set one bit
  *_clockRegister = r;            //  reset it
}

https://blog.oddbit.com/post/2019-02-01-atomicblock-magic-in-avrlibc/
https://www.nongnu.org/avr-libc/user-manual/group__util__atomic.html

it just does disable and set interrupts, so I keep my own noInterrupts() construct.

@RobTillaart
Copy link
Owner

Updated the develop branch and created a PR for the 0.4.0 release.
Think this version has to be tested more before adding more optimizations to it.
Please download the develop branch and let me know if you find problems.

@RobTillaart
Copy link
Owner

@nt314p
Created a PR for FastShiftIn class today. - RobTillaart/FastShiftIn#18

Performance gains are looking good. Snippet from Readme.md

|  function            |   0.2.3  |   0.3.2  |   0.4.0  |  0.4.0L  |
|:---------------------|---------:|---------:|---------:|---------:|
|  read()              |   19.30  |   20.49  |   20.57  |   12.40  |
|  read16()            |          |   41.04  |   41.11  |   25.64  |
|  read24()            |          |   62.91  |   62.68  |   39.47  |
|  read32()            |          |   83.95  |   82.86  |   51.92  |
|  readLSBFIRST()      |   19.04  |   19.92  |   19.82  |   12.08  |
|  readMSBFIRST()      |   19.04  |   19.92  |   19.81  |   12.07  |
|  reference shiftIn() |  107.82  |  108.20  |  108.20  |  108.05  |

- Note: 0.4.0L measured with loop unrolled flag enabled. 
- Note: 0.4.0 measured with loop unroll flag disabled.

I will only merge after FastShiftInOut has been updated too as these three libs are quite related.

@RobTillaart RobTillaart linked a pull request Sep 10, 2024 that will close this issue
@RobTillaart
Copy link
Owner

@nt314p
Created a PR for FastShiftInOut class today. - RobTillaart/FastShiftInOut#8

Performance gain is about 25% so 👍

Will start merging the three PR's tomorrow

@nt314p
Copy link
Author

nt314p commented Sep 10, 2024

Great stuff. Do you intentionally disable interrupts during the entire shift out for the unrolled writeLSBFIRST function (and in other functions)? I thought it would just be per bit.

Also, as I mentioned before, I believe you should cache the volatile registers locally to avoid extra indirect loads to fetch the register pointers for every bit.

For example:

size_t FastShiftOut::writeLSBFIRST(uint8_t data)
{
  // other initialization...
  volatile uint8_t* localDataOutRegister = _dataOutRegister;
  volatile uint8_t* localClockRegister = _clockRegister;
  // ...

This should give another performance boost, although I do not have the hardware with me to test it.

@RobTillaart
Copy link
Owner

This should give another performance boost, although I do not have the hardware with me to test it.

You can test this with only an UNO, nothing connected as there are no return signals.
I will give it a try

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 10, 2024

Yes, it is faster quite a bit.

Only patched the writeLSBFIRST() and there is another 4.4 usec off, see the diff with writeMSBFIRST()

writeLSBFIRST: 11.76
writeLSBFIRST: 23.26
        Delta: 11.50

writeMSBFIRST: 16.16
writeMSBFIRST: 32.07
        Delta: 15.91

The bit rate is approaching 700.000 bit / second,

TODO: update code and readme.md (maybe tomorrow)

@RobTillaart
Copy link
Owner

Patched the not unrolled loop too + move the interrupt block out the loop (per byte not per bit)

writeLSBFIRST: 14.34
writeLSBFIRST: 28.42
        Delta: 14.09

writeMSBFIRST: 22.26
writeMSBFIRST: 44.27
        Delta: 22.01

almost 8 us of 22 = ~35% faster.

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 10, 2024

Did an update for FastShiftOut(), ==> roughly 45% improved versus 0.3.3


|  function                |  0.2.4  |   0.3.1  |   0.3.3  |   0.4.0  |  0.4.0L  |
|:-------------------------|--------:|---------:|---------:|---------:|---------:|
|  write()                 |  21.66  |   22.48  |   22.27  |   14.10  |   11.51  |
|  writeLSBFIRST()         |  22.94  |   23.37  |   22.25  |   14.09  |   11.50  |
|  writeMSBFIRST()         |  20.30  |   21.86  |   22.26  |   14.08  |   11.50  |
|  reference shiftOut()    |  89.74  |   89.74  |   89.59  |   89.60  |   89.60  |
|  write16()               |   na    |    na    |   45.39  |   29.06  |   23.89  |
|  write24()               |   na    |    na    |   67.66  |   43.12  |   35.40  |
|  write32()               |   na    |    na    |   89.91  |   57.22  |   46.90  |
|  println("Hello world")  |   na    |  328.92  |  328.92  |  222.68  |  189.20  |
|  println(1357)           |   na    |  313.56  |  311.60  |  262.60  |  247.12  |
|  println(3.14159265, 4)  |   na    |  717.36  |  716.04  |  650.68  |  629.96  |

@nt314p
Copy link
Author

nt314p commented Sep 10, 2024

Nice! I actually don't even have an Arduino with me. Really no hardware at all. I'm wondering what hardware setup can be used to verify that the output is behaving as we expect. It's possible we've overlooked something that breaks the shift out that the software tests cannot detect.

Somehow ~45% improved feels like an understatement. I think we shaved off 45% of the time, but that made it nearly 100% faster!

@RobTillaart
Copy link
Owner

It's possible we've overlooked something that breaks the shift out that the software tests cannot detect.

One thing is sure, going twice as fast means that the signals are having more RC effects.
The pulses will be less "squarish", I expect that will be visible on a oscilloscope.

to be continued.

@RobTillaart
Copy link
Owner

@nt314p

new timing for the FastShiftIn() with local registers ==> crossed the 10 us border,
(verification with hardware still to be done.)

|  function            |   0.2.3  |   0.3.2  |   0.4.0  |  0.4.0L  |
|:---------------------|---------:|---------:|---------:|---------:|
|  read()              |   19.30  |   20.49  |   12.71  |    9.51  |
|  read16()            |          |   41.04  |   25.39  |   18.98  |
|  read24()            |          |   62.91  |   39.10  |   29.48  |
|  read32()            |          |   83.95  |   51.42  |   38.60  |
|  readLSBFIRST()      |   19.04  |   19.92  |   11.96  |    8.81  |
|  readMSBFIRST()      |   19.04  |   19.92  |   11.94  |    8.75  |
|  reference shiftIn() |  107.82  |  108.20  |  108.05  |  108.05  |

@RobTillaart
Copy link
Owner

@nt314p

And the numbers for FastShiftInOut() with local registers.

|  function                |   0.1.3  |   0.2.0  |   0.2.0L  |
|:-------------------------|---------:|---------:|----------:|
|  write() (reference)     | no data  |  158.24  |  no data  |
|  write()                 |   25.52  |   17.61  |    12.26  |
|  writeLSBFIRST()         |   25.52  |   17.61  |    12.26  |
|  writeMSBFIRST()         |   25.52  |   17.60  |    12.20  |


- Note: 0.1.3 added from old table.
- Note: reference run on AVR by commenting all optimizations.
- Note: 0.2.0 measured with loop unroll flag disabled.
- Note: 0.2.0L measured with loop unrolled flag enabled.

@nt314p
Copy link
Author

nt314p commented Sep 14, 2024

For the sake of cleaner code, you might consider using a macro to unroll the loop. For example,

#define UNROLL_2(x) x x
#define UNROLL_4(x) UNROLL_2(x) UNROLL_2(x)
#define UNROLL_8(x) UNROLL_4(x) UNROLL_4(x)

To make it so the code inside the UNROLL can be repeated, you can shift the byte value instead of shifting the bit you check each time (currently you do if ((value & 0x80) == 0), if ((value & 0x40) == 0), ...). For MSB first you can do,

UNROLL_8(
  // ...
  if ((value & 0x80) == 0) // ...
  // ...
  value <<= 1;
);

The compiler should recognize this shifting pattern and optimize it to a compile time bit check (generating the same assembly as before).

@RobTillaart
Copy link
Owner

Gain is only in maintenance,

First step (imho) is to verify current code with hardware.
Then release it as "reference".
Then look for additional optimization's.

@RobTillaart
Copy link
Owner

@nt314p
Finally found time to check the code with a scope to see the signal quality.
(disclaimer photo made with phone as scope froze on USB)

Arduino UNO, LOOP UNROLLED

YELLOW = DATA
PURPLE = CLOCK

  • Note a byte takes between 8 and 9 microseconds.
    (no loop unrolled is about 11 micros, no artifacts.).
  • data line shows nice 0x55 pattern alternating 0b01010101
    BYTE_0 4 0

Zooming in on a single bit of the clock, one sees the signal looks good.
(it has a pull up resistor)

  • rise time ~20 ns,
  • fall time ~40 ns
  • high time ~100 ns
  • clock pulse ==> 2 clock cycli UNO)
  • data pulse is rather long compared to clock pulse.
  • minor noise (ripple) on data line when clock rises / falls (and vice versa)

BIT_0 4 0

Think the most expensive part is now setting the output line.

  if ((value & 0x01) == 0) *localDataOutRegister &= outmask2;
  else                     *localDataOutRegister |= outmask1;

As this is done for every bit again it might be inefficient.
For the bit pattern 01010101 one needs to change the dataRegister 8 times.
For a bit pattern 11110000 one only needs to change the dataRegister 2 times
And a bit pattern 11111111 one needs to change only the dataRegister 1 time.
So remembering the status and not change if not needed might strip off one or more clockcycle.

@RobTillaart
Copy link
Owner

So all looks good, I'm going to merge after uploaded the FastShiftOut_scope_test.ino.

RobTillaart added a commit that referenced this issue Sep 19, 2024
- fix #15, loop unroll option, improving performance, kudos to nt314p
- fixed bug in test program (see #15)
- added flag to select LOOP UNROLL (is optional as it gives larger code size)
- optimized the not unrolled loop with ideas of the unrolling version.
- corrected type lastValue to uint8_t
- add FastShiftOut_scope_test.ino
- update readme.md
- minor edits
@nt314p
Copy link
Author

nt314p commented Sep 19, 2024

Neat how the changes in the lines affect one another. You could have a bitmask that checks the current and next bits to see if they're 00 or 11. Might be interesting to investigate.

One other optimization you can do is eliminating some expensive jumps when setting the data register. You can even see on the scope that after the previous clock pulse, it takes longer to set the data line compared to resetting it. It does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization, which is actually faster.

uint8_t d = *localDataOutRegister;
d |= outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & 0x01) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d;

Should another issue be opened for more optimization discussions?

@RobTillaart
Copy link
Owner

Should another issue be opened for more optimization discussions?

Think that is the way forward.
There might be some feedback / questions on the new 040 release too.
(Including comments on the related libs I released today).

@RobTillaart
Copy link
Owner

Please copy the ideas to the new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants