-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added integer based sin()/cos()
functions, changed all trig functions to wled_math
#4181
Conversation
…ns to wled_math - `sin16_t() / cos16_t()` are faster and more accurate than fastled versions - `sin_approx() / cos_approx()` are float wrappers for `sin16_t() / cos16_t()` and are accurate enough to replace `sinf()/cosf()` - `atan2()` is used only in octopus to calculate center offset, new approximated version saves flash - `tan(), atan(), asin(), acos(), floor(), fmod()` are used only for sunrise/sunset calculation, using wled_math version saves flash - `beatsinx()` replacements are to make use of new `sin16_t()/sin8_t()` functions to reduce flash size - Extensively tested surnise/sunset calculation: deviation is 1min. max - Tested some of the relevant FX and found no visual difference: Julia, 2D Drift, Drift Rose, Ghost rider, Rotozoomer, Palette, Arc 1D expansion - total flash savings: 7.4k
sin()/cos()
functions, changed all trig functins to wled_mathsin()/cos()
functions, changed all trig functions to wled_math
Is this with or without audioreactive usermod? Because ArduinoFFT pulls in some trig functions (sin,cos,atan, sqrt) -so possibly this finally increases flash size because two variants of the function need to be linked into the final binary. Edit: I have a fork of ArduinoFFT where "double" trigonometry functions are replaced with float functions (sinf, cosf, atanf). I thinks it's not possible to replace them with approximations ... FFT could become very noisy if we do, because the fourier transform of the window function must be as exact as possible. Any inaccuracy will add "ghost frequencies". |
One question: is the approximated function also faster than sinf() and cosf()? In earlier tests I was suprised to find that the libm functions are really fast (faster than fastled sin16/cos16), at least on esp32 and ESP32-S3 with FPU. |
I'll give it a try with pinwheel mapping in the next days. Pinwheel is very sensitive to inaccuracies in sinf() and cosf(), and possibly leaves holes as a consequence. |
// the math.h functions use several kB of flash and are to be avoided if possible | ||
// sin16_t / cos16_t are faster and much more accurate than the fastled variants | ||
// sin_approx and cos_approx are float wrappers for sin16_t/cos16_t and have an accuracy of +/-0.0015 compared to sinf() | ||
// sin8_t / cos8_t are fastled replacements and use sin16_t / cos16_t. Slightly slower than fastled version but very accurate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 very good that we have a replacement for sin8 and cos8.
These FastLed functions were so inaccurate that you could not even draw a circle with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the comparison between the new version and fastled (compared to sinf()
)
sin8() absolute error, output range is ±127:
sin16() absolute error, output range is ±32767:
the new sin8_t() is a bit slower than the fastled variant due to overhad in converstion to 16bit range. The approximation can be re-written to any scale so it could be done for 8bit inputs/outputs natively but since sin16_t() is now faster than sin8() it would make more sense to use the sin16_t() instead wherever speed is of utmost importance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I like the graphs 👍 very useful to understand accuracy and limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but since sin16_t() is now faster than sin8() it would make more sense to use the sin16_t() instead wherever speed is of utmost importance.
I agree - I don't think we need a specialized sin8, as most "sine wave" effects are using beatsin88() and not sin8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea about sin8_t and cos8_t:
You could move these two functions into fx.cpp and make them "static inline". This will allow the compiler to optimize out function calls (->faster). I think there is not other place where these 8bit version would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is not other place where these 8bit version would be needed.
It is used in beatsin8, which is also used in sound simulation in util.cpp.
While making that an inline function could boost speed, I don't think it those extra few instructions will do any harm. If we really want it faster, it should have its own dedicated function (which is doable, it uses roughly 100 bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oops wrong thread) see #4181 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works, maybe a good tradoff between accuracy and speed.
I've just compared your new trig functions against the old ones, using the "Lissajous" effect on 128x64. It's really a major difference. The "fastled" lissajous curve looks very edgy and somehow wrong, more like a twisted hexagon. With your new functions, I see a perfect wave.
Good job👍 thanks.
Goot point. I checked and it is the same: ArduinoFFT uses cos(x) which is the double float implementation and sqrt() was not touched in this PR. |
Another good point. I did not actually measure that. But now I checked with rough test. Here are my findings:
I am not 100% sure about the method I used so no warranty about correctness of these numbers. Using the same method for sin16() functions gave like a 50x speed increase, which seems impossible without the compiler playing some tricks. So on an S3 the difference is (as expected) not as large thanks to the FPU but still an improvement. On S2, C3 (and ESP8266) the speed-gain is huge. |
I am looking at pinwheel RN. The issue is not the precision of sin() cos() but the number of steps: edit: |
@DedeHai pinwheel has a general problem with holes - we tested it until 56x56, so below 56x56 there should be no holes (using sinf()/cosf(), not the _t approximations from AC). |
512 was just a "shot in the dark". You're right we could increase the value to 2^15 or higher. |
I saw that and I am trying to work out the math why that is.
|
@softhack007 looking at the pinwheel some more. I have some fixes but also: why is the center calculated in such a complicated way (with shifting odd rays)? I removed all that code and to me it still looks the same. Can you explain the intentions? |
@DedeHai the "odd rays" part was an invention from @Brandon502, to reduce the number of "full rays" by only drawing outer parts of odd-numbered rays. That was around 30% faster in our tests. Its correct that omitting this optimization should not influence the look. But the odd-rays optimization should be kept as-is for performance reasons. |
@DedeHai would be nice if we find an improvement, because each time I get a larger setup, there are new holes in pinwheel. Looks like a kind of Moire effect. I've found the "magic parameters" up to 128x128: see https://github.com/MoonModules/WLED/blob/a84216947bfb28488e58837d5a6732836be93d6c/wled00/FX_fcn.cpp#L833-L854 I'm really curious to know if there is a better way to avoid these inaccuracies, while preserving speed. |
How much memory can a expand1D type use? If it could work similar to jMap you could create a lookup map on the first call then never have to do the math again. Would never produce holes and likely be much faster, but the memory footprint might be too big for 128x128+ |
@Brandon502 I have sort of working code improvements: it may take a slight hit on loop speed but it work for larger sizes and reduces steps for smaller ones by dynamically calculating the required steps, resulting in an overall speed improvement. Also "pixel holes" are controlled better by reducing the step size, running more 'blind' loops (but setting less pixels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and works well for me.
approximation was incorrect, now doing it right. also removed hypotf() from octopus, saving a little flash.
wled00/wled_math.cpp
Outdated
@@ -88,8 +88,8 @@ uint8_t cos8_t(uint8_t theta) { | |||
float sin_approx(float theta) | |||
{ | |||
theta = modd(theta, TWO_PI); // modulo: bring to -2pi to 2pi range | |||
if(theta < 0) theta += M_TWOPI; // 0-2pi range | |||
uint16_t scaled_theta = (uint16_t)(theta * (0xFFFF / M_TWOPI)); | |||
if(theta < 0) theta += TWO_PI; // 0-2pi range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Why not change M_PI
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. I think this was auto-complete suggestion from co-pilot.
I will make it consistant to M_PI variants.
Also: the conversion of the angle using float is really not needed, I added that to fix a bug. but the actual fix is: cast a negative float to a int
and NEVER to a unsigned
type, as that behaviour is undefined and depends on the CPU/Compiler, hence the issues on the C3 which does (uint)(float) = 0
-replaced all PI references with M_PI version -there is no need to do the angle-modulo in float, casting it to an integer does the same BUT it has to be cast to an `int` first, see comment.
IMO, no but there is no reason to take my answer into account. |
@DedeHai thanks I think your judgement as the original author is very important. I agree with with, better to get some confidence (beta testing) before releasing this (and #4138) to all users. 👍 So we'll merge all the PRs tagged as "optimization" after 0.15.0 is released, right? |
I think the decision is up to @Aircoookie whether to push 0.15 release ASAP or add in these improvements. If it were up to me I would add all pending improvements I and @blazoncek made as they improve FX speed and FPS and FX quality quite a bit but only if these are extensively tested as there are chances of introducing new bugs. By all I mean this PR as well as #4138 #4145 #4245 |
FYI I just stumbled upon this PR, and decided to run some quick execution time benchmarks on the ESP32 (lower is better): Btw. I believe the |
@TripleWhy there's also code size aspect to sin_t. This was the primary reason @Aircoookie introduced these functions. |
@TripleWhy interesting results, do you have a link to the code you used? Would be interesting to compare other MCUs. One thing that is really strange in your graph is that |
Thanks @TripleWhy for the benchmark 🥇
Finally someone has an explanation for the "_t" - it's not written in the code, and my best guess was it stands for "trick" 😆 As nobody except the three of us knows what "_t" stands for, it does not matter much which letter we pick. But for minimal code change, imho it would be good to keep the "_t". |
I didn't mean to give any recommendation here, just information.
Just pushed it here: https://github.com/TripleWhy/WLED/tree/benchTrig
And some float<->int conversions.
I guess float additions and calls are expensive? Well... According to this diagram, I have no idea what's going on. I guess more fiddling is required here.
Very. Look at the raw log of sin_t vs sin_approx and cos_t vs cos_approx. They are aliases, I just left them in for sanity checking. The numbers are identical. There is also (almost) no run-to run variation, these are fixed frequency single threaded risc chips without task scheduling after all. You are free to check the code and run it for yourself.
I don't disagree. As I said, I was just publishing information that I generated locally. |
Oh right, I also did a size comparison:
|
Uh... looking at the code a bit more, it appears that this aliasing isn't intentional? |
In that case the original implementations should be removed.
Interesting. What are the differences to my runs? |
I wanted to keep the original functions as they do take a completely different approach. I could comment them out for readability though.
I ran it on a regular ESP32, this is the code I used: edit: I am not 100% sure the compiler does not play tricks, using |
I want to clarify that anything I introduced into WLED was But that does not matter. There may have been time in WLED development that needed those functions and that time may have passed. It may be time to move forward, there was a lot of development and progress in every way since that introduction. There will always be a reason to optimise either for speed or accuracy or code size. It is difficult to have all at once. As for this PR: Since @DedeHai started to replace FastLED core functions (here and elsewhere) it might be beneficial to remove FastLED entirely from dependencies if those replacements are not copy/paste and perform better. |
- speed improvement: by default M_TWOPI is treated as a double float - directly calling sin16_t in cos_approx() saves a lot of overhead
Are you happy for this to go into the 0.15.0.rc1 @softhack007 or wait for 0.1.5.1 ? |
I think it could even go into 0.15.0.rc1 - just that @DedeHai previously said he would like to see some more confidence testing on 8266. If @DedeHai is happy with the benchmarks and tests we have, then OK for me to put this into 0.15.0.rc1. |
I think it can go into 0.15.0.rc1, @TripleWhy an me were running some more tests and it is looking good, I will post results in a bit, once I confirmed it works on S2, S3 and ESP8266 without any issues. |
Ok, if you can fix the merge conflict @DedeHai fixed, we can merge |
my tests are done, its all looking good but there may be a small pitfall @TripleWhy found: calling As for my final results: here are some graphs. Input to float versions was approximately 0-2pi, integer versions just use a increment of 1 for each loop. |
…th incrementing number - sin/cos calls with incrementing numbers can lead to bad outcomes, the functions (_approx or original sinf/cosf) return bad values for very large float inputs
Added integer based `sin()/cos()` functions, changed all trig functions to wled_math
I wrote a new integer math based sine/cosine approximation function implementing Bhaskara I's approximation formula. It is blazing fast and quite accurate (almost 3 significant digits when converting to float).
sin16_t() / cos16_t()
are more accurate than fastled versionssin_approx() / cos_approx()
are float wrappers forsin16_t() / cos16_t()
and are accurate enough to replacesinf()/cosf()
for all current WLED purposesatan2()
is used only in octopus to calculate center offset, new approximated version saves flashtan(), atan(), asin(), acos(), floor(), fmod()
are used only for sunrise/sunset calculation, using existing wled_math version saves flashbeatsinx()
replacements are to make use of newsin16_t()/sin8_t()
functions to reduce flash size