* added bitwise operation based sqrt16
- replacement for fastled, it is about 10% slower for numbers smaller 128 but faster for larger numbers. speed difference is irrelevant to WLED but it saves some flash.
* updated to 32bit, improved for typical WLED use
- making it 32bits allows for larger numbers
- added another initial condition check for medium sized numbers
- increased the "small number" optimization to larger numbers: the function is currently only used to calculate sqrt(x^2+y^2) which even for small segments is larger than the initially used 64, so optimizing for 1024 makes more sense, although the value is arbitrarily chosen
- sin/cos calls with incrementing numbers can lead to bad outcomes, the functions (_approx or original sinf/cosf) return bad values for very large float inputs
-replaced all PI references with M_PI version
-there is no need to do the angle-modulo in float, casting it to an integer does the same BUT it has to be cast to an `int` first, see comment.
- `sin16_t() / cos16_t()` are faster and more accurate than fastled versions
- `sin_approx() / cos_approx()` are float wrappers for `sin16_t() / cos16_t()` and are accurate enough to replace `sinf()/cosf()`
- `atan2()` is used only in octopus to calculate center offset, new approximated version saves flash
- `tan(), atan(), asin(), acos(), floor(), fmod()` are used only for sunrise/sunset calculation, using wled_math version saves flash
- `beatsinx()` replacements are to make use of new `sin16_t()/sin8_t()` functions to reduce flash size
- Extensively tested surnise/sunset calculation: deviation is 1min. max
- Tested some of the relevant FX and found no visual difference: Julia, 2D Drift, Drift Rose, Ghost rider, Rotozoomer, Palette, Arc 1D expansion
- total flash savings: 7.4k