On the FAUST mailing list, Dario Sanfilippo suggested that we change the default one-pole smoother fi.smoo in filters.lib from the form to instead , recognizing that multiplications are more expensive in hardware than additions. This appendix was written from my email reply to the list:
This is a winner. I am strongly in favor of the change. One multiply and two additions is fundamentally less work than two multiplies and one addition. However, when two multiplies are available in parallel, then (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two steps instead of three. Thus, a parallel-processing implementation might prefer the first form.
Ideally both forms would compile to the same assembly, but this is not yet the case. Neither the FAUST compiler nor the C++ compiler appear to work to minimize multiplies relative to additions when the target architecture warrants that.
Of course we should run benchmarks to measure the actual improvement on each architecture,28but looking at assembly can also give the answer. I recently learned about the Compiler Explorer at godbolt.org, for comparing assemblies on various processors, and this was my first use of it:
First, the FAUST source, adapted from mailing-list thread, is shown in Fig.23.
import("stdfaust.lib"); smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); }; c = 1.0 - 44.1 / ma.SR smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; process = _ <: si.smooth(c), smooth(c), smooth3(c); |
Next, I compiled the FAUST source at the command line with no options, and copy/pasted the compute() function to create the stand-alone code snippet shown in Fig.24. (Note that it's no longer a virtual function.)
#define FAUSTFLOAT float int fSampleRate = 44100; float fConst0 = 0.1; // linear-interpolation constant float fConst1 = 0.9; // 1-fConst0 float fRec0[2]; float fRec1[2]; float fRec2[2]; void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { FAUSTFLOAT* input0 = inputs[0]; FAUSTFLOAT* output0 = outputs[0]; FAUSTFLOAT* output1 = outputs[1]; FAUSTFLOAT* output2 = outputs[2]; for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { float fTemp0 = float(input0[i0]); fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); output0[i0] = FAUSTFLOAT(fRec0[0]); fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); output1[i0] = FAUSTFLOAT(fRec1[0]); fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); output2[i0] = FAUSTFLOAT(fRec2[0]); fRec0[1] = fRec0[0]; fRec1[1] = fRec1[0]; fRec2[1] = fRec2[0]; } } |
This code can be pasted into the left panel of the Compiler Explorer at godbolt.org. Next choose your processor architecture and compiler on the right panel, and your C++ compiler options. Here I chose the first Intel case (more readable than ARM): x86-64 clang (assertions trunk), with compiler options -std=c++17 -O3. The term ``trunk'' refers to the latest version of the compiler source, but you can also try earlier versions of the compiler listed next. WebAssembly, alternative compilers, and many embedded processors appear as choices. My choice is ideal for my iMac Pro where I do most of my work, but I need to check ARM also for my iOS work. The Compiler Explorer is a great tool for fine-tuning performance at the lowest level.
Figure 25 shows the assembly output with added comments indicating where I guessed things came from. You can see that the specified computation structure is preserved all the way down to the bottom, even with -O3 optimization. The clear winner is smooth3, and benchmarks should verify that. It has only one multiply and two additions, and only six instructions (lines of assembly code) total.
compute(int, float**, float**): # @compute(int, float**, float**) ... .LBB0_2: # %for.body # output0[i0] = ((fConst1 * fRec0[1]) + (fConst0 * input0[i0])), 7 lines: movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero mulss xmm0, dword ptr [rip + fConst1] movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero mulss xmm2, xmm1 addss xmm2, xmm0 movss dword ptr [rip + fRec0], xmm2 movss dword ptr [rcx + 4*rax], xmm2 # output1[i0] = (fRec1[1] + (fConst0 * (input0[i0] - fRec1[1]))), 7 lines: movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero movaps xmm2, xmm1 subss xmm2, xmm0 mulss xmm2, dword ptr [rip + fConst0] addss xmm2, xmm0 movss dword ptr [rip + fRec1], xmm2 movss dword ptr [rsi + 4*rax], xmm2 # output2[i0] = (input0[i0] + (fConst1 * (fRec2[1] - input0[i0]))), 6 lines: movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero subss xmm0, xmm1 mulss xmm0, dword ptr [rip + fConst1] addss xmm0, xmm1 movss dword ptr [rip + fRec2], xmm0 movss dword ptr [rdx + 4*rax], xmm0 # fRec0[1] = fRec0[0]; movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero movss dword ptr [rip + fRec0+4], xmm0 # fRec1[1] = fRec1[0]; movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec1+4], xmm1 # fRec2[1] = fRec2[0]; movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec2+4], xmm1 # i0 = i0 + 1 add rax, 1 cmp rdi, rax jne .LBB0_2 #------------------------ end of loop --------------------------- ... |
Stéphane Letz commented in the discussion thread that in his tests using faustbench-llvm on an Apple M1, process = par(i, 10, si.smoo); was faster, while the following ran a bit slower:
voice(i) = os.osc(400+i*300) : si.smoo; process = par(i, 10, voice(i));This is not yet understood.