Appendix B: Inspecting Assembly to Fine-Tune Performance

On the FAUST mailing list, Dario Sanfilippo suggested that we change the default one-pole smoother fi.smoo in filters.lib from the form

to instead

, recognizing that multiplications are more expensive in hardware than additions. This appendix was written from my email reply to the list:

This is a winner. I am strongly in favor of the change. One multiply and two additions is fundamentally less work than two multiplies and one addition. However, when two multiplies are available in parallel, then (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two steps instead of three. Thus, a parallel-processing implementation might prefer the first form.

Ideally both forms would compile to the same assembly, but this is not yet the case. Neither the FAUST compiler nor the C++ compiler appear to work to minimize multiplies relative to additions when the target architecture warrants that.

Of course we should run benchmarks to measure the actual improvement on each architecture,²⁸but looking at assembly can also give the answer. I recently learned about the Compiler Explorer at godbolt.org, for comparing assemblies on various processors, and this was my first use of it:

**Figure 23:** FAUST program specifying three different one-pole smoothers in parallel.
import("stdfaust.lib"); smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); }; c = 1.0 - 44.1 / ma.SR smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; process = _ <: si.smooth(c), smooth(c), smooth3(c);

Figure 23: FAUST program specifying three different one-pole smoothers in parallel.

import("stdfaust.lib");
smooth(coeff, x) = fb ~ _ with {  fb(y) = y + (1.0 - coeff) * (x - y);  };
c = 1.0 - 44.1 / ma.SR
smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };
process = _ <: si.smooth(c), smooth(c), smooth3(c);

Next, I compiled the FAUST source at the command line with no options, and copy/pasted the compute() function to create the stand-alone code snippet shown in Fig.24. (Note that it's no longer a virtual function.)

**Figure 24:** C++ program adapted from the output of simplest FAUST compilation at the command line.
#define FAUSTFLOAT float int fSampleRate = 44100; float fConst0 = 0.1; // linear-interpolation constant float fConst1 = 0.9; // 1-fConst0 float fRec0[2]; float fRec1[2]; float fRec2[2]; void compute(int count, FAUSTFLOAT inputs, FAUSTFLOAT outputs) { FAUSTFLOAT* input0 = inputs[0]; FAUSTFLOAT* output0 = outputs[0]; FAUSTFLOAT* output1 = outputs[1]; FAUSTFLOAT* output2 = outputs[2]; for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { float fTemp0 = float(input0[i0]); fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); output0[i0] = FAUSTFLOAT(fRec0[0]); fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); output1[i0] = FAUSTFLOAT(fRec1[0]); fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); output2[i0] = FAUSTFLOAT(fRec2[0]); fRec0[1] = fRec0[0]; fRec1[1] = fRec1[0]; fRec2[1] = fRec2[0]; } }

Figure 24: C++ program adapted from the output of simplest FAUST compilation at the command line.

#define FAUSTFLOAT float

int fSampleRate = 44100;
float fConst0 = 0.1; // linear-interpolation constant
float fConst1 = 0.9; // 1-fConst0
float fRec0[2];
float fRec1[2];
float fRec2[2];

void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
  FAUSTFLOAT* input0 = inputs[0];
  FAUSTFLOAT* output0 = outputs[0];
  FAUSTFLOAT* output1 = outputs[1];
  FAUSTFLOAT* output2 = outputs[2];
  for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
    float fTemp0 = float(input0[i0]);
    fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0));
    output0[i0] = FAUSTFLOAT(fRec0[0]);
    fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
    output1[i0] = FAUSTFLOAT(fRec1[0]);
    fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
    output2[i0] = FAUSTFLOAT(fRec2[0]);
    fRec0[1] = fRec0[0];
    fRec1[1] = fRec1[0];
    fRec2[1] = fRec2[0];
  }
}

This code can be pasted into the left panel of the Compiler Explorer at godbolt.org. Next choose your processor architecture and compiler on the right panel, and your C++ compiler options. Here I chose the first Intel case (more readable than ARM): x86-64 clang (assertions trunk), with compiler options -std=c++17 -O3. The term ``trunk'' refers to the latest version of the compiler source, but you can also try earlier versions of the compiler listed next. WebAssembly, alternative compilers, and many embedded processors appear as choices. My choice is ideal for my iMac Pro where I do most of my work, but I need to check ARM also for my iOS work. The Compiler Explorer is a great tool for fine-tuning performance at the lowest level.

Figure 25 shows the assembly output with added comments indicating where I guessed things came from. You can see that the specified computation structure is preserved all the way down to the bottom, even with -O3 optimization. The clear winner is smooth3, and benchmarks should verify that. It has only one multiply and two additions, and only six instructions (lines of assembly code) total.

**Figure 25:** Intel x86-64 assembly language, generated using the first (latest) Intel option of the Compiler Explorer at `godbolt.org` (compiler options `-std=c++17 -O3`).
compute(int, float, float): # @compute(int, float, float) ... .LBB0_2: # %for.body # output0[i0] = ((fConst1 * fRec0[1]) + (fConst0 * input0[i0])), 7 lines: movss xmm1, dword ptr [r8 + 4rax] # xmm1 = mem[0],zero,zero,zero mulss xmm0, dword ptr [rip + fConst1] movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero mulss xmm2, xmm1 addss xmm2, xmm0 movss dword ptr [rip + fRec0], xmm2 movss dword ptr [rcx + 4rax], xmm2 # output1[i0] = (fRec1[1] + (fConst0 * (input0[i0] - fRec1[1]))), 7 lines: movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero movaps xmm2, xmm1 subss xmm2, xmm0 mulss xmm2, dword ptr [rip + fConst0] addss xmm2, xmm0 movss dword ptr [rip + fRec1], xmm2 movss dword ptr [rsi + 4rax], xmm2 # output2[i0] = (input0[i0] + (fConst1 (fRec2[1] - input0[i0]))), 6 lines: movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero subss xmm0, xmm1 mulss xmm0, dword ptr [rip + fConst1] addss xmm0, xmm1 movss dword ptr [rip + fRec2], xmm0 movss dword ptr [rdx + 4*rax], xmm0 # fRec0[1] = fRec0[0]; movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero movss dword ptr [rip + fRec0+4], xmm0 # fRec1[1] = fRec1[0]; movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec1+4], xmm1 # fRec2[1] = fRec2[0]; movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec2+4], xmm1 # i0 = i0 + 1 add rax, 1 cmp rdi, rax jne .LBB0_2 #------------------------ end of loop --------------------------- ...

compute(int, float**, float**): # @compute(int, float**, float**)
  ...
.LBB0_2: # %for.body

# output0[i0] = ((fConst1 * fRec0[1]) + (fConst0 * input0[i0])), 7 lines:
  movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero
  mulss xmm0, dword ptr [rip + fConst1]
  movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero
  mulss xmm2, xmm1
  addss xmm2, xmm0
  movss dword ptr [rip + fRec0], xmm2
  movss dword ptr [rcx + 4*rax], xmm2

# output1[i0] = (fRec1[1] + (fConst0 * (input0[i0] - fRec1[1]))), 7 lines:
  movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero
  movaps xmm2, xmm1
  subss xmm2, xmm0
  mulss xmm2, dword ptr [rip + fConst0]
  addss xmm2, xmm0
  movss dword ptr [rip + fRec1], xmm2
  movss dword ptr [rsi + 4*rax], xmm2

# output2[i0] = (input0[i0] + (fConst1 * (fRec2[1] - input0[i0]))), 6 lines:
  movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero
  subss xmm0, xmm1
  mulss xmm0, dword ptr [rip + fConst1]
  addss xmm0, xmm1
  movss dword ptr [rip + fRec2], xmm0
  movss dword ptr [rdx + 4*rax], xmm0

# fRec0[1] = fRec0[0];
  movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero
  movss dword ptr [rip + fRec0+4], xmm0

# fRec1[1] = fRec1[0];
  movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero
  movss dword ptr [rip + fRec1+4], xmm1

# fRec2[1] = fRec2[0];
  movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero
  movss dword ptr [rip + fRec2+4], xmm1

# i0 = i0 + 1
  add rax, 1
  cmp rdi, rax
  jne .LBB0_2
  #------------------------ end of loop ---------------------------
  ...

Stéphane Letz commented in the discussion thread that in his tests using faustbench-llvm on an Apple M1, process = par(i, 10, si.smoo); was faster, while the following ran a bit slower:

Appendix B: Inspecting Assembly to Fine-Tune Performance

``Audio Signal Processing in Faust'', by Julius O. Smith III Copyright © 2024-05-01 by Julius O. Smith III Center for Computer Research in Music and Acoustics (CCRMA), Stanford University

``Audio Signal Processing in Faust'', by Julius O. Smith III
Copyright © 2024-05-01 by Julius O. Smith III
Center for Computer Research in Music and Acoustics (CCRMA), Stanford University