Next  |  Prev  |  Top  |  JOS Index  |  JOS Pubs  |  JOS Home  |  Search

Appendix B: Inspecting Assembly to Fine-Tune Performance

On the FAUST mailing list, Dario Sanfilippo suggested that we change the default one-pole smoother fi.smoo in filters.lib from the form $ (1-b) * x(n) + b * y(n-1)$ to instead $ x(n) + b *
[y(n-1) - x(n)]$ , recognizing that multiplications are more expensive in hardware than additions. This appendix was written from my email reply to the list:

This is a winner. I am strongly in favor of the change. One multiply and two additions is fundamentally less work than two multiplies and one addition. However, when two multiplies are available in parallel, then (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two steps instead of three. Thus, a parallel-processing implementation might prefer the first form.

Ideally both forms would compile to the same assembly, but this is not yet the case. Neither the FAUST compiler nor the C++ compiler appear to work to minimize multiplies relative to additions when the target architecture warrants that.

Of course we should run benchmarks to measure the actual improvement on each architecture,28but looking at assembly can also give the answer. I recently learned about the Compiler Explorer at godbolt.org, for comparing assemblies on various processors, and this was my first use of it:

First, the FAUST source, adapted from mailing-list thread, is shown in Fig.23.

Figure 23: FAUST program specifying three different one-pole smoothers in parallel.

 
import("stdfaust.lib");
smooth(coeff, x) = fb ~ _ with {  fb(y) = y + (1.0 - coeff) * (x - y);  };
c = 1.0 - 44.1 / ma.SR
smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };
process = _ <: si.smooth(c), smooth(c), smooth3(c);

Next, I compiled the FAUST source at the command line with no options, and copy/pasted the compute() function to create the stand-alone code snippet shown in Fig.24. (Note that it's no longer a virtual function.)

Figure 24: C++ program adapted from the output of simplest FAUST compilation at the command line.

 
#define FAUSTFLOAT float

int fSampleRate = 44100;
float fConst0 = 0.1; // linear-interpolation constant
float fConst1 = 0.9; // 1-fConst0
float fRec0[2];
float fRec1[2];
float fRec2[2];

void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
  FAUSTFLOAT* input0 = inputs[0];
  FAUSTFLOAT* output0 = outputs[0];
  FAUSTFLOAT* output1 = outputs[1];
  FAUSTFLOAT* output2 = outputs[2];
  for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
    float fTemp0 = float(input0[i0]);
    fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0));
    output0[i0] = FAUSTFLOAT(fRec0[0]);
    fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
    output1[i0] = FAUSTFLOAT(fRec1[0]);
    fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
    output2[i0] = FAUSTFLOAT(fRec2[0]);
    fRec0[1] = fRec0[0];
    fRec1[1] = fRec1[0];
    fRec2[1] = fRec2[0];
  }
}

This code can be pasted into the left panel of the Compiler Explorer at godbolt.org. Next choose your processor architecture and compiler on the right panel, and your C++ compiler options. Here I chose the first Intel case (more readable than ARM): x86-64 clang (assertions trunk), with compiler options -std=c++17 -O3. The term ``trunk'' refers to the latest version of the compiler source, but you can also try earlier versions of the compiler listed next. WebAssembly, alternative compilers, and many embedded processors appear as choices. My choice is ideal for my iMac Pro where I do most of my work, but I need to check ARM also for my iOS work. The Compiler Explorer is a great tool for fine-tuning performance at the lowest level.

Figure 25 shows the assembly output with added comments indicating where I guessed things came from. You can see that the specified computation structure is preserved all the way down to the bottom, even with -O3 optimization. The clear winner is smooth3, and benchmarks should verify that. It has only one multiply and two additions, and only six instructions (lines of assembly code) total.

Figure 25: Intel x86-64 assembly language, generated using the first (latest) Intel option of the Compiler Explorer at godbolt.org (compiler options -std=c++17 -O3).

 
compute(int, float**, float**): # @compute(int, float**, float**)
  ...
.LBB0_2: # %for.body

# output0[i0] = ((fConst1 * fRec0[1]) + (fConst0 * input0[i0])), 7 lines:
  movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero
  mulss xmm0, dword ptr [rip + fConst1]
  movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero
  mulss xmm2, xmm1
  addss xmm2, xmm0
  movss dword ptr [rip + fRec0], xmm2
  movss dword ptr [rcx + 4*rax], xmm2

# output1[i0] = (fRec1[1] + (fConst0 * (input0[i0] - fRec1[1]))), 7 lines:
  movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero
  movaps xmm2, xmm1
  subss xmm2, xmm0
  mulss xmm2, dword ptr [rip + fConst0]
  addss xmm2, xmm0
  movss dword ptr [rip + fRec1], xmm2
  movss dword ptr [rsi + 4*rax], xmm2

# output2[i0] = (input0[i0] + (fConst1 * (fRec2[1] - input0[i0]))), 6 lines:
  movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero
  subss xmm0, xmm1
  mulss xmm0, dword ptr [rip + fConst1]
  addss xmm0, xmm1
  movss dword ptr [rip + fRec2], xmm0
  movss dword ptr [rdx + 4*rax], xmm0

# fRec0[1] = fRec0[0];
  movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero
  movss dword ptr [rip + fRec0+4], xmm0

# fRec1[1] = fRec1[0];
  movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero
  movss dword ptr [rip + fRec1+4], xmm1

# fRec2[1] = fRec2[0];
  movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero
  movss dword ptr [rip + fRec2+4], xmm1

# i0 = i0 + 1
  add rax, 1
  cmp rdi, rax
  jne .LBB0_2
  #------------------------ end of loop ---------------------------
  ...

Stéphane Letz commented in the discussion thread that in his tests using faustbench-llvm on an Apple M1, process = par(i, 10, si.smoo); was faster, while the following ran a bit slower:

  voice(i) = os.osc(400+i*300) : si.smoo;
  process = par(i, 10, voice(i));
This is not yet understood.


Next  |  Prev  |  Top  |  JOS Index  |  JOS Pubs  |  JOS Home  |  Search

Download aspf.pdf
[Comment on this page via email]

``Audio Signal Processing in Faust'', by Julius O. Smith III
Copyright © 2023-08-16 by Julius O. Smith III
Center for Computer Research in Music and Acoustics (CCRMA),   Stanford University
CCRMA