Fused Multiply-Add (FMA)

2021/10/01 |

| Linux Operating system

Fused Multiply-Add (FMA)

Introduction

In computing, especially digital signal processing, the multiply-add operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator, this operation modifies an accumulator a:

  a ← a + ( b × c )

When done with floating point numbers, it might be performed with two roundings (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA).

In floating-point arithmetic

When done with integers, the operation is typically exact (computed modulo some power of two). However, floating-point numbers have only a certain amount of mathematical precision. That is, digital floating-point arithmetic is generally not associative or distributive. Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). IEEE 754-2008 specifies that it must be performed with one rounding, yielding a more accurate result.

Fused multiply–add

A fused multiply–add (FMA or fmadd) is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.

A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

Dot product
Matrix multiplication
Polynomial evaluation (e.g., with Horner’s rule)
Newton’s method for evaluating functions (from the inverse function)
Convolutions and artificial neural networks
Multiplication in double-double arithmetic

Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly.

When implemented inside a microprocessor, an FMA can be faster than a multiply operation followed by an add.

Another benefit of including this instruction is that it allows an efficient software implementation of division and square root operations, thus eliminating the need for dedicated hardware for those operations.

Support

The FMA operation is included in IEEE 754-2008. The 1999 standard of the C programming language supports the FMA operation through the fma() standard math library function, and standard pragmas (#pragma STDC FP_CONTRACT) controlling optimizations based on FMA.

The fused multiply–add operation was introduced as “multiply–add fused” in the IBM POWER1 (1990) processor, but has been added to numerous other processors since then.

x86 processors with FMA3 and/or FMA4 instruction set
ARM processors with VFPv4 and/or NEONv2
MIPS processors
IBM z/Architecture
GPUs and GPGPU boards
Vector Processors

Usage

Test program

int madd_i(int a, int b, int c) {
    return a * b + c;
}

float madd_s(float a, float b, float c) {
    return a * b + c;
}

double madd_d(double a, double b, double c) {
    return a * b + c;
}

int msub_i(int a, int b, int c) {
    return a * b - c;
}

float msub_s(float a, float b, float c) {
    return a * b - c;
}

double msub_d(double a, double b, double c) {
    return a * b - c;
}

ARMv8-A

aarch64-linux-gnu-gcc -O2 -S -o - test.c

aarch64-linux-gnu-gcc -O2 -Ofast -c test.c && \
aarch64-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o

0000000000000000 <madd_i>:
   0:   1b010800        madd    w0, w0, w1, w2
   4:   d65f03c0        ret

0000000000000010 <madd_s>:
  10:   1f010800        fmadd   s0, s0, s1, s2
  14:   d65f03c0        ret

0000000000000020 <madd_d>:
  20:   1f410800        fmadd   d0, d0, d1, d2
  24:   d65f03c0        ret

0000000000000030 <msub_i>:
  30:   1b017c00        mul     w0, w0, w1
  34:   4b020000        sub     w0, w0, w2
  38:   d65f03c0        ret

0000000000000040 <msub_s>:
  40:   1f218800        fnmsub  s0, s0, s1, s2
  44:   d65f03c0        ret

0000000000000050 <msub_d>:
  50:   1f618800        fnmsub  d0, d0, d1, d2
  54:   d65f03c0        ret

ARMv7-A

arm-linux-gnueabi-gcc -march=armv7-a+neon-vfpv4 -mfloat-abi=hard -O2 -S -o - test.c

arm-linux-gnueabi-gcc -march=armv7-a+neon-vfpv4 -mfloat-abi=hard -Ofast -c test.c && \
arm-linux-gnueabi-objdump -Cw --visualize-jumps -d -S -t test.o

00000000 <madd_i>:
   0:   e0202091        mla     r0, r1, r0, r2
   4:   e12fff1e        bx      lr

00000008 <madd_s>:
   8:   eea01a20        vfma.f32    s2, s0, s1
   c:   eeb00a41        vmov.f32    s0, s2
  10:   e12fff1e        bx          lr

00000014 <madd_d>:
  14:   eea02b01        vfma.f64    d2, d0, d1
  18:   eeb00b42        vmov.f64    d0, d2
  1c:   e12fff1e        bx          lr

00000020 <msub_i>:
  20:   e0000091        mul     r0, r1, r0
  24:   e0400002        sub     r0, r0, r2
  28:   e12fff1e        bx      lr

0000002c <msub_s>:
  2c:   ee901a20        vfnms.f32   s2, s0, s1
  30:   eeb00a41        vmov.f32    s0, s2
  34:   e12fff1e        bx          lr

00000038 <msub_d>:
  38:   ee902b01        vfnms.f64   d2, d0, d1
  3c:   eeb00b42        vmov.f64    d0, d2
  40:   e12fff1e        bx          lr

ARMv6

arm-linux-gnueabi-gcc -march=armv6kz -mfloat-abi=hard -mfpu=vfpv2 -O2 -S -o - test.c

arm-linux-gnueabi-gcc -march=armv6kz -mfloat-abi=hard -mfpu=vfpv2 -Ofast -c test.c && \
arm-linux-gnueabi-objdump -Cw --visualize-jumps -d -S -t test.o

00000000 <madd_i>:
   0:   e0202091        mla     r0, r1, r0, r2
   4:   e12fff1e        bx      lr

00000008 <madd_s>:
   8:   ee001a20        vmla.f32    s2, s0, s1
   c:   eeb00a41        vmov.f32    s0, s2
  10:   e12fff1e        bx          lr

00000014 <madd_d>:
  14:   ee002b01        vmla.f64    d2, d0, d1
  18:   eeb00b42        vmov.f64    d0, d2
  1c:   e12fff1e        bx          lr

00000020 <msub_i>:
  20:   e0000091        mul     r0, r1, r0
  24:   e0400002        sub     r0, r0, r2
  28:   e12fff1e        bx      lr

0000002c <msub_s>:
  2c:   ee101a20        vnmls.f32   s2, s0, s1
  30:   eeb00a41        vmov.f32    s0, s2
  34:   e12fff1e        bx          lr

00000038 <msub_d>:
  38:   ee102b01        vnmls.f64   d2, d0, d1
  3c:   eeb00b42        vmov.f64    d0, d2
  40:   e12fff1e        bx          lr

MIPS64

mips64el-linux-gnuabi64-gcc -march=mips64r2 -O2 -S -o - test.c

mips64el-linux-gnuabi64-gcc -march=mips64r2 -Ofast -c test.c && \
mips64el-linux-gnuabi64-objdump -Cw --visualize-jumps -d -S -t test.o

0000000000000000 <madd_i>:
   0:   /-> 70851802    mul     v1,a0,a1
   4:   \-- 03e00008    jr      ra
   8:       00661021    addu    v0,v1,a2

0000000000000010 <madd_s>:
  10:   460d6002        mul.s   $f0,$f12,$f13
  14:   03e00008        jr      ra
  18:   460e0000        add.s   $f0,$f0,$f14

0000000000000020 <madd_d>:
  20:   462d6002        mul.d   $f0,$f12,$f13
  24:   03e00008        jr      ra
  28:   462e0000        add.d   $f0,$f0,$f14

0000000000000030 <msub_i>:
  30:   70852002        mul     a0,a0,a1
  34:   03e00008        jr      ra
  38:   00861023        subu    v0,a0,a2

0000000000000040 <msub_s>:
  40:   460d6002        mul.s   $f0,$f12,$f13
  44:   03e00008        jr      ra
  48:   460e0001        sub.s   $f0,$f0,$f14

0000000000000050 <msub_d>:
  50:   462d6002        mul.d   $f0,$f12,$f13
  54:   03e00008        jr      ra
  58:   462e0001        sub.d   $f0,$f0,$f14

# mips64el-linux-gnuabi64-gcc -march=mips64r6 -Ofast -s --freestanding -nostdlib -static -o mips64 mips64.S
# mips64el-linux-gnuabi64-objdump -Cw --visualize-jumps -d -S -t mips64
# qemu-mips64el-static mips64

#include <regdef.h>
#include <asm/unistd.h>

.data
hello_string: .ascii "Hello World!\n"
hello_string_length: .quad . - hello_string

.text
.global __start
__start:
        bal 1f
1:
        .cpsetup ra, zero, 1b

        dli a0, 1
        dla a1, hello_string
        ld a2, hello_string_length
        dli v0, __NR_write
        syscall

        move a0, zero
        dli v0, __NR_exit
        syscall

$ qemu-mips64el-static mips64
Hello World!

$ mips64el-linux-gnuabi64-objdump -Cw --visualize-jumps -d -S -t mips64
0000000120000190 <__start>:
   120000190:   /-- 04110001    bal     120000198 <__start+0x8>
   120000194:   |   00000000    nop
   120000198:   \-> 03800025    move    zero,gp
   12000019c:       3c1c0002    lui     gp,0x2
   1200001a0:       279c8048    addiu   gp,gp,-32696
   1200001a4:       039fe02d    daddu   gp,gp,ra
   1200001a8:       24040001    li      a0,1
   1200001ac:       df858020    ld      a1,-32736(gp)
   1200001b0:       df868028    ld      a2,-32728(gp)
   1200001b4:       dcc601e0    ld      a2,480(a2)
   1200001b8:       24021389    li      v0,5001
   1200001bc:       0000000c    syscall
   1200001c0:       00002025    move    a0,zero
   1200001c4:       240213c2    li      v0,5058
   1200001c8:       0000000c    syscall
   1200001cc:       00000000    nop

PowerPC

powerpc64le-linux-gnu-gcc -mcpu=power8 -O2 -S -o - test.c

powerpc64le-linux-gnu-gcc -mcpu=power8 -Ofast -c test.c && \
powerpc64le-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o

0000000000000000 <madd_i>:
   0:   d6 21 63 7c     mullw   r3,r3,r4
   4:   14 2a 63 7c     add     r3,r3,r5
   8:   b4 07 63 7c     extsw   r3,r3
   c:   20 00 80 4e     blr

0000000000000020 <madd_s>:
  20:   ba 18 21 ec     fmadds  f1,f1,f2,f3
  24:   20 00 80 4e     blr

0000000000000040 <madd_d>:
  40:   ba 18 21 fc     fmadd   f1,f1,f2,f3
  44:   20 00 80 4e     blr

0000000000000060 <msub_i>:
  60:   d6 21 63 7c     mullw   r3,r3,r4
  64:   50 18 65 7c     subf    r3,r5,r3
  68:   b4 07 63 7c     extsw   r3,r3
  6c:   20 00 80 4e     blr

0000000000000080 <msub_s>:
  80:   b8 18 21 ec     fmsubs  f1,f1,f2,f3
  84:   20 00 80 4e     blr

00000000000000a0 <msub_d>:
  a0:   b8 18 21 fc     fmsub   f1,f1,f2,f3
  a4:   20 00 80 4e     blr

RISC-V

riscv64-linux-gnu-gcc -march=rv64gc -mabi=lp64d -O2 -S -o - test.c

riscv64-linux-gnu-gcc -march=rv64gc -mabi=lp64d -Ofast -c test.c
riscv64-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o

0000000000000000 <madd_i>:
   0:   /-> 02b5053b            mulw    a0,a0,a1
   4:   |   9d31                addw    a0,a0,a2
   6:   \-- 8082                ret

0000000000000008 <madd_s>:
   8:   60b57543                fmadd.s fa0,fa0,fa1,fa2
   c:   8082                    ret

000000000000000e <madd_d>:
   e:   62b57543                fmadd.d fa0,fa0,fa1,fa2
  12:   8082                    ret

0000000000000014 <msub_i>:
  14:   02b5053b                mulw    a0,a0,a1
  18:   9d11                    subw    a0,a0,a2
  1a:   8082                    ret

000000000000001c <msub_s>:
  1c:   60b57547                fmsub.s fa0,fa0,fa1,fa2
  20:   8082                    ret

0000000000000022 <msub_d>:
  22:   62b57547                fmsub.d fa0,fa0,fa1,fa2
  26:   8082                    ret

x86_64

x86_64-linux-gnu-gcc -march=x86-64-v3 -masm=intel -O2 -S -o - test.c

x86_64-linux-gnu-gcc -march=x86-64-v3 -Ofast -c test.c && \
x86_64-linux-gnu-objdump -Cw --visualize-jumps -d -M intel -S -t test.o

0000000000000000 <madd_i>:
   0:   0f af fe                imul   edi,esi
   3:   8d 04 17                lea    eax,[rdi+rdx*1]
   6:   c3                      ret

0000000000000010 <madd_s>:
  10:   c4 e2 69 99 c1          vfmadd132ss xmm0,xmm2,xmm1
  15:   c3                      ret

0000000000000020 <madd_d>:
  20:   c4 e2 e9 99 c1          vfmadd132sd xmm0,xmm2,xmm1
  25:   c3                      ret

0000000000000030 <msub_i>:
  30:   0f af fe                imul   edi,esi
  33:   89 f8                   mov    eax,edi
  35:   29 d0                   sub    eax,edx
  37:   c3                      ret

0000000000000040 <msub_s>:
  40:   c4 e2 69 9b c1          vfmsub132ss xmm0,xmm2,xmm1
  45:   c3                      ret

0000000000000050 <msub_d>:
  50:   c4 e2 e9 9b c1          vfmsub132sd xmm0,xmm2,xmm1
  55:   c3                      ret

z/Architecture

s390x-linux-gnu-gcc -march=z13 -O2 -S -o - test.c

s390x-linux-gnu-gcc -march=z13 -Ofast -c test.c && \
s390x-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o

0000000000000000 <madd_i>:
   0:   b2 52 00 23             msr     %r2,%r3
   4:   1a 24                   ar      %r2,%r4
   6:   b9 14 00 22             lgfr    %r2,%r2
   a:   07 fe                   br      %r14

0000000000000010 <madd_s>:
  10:   b3 0e 40 02             maebr   %f4,%f0,%f2
  14:   28 04                   ldr     %f0,%f4
  16:   07 fe                   br      %r14

0000000000000020 <madd_d>:
  20:   e7 00 23 08 40 8f       wfmadb  %v0,%v0,%v2,%v4
  26:   07 fe                   br      %r14

0000000000000030 <msub_i>:
  30:   b2 52 00 23             msr     %r2,%r3
  34:   1b 24                   sr      %r2,%r4
  36:   b9 14 00 22             lgfr    %r2,%r2
  3a:   07 fe                   br      %r14

0000000000000040 <msub_s>:
  40:   b3 0f 40 02             msebr   %f4,%f0,%f2
  44:   28 04                   ldr     %f0,%f4
  46:   07 fe                   br      %r14

0000000000000050 <msub_d>:
  50:   e7 00 23 08 40 8e       wfmsdb  %v0,%v0,%v2,%v4
  56:   07 fe                   br      %r14