Fused Multiply-Add (FMA)

Fused Multiply-Add (FMA)

Introduction

In computing, especially digital signal processing, the multiply-add operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator, this operation modifies an accumulator a:

  a ← a + ( b × c )

When done with floating point numbers, it might be performed with two roundings (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA).

In floating-point arithmetic

When done with integers, the operation is typically exact (computed modulo some power of two). However, floating-point numbers have only a certain amount of mathematical precision. That is, digital floating-point arithmetic is generally not associative or distributive. Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). IEEE 754-2008 specifies that it must be performed with one rounding, yielding a more accurate result.

Fused multiply–add

A fused multiply–add (FMA or fmadd) is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.

A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

  • Dot product
  • Matrix multiplication
  • Polynomial evaluation (e.g., with Horner’s rule)
  • Newton’s method for evaluating functions (from the inverse function)
  • Convolutions and artificial neural networks
  • Multiplication in double-double arithmetic

Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly.

When implemented inside a microprocessor, an FMA can be faster than a multiply operation followed by an add.

Another benefit of including this instruction is that it allows an efficient software implementation of division and square root operations, thus eliminating the need for dedicated hardware for those operations.

Support

The FMA operation is included in IEEE 754-2008. The 1999 standard of the C programming language supports the FMA operation through the fma() standard math library function, and standard pragmas (#pragma STDC FP_CONTRACT) controlling optimizations based on FMA.

The fused multiply–add operation was introduced as “multiply–add fused” in the IBM POWER1 (1990) processor, but has been added to numerous other processors since then.

  • x86 processors with FMA3 and/or FMA4 instruction set
  • ARM processors with VFPv4 and/or NEONv2
  • MIPS processors
  • IBM z/Architecture
  • GPUs and GPGPU boards
  • Vector Processors

Usage

Test program

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int madd_i(int a, int b, int c) {
return a * b + c;
}

float madd_s(float a, float b, float c) {
return a * b + c;
}

double madd_d(double a, double b, double c) {
return a * b + c;
}

int msub_i(int a, int b, int c) {
return a * b - c;
}

float msub_s(float a, float b, float c) {
return a * b - c;
}

double msub_d(double a, double b, double c) {
return a * b - c;
}

ARMv8-A

1
2
3
4
aarch64-linux-gnu-gcc -O2 -S -o - test.c

aarch64-linux-gnu-gcc -O2 -Ofast -c test.c && \
aarch64-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0000000000000000 <madd_i>:
0: 1b010800 madd w0, w0, w1, w2
4: d65f03c0 ret

0000000000000010 <madd_s>:
10: 1f010800 fmadd s0, s0, s1, s2
14: d65f03c0 ret

0000000000000020 <madd_d>:
20: 1f410800 fmadd d0, d0, d1, d2
24: d65f03c0 ret

0000000000000030 <msub_i>:
30: 1b017c00 mul w0, w0, w1
34: 4b020000 sub w0, w0, w2
38: d65f03c0 ret

0000000000000040 <msub_s>:
40: 1f218800 fnmsub s0, s0, s1, s2
44: d65f03c0 ret

0000000000000050 <msub_d>:
50: 1f618800 fnmsub d0, d0, d1, d2
54: d65f03c0 ret

ARMv7-A

1
2
3
4
arm-linux-gnueabi-gcc -march=armv7-a+neon-vfpv4 -mfloat-abi=hard -O2 -S -o - test.c

arm-linux-gnueabi-gcc -march=armv7-a+neon-vfpv4 -mfloat-abi=hard -Ofast -c test.c && \
arm-linux-gnueabi-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
00000000 <madd_i>:
0: e0202091 mla r0, r1, r0, r2
4: e12fff1e bx lr

00000008 <madd_s>:
8: eea01a20 vfma.f32 s2, s0, s1
c: eeb00a41 vmov.f32 s0, s2
10: e12fff1e bx lr

00000014 <madd_d>:
14: eea02b01 vfma.f64 d2, d0, d1
18: eeb00b42 vmov.f64 d0, d2
1c: e12fff1e bx lr

00000020 <msub_i>:
20: e0000091 mul r0, r1, r0
24: e0400002 sub r0, r0, r2
28: e12fff1e bx lr

0000002c <msub_s>:
2c: ee901a20 vfnms.f32 s2, s0, s1
30: eeb00a41 vmov.f32 s0, s2
34: e12fff1e bx lr

00000038 <msub_d>:
38: ee902b01 vfnms.f64 d2, d0, d1
3c: eeb00b42 vmov.f64 d0, d2
40: e12fff1e bx lr

ARMv6

1
2
3
4
arm-linux-gnueabi-gcc -march=armv6kz -mfloat-abi=hard -mfpu=vfpv2 -O2 -S -o - test.c

arm-linux-gnueabi-gcc -march=armv6kz -mfloat-abi=hard -mfpu=vfpv2 -Ofast -c test.c && \
arm-linux-gnueabi-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
00000000 <madd_i>:
0: e0202091 mla r0, r1, r0, r2
4: e12fff1e bx lr

00000008 <madd_s>:
8: ee001a20 vmla.f32 s2, s0, s1
c: eeb00a41 vmov.f32 s0, s2
10: e12fff1e bx lr

00000014 <madd_d>:
14: ee002b01 vmla.f64 d2, d0, d1
18: eeb00b42 vmov.f64 d0, d2
1c: e12fff1e bx lr

00000020 <msub_i>:
20: e0000091 mul r0, r1, r0
24: e0400002 sub r0, r0, r2
28: e12fff1e bx lr

0000002c <msub_s>:
2c: ee101a20 vnmls.f32 s2, s0, s1
30: eeb00a41 vmov.f32 s0, s2
34: e12fff1e bx lr

00000038 <msub_d>:
38: ee102b01 vnmls.f64 d2, d0, d1
3c: eeb00b42 vmov.f64 d0, d2
40: e12fff1e bx lr

MIPS64

1
2
3
4
mips64el-linux-gnuabi64-gcc -march=mips64r2 -O2 -S -o - test.c

mips64el-linux-gnuabi64-gcc -march=mips64r2 -Ofast -c test.c && \
mips64el-linux-gnuabi64-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0000000000000000 <madd_i>:
0: /-> 70851802 mul v1,a0,a1
4: \-- 03e00008 jr ra
8: 00661021 addu v0,v1,a2

0000000000000010 <madd_s>:
10: 460d6002 mul.s $f0,$f12,$f13
14: 03e00008 jr ra
18: 460e0000 add.s $f0,$f0,$f14

0000000000000020 <madd_d>:
20: 462d6002 mul.d $f0,$f12,$f13
24: 03e00008 jr ra
28: 462e0000 add.d $f0,$f0,$f14

0000000000000030 <msub_i>:
30: 70852002 mul a0,a0,a1
34: 03e00008 jr ra
38: 00861023 subu v0,a0,a2

0000000000000040 <msub_s>:
40: 460d6002 mul.s $f0,$f12,$f13
44: 03e00008 jr ra
48: 460e0001 sub.s $f0,$f0,$f14

0000000000000050 <msub_d>:
50: 462d6002 mul.d $f0,$f12,$f13
54: 03e00008 jr ra
58: 462e0001 sub.d $f0,$f0,$f14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# mips64el-linux-gnuabi64-gcc -march=mips64r6 -Ofast -s --freestanding -nostdlib -static -o mips64 mips64.S
# mips64el-linux-gnuabi64-objdump -Cw --visualize-jumps -d -S -t mips64
# qemu-mips64el-static mips64

#include <regdef.h>
#include <asm/unistd.h>

.data
hello_string: .ascii "Hello World!\n"
hello_string_length: .quad . - hello_string

.text
.global __start
__start:
bal 1f
1:
.cpsetup ra, zero, 1b

dli a0, 1
dla a1, hello_string
ld a2, hello_string_length
dli v0, __NR_write
syscall

move a0, zero
dli v0, __NR_exit
syscall
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ qemu-mips64el-static mips64
Hello World!

$ mips64el-linux-gnuabi64-objdump -Cw --visualize-jumps -d -S -t mips64
0000000120000190 <__start>:
120000190: /-- 04110001 bal 120000198 <__start+0x8>
120000194: | 00000000 nop
120000198: \-> 03800025 move zero,gp
12000019c: 3c1c0002 lui gp,0x2
1200001a0: 279c8048 addiu gp,gp,-32696
1200001a4: 039fe02d daddu gp,gp,ra
1200001a8: 24040001 li a0,1
1200001ac: df858020 ld a1,-32736(gp)
1200001b0: df868028 ld a2,-32728(gp)
1200001b4: dcc601e0 ld a2,480(a2)
1200001b8: 24021389 li v0,5001
1200001bc: 0000000c syscall
1200001c0: 00002025 move a0,zero
1200001c4: 240213c2 li v0,5058
1200001c8: 0000000c syscall
1200001cc: 00000000 nop

PowerPC

1
2
3
4
powerpc64le-linux-gnu-gcc -mcpu=power8 -O2 -S -o - test.c

powerpc64le-linux-gnu-gcc -mcpu=power8 -Ofast -c test.c && \
powerpc64le-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
0000000000000000 <madd_i>:
0: d6 21 63 7c mullw r3,r3,r4
4: 14 2a 63 7c add r3,r3,r5
8: b4 07 63 7c extsw r3,r3
c: 20 00 80 4e blr

0000000000000020 <madd_s>:
20: ba 18 21 ec fmadds f1,f1,f2,f3
24: 20 00 80 4e blr

0000000000000040 <madd_d>:
40: ba 18 21 fc fmadd f1,f1,f2,f3
44: 20 00 80 4e blr

0000000000000060 <msub_i>:
60: d6 21 63 7c mullw r3,r3,r4
64: 50 18 65 7c subf r3,r5,r3
68: b4 07 63 7c extsw r3,r3
6c: 20 00 80 4e blr

0000000000000080 <msub_s>:
80: b8 18 21 ec fmsubs f1,f1,f2,f3
84: 20 00 80 4e blr

00000000000000a0 <msub_d>:
a0: b8 18 21 fc fmsub f1,f1,f2,f3
a4: 20 00 80 4e blr

RISC-V

1
2
3
4
riscv64-linux-gnu-gcc -march=rv64gc -mabi=lp64d -O2 -S -o - test.c

riscv64-linux-gnu-gcc -march=rv64gc -mabi=lp64d -Ofast -c test.c
riscv64-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0000000000000000 <madd_i>:
0: /-> 02b5053b mulw a0,a0,a1
4: | 9d31 addw a0,a0,a2
6: \-- 8082 ret

0000000000000008 <madd_s>:
8: 60b57543 fmadd.s fa0,fa0,fa1,fa2
c: 8082 ret

000000000000000e <madd_d>:
e: 62b57543 fmadd.d fa0,fa0,fa1,fa2
12: 8082 ret

0000000000000014 <msub_i>:
14: 02b5053b mulw a0,a0,a1
18: 9d11 subw a0,a0,a2
1a: 8082 ret

000000000000001c <msub_s>:
1c: 60b57547 fmsub.s fa0,fa0,fa1,fa2
20: 8082 ret

0000000000000022 <msub_d>:
22: 62b57547 fmsub.d fa0,fa0,fa1,fa2
26: 8082 ret

x86_64

1
2
3
4
x86_64-linux-gnu-gcc -march=x86-64-v3 -masm=intel -O2 -S -o - test.c

x86_64-linux-gnu-gcc -march=x86-64-v3 -Ofast -c test.c && \
x86_64-linux-gnu-objdump -Cw --visualize-jumps -d -M intel -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0000000000000000 <madd_i>:
0: 0f af fe imul edi,esi
3: 8d 04 17 lea eax,[rdi+rdx*1]
6: c3 ret

0000000000000010 <madd_s>:
10: c4 e2 69 99 c1 vfmadd132ss xmm0,xmm2,xmm1
15: c3 ret

0000000000000020 <madd_d>:
20: c4 e2 e9 99 c1 vfmadd132sd xmm0,xmm2,xmm1
25: c3 ret

0000000000000030 <msub_i>:
30: 0f af fe imul edi,esi
33: 89 f8 mov eax,edi
35: 29 d0 sub eax,edx
37: c3 ret

0000000000000040 <msub_s>:
40: c4 e2 69 9b c1 vfmsub132ss xmm0,xmm2,xmm1
45: c3 ret

0000000000000050 <msub_d>:
50: c4 e2 e9 9b c1 vfmsub132sd xmm0,xmm2,xmm1
55: c3 ret

z/Architecture

1
2
3
4
s390x-linux-gnu-gcc -march=z13 -O2 -S -o - test.c

s390x-linux-gnu-gcc -march=z13 -Ofast -c test.c && \
s390x-linux-gnu-objdump -Cw --visualize-jumps -d -S -t test.o
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0000000000000000 <madd_i>:
0: b2 52 00 23 msr %r2,%r3
4: 1a 24 ar %r2,%r4
6: b9 14 00 22 lgfr %r2,%r2
a: 07 fe br %r14

0000000000000010 <madd_s>:
10: b3 0e 40 02 maebr %f4,%f0,%f2
14: 28 04 ldr %f0,%f4
16: 07 fe br %r14

0000000000000020 <madd_d>:
20: e7 00 23 08 40 8f wfmadb %v0,%v0,%v2,%v4
26: 07 fe br %r14

0000000000000030 <msub_i>:
30: b2 52 00 23 msr %r2,%r3
34: 1b 24 sr %r2,%r4
36: b9 14 00 22 lgfr %r2,%r2
3a: 07 fe br %r14

0000000000000040 <msub_s>:
40: b3 0f 40 02 msebr %f4,%f0,%f2
44: 28 04 ldr %f0,%f4
46: 07 fe br %r14

0000000000000050 <msub_d>:
50: e7 00 23 08 40 8e wfmsdb %v0,%v0,%v2,%v4
56: 07 fe br %r14