Add AVX-512 optimised dot product distance function for int4 on x64 #109238

ldematte · 2024-05-31T10:12:03Z

Based on #109084 -- only the last commit is relevant for this draft.

Add a int4 implementation for dot product, between a unpacked vector (1 value between 0x00 to 0x0F in a byte) and a packed vector (2 values between 0x0 and 0xF in a byte).
When compiled with clang (gcc presents the same bug as in #109084), it produces the following code:

loop:
    vmovdqu64       zmm18, zmmword ptr [r9]
    vpandq  zmm22, zmm18, zmm2
    vmovdqu64       zmm23, zmmword ptr [r8]
    vpmaddubsw      zmm22, zmm23, zmm22
    vpaddw  zmm17, zmm22, zmm17
    vmovdqu64       zmm22, zmmword ptr [r8 + 64]
    vpsrld  zmm18, zmm18, 4
    vpandq  zmm18, zmm18, zmm2
    vpmaddubsw      zmm18, zmm22, zmm18
    vpaddw  zmm17, zmm17, zmm18
    
    add     r8, 128
    add     r9, 64
    jne loop

Notice the 2 vector mul/add, 2 vector and, 2 vector adds, 1 vector (intra-lane) shift, and 3 loads.
Notice also we are doing 2 operations per loop iteration; this means that we are not FMA-units limited: a CPU with enough ports should be able to have a RThroughput of 2.0 (perform both operations in 2 CPU cycles); static analysis reveals this should be possible for Zen4 (link), and close on Intel (between 2.3 and 3.0).

TODO:

sqr4u
AVX2 variants
Binding on Java side
packed-packed (both operands) variant

…4-avx512

… for manual unrolling

…4-avx512

ldematte · 2024-05-31T11:00:04Z

Benchmarking on cloud machines (which is not really ideal, but..) show an opposite picture: performance of int4 and int7 are identical, around 80/90 ops/us, on AMD machines and Xeon 4th gen (sapphirerapids). Performance are 50% better on Xeon 3th gen (icelake).

ldematte added 8 commits May 14, 2024 15:12

Add vec_caps and inner implementation for AVX-512-F (without VNNI)

9951eb5

WIP

98e677f

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

7b1c11c

…4-avx512

select FNNI function name based on vec_caps; templated implementation…

866199c

… for manual unrolling

Manual unroll sqr7u + static bind mh in outer class

83b820a

Switched compiler to clang for x64, as gcc has a bug

ee7094b

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

02d2503

…4-avx512

AVX-512 int4 dot product

fbea023

ldematte added the WIP label May 31, 2024

elasticsearchmachine added the v8.15.0 label May 31, 2024

ldematte mentioned this pull request Jun 18, 2024

Investigate native impl for int4 vector comparators #109811

Open

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

ldematte commented May 31, 2024

ldematte commented May 31, 2024

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

Are you sure you want to change the base?

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

Conversation

ldematte commented May 31, 2024

ldematte commented May 31, 2024