Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ldematte
Copy link
Contributor

Based on #109084 -- only the last commit is relevant for this draft.

Add a int4 implementation for dot product, between a unpacked vector (1 value between 0x00 to 0x0F in a byte) and a packed vector (2 values between 0x0 and 0xF in a byte).
When compiled with clang (gcc presents the same bug as in #109084), it produces the following code:

loop:
    vmovdqu64       zmm18, zmmword ptr [r9]
    vpandq  zmm22, zmm18, zmm2
    vmovdqu64       zmm23, zmmword ptr [r8]
    vpmaddubsw      zmm22, zmm23, zmm22
    vpaddw  zmm17, zmm22, zmm17
    vmovdqu64       zmm22, zmmword ptr [r8 + 64]
    vpsrld  zmm18, zmm18, 4
    vpandq  zmm18, zmm18, zmm2
    vpmaddubsw      zmm18, zmm22, zmm18
    vpaddw  zmm17, zmm17, zmm18
    
    add     r8, 128
    add     r9, 64
    jne loop

Notice the 2 vector mul/add, 2 vector and, 2 vector adds, 1 vector (intra-lane) shift, and 3 loads.
Notice also we are doing 2 operations per loop iteration; this means that we are not FMA-units limited: a CPU with enough ports should be able to have a RThroughput of 2.0 (perform both operations in 2 CPU cycles); static analysis reveals this should be possible for Zen4 (link), and close on Intel (between 2.3 and 3.0).

TODO:

  • sqr4u
  • AVX2 variants
  • Binding on Java side
  • packed-packed (both operands) variant

@ldematte
Copy link
Contributor Author

Benchmarking on cloud machines (which is not really ideal, but..) show an opposite picture: performance of int4 and int7 are identical, around 80/90 ops/us, on AMD machines and Xeon 4th gen (sapphirerapids). Performance are 50% better on Xeon 3th gen (icelake).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants