Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613

Open
pramodk opened this issue Apr 24, 2021 · 7 comments
Assignees
Labels
llvm performance Related to performance improvement

Comments

@pramodk
Copy link
Contributor

pramodk commented Apr 24, 2021

As part of this ticket, we are going to benchmark LLVM code generation backend with different configurations. Here are some practical considerations:

  • Simple synthetic kernels with basic patterns (involving memory accesses, gather, div, exp) vs real-word MOD files
  • Dataset fitting in memory vs DRAM
  • Different Vector widts
  • With / wotjpit VecLibReplace with SVML
  • Different (Inte CPUl) backends : SSE, AVX-2, AVX-512

@georgemitenkov : I have assigned this to myself temporarily as I am going to do simple cross-checks with performance numbers with recently added --veclib SVML option.

@pramodk
Copy link
Contributor Author

pramodk commented Apr 27, 2021

Just to update here @georgemitenkov : I have tested few small examples and SSE vs AVX-2 examples locally. But for detailed analysis, I will wait for #611 ( / #612) so that assembly & performance metrics could be analysed in the detailed.

@georgemitenkov
Copy link
Collaborator

georgemitenkov commented Apr 28, 2021

Great! I had an exam yesterday so Monday/Tuesday were a bit out for me. I have started looking at the debug info, so hopefully this one should be ready soonish (~Thursday).

Regarding assembly verification: ideally, do we want to dump it to the log file, so that the structure is:

====== start

  • NMODL source (not in log file for now)
  • NMODL after transformations (we only print kernels, so that's fine)
  • Generated LLVM
  • Generated assembly from JIT?

====== JIT part

  • Visiting time
  • Benchmark time

====== end

What do you think? @pramodk

@pramodk
Copy link
Contributor Author

pramodk commented Apr 28, 2021

Great! I had an exam yesterday so Monday/Tuesday were a bit out for me.

Oh ok! Np!

What do you think? @pramodk

Yup, above part LGTM!

@castigli
Copy link
Contributor

Just as initial reference, below is a summary of current timings on x86_64.
First line is JIT, second line is external kernel (note that there is some overhead from JIT calling mechanism)
JIT options are

--fmf nnan contract afn --vector-width 8 --veclib SVML benchmark \
--opt-level-ir 3 --opt-level-codegen 3 --run --instance-size 100000000 \
--repeat 10
compute-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.322915
compute-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.419407
compute-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.344690
compute-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.423696
compute-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.350585
compute-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.319667
compute-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.347119
compute-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.320830
compute-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.323365
compute-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.317312
compute-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.347382
compute-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.629991
hh_clang_-O3-march=skylake-avx512-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.659959
hh_clang_-O3-march=skylake-avx512-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 10.597442
hh_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.639105
hh_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 2.132582
hh_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.635455
hh_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.510965
hh_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.634934
hh_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 10.587418
hh_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.610168
hh_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 12.130137
hh_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.634898
hh_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 3.086421
hh_gcc_-O3-mavx2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.610445
hh_gcc_-O3-mavx2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 10.701414
hh_gcc_-O3-mavx512f-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.614212
hh_gcc_-O3-mavx512f-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 10.897828
hh_gcc_-O3-msse2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.611068
hh_gcc_-O3-msse2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 11.025482
hh_icpc_-O2-march=skylake-avx512-mtune=skylake-avx512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.622493
hh_icpc_-O2-march=skylake-avx512-mtune=skylake-avx512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.913908
hh_icpc_-O2-mavx2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.792381
hh_icpc_-O2-mavx2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.908091
hh_icpc_-O2-mavx512f-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.794239
hh_icpc_-O2-mavx512f-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.576430
hh_icpc_-O2-msse2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.792621
hh_icpc_-O2-msse2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 3.003994
hh_icpc_-O2-qopt-zmm-usage=high-xCORE-AVX512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.612436
hh_icpc_-O2-qopt-zmm-usage=high-xCORE-AVX512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.750384
memory-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402982
memory-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.404010
memory-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402691
memory-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403016
memory-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402822
memory-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403130
memory-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402736
memory-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403115
memory-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.405940
memory-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.406087
memory-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403234
memory-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.404857

@georgemitenkov
Copy link
Collaborator

Thanks @castigli ! Any specific reason we use only nnan contract afn and not fast for fast math flags?

@castigli
Copy link
Contributor

castigli commented Jun 1, 2021

no, except that I forgot to add it! I will re-run the test.

@georgemitenkov
Copy link
Collaborator

georgemitenkov commented Jun 3, 2021

@pramodk @castigli @iomaganaris

Current configurations would be, with [..] indicating a test parameter

llvm --ir [--fmf fast] [--assume-may-alias] [--single-precision] --vector-width [W] --veclib [LIB] --opt-level-ir 3 \
benchmark -run --instance-size [S] --repeat [R] --opt-level-codegen 3 --cpu [cpu name or default] --libs [...]

For CPU names, we can use any that Clang supports. We also want to see the effect of aliasing, and see how performance for floats differ (float => 32bits => vector width is greater => maybe more scatter/gather overhead)

@1uc 1uc added performance Related to performance improvement and removed benchmark labels Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm performance Related to performance improvement
Projects
None yet
Development

No branches or pull requests

4 participants