Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613

pramodk · 2021-04-24T12:48:25Z

As part of this ticket, we are going to benchmark LLVM code generation backend with different configurations. Here are some practical considerations:

Simple synthetic kernels with basic patterns (involving memory accesses, gather, div, exp) vs real-word MOD files
Dataset fitting in memory vs DRAM
Different Vector widts
With / wotjpit VecLibReplace with SVML
Different (Inte CPUl) backends : SSE, AVX-2, AVX-512

@georgemitenkov : I have assigned this to myself temporarily as I am going to do simple cross-checks with performance numbers with recently added --veclib SVML option.

The text was updated successfully, but these errors were encountered:

pramodk · 2021-04-27T22:29:28Z

Just to update here @georgemitenkov : I have tested few small examples and SSE vs AVX-2 examples locally. But for detailed analysis, I will wait for #611 ( / #612) so that assembly & performance metrics could be analysed in the detailed.

georgemitenkov · 2021-04-28T06:28:51Z

Great! I had an exam yesterday so Monday/Tuesday were a bit out for me. I have started looking at the debug info, so hopefully this one should be ready soonish (~Thursday).

Regarding assembly verification: ideally, do we want to dump it to the log file, so that the structure is:

====== start

NMODL source (not in log file for now)
NMODL after transformations (we only print kernels, so that's fine)
Generated LLVM
Generated assembly from JIT?

====== JIT part

Visiting time
Benchmark time

====== end

What do you think? @pramodk

pramodk · 2021-04-28T09:35:06Z

Great! I had an exam yesterday so Monday/Tuesday were a bit out for me.

Oh ok! Np!

What do you think? @pramodk

Yup, above part LGTM!

castigli · 2021-05-31T16:02:27Z

Just as initial reference, below is a summary of current timings on x86_64.
First line is JIT, second line is external kernel (note that there is some overhead from JIT calling mechanism)
JIT options are

--fmf nnan contract afn --vector-width 8 --veclib SVML benchmark \
--opt-level-ir 3 --opt-level-codegen 3 --run --instance-size 100000000 \
--repeat 10

compute-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.322915
compute-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.419407
compute-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.344690
compute-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.423696
compute-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.350585
compute-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.319667
compute-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.347119
compute-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.320830
compute-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.323365
compute-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.317312
compute-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.347382
compute-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.629991
hh_clang_-O3-march=skylake-avx512-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.659959
hh_clang_-O3-march=skylake-avx512-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 10.597442
hh_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.639105
hh_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 2.132582
hh_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.635455
hh_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.510965
hh_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.634934
hh_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 10.587418
hh_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.610168
hh_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 12.130137
hh_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.634898
hh_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 3.086421
hh_gcc_-O3-mavx2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.610445
hh_gcc_-O3-mavx2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 10.701414
hh_gcc_-O3-mavx512f-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.614212
hh_gcc_-O3-mavx512f-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 10.897828
hh_gcc_-O3-msse2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.611068
hh_gcc_-O3-msse2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 11.025482
hh_icpc_-O2-march=skylake-avx512-mtune=skylake-avx512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.622493
hh_icpc_-O2-march=skylake-avx512-mtune=skylake-avx512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.913908
hh_icpc_-O2-mavx2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.792381
hh_icpc_-O2-mavx2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.908091
hh_icpc_-O2-mavx512f-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.794239
hh_icpc_-O2-mavx512f-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.576430
hh_icpc_-O2-msse2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.792621
hh_icpc_-O2-msse2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 3.003994
hh_icpc_-O2-qopt-zmm-usage=high-xCORE-AVX512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.612436
hh_icpc_-O2-qopt-zmm-usage=high-xCORE-AVX512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.750384
memory-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402982
memory-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.404010
memory-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402691
memory-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403016
memory-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402822
memory-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403130
memory-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402736
memory-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403115
memory-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.405940
memory-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.406087
memory-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403234
memory-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.404857

georgemitenkov · 2021-05-31T16:37:08Z

Thanks @castigli ! Any specific reason we use only nnan contract afn and not fast for fast math flags?

castigli · 2021-06-01T06:09:44Z

no, except that I forgot to add it! I will re-run the test.

georgemitenkov · 2021-06-03T07:39:21Z

@pramodk @castigli @iomaganaris

Current configurations would be, with [..] indicating a test parameter

llvm --ir [--fmf fast] [--assume-may-alias] [--single-precision] --vector-width [W] --veclib [LIB] --opt-level-ir 3 \
benchmark -run --instance-size [S] --repeat [R] --opt-level-codegen 3 --cpu [cpu name or default] --libs [...]

For CPU names, we can use any that Clang supports. We also want to see the effect of aliasing, and see how performance for floats differ (float => 32bits => vector width is greater => maybe more scatter/gather overhead)

pramodk added benchmark llvm labels Apr 24, 2021

pramodk self-assigned this Apr 24, 2021

1uc added performance Related to performance improvement and removed benchmark labels Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613

Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613

pramodk commented Apr 24, 2021

pramodk commented Apr 27, 2021

georgemitenkov commented Apr 28, 2021 •

edited

Loading

pramodk commented Apr 28, 2021

castigli commented May 31, 2021

georgemitenkov commented May 31, 2021

castigli commented Jun 1, 2021 •

edited

Loading

georgemitenkov commented Jun 3, 2021 •

edited

Loading

Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613

Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613

Comments

pramodk commented Apr 24, 2021

pramodk commented Apr 27, 2021

georgemitenkov commented Apr 28, 2021 • edited Loading

pramodk commented Apr 28, 2021

castigli commented May 31, 2021

georgemitenkov commented May 31, 2021

castigli commented Jun 1, 2021 • edited Loading

georgemitenkov commented Jun 3, 2021 • edited Loading

georgemitenkov commented Apr 28, 2021 •

edited

Loading

castigli commented Jun 1, 2021 •

edited

Loading

georgemitenkov commented Jun 3, 2021 •

edited

Loading