Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tier1 decoder speed optimizations #783

Merged
merged 12 commits into from
Sep 13, 2016

Conversation

rouault
Copy link
Collaborator

@rouault rouault commented May 23, 2016

This patch series improves T1 decoding speed, resulting in overall decompression time gains typically of 10-15% for operational products

Various tricks used :

  • more aggressive inlining (reusing Move some MQC functions into a header for speed #675)
  • specialization of decoding of 64x64 code blocks, which are common in a lot of products
  • addition of an auxiliary colflags array whose each 16-bits item stores the overall state of 4 values in a colum, and thus enables quick checks in a cache friendly way
  • loop unrolling for VSC steps (similarly to the non-VSC case)

Benchmarking:

This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)

Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line

Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit

compiler C1 before (ms) C1 after (ms) delta % C2 before (ms) C2 after (ms) delta % C3 before (ms) C3 after (ms) delta % C4 before (ms) C4 after (ms) delta % C5 before (ms) C5 after (ms) delta % C6 before (ms) C6 after (ms) delta %
GCC 4.4.3 799 730 -8.6 1710 1559 -8.8 4390 3960 -9.8 2670 2070 -22.5 1990 1589 -20.2 4250 3770 -11.3
GCC 4.6.2 830 720 -13.3 1700 1480 -12.9 4620 3860 -16.5 2469 2070 -16.2 1930 1490 -22.8 4400 3710 -15.7
GCC 4.8.0 880 740 -15.9 1790 1569 -12.3 4820 4040 -16.2 2630 2120 -19.4 2130 1530 -28.2 4700 3880 -17.4
GCC 5.2.0 860 740 -14.0 1720 1569 -8.8 4630 4160 -10.2 2480 2050 -17.3 1859 1569 -15.6 4520 3830 -15.3
GCC 5.3.0 850 730 -14.1 1730 1569 -9.3 4640 4010 -13.6 2480 2040 -17.7 1850 1569 -15.2 4530 3840 -15.2
CLang 3.7.0 809 730 -9.8 1670 1560 -6.6 4510 4090 -9.3 2770 2430 -12.3 1859 1549 -16.7 4390 3920 -10.7

Same with 32 bit build (-m32) :

compiler C1 before (ms) C1 after (ms) delta % C2 before (ms) C2 after (ms) delta % C3 before (ms) C3 after (ms) delta % C4 before (ms) C4 after (ms) delta % C5 before (ms) C5 after (ms) delta % C6 before (ms) C6 after (ms) delta %
GCC 4.4 1100 989 -10.1 2129 2069 -2.8 5570 4900 -12.0 3489 2719 -22.1 2479 2149 -13.3 5429 4740 -12.7
GCC 4.6.2 950 840 -11.6 1940 1800 -7.2 5210 4170 -20.0 2950 2420 -18.0 2280 1859 -18.5 5110 4019 -21.4
GCC 4.8.0 1000 809 -19.1 2060 1810 -12.1 5560 4570 -17.8 2960 2380 -19.6 2500 1950 -22.0 5450 4360 -20.0
GCC 5.2.0 909 779 -14.3 1839 1700 -7.6 4880 4230 -13.3 2680 2340 -12.7 2050 1770 -13.7 4810 4090 -15.0
GCC 5.3.0 909 789 -13.2 1830 1710 -6.6 4860 4240 -12.8 2690 2300 -14.5 2070 1760 -15.0 4800 4070 -15.2
CLang 3.7.0 980 850 -13.3 2009 1740 -13.4 5340 4490 -15.9 3050 2710 -11.1 2160 1799 -16.7 5200 4300 -17.3

This work has been funded by Planet Labs

c0nk and others added 12 commits May 21, 2016 15:18
Allow these hot functions to be inlined. This boosts decode performance by ~10%.
We can avoid using a loop-up table with some shift arithmetics.
Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that
does the job of opj_t1_dec_clnpass_step_only() assuming the conditions
are met. And use it in opj_t1_dec_clnpass(). The compiler generates
more efficient code.
This is essentially used to shift inside the lut_ctxno_zc, which we
can precompute at the beginning of opj_t1_decode_cblk() /
opj_t1_encode_cblk()
Addition flag array such that colflags[1+0] is for state of col=0,row=0..3,
colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ...
This array avoids too much cache trashing when processing by 4 vertical samples
as done in the various decoding steps.
}
} /* VSC and BYPASS by Antonin */

static void opj_t1_dec_sigpass_mqc(
#define opj_t1_dec_sigpass_mqc_internal(t1, bpno, w, h, flags_stride) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, is it possible to use an inline function here so that debugging is made easier ? like what was done in ae1da37

Timings will have to be checked after that of course.

@detonin detonin merged commit 7092f7e into uclouvain:master Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants