Quick experiment regarding fast initialization of one TypedArray from another of a different type, e.g.
const inp = new Float32Array(len);
const out = new Uint16Array(inp);
or
const inp = new Int8Array(len);
const out = new Float32Array(len);
out.set(inp);
v8 does scalar conversion with a generic conversion routine. This module uses 256-bit-wide SIMD conversions and has specialized conversion routines.
ECMAScript's conversion routines
don't all match well with Intel's instructions and some have to be implemented
in software. (Keep that in mind if you're doing something that needs fast
conversions and don't need to adhere to the ECMAScript rules; see for example
the note in Float32Array
to Int32Array
below.)
-
Float/double conversions are correct and fast (Intel ins'n match ECMA spec).
✔️Float64Array
toFloat32Array
✔️Float32Array
toFloat64Array
-
Float to integer conversions require specializations; only one is done.
✔️Float32Array
toInt32Array
✔️Float64Array
toInt32Array
are correct and fairly fast. (Depending on the values, much faster than v8.) Intel's instructions don't match ECMA262 exactly: ECMA262 specifies that NaN, +Infinity and -Infinity to return 0, and that values wrap-around in case of overflow, whereas Intel'scvt[t]ps2dq
returns 0x80000000 (-2147483648) in these cases. (Also, ECMA262'sToInt32
does not match the behavior ofstatic_cast<int32_t>
in C++.) Right now this module has a fast path for when the instruction matches the spec (better than v8's fast path, see TODO below), and a slow scalar path to fix up values that don't.AVX512
vfixupimmps
is potentially useful here but not widely available.I have no use case for this conversion, but if someone else does, would it be useful to offer fast conversion that doesn't follow ECMA262 spec and instead just passes through Intel's instruction behavior?
TODO I think there's a missed optimization in v8's DoubleToInt32. Their fast-path requires this condition:
static_cast<double>(static_cast<int32_t>(double_input)) == double_input
but I think it should be
static_cast<double>(static_cast<int32_t>(double_input)) == trunc(double_input)
❌
Float32Array
toUint32Array
(AVX512)
❌Float32Array
toInt16Array
(SSE 4-at-a-time) ❌Float32Array
toUint16Array
❌Float32Array
toInt8Array
(SSE 4-at-a-time)
❌Float32Array
toUint8Array
❌Float64Array
toUint32Array
❌Float64Array
toInt16Array
❌Float64Array
toUint16Array
❌Float64Array
toInt8Array
❌Float64Array
toUint8Array
require in-software specializations -
Integer to float conversions are correct and fast, with two exceptions.
✔️Int32Array
toFloat64Array
✔️Int32Array
toFloat32Array
✔️Int16Array
toFloat32Array
✔️Uint16Array
toFloat32Array
✔️Int8Array
toFloat32Array
✔️Uint8Array
toFloat32Array
❌Uint32Array
toFloat32Array
and
❌Uint32Array
toFloat64Array
require either AVX512 or in-software specializations -
Widening integer conversions are correct and fast.
✔️ ️Int16Array
toInt32Array
✔️ ️Int16Array
toUint32Array
✔️ ️Uint16Array
toInt32Array
✔️ ️Uint16Array
toUint32Array
✔️ ️Int8Array
toInt32Array
✔️ ️Int8Array
toUint32Array
✔️ ️Int8Array
toInt16Array
✔️ ️Int8Array
toUint16Array
✔️ ️Uint8Array
toInt32Array
✔️ ️Uint8Array
toUint32Array
✔️ ️Uint8Array
toInt16Array
✔️ ️Uint8Array
toUint16Array
-
Unsigned/signed conversions are just
memcpy()
s (reinterpretations of the same bit strings). v8 is fast; this module passes-through todst.set(src)
.
✔️Int32Array
toUint32Array
✔️Uint32Array
toInt32Array
✔️Int16Array
toUint16Array
✔️Uint16Array
toInt16Array
✔️Int8Array
toUint8Array
✔️Uint8Array
toInt8Array
-
Narrowing integer conversions are correct and fast.
✔️Int32Array
toInt16Array
✔️Int32Array
toInt8Array
✔️Uint32Array
toInt16Array
✔️Uint32Array
toUint16Array
✔️Uint32Array
toInt8Array
✔️Uint32Array
toUint8Array
✔️Int16Array
toInt8Array
✔️Int16Array
toUint8Array
✔️Uint16Array
toInt8Array
✔️Uint16Array
toUint8Array
Conversions that aren't implemented pass-through to dst.set(src)
.
Run node ./test.js --benchmark
.
Numbers are dst.set(src)
(v8) ÷ set(dst, src)
(this module).
The diagonal should be 1 or slightly less than 1; deviation from 1 there can estimate the noise in the benchmark.
I've marked ones that are actually expected to be faster with asterisks below.
Linux/GCC8
┌──────────────┬──────────────┬──────────────┬────────────┬─────────────┬────────────┬─────────────┬───────────┬────────────┐
│ from \ to │ Float64Array │ Float32Array │ Int32Array │ Uint32Array │ Int16Array │ Uint16Array │ Int8Array │ Uint8Array │
├──────────────┼──────────────┼──────────────┼────────────┼─────────────┼────────────┼─────────────┼───────────┼────────────┤
│ Float64Array │ 0.85 │ *4.19* │ *6.05* │ 0.97 │ 1.01 │ 0.98 │ 0.96 │ 1.02 │
│ Float32Array │ *4.46* │ 1.06 │ *22.63* │ 1.02 │ 0.99 │ 0.99 │ 1.00 │ 1.01 │
│ Int32Array │ *4.43* │ *7.18* │ 1.06 │ 1.06 │ *13.71* │ *14.65* │ *10.57* │ *7.19* │
│ Uint32Array │ 1.10 │ 0.93 │ 1.48 │ 0.99 │ *11.53* │ *10.56* │ *12.11* │ *11.95* │
│ Int16Array │ *5.76* │ *5.94* │ *9.67* │ *10.84* │ 0.96 │ 1.00 │ *21.12* │ *16.02* │
│ Uint16Array │ *4.72* │ *9.93* │ *10.60* │ *12.06* │ 1.02 │ 1.05 │ *18.54* │ *15.09* │
│ Int8Array │ *2.77* │ *12.96* │ *11.74* │ *10.85* │ *25.11* │ *21.40* │ 1.05 │ 0.75 │
│ Uint8Array │ *6.38* │ *10.49* │ *12.32* │ *9.86* │ *20.77* │ *16.01* │ 0.88 │ 0.90 │
└──────────────┴──────────────┴──────────────┴────────────┴─────────────┴────────────┴─────────────┴───────────┴────────────┘
Windows/MSVS 2017
┌──────────────┬──────────────┬──────────────┬────────────┬─────────────┬────────────┬─────────────┬───────────┬────────────┐
│ from \ to │ Float64Array │ Float32Array │ Int32Array │ Uint32Array │ Int16Array │ Uint16Array │ Int8Array │ Uint8Array │
├──────────────┼──────────────┼──────────────┼────────────┼─────────────┼────────────┼─────────────┼───────────┼────────────┤
│ Float64Array │ 1.04 │ *4.64* │ *9.21* │ 1.02 │ 1.08 │ 0.92 │ 0.95 │ 0.96 │
│ Float32Array │ *4.16* │ 1.09 │ *35.19* │ 1.00 │ 1.04 │ 0.94 │ 1.01 │ 1.03 │
│ Int32Array │ *4.49* │ *6.91* │ 1.05 │ 0.98 │ *8.57* │ *11.02* │ *9.43* │ *9.26* │
│ Uint32Array │ 0.98 │ 1.25 │ 1.02 │ 0.98 │ *7.32* │ *8.28* │ *8.45* │ *5.30* │
│ Int16Array │ *3.68* │ *9.44* │ *8.28* │ *8.42* │ 0.95 │ 0.91 │ *8.94* │ *13.77* │
│ Uint16Array | *5.18* │ *7.81* │ *9.03* │ *7.80* │ 1.02 │ 0.80 │ *16.33* │ *9.17* │
│ Int8Array │ *6.21* │ *9.95* │ *8.10* │ *7.27* │ *14.54* │ *9.84* │ 0.97 │ 1.02 │
│ Uint8Array │ *3.82* │ *9.61* │ *9.46* │ *9.41* │ *14.75* │ *14.69* │ 1.01 │ 0.91 │
└──────────────┴──────────────┴──────────────┴────────────┴─────────────┴────────────┴─────────────┴───────────┴────────────┘
Note: Float32Array to Int32Array benchmark has almost no cases of overflow or other fixup. Actual runtime depends on numerical values in array.
- The
offset
parameter is ignored. - The source/destination must have a length that is a multiple of 8, 16 or 32. (That is, I've only dealt with the vectorized loop body and not the tail.)
- AVX2 is required, and most or all of these could be done with earlier extension sets albeit with narrower vectors. Since this library is just for fun, I have no intention of adding e.g. an SSE4.2 version.
Was a fun weekend project. I have no idea if anyone ever uses these conversions.