This specification describes a Single Instruction Multiple Data (SIMD) instruction set that can be implemented efficiently on current popular instruction set architectures. It provides shared semantics for WebAssembly and SIMD.js.
The types used in this specification can be concrete or abstract. Concrete types have a defined representation as a bit pattern, while abstract types are simply a set of allowed values.
The concrete scalar integer types are not interpreted as either signed or unsigned integers.
i8
: An 8-bit integer with bits numbered 0–7.i16
: A 16-bit integer with bits numbered 0–15.i32
: A 32-bit integer with bits numbered 0–31.i64
: A 64-bit integer with bits numbered 0–63.
The concrete scalar floating-point types follow the encoding and semantics of the IEEE 754-2008 standard for floating-point arithmetic. See the Floating-point semantics section for details and exceptions.
f32
: A floating-point number in the IEEE binary32 interchange format.f64
: A floating-point number in the IEEE binary64 interchange format.
The following abstract types don't have a specified representation as a bit pattern:
boolean
: Eithertrue
orfalse
.LaneIdx2
: An integer in the range 0–1 identifying a lane.LaneIdx4
: An integer in the range 0–3 identifying a lane.LaneIdx8
: An integer in the range 0–7 identifying a lane.LaneIdx16
: An integer in the range 0–15 identifying a lane.LaneIdx32
: An integer in the range 0–31 identifying a lane.
All of the numerical SIMD types have a concrete mapping to a 128-bit representation. The boolean types do not have a bit-pattern representation.
v128
: A 128-bit SIMD vector. Bits are numbered 0–127.b8x16
: A vector of 16boolean
lanes numbered 0–15.b16x8
: A vector of 8boolean
lanes numbered 0–7.b32x4
: A vector of 4boolean
lanes numbered 0–3.b64x2
: A vector of 2boolean
lanes numbered 0–1.
The v128
type corresponds to a vector register in a typical SIMD ISA. The
interpretation of the 128 bits in the vector register is provided by the
individual instructions.
The abstract boolean vector types can be mapped to vector registers or predicate
registers by an implementation. They have a property S.Lanes
which is used by
the pseudo-code below:
S | S.Lanes |
---|---|
b8x16 |
16 |
b16x8 |
8 |
b32x4 |
4 |
b64x2 |
2 |
The single v128
SIMD type can represent packed data in multiple ways.
Instructions specify how the bits should be interpreted through a hierarchy of
interpretations.
The boolean vector types only have the one interpretation given by their type.
The first level of interpretations of the v128
type impose a lane structure on
the bits:
v8x16 : v128
: 8-bit lanes numbered 0–15. Lane n corresponds to bits 8n – 8n+7.v16x8 : v128
: 16-bit lanes numbered 0–7. Lane n corresponds to bits 16n – 16n+15.v32x4 : v128
: 32-bit lanes numbered 0–3. Lane n corresponds to bits 32n – 32n+31.v64x2 : v128
: 64-bit lanes numbered 0–1. Lane n corresponds to bits 64n – 64n+63.
The lane dividing interpretations don't say anything about the semantics of the bits in each lane. The interpretations have properties used by the semantic specification pseudo-code below:
S | S.LaneBits | S.Lanes | S.BoolType |
---|---|---|---|
v8x16 |
8 | 16 | b8x16 |
v16x8 |
16 | 8 | b16x8 |
v32x4 |
32 | 4 | b32x4 |
v64x2 |
64 | 2 | b64x2 |
The bits in a lane can be interpreted as integers with modulo arithmetic semantics. Many arithmetic operations can be defined on these types which don't impose a signed or unsigned integer interpretation.
i8x16 : v8x16
: Each lane is ani8
.i16x8 : v16x8
: Each lane is ani16
.i32x4 : v32x4
: Each lane is ani32
.i64x2 : v64x2
: Each lane is ani64
.
Additional properties:
S | S.LaneType |
---|---|
i8x16 |
i8 |
i16x8 |
i16 |
i32x4 |
i32 |
i64x2 |
i64 |
Each lane is interpreted as a two's complement integer.
s8x16 : i8x16
: Lane values in the range -2^7 – 2^7-1.s16x8 : i16x8
: Lane values in the range -2^15 – 2^15-1.s32x4 : i32x4
: Lane values in the range -2^31 – 2^31-1.s64x2 : i64x2
: Lane values in the range -2^63 – 2^63-1.
These interpretations get additional properties defining the range of values in a lane:
S | S.Min | S.Max |
---|---|---|
s8x16 |
-2^7 | 2^7-1 |
s16x8 |
-2^15 | 2^15-1 |
s32x4 |
-2^31 | 2^31-1 |
s64x2 |
-2^63 | 2^63-1 |
Each lane is interpreted as an unsigned integer.
u8x16 : i8x16
: Lane values in the range 0 – 2^8-1.u16x8 : i16x8
: Lane values in the range 0 – 2^16-1.u32x4 : i32x4
: Lane values in the range 0 – 2^32-1.u64x2 : i64x2
: Lane values in the range 0 – 2^64-1.
These interpretations get additional properties defining the range of values in a lane:
S | S.Min | S.Max |
---|---|---|
u8x16 |
0 | 2^8-1 |
u16x8 |
0 | 2^16-1 |
u32x4 |
0 | 2^32-1 |
u64x2 |
0 | 2^64-1 |
Each lane is interpreted as an IEEE floating-point number.
f32x4 : v32x4
: Each lane is anf32
.f64x2 : v64x2
: Each lane is anf64
.
Additional properties:
S | S.LaneType |
---|---|
f32x4 |
f32 |
f64x2 |
f64 |
The floating-point operations in this specification aim to be conforming to IEEE 754-2008 while being compatible with WebAssembly and JavaScript. Some things which are left unspecified by the IEEE standard are given stricter semantics by WebAssembly.
This specification does not yet provide a way of changing floating point rounding modes. All floating point operations use the default roundTiesToEven mode.
When a floating-point operation needs to return a NaN and none of its operands are NaN, it generates a default NaN value which is a quiet NaN with an all-zero payload field. The sign of the default NaN is not specified:
def f32.default_nan():
if unspecified_choice():
bits = 0x7fc00000
else:
bits = 0xffc00000
return f32.from_bits(bits)
def f64.default_nan():
if unspecified_choice():
bits = 0x7ff8000000000000
else:
bits = 0xfff8000000000000
return f64.from_bits(bits)
When propagating a NaN value from an operand, all the bits of the NaN are preserved, except a signaling NaN is quieted by setting the most significand bit in the trailing significand field.
def canonicalize_nan(x):
assert isnan(x)
t = type(x)
assert t == f32 or t == f64
bits = x.to_bits()
if t == f32:
bits |= (1 << 22)
else:
bits |= (1 << 51)
return t.from_bits(bits)
When two operands are NaN, one of them is propagated. Which one is not specified:
def propagate_nan(x, y):
assert isinan(x) or isnan(y)
if not isnan(x):
return canonicalize_nan(y)
if not isnan(y)
return canonicalize_nan(x)
# Both x and y are NaNs: pick one to propagate.
if unspecified_choice():
return canonicalize_nan(x)
else:
return canonicalize_nan(y)
An implementation is allowed to flush subnormals in arithmetic floating-point operations. This means that any subnormal operand is treated as 0, and any subnormal result is rounded to 0.
Note that this differs from WebAssembly scalar floating-point semantics which require correct subnormal handling.
The SIMD operations described in this sections are generally named
S.Op
, where S
is either a SIMD type or one of the interpretations
of a SIMD type.
Many operations are simply the lane-wise application of a scalar operation:
def S.lanewise_unary(func, a):
result = S.New()
for i in range(S.Lanes):
result[i] = func(a[i])
return result
def S.lanewise_binary(func, a, b):
result = S.New()
for i in range(S.Lanes):
result[i] = func(a[i], b[i])
return result
Comparison operators produce a boolean vector:
def S.lanewise_comparison(func, a, b):
result = S.BoolType.New()
for i in range(S.Lanes):
result[i] = func(a[i], b[i])
return result
b8x16.build(x: boolean[16]) -> b8x16
b16x8.build(x: boolean[8]) -> b16x8
b32x4.build(x: boolean[4]) -> b32x4
b64x2.build(x: boolean[2]) -> b64x2
i8x16.build(x: i8[16]) -> v128
i16x8.build(x: i16[8]) -> v128
i32x4.build(x: i32[4]) -> v128
i64x2.build(x: i64[2]) -> v128
f32x4.build(x: f32[4]) -> v128
f64x2.build(x: f64[2]) -> v128
Construct a vector from an array of individual lane values.
def S.build(x):
result = S.New()
for i in range(S.Lanes):
result[i] = x[i]
return result
b8x16.splat(x: boolean) -> b8x16
b16x8.splat(x: boolean) -> b16x8
b32x4.splat(x: boolean) -> b32x4
b64x2.splat(x: boolean) -> b64x2
i8x16.splat(x: i8) -> v128
i16x8.splat(x: i16) -> v128
i32x4.splat(x: i32) -> v128
i64x2.splat(x: i64) -> v128
f32x4.splat(x: f32) -> v128
f64x2.splat(x: f64) -> v128
Construct a vector with x
replicated to all lanes:
def S.splat(x):
result = S.New()
for i in range(S.Lanes):
result[i] = x
return result
b8x16.extractLane(a: b8x16, i: LaneIdx16) -> boolean
b16x8.extractLane(a: b16x8, i: LaneIdx8) -> boolean
b32x4.extractLane(a: b32x4, i: LaneIdx4) -> boolean
b64x2.extractLane(a: b64x2, i: LaneIdx2) -> boolean
i8x16.extractLane(a: v128, i: LaneIdx16) -> i8
i16x8.extractLane(a: v128, i: LaneIdx8) -> i16
i32x4.extractLane(a: v128, i: LaneIdx4) -> i32
i64x2.extractLane(a: v128, i: LaneIdx2) -> i64
f32x4.extractLane(a: v128, i: LaneIdx4) -> f32
f64x2.extractLane(a: v128, i: LaneIdx2) -> f64
Extract the value of lane i
in a
.
def S.extractLane(a, i):
return a[i]
b8x16.replaceLane(a: b8x16, i: LaneIdx16, x: boolean) -> b8x16
b16x8.replaceLane(a: b16x8, i: LaneIdx8, x: boolean) -> b16x8
b32x4.replaceLane(a: b32x4, i: LaneIdx4, x: boolean) -> b32x4
b64x2.replaceLane(a: b64x2, i: LaneIdx2, x: boolean) -> b64x2
i8x16.replaceLane(a: v128, i: LaneIdx16, x: i8) -> v128
i16x8.replaceLane(a: v128, i: LaneIdx8, x: i16) -> v128
i32x4.replaceLane(a: v128, i: LaneIdx4, x: i32) -> v128
i64x2.replaceLane(a: v128, i: LaneIdx2, x: i64) -> v128
f32x4.replaceLane(a: v128, i: LaneIdx4, x: f32) -> v128
f64x2.replaceLane(a: v128, i: LaneIdx2, x: f64) -> v128
Return a new vector with lanes identical to a
, except for lane i
which has
the value x
.
def S.replaceLane(a, i, x):
result = S.New()
for j in range(S.Lanes):
result[j] = a[j]
result[i] = x
return result
v8x16.select(s: b8x16, t: v128, f: v128) -> v128
v16x8.select(s: b16x8, t: v128, f: v128) -> v128
v32x4.select(s: b32x4, t: v128, f: v128) -> v128
v64x2.select(s: b64x2, t: v128, f: v128) -> v128
Use a boolean vector to select lanes from two numerical vectors.
def S.select(s, t, f):
result = S.New()
for i in range(S.Lanes):
if s[i]:
result[i] = t[i]
else
result[i] = f[i]
return result
v8x16.swizzle(a: v128, s: LaneIdx16[16]) -> v128
v16x8.swizzle(a: v128, s: LaneIdx8[8]) -> v128
v32x4.swizzle(a: v128, s: LaneIdx4[4]) -> v128
v64x2.swizzle(a: v128, s: LaneIdx2[2]) -> v128
Create vector with lanes rearranged:
def S.swizzle(a, s):
result = S.New()
for i in range(S.Lanes):
result[i] = a[s[i]]
return result
v8x16.shuffle(a: v128, b: v128, s: LaneIdx32[16]) -> v128
v16x8.shuffle(a: v128, b: v128, s: LaneIdx16[8]) -> v128
v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128
v64x2.shuffle(a: v128, b: v128, s: LaneIdx4[2]) -> v128
Create vector with lanes selected from the lanes of two input vectors:
def S.shuffle(a, b, s):
result = S.New()
for i in range(S.Lanes):
if s[i] < S.lanes:
result[i] = a[s[i]]
else:
result[i] = b[s[i] - S.lanes]
return result
Wrapping integer arithmetic discards the high bits of the result.
def S.Reduce(x):
bitmask = (1 << S.LaneBits) - 1
return x & bitmask
There is no integer division operation provided here. This operation is not commonly part of bit 128-bit SIMD ISAs.
i8x16.add(a: v128, b: v128) -> v128
i16x8.add(a: v128, b: v128) -> v128
i32x4.add(a: v128, b: v128) -> v128
i64x2.add(a: v128, b: v128) -> v128
Lane-wise wrapping integer addition:
def S.add(a, b):
def add(x, y):
return S.Reduce(x + y)
return S.lanewise_binary(add, a, b)
i8x16.sub(a: v128, b: v128) -> v128
i16x8.sub(a: v128, b: v128) -> v128
i32x4.sub(a: v128, b: v128) -> v128
i64x2.sub(a: v128, b: v128) -> v128
Lane-wise wrapping integer subtraction:
def S.sub(a, b):
def sub(x, y):
return S.Reduce(x - y)
return S.lanewise_binary(sub, a, b)
i8x16.mul(a: v128, b: v128) -> v128
i16x8.mul(a: v128, b: v128) -> v128
i32x4.mul(a: v128, b: v128) -> v128
i64x2.mul(a: v128, b: v128) -> v128
Lane-wise wrapping integer multiplication:
def S.mul(a, b):
def mul(x, y):
return S.Reduce(x * y)
return S.lanewise_binary(mul, a, b)
i8x16.neg(a: v128) -> v128
i16x8.neg(a: v128) -> v128
i32x4.neg(a: v128) -> v128
i64x2.neg(a: v128) -> v128
Lane-wise wrapping integer negation. In wrapping arithmetic, y = -x
is the
unique value such that x + y == 0
.
def S.neg(a):
def neg(x):
return S.Reduce(-x)
return S.lanewise_unary(neg, a)
Saturating integer arithmetic behaves differently on signed and unsigned types. It is only defined for 8-bit and 16-bit integer lanes.
def S.Saturate(x):
if x < S.Min:
return S.Min
if x > S.Max:
return S.Max
return x
s8x16.addSaturate(a: v128, b: v128) -> v128
s16x8.addSaturate(a: v128, b: v128) -> v128
u8x16.addSaturate(a: v128, b: v128) -> v128
u16x8.addSaturate(a: v128, b: v128) -> v128
Lane-wise saturating addition:
def S.addSaturate(a, b):
def addsat(x, y):
return S.Saturate(x + y)
return S.lanewise_binary(addsat, a, b)
s8x16.subSaturate(a: v128, b: v128) -> v128
s16x8.subSaturate(a: v128, b: v128) -> v128
u8x16.subSaturate(a: v128, b: v128) -> v128
u16x8.subSaturate(a: v128, b: v128) -> v128
Lane-wise saturating subtraction:
def S.subSaturate(a, b):
def subsat(x, y):
return S.Saturate(x - y)
return S.lanewise_binary(subsat, a, b)
i8x16.shiftLeftByScalar(a: v128, y: i8) -> v128
i16x8.shiftLeftByScalar(a: v128, y: i8) -> v128
i32x4.shiftLeftByScalar(a: v128, y: i8) -> v128
i64x2.shiftLeftByScalar(a: v128, y: i8) -> v128
Shift the bits in each lane to the left by the same amount. Only the low bits of the shift amount are used:
def S.shiftLeftByScalar(a, x):
# Number of bits to shift: 0 .. S.LaneBits - 1.
amount = y mod S.LaneBits
def shift(x):
return S.Reduce(x << amount)
return S.lanewise_unary(shift, a)
s8x16.shiftRightByScalar(a: v128, y: i8) -> v128
s16x8.shiftRightByScalar(a: v128, y: i8) -> v128
s32x4.shiftRightByScalar(a: v128, y: i8) -> v128
s64x2.shiftRightByScalar(a: v128, y: i8) -> v128
u8x16.shiftRightByScalar(a: v128, y: i8) -> v128
u16x8.shiftRightByScalar(a: v128, y: i8) -> v128
u32x4.shiftRightByScalar(a: v128, y: i8) -> v128
u64x2.shiftRightByScalar(a: v128, y: i8) -> v128
Shift the bits in each lane to the right by the same amount. This is an arithmetic right shift for the signed integer interpretations and a logical right shift for the unsigned integer interpretations.
def S.shiftRightByScalar(a, y):
# Number of bits to shift: 0 .. S.LaneBits - 1.
amount = y mod S.LaneBits
def shift(x):
return x >> amount
return S.lanewise_unary(shift, a)
The logical operations are defined on the boolean SIMD types. See also the Bitwise operations below.
b8x16.and(a: b8x16, b: b8x16) -> b8x16
b16x8.and(a: b16x8, b: b16x8) -> b16x8
b32x4.and(a: b32x4, b: b32x4) -> b32x4
b64x2.and(a: b64x2, b: b64x2) -> b64x2
def S.and(a, b):
def logical_and(x, y):
return x and y
return S.lanewise_binary(logical_and, a, b)
b8x16.or(a: b8x16, b: b8x16) -> b8x16
b16x8.or(a: b16x8, b: b16x8) -> b16x8
b32x4.or(a: b32x4, b: b32x4) -> b32x4
b64x2.or(a: b64x2, b: b64x2) -> b64x2
def S.or(a, b):
def logical_or(x, y):
return x or y
return S.lanewise_binary(logical_or, a, b)
b8x16.xor(a: b8x16, b: b8x16) -> b8x16
b16x8.xor(a: b16x8, b: b16x8) -> b16x8
b32x4.xor(a: b32x4, b: b32x4) -> b32x4
b64x2.xor(a: b64x2, b: b64x2) -> b64x2
def S.xor(a, b):
def logical_xor(x, y):
return x xor y
return S.lanewise_binary(logical_xor, a, b)
b8x16.not(a: b8x16) -> b8x16
b16x8.not(a: b16x8) -> b16x8
b32x4.not(a: b32x4) -> b32x4
b64x2.not(a: b64x2) -> b64x2
def S.not(a):
def logical_not(x):
return not x
return S.lanewise_unary(logical_not, a)
The same logical operations defined on the boolean types are also available on
the v128
type where they operate bitwise the same way C's &
, |
, ^
, and
~
operators work on an unsigned
type.
v128.and(a: v128, b: v128) -> v128
v128.or(a: v128, b: v128) -> v128
v128.xor(a: v128, b: v128) -> v128
v128.not(a: v128) -> v128
These operations reduce all the lanes of a boolean vector to a single scalar boolean value.
b8x16.anyTrue(a: b8x16) -> boolean
b16x8.anyTrue(a: b16x8) -> boolean
b32x4.anyTrue(a: b32x4) -> boolean
b64x2.anyTrue(a: b64x2) -> boolean
These functions return true if any lane in a
is true.
def S.anyTrue(a):
for i in range(S.Lanes):
if a[i]:
return true
return false
b8x16.allTrue(a: b8x16) -> boolean
b16x8.allTrue(a: b16x8) -> boolean
b32x4.allTrue(a: b32x4) -> boolean
b64x2.allTrue(a: b64x2) -> boolean
These functions return true if all lanes in a
are true.
def S.allTrue(a):
for i in range(S.Lanes):
if not a[i]:
return false
return true
The comparison operations all compare two vectors lane-wise, and produce a boolean vector with the same number of lanes as the input interpretation.
i8x16.equal(a: v128, b: v128) -> b8x16
i16x8.equal(a: v128, b: v128) -> b16x8
i32x4.equal(a: v128, b: v128) -> b32x4
i64x2.equal(a: v128, b: v128) -> b64x2
f32x4.equal(a: v128, b: v128) -> b32x4
f64x2.equal(a: v128, b: v128) -> b64x2
Integer equality is independent of the signed/unsigned interpretation. Floating point equality follows IEEE semantics, so a NaN lane compares not equal with anything, including itself, and +0.0 is equal to -0.0:
def S.equal(a, b):
def eq(x, y):
return x == y
return S.lanewise_comparison(eq, a, b)
i8x16.notEqual(a: v128, b: v128) -> b8x16
i16x8.notEqual(a: v128, b: v128) -> b16x8
i32x4.notEqual(a: v128, b: v128) -> b32x4
i64x2.notEqual(a: v128, b: v128) -> b64x2
f32x4.notEqual(a: v128, b: v128) -> b32x4
f64x2.notEqual(a: v128, b: v128) -> b64x2
The notEqual
operations produce the inverse of their equal
counterparts:
def S.notEqual(a, b):
def ne(x, y):
return x != y
return S.lanewise_comparison(ne, a, b)
s8x16.lessThan(a: v128, b: v128) -> b8x16
s16x8.lessThan(a: v128, b: v128) -> b16x8
s32x4.lessThan(a: v128, b: v128) -> b32x4
s64x2.lessThan(a: v128, b: v128) -> b64x2
u8x16.lessThan(a: v128, b: v128) -> b8x16
u16x8.lessThan(a: v128, b: v128) -> b16x8
u32x4.lessThan(a: v128, b: v128) -> b32x4
u64x2.lessThan(a: v128, b: v128) -> b64x2
f32x4.lessThan(a: v128, b: v128) -> b32x4
f64x2.lessThan(a: v128, b: v128) -> b64x2
Integer magnitude comparisons depend on the signed/unsigned interpretation of the lanes. Floating point comparisons follow IEEE semantics:
def S.lessThan(a, b):
def lt(x, y):
return x < y
return S.lanewise_comparison(lt, a, b)
s8x16.lessThanOrEqual(a: v128, b: v128) -> b8x16
s16x8.lessThanOrEqual(a: v128, b: v128) -> b16x8
s32x4.lessThanOrEqual(a: v128, b: v128) -> b32x4
s64x2.lessThanOrEqual(a: v128, b: v128) -> b64x2
u8x16.lessThanOrEqual(a: v128, b: v128) -> b8x16
u16x8.lessThanOrEqual(a: v128, b: v128) -> b16x8
u32x4.lessThanOrEqual(a: v128, b: v128) -> b32x4
u64x2.lessThanOrEqual(a: v128, b: v128) -> b64x2
f32x4.lessThanOrEqual(a: v128, b: v128) -> b32x4
f64x2.lessThanOrEqual(a: v128, b: v128) -> b64x2
def S.lessThanOrEqual(a, b):
def le(x, y):
return x <= y
return S.lanewise_comparison(le, a, b)
s8x16.greaterThan(a: v128, b: v128) -> b8x16
s16x8.greaterThan(a: v128, b: v128) -> b16x8
s32x4.greaterThan(a: v128, b: v128) -> b32x4
s64x2.greaterThan(a: v128, b: v128) -> b64x2
u8x16.greaterThan(a: v128, b: v128) -> b8x16
u16x8.greaterThan(a: v128, b: v128) -> b16x8
u32x4.greaterThan(a: v128, b: v128) -> b32x4
u64x2.greaterThan(a: v128, b: v128) -> b64x2
f32x4.greaterThan(a: v128, b: v128) -> b32x4
f64x2.greaterThan(a: v128, b: v128) -> b64x2
def S.greaterThan(a, b):
def gt(x, y):
return x > y
return S.lanewise_comparison(gt, a, b)
s8x16.greaterThanOrEqual(a: v128, b: v128) -> b8x16
s16x8.greaterThanOrEqual(a: v128, b: v128) -> b16x8
s32x4.greaterThanOrEqual(a: v128, b: v128) -> b32x4
s64x2.greaterThanOrEqual(a: v128, b: v128) -> b64x2
u8x16.greaterThanOrEqual(a: v128, b: v128) -> b8x16
u16x8.greaterThanOrEqual(a: v128, b: v128) -> b16x8
u32x4.greaterThanOrEqual(a: v128, b: v128) -> b32x4
u64x2.greaterThanOrEqual(a: v128, b: v128) -> b64x2
f32x4.greaterThanOrEqual(a: v128, b: v128) -> b32x4
f64x2.greaterThanOrEqual(a: v128, b: v128) -> b64x2
def S.greaterThanOrEqual(a, b):
def ge(x, y):
return x >= y
return S.lanewise_comparison(ge, a, b)
Load and store operations are provided for v128
vectors, but not for the
boolean vectors; we don't want to impose a bitwise representation of the boolean
vectors.
The memory operations work on an abstract Buffer
instance which can be
addressed by a ByteOffset
type. Unaligned memory operations are allowed, but
they may be slower than aligned operations.
This specification does not address bounds checking and trap handling for memory
operations. It is assumed that the range addr .. addr+15
are valid offsets in
the buffer, and that computing addr+15
does not overflow the ByteOffset
type. Bounds checking should be handled by the embedding specification.
v8x16.load(mem: Buffer, addr: ByteOffset) -> v128
v16x8.load(mem: Buffer, addr: ByteOffset) -> v128
v32x4.load(mem: Buffer, addr: ByteOffset) -> v128
v64x2.load(mem: Buffer, addr: ByteOffset) -> v128
Load a v128
vector from the given buffer and offset.
def S.load(mem, addr):
assert mem.in_range(addr, 16)
result = S.New()
lane_bytes = S.LaneBits / 8
for i in range(S.Lanes):
result[i] = mem.load(S.LaneBits, addr + i * lane_bytes)
return result
v8x16.store(mem: Buffer, addr: ByteOffset, data: v128)
v16x8.store(mem: Buffer, addr: ByteOffset, data: v128)
v32x4.store(mem: Buffer, addr: ByteOffset, data: v128)
v64x2.store(mem: Buffer, addr: ByteOffset, data: v128)
Store a v128
vector to the given buffer and offset.
def S.store(mem, addr, data):
assert mem.in_range(addr, 16)
lane_bytes = S.LaneBits / 8
for i in range(S.Lanes):
mem.store(S.LaneBits, addr + i * lane_bytes, data[i])
The lane-wise load and store operations used above will read and write a lane
using the native byte order, so for example storing a vector with the i32x4
interpretation is equivalent to storing 4 i32
values to memory. This
specification has some hard requirements for the lane and bit numbering:
- The bits in a
v128
are numbered 0-127. - Lanes are numbered in the same direction as the
v128
bits. - Lanes are stored in memory in ascending addresses, so lane 0 gets the lowest address.
These hard requirements still leave multiple ways of mapping byte order to vectors:
-
Little-endian direct: The bit with the lowest number in each lane is the least significant bit. This is the natural mapping for Intel SSE and the little-endian modes of ARM NEON and MIPS MSA.
-
Big-endian direct: The bit with the lowest number in each lane is the most significant bit. This is the natural mapping for big-endian PowerPC.
-
Big-endian hybrid: The bit with the lowest number in each lane is the least significant bit. This is the natural mapping for the big-endian modes of ARM NEON and MIPS MSA.
The mapping is visible when reinterpreting a vector:
a = i64x2.build([0x0123456789abcdef, 0x1122334455667788])
x = i8x16.extractLane(a, 0)
The extracted lane, x
, will be 0xef
in the little-endian direct and the
big-endian hybrid mappings, but 0x01
in the big-endian direct mapping.
The big-endian hybrid mapping requires separate load and store instructions for
each lane width, while the direct mappings can use the same instruction for all
vectors. For example, the a
vector above will be stored like this with the
big-endian hybrid mapping:
v64x2.store: 01 23 45 67 89 ab cd ef 11 22 33 44 55 66 77 88
v32x4.store: 89 ab cd ef 01 23 45 67 55 66 77 88 11 22 33 44
v16x8.store: cd ef 89 ab 45 67 01 23 77 88 55 66 33 44 11 22
v8x16.store: ef cd ab 89 67 45 23 01 88 77 66 55 44 33 22 11
The big-endian direct mapping would write a
like this:
v64x2.store: 01 23 45 67 89 ab cd ef 11 22 33 44 55 66 77 88
v32x4.store: 01 23 45 67 89 ab cd ef 11 22 33 44 55 66 77 88
v16x8.store: 01 23 45 67 89 ab cd ef 11 22 33 44 55 66 77 88
v8x16.store: 01 23 45 67 89 ab cd ef 11 22 33 44 55 66 77 88
The little-endian direct mapping would write a
like this:
v64x2.store: ef cd ab 89 67 45 23 01 88 77 66 55 44 33 22 11
v32x4.store: ef cd ab 89 67 45 23 01 88 77 66 55 44 33 22 11
v16x8.store: ef cd ab 89 67 45 23 01 88 77 66 55 44 33 22 11
v8x16.store: ef cd ab 89 67 45 23 01 88 77 66 55 44 33 22 11
This specification doesn't address type conversions since there is only one
type, v128
, but note that it is common for more fine-grained SIMD type systems
to specify 'bit casts' between different SIMD types of the same size as
equivalent to storing one type and loading another from the same address. Both
LLVM and SIMD.js specify bit casts that way. LLVM's ARM and MIPS targets use the
hybrid lane mapping in their big-endian modes and translate bitcast
instructions to shuffles.
It would be possible for SIMD.js to use the big-endian direct mapping on ARM and
MIPS by numbering the lanes differently and using the 64x2
load/store
instructions for all memory operations. It would also be possible to use the
big-endian hybrid mapping by expanding bit casts into shuffles.
WebAssembly is little-endian only.
v32x4.load1(mem: Buffer, addr: ByteOffset) -> v128
v32x4.load2(mem: Buffer, addr: ByteOffset) -> v128
v32x4.load3(mem: Buffer, addr: ByteOffset) -> v128
These functions load the first 1, 2, or 3 lanes from a buffer and sets the remaining lanes to all zeroes. The partial loads are only defined for 4-lane interpretations.
def partial_load(mem, addr, lanes):
result = v32x4.splat(0)
for i in range(lanes):
result[i] = mem.load(32, addr + i * 4)
return result
def v32x4.load1(mem, addr):
assert mem.in_range(addr, 4)
return partial_load(mem, addr, 1)
def v32x4.load2(mem, addr):
assert mem.in_range(addr, 8)
return partial_load(mem, addr, 2)
def v32x4.load3(mem, addr):
assert mem.in_range(addr, 12)
return partial_load(mem, addr, 3)
v32x4.store1(mem: Buffer, addr: ByteOffset, data: v128)
v32x4.store2(mem: Buffer, addr: ByteOffset, data: v128)
v32x4.store3(mem: Buffer, addr: ByteOffset, data: v128)
These functions store the first 1, 2, or 3 lanes to a buffer. They are only defined for the 4-lane interpretations.
def partial_store(mem, addr, data, lanes):
for i in range(lanes):
mem.store(32, addr + i * 4, data[i])
def v32x4.store1(mem, addr, data):
assert mem.in_range(addr, 4)
partial_store(mem, addr, data, 1)
def v32x4.store2(mem, addr, data):
assert mem.in_range(addr, 8)
partial_store(mem, addr, data, 2)
def v32x4.store3(mem, addr, data):
assert mem.in_range(addr, 12)
partial_store(mem, addr, data, 3)
These floating point operations are simple manipulations of the sign bit. No changes are made to the exponent or trailing significand bits, even for NaN inputs.
f32x4.neg(a: v128) -> v128
f64x2.neg(a: v128) -> v128
Apply the IEEE negate(x)
function to each lane. This simply inverts the sign
bit, preserving all other bits.
def S.neg(a):
return S.lanewise_unary(ieee.negate, a)
f32x4.abs(a: v128) -> v128
f64x2.abs(a: v128) -> v128
Apply the IEEE abs(x)
function to each lane. This simply clears the sign bit,
preserving all other bits.
def S.abs(a):
return S.lanewise_unary(ieee.abs, a)
These operations are not part of the IEEE 754-2008 standard. Notably, the
minNum
and maxNum
operations defined here behave differently than the IEEE
minNum
and maxNum
operations when one operand is a signaling NaN.
The minimum and maximum value of +0 and -0 is computed as if -0 < +0.
f32x4.min(a: v128, b: v128) -> v128
f64x2.min(a: v128, b: v128) -> v128
Lane-wise minimum value, propagating NaNs:
def S.min(a, b):
def min(x, y):
if isnan(x) or isnan(y):
return propagate_nan(x, y)
# Prefer -0 for min(-0, +0) and min(+0, -0).
if x == 0 and y == 0 and signbit(x) != signbit(y):
return -0.0
if x < y:
return x
else:
return y
return S.lanewise_binary(min, a, b)
f32x4.max(a: v128, b: v128) -> v128
f64x2.max(a: v128, b: v128) -> v128
Lane-wise maximum value, propagating NaNs:
def S.max(a, b):
def max(x, y):
if isnan(x) or isnan(y):
return propagate_nan(x, y)
# Prefer +0 for max(-0, +0) and max(+0, -0).
if x == 0 and y == 0 and signbit(x) != signbit(y):
return +0.0
if x > y:
return x
else:
return y
return S.lanewise_binary(max, a, b)
f32x4.minNum(a: v128, b: v128) -> v128
f64x2.minNum(a: v128, b: v128) -> v128
Lane-wise minimum value, suppressing single NaNs:
def S.minNum(a, b):
def minNum(x, y):
if isnan(x) and isnan(y):
return propagate_nan(x, y)
if isnan(x):
return y
if isnan(y):
return x
# Prefer -0 for min(-0, +0) and min(+0, -0).
if x == 0 and y == 0 and signbit(x) != signbit(y):
return -0.0
if x < y:
return x
else:
return y
return S.lanewise_binary(minNum, a, b)
Note that this function behaves differently than the IEEE 754 minNum
function
when one of the operands is a signaling NaN.
f32x4.maxNum(a: v128, b: v128) -> v128
f64x2.maxNum(a: v128, b: v128) -> v128
Lane-wise maximum value, suppressing single NaNs:
def S.maxNum(a, b):
def maxNum(a, b):
if isnan(x) and isnan(y):
return propagate_nan(x, y)
if isnan(x):
return y
if isnan(y):
return x
# Prefer +0 for max(-0, +0) and max(+0, -0).
if x == 0 and y == 0 and signbit(x) != signbit(y):
return +0.0
if x > y:
return x
else:
return y
return S.lanewise_binary(maxNum, a, b)
Note that this function behaves differently than the IEEE 754 maxNum
function
when one of the operands is a signaling NaN.
The floating-point arithmetic operations handle NaNs more strictly specified than the IEEE standard:
def wrap_fp_unary(func):
def wrapped(x):
if isnan(x):
return canonicalize_nan(x)
result = func(x)
if isnan(result):
return type(result).default_nan()
else:
return result
return wrapped
def wrap_fp_binary(func):
def wrapped(x, y):
if isnan(x) or isnan(y):
return propagate_nan(x, y)
result = func(x, y)
if isnan(result):
return type(result).default_nan()
else:
return result
return wrapped
f32x4.add(a: v128, b: v128) -> v128
f64x2.add(a: v128, b: v128) -> v128
Lane-wise IEEE addition
.
def S.add(a, b):
return S.lanewise_binary(wrap_fp_binary(ieee.addition), a, b)
f32x4.sub(a: v128, b: v128) -> v128
f64x2.sub(a: v128, b: v128) -> v128
Lane-wise IEEE subtraction
.
def S.sub(a, b):
return S.lanewise_binary(wrap_fp_binary(ieee.subtraction), a, b)
f32x4.div(a: v128, b: v128) -> v128
f64x2.div(a: v128, b: v128) -> v128
Lane-wise IEEE division
.
def S.div(a, b):
return S.lanewise_binary(wrap_fp_binary(ieee.division), a, b)
f32x4.mul(a: v128, b: v128) -> v128
f64x2.mul(a: v128, b: v128) -> v128
Lane-wise IEEE multiplication
.
def S.mul(a, b):
return S.lanewise_binary(wrap_fp_binary(ieee.multiplication), a, b)
f32x4.sqrt(a: v128) -> v128
f64x2.sqrt(a: v128) -> v128
Lane-wise IEEE squareRoot
.
def S.sqrt(a):
return S.lanewise_unary(wrap_fp_unary(ieee.squareRoot), a)
f32x4.reciprocalApproximation(a: v128) -> v128
f64x2.reciprocalApproximation(a: v128) -> v128
Implementation-dependent approximation to the reciprocal.
def S.reciprocalApproximation(a):
def recip_approx(x):
if isnan(x):
return canonicalize_nan(x)
if x == 0.0:
# +0.0 -> +Inf, -0.0 -> -Inf.
return 1/x
if isinf(x):
# +Inf -> +0.0, -Inf -> -0.0.
return 1/x
# The exact nature of the approximation is unspecified.
return implementation_dependent(x)
return S.lanewise_unary(recip_approx, a)
f32x4.reciprocalSqrtApproximation(a: v128) -> v128
f64x2.reciprocalSqrtApproximation(a: v128) -> v128
Implementation-dependent approximation to the reciprocal of the square root.
def S.reciprocalSqrtApproximation(a):
def recip_sqrt_approx(x):
if isnan(x):
return canonicalize_nan(x)
if x == 0:
# +0.0 -> +Inf, -0.0 -> -Inf.
return 1/x
if isinf(x):
# +Inf -> +0.0, -Inf -> -0.0.
return 1/x
# The exact nature of the approximation is unspecified.
return implementation_dependent(x)
return S.lanewise_unary(recip_sqrt_approx, a)
f32x4.fromSignedInt(a: v128) -> v128
f64x2.fromSignedInt(a: v128) -> v128
f32x4.fromUnsignedInt(a: v128) -> v128
f64x2.fromUnsignedInt(a: v128) -> v128
Lane-wise conversion from integer to floating point. Some integer values will be rounded.
def S.fromSignedInt(a):
def convert(x):
return S.LaneType.convertFromInt(x)
return S.lanewise_unary(convert, a)
def S.fromUnsignedInt(a):
def convert(x):
return S.LaneType.convertFromInt(x)
return S.lanewise_unary(convert, a)
s32x4.fromFloat(a: v128) -> (result: v128, fail: boolean)
s64x2.fromFloat(a: v128) -> (result: v128, fail: boolean)
u32x4.fromFloat(a: v128) -> (result: v128, fail: boolean)
u64x2.fromFloat(a: v128) -> (result: v128, fail: boolean)
Lane-wise conversion from floating point to integer using the IEEE
convertToIntegerTowardZero
function. If any lane is a NaN or the rounded
integer value is outside the range of the destination type, return fail = true
and an unspecified result
.
def S.fromFloat(a):
result = S.New()
fail = false
for i in range(S.Lanes):
r = ieee.roundToIntegralTowardZero(a[i])
if isnan(r):
fail = true
elif S.Min <= r and r <= S.Max:
result[i] = r
else:
fail = true
if fail:
return (unspecified(), true)
else
return (result, false)