wasmparser: Perform type canonicalization for Wasm GC

The unit of canonicalization is a recursion group. Having "unnecessary" types in a recursion group can "break" canonicalization of other types within that same recursion group, as can reordering types within a recursion group. It is an invariant that all types defined before the recursion group we are currently canonicalizing have already been canonicalized themselves. Canonicalizing a recursion group then proceeds as follows: * First we walk each of its `SubType` elements and put their type references (i.e. their `PackedIndex`es) into canonical form. Canonicalizing a `PackedIndex` means switching it from indexing into the Wasm module's types space into either 1. Referencing an already-canonicalized type, for types outside of this recursion group. Because inter-group type references can only go towards types defined before this recursion group, we know the type is already canonicalized and we have a `CoreTypeId` for each of those types. This updates the `PackedIndex` into a `CoreTypeId`. 2. Indexing into the current recursion group, for intra-group type references. Note that (2) has the effect of making the "same" structure of mutual type recursion look identical across recursion groups: ;; Before (rec (struct (field (module-type 1))) (struct (field (module-type 0)))) (rec (struct (field (module-type 3))) (struct (field (module-type 2)))) ;; After (rec (struct (field (rec-group-type 1))) (struct (field (rec-group-type 0)))) (rec (struct (field (rec-group-type 1))) (struct (field (rec-group-type 0)))) * Now that the recursion group's elements are in canonical form, we can "simply" hash cons whole rec groups at a time. The `TypesList` morally maintains a hash map from `Vec<SubType>` to `RecGroupId` and we can do get-or-create operations on it. I say "morally" because we don't actually duplicate the `Vec<SubType>` key in that hash map since those elements are already stored in the `TypeList`'s internal `SnapshotList<CoreType>`. This means we need to do some low-level hash table fiddling with the `hashbrown` crate. And that's it! That is the whole canonicalization algorithm. Some more random things to note: * Because we essentially already have to do the check to canonicalize, and to avoid additional passes over the types, the canonicalization pass also checks that type references are in bounds. These are the only errors that can be returned from canonicalization. * Canonicalizing requires the `Module` to translate type indices to actual `CoreTypeId`s. * It is important that *after* we have canonicalized all types, we don't need the `Module` anymore. This makes sure that we can, for example, intern all types from the same store into the same `TypeList`. Which in turn lets us type check function imports of a same-store instance's exported functions and we don't need to translate from one module's canonical representation to another module's canonical representation or perform additional expensive checks to see if the types match or not (since the whole point of canonicalization is to avoid that!). ------------------------------------- I initially tried to have two different Rust types for each Wasm core type (`SubType`, `FuncType`, etc...): one for the version that contains raw type space indices that are produced directly from the reader and another that contains `CoreTypeId`s after canonicalization. This approach is essentially what we do for component model types. However, this was getting really painful, because even `ValType` would have to have two different versions. The amount of places I was touching, including in downstream crates, was getting out of hand. So instead I opted to make a new index type that is morally the following enum: ```rust enum Index { ModuleTypesSpaceIndex(u32), RecGroupLocalIndex(u32), CoreTypeId(CoreTypeId), } ``` Of course, we have to be very frugal with bits to keep `RefType` fitting in 24 bits and `ValType` in 32 bits, so it is actually a bit-packed version of that. We can still represent the maximum number of Wasm types in a module. However a `TypeList` can only have `2 * MAX_WASM_TYPES` stored in it now (or at least that is how many are addressable; you could add more and then never stuff their `CoreTypeId`s into these bit-packed indices). We could free up some more bits here if we started bit-packing `ValType`, but the loss in ergonomics of matching on `ValType` would be pretty bad. Anyways, I also added an unpacked version of these bit-packed indices for ergonomics. The bit-packed version can infallibly be converted to the unpacked version, and the unpacked version can fallibly be converted to the bit-packed version (it checks that the indices are representable in the number of bits we actually have available). Finally, because we are back to only having a single Rust type for each core Wasm type, I removed the `define_core_wasm_types!` macro and inlined the definitions. Sorry for the churn! But it is definitely nicer not having them inside a macro at the end of the day. ----------------------------------- This also fixes bytecodealliance#923, since canonicalization avoids the exponential behavior observed there.
fitzgen · Nov 1, 2023 · 23cd7ff · 23cd7ff
1 parent 706f755
commit 23cd7ff
Show file tree

Hide file tree

Showing 27 changed files with 2,457 additions and 1,276 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/crates/wasm-compose/src/encoding.rs b/crates/wasm-compose/src/encoding.rs
@@ -434,7 +434,9 @@ impl<'a> TypeEncoder<'a> {
                 wasmparser::HeapType::Struct => HeapType::Struct,
                 wasmparser::HeapType::Array => HeapType::Array,
                 wasmparser::HeapType::I31 => HeapType::I31,
-                wasmparser::HeapType::Concrete(i) => HeapType::Concrete(i),
+                wasmparser::HeapType::Concrete(i) => {
+                    HeapType::Concrete(i.as_module_index().unwrap())
+                }
             },
         }
     }

diff --git a/crates/wasm-encoder/src/core/code.rs b/crates/wasm-encoder/src/core/code.rs
@@ -2893,6 +2893,11 @@ pub enum ConstExprConversionError {
     /// The const expression is invalid: not actually constant or something like
     /// that.
     Invalid,
+
+    /// There was a type reference that was canonicalized and no longer
+    /// references an index into a module's types space, so we cannot encode it
+    /// into a Wasm binary again.
+    CanonicalizedTypeReference,
 }
 
 #[cfg(feature = "wasmparser")]
@@ -2903,6 +2908,10 @@ impl std::fmt::Display for ConstExprConversionError {
                 write!(f, "There was an error when parsing the const expression")
             }
             Self::Invalid => write!(f, "The const expression was invalid"),
+            Self::CanonicalizedTypeReference => write!(
+                f,
+                "There was a canonicalized type reference without type index information"
+            ),
         }
     }
 }
@@ -2912,7 +2921,7 @@ impl std::error::Error for ConstExprConversionError {
     fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
         match self {
             Self::ParseError(e) => Some(e),
-            Self::Invalid => None,
+            Self::Invalid | Self::CanonicalizedTypeReference => None,
         }
     }
 }
@@ -2936,7 +2945,10 @@ impl<'a> TryFrom<wasmparser::ConstExpr<'a>> for ConstExpr {
             Some(Ok(wasmparser::Operator::V128Const { value })) => {
                 ConstExpr::v128_const(i128::from_le_bytes(*value.bytes()))
             }
-            Some(Ok(wasmparser::Operator::RefNull { hty })) => ConstExpr::ref_null(hty.into()),
+            Some(Ok(wasmparser::Operator::RefNull { hty })) => ConstExpr::ref_null(
+                HeapType::try_from(hty)
+                    .map_err(|_| ConstExprConversionError::CanonicalizedTypeReference)?,
+            ),
             Some(Ok(wasmparser::Operator::RefFunc { function_index })) => {
                 ConstExpr::ref_func(function_index)
             }

diff --git a/crates/wasm-encoder/src/core/globals.rs b/crates/wasm-encoder/src/core/globals.rs
@@ -90,11 +90,12 @@ impl Encode for GlobalType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::GlobalType> for GlobalType {
-    fn from(global_ty: wasmparser::GlobalType) -> Self {
-        GlobalType {
-            val_type: global_ty.content_type.into(),
+impl TryFrom<wasmparser::GlobalType> for GlobalType {
+    type Error = ();
+    fn try_from(global_ty: wasmparser::GlobalType) -> Result<Self, Self::Error> {
+        Ok(GlobalType {
+            val_type: global_ty.content_type.try_into()?,
             mutable: global_ty.mutable,
-        }
+        })
     }
 }
diff --git a/crates/wasm-encoder/src/core/imports.rs b/crates/wasm-encoder/src/core/imports.rs
@@ -74,15 +74,16 @@ impl From<TagType> for EntityType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::TypeRef> for EntityType {
-    fn from(type_ref: wasmparser::TypeRef) -> Self {
-        match type_ref {
+impl TryFrom<wasmparser::TypeRef> for EntityType {
+    type Error = ();
+    fn try_from(type_ref: wasmparser::TypeRef) -> Result<Self, Self::Error> {
+        Ok(match type_ref {
             wasmparser::TypeRef::Func(i) => EntityType::Function(i),
-            wasmparser::TypeRef::Table(t) => EntityType::Table(t.into()),
+            wasmparser::TypeRef::Table(t) => EntityType::Table(t.try_into()?),
             wasmparser::TypeRef::Memory(m) => EntityType::Memory(m.into()),
-            wasmparser::TypeRef::Global(g) => EntityType::Global(g.into()),
+            wasmparser::TypeRef::Global(g) => EntityType::Global(g.try_into()?),
             wasmparser::TypeRef::Tag(t) => EntityType::Tag(t.into()),
-        }
+        })
     }
 }
 

diff --git a/crates/wasm-encoder/src/core/tables.rs b/crates/wasm-encoder/src/core/tables.rs
@@ -104,12 +104,13 @@ impl Encode for TableType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::TableType> for TableType {
-    fn from(table_ty: wasmparser::TableType) -> Self {
-        TableType {
-            element_type: table_ty.element_type.into(),
+impl TryFrom<wasmparser::TableType> for TableType {
+    type Error = ();
+    fn try_from(table_ty: wasmparser::TableType) -> Result<Self, Self::Error> {
+        Ok(TableType {
+            element_type: table_ty.element_type.try_into()?,
             minimum: table_ty.initial,
             maximum: table_ty.maximum,
-        }
+        })
     }
 }
diff --git a/crates/wasm-encoder/src/core/types.rs b/crates/wasm-encoder/src/core/types.rs
@@ -12,13 +12,18 @@ pub struct SubType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::SubType> for SubType {
-    fn from(sub_ty: wasmparser::SubType) -> Self {
-        SubType {
+impl TryFrom<wasmparser::SubType> for SubType {
+    type Error = ();
+
+    fn try_from(sub_ty: wasmparser::SubType) -> Result<Self, Self::Error> {
+        Ok(SubType {
             is_final: sub_ty.is_final,
-            supertype_idx: sub_ty.supertype_idx,
-            composite_type: sub_ty.composite_type.into(),
-        }
+            supertype_idx: sub_ty
+                .supertype_idx
+                .map(|i| i.as_module_index().ok_or(()))
+                .transpose()?,
+            composite_type: sub_ty.composite_type.try_into()?,
+        })
     }
 }
 
@@ -52,13 +57,14 @@ impl Encode for CompositeType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::CompositeType> for CompositeType {
-    fn from(composite_ty: wasmparser::CompositeType) -> Self {
-        match composite_ty {
-            wasmparser::CompositeType::Func(f) => CompositeType::Func(f.into()),
-            wasmparser::CompositeType::Array(a) => CompositeType::Array(a.into()),
-            wasmparser::CompositeType::Struct(s) => CompositeType::Struct(s.into()),
-        }
+impl TryFrom<wasmparser::CompositeType> for CompositeType {
+    type Error = ();
+    fn try_from(composite_ty: wasmparser::CompositeType) -> Result<Self, Self::Error> {
+        Ok(match composite_ty {
+            wasmparser::CompositeType::Func(f) => CompositeType::Func(f.try_into()?),
+            wasmparser::CompositeType::Array(a) => CompositeType::Array(a.try_into()?),
+            wasmparser::CompositeType::Struct(s) => CompositeType::Struct(s.try_into()?),
+        })
     }
 }
 
@@ -72,12 +78,14 @@ pub struct FuncType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::FuncType> for FuncType {
-    fn from(func_ty: wasmparser::FuncType) -> Self {
-        FuncType::new(
-            func_ty.params().iter().cloned().map(Into::into),
-            func_ty.results().iter().cloned().map(Into::into),
-        )
+impl TryFrom<wasmparser::FuncType> for FuncType {
+    type Error = ();
+    fn try_from(func_ty: wasmparser::FuncType) -> Result<Self, Self::Error> {
+        let mut buf = Vec::with_capacity(func_ty.params().len() + func_ty.results().len());
+        for ty in func_ty.params().iter().chain(func_ty.results()).copied() {
+            buf.push(ty.try_into()?);
+        }
+        Ok(FuncType::from_parts(buf.into(), func_ty.params().len()))
     }
 }
 
@@ -86,9 +94,10 @@ impl From<wasmparser::FuncType> for FuncType {
 pub struct ArrayType(pub FieldType);
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::ArrayType> for ArrayType {
-    fn from(array_ty: wasmparser::ArrayType) -> Self {
-        ArrayType(array_ty.0.into())
+impl TryFrom<wasmparser::ArrayType> for ArrayType {
+    type Error = ();
+    fn try_from(array_ty: wasmparser::ArrayType) -> Result<Self, Self::Error> {
+        Ok(ArrayType(array_ty.0.try_into()?))
     }
 }
 
@@ -100,11 +109,17 @@ pub struct StructType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::StructType> for StructType {
-    fn from(struct_ty: wasmparser::StructType) -> Self {
-        StructType {
-            fields: struct_ty.fields.iter().cloned().map(Into::into).collect(),
-        }
+impl TryFrom<wasmparser::StructType> for StructType {
+    type Error = ();
+    fn try_from(struct_ty: wasmparser::StructType) -> Result<Self, Self::Error> {
+        Ok(StructType {
+            fields: struct_ty
+                .fields
+                .iter()
+                .cloned()
+                .map(TryInto::try_into)
+                .collect::<Result<_, _>>()?,
+        })
     }
 }
 
@@ -118,12 +133,13 @@ pub struct FieldType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::FieldType> for FieldType {
-    fn from(field_ty: wasmparser::FieldType) -> Self {
-        FieldType {
-            element_type: field_ty.element_type.into(),
+impl TryFrom<wasmparser::FieldType> for FieldType {
+    type Error = ();
+    fn try_from(field_ty: wasmparser::FieldType) -> Result<Self, Self::Error> {
+        Ok(FieldType {
+            element_type: field_ty.element_type.try_into()?,
             mutable: field_ty.mutable,
-        }
+        })
     }
 }
 
@@ -139,13 +155,14 @@ pub enum StorageType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::StorageType> for StorageType {
-    fn from(storage_ty: wasmparser::StorageType) -> Self {
-        match storage_ty {
+impl TryFrom<wasmparser::StorageType> for StorageType {
+    type Error = ();
+    fn try_from(storage_ty: wasmparser::StorageType) -> Result<Self, Self::Error> {
+        Ok(match storage_ty {
             wasmparser::StorageType::I8 => StorageType::I8,
             wasmparser::StorageType::I16 => StorageType::I16,
-            wasmparser::StorageType::Val(v) => StorageType::Val(v.into()),
-        }
+            wasmparser::StorageType::Val(v) => StorageType::Val(v.try_into()?),
+        })
     }
 }
 
@@ -173,16 +190,17 @@ pub enum ValType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::ValType> for ValType {
-    fn from(val_ty: wasmparser::ValType) -> Self {
-        match val_ty {
+impl TryFrom<wasmparser::ValType> for ValType {
+    type Error = ();
+    fn try_from(val_ty: wasmparser::ValType) -> Result<Self, Self::Error> {
+        Ok(match val_ty {
             wasmparser::ValType::I32 => ValType::I32,
             wasmparser::ValType::I64 => ValType::I64,
             wasmparser::ValType::F32 => ValType::F32,
             wasmparser::ValType::F64 => ValType::F64,
             wasmparser::ValType::V128 => ValType::V128,
-            wasmparser::ValType::Ref(r) => ValType::Ref(r.into()),
-        }
+            wasmparser::ValType::Ref(r) => ValType::Ref(r.try_into()?),
+        })
     }
 }
 
@@ -196,8 +214,13 @@ impl FuncType {
         let mut buffer = params.into_iter().collect::<Vec<_>>();
         let len_params = buffer.len();
         buffer.extend(results);
+        Self::from_parts(buffer.into(), len_params)
+    }
+
+    #[inline]
+    pub(crate) fn from_parts(params_results: Box<[ValType]>, len_params: usize) -> Self {
         Self {
-            params_results: buffer.into(),
+            params_results,
             len_params,
         }
     }
@@ -293,12 +316,14 @@ impl Encode for RefType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::RefType> for RefType {
-    fn from(ref_type: wasmparser::RefType) -> Self {
-        RefType {
+impl TryFrom<wasmparser::RefType> for RefType {
+    type Error = ();
+
+    fn try_from(ref_type: wasmparser::RefType) -> Result<Self, Self::Error> {
+        Ok(RefType {
             nullable: ref_type.is_nullable(),
-            heap_type: ref_type.heap_type().into(),
-        }
+            heap_type: ref_type.heap_type().try_into()?,
+        })
     }
 }
 
@@ -381,10 +406,12 @@ impl Encode for HeapType {
 }
 
 #[cfg(feature = "wasmparser")]
-impl From<wasmparser::HeapType> for HeapType {
-    fn from(heap_type: wasmparser::HeapType) -> Self {
-        match heap_type {
-            wasmparser::HeapType::Concrete(i) => HeapType::Concrete(i),
+impl TryFrom<wasmparser::HeapType> for HeapType {
+    type Error = ();
+
+    fn try_from(heap_type: wasmparser::HeapType) -> Result<Self, Self::Error> {
+        Ok(match heap_type {
+            wasmparser::HeapType::Concrete(i) => HeapType::Concrete(i.as_module_index().ok_or(())?),
             wasmparser::HeapType::Func => HeapType::Func,
             wasmparser::HeapType::Extern => HeapType::Extern,
             wasmparser::HeapType::Any => HeapType::Any,
@@ -395,7 +422,7 @@ impl From<wasmparser::HeapType> for HeapType {
             wasmparser::HeapType::Struct => HeapType::Struct,
             wasmparser::HeapType::Array => HeapType::Array,
             wasmparser::HeapType::I31 => HeapType::I31,
-        }
+        })
     }
 }
 

diff --git a/crates/wasm-mutate/src/module.rs b/crates/wasm-mutate/src/module.rs
@@ -90,7 +90,7 @@ pub fn map_ref_type(ref_ty: wasmparser::RefType) -> Result<RefType> {
             wasmparser::HeapType::Struct => HeapType::Struct,
             wasmparser::HeapType::Array => HeapType::Array,
             wasmparser::HeapType::I31 => HeapType::I31,
-            wasmparser::HeapType::Concrete(i) => HeapType::Concrete(i.into()),
+            wasmparser::HeapType::Concrete(i) => HeapType::Concrete(i.as_module_index().unwrap()),
         },
     })
 }

diff --git a/crates/wasm-mutate/src/mutators/translate.rs b/crates/wasm-mutate/src/mutators/translate.rs
@@ -210,9 +210,9 @@ pub fn heapty(t: &mut dyn Translator, ty: &wasmparser::HeapType) -> Result<HeapT
         wasmparser::HeapType::Struct => Ok(HeapType::Struct),
         wasmparser::HeapType::Array => Ok(HeapType::Array),
         wasmparser::HeapType::I31 => Ok(HeapType::I31),
-        wasmparser::HeapType::Concrete(i) => {
-            Ok(HeapType::Concrete(t.remap(Item::Type, (*i).into())?))
-        }
+        wasmparser::HeapType::Concrete(i) => Ok(HeapType::Concrete(
+            t.remap(Item::Type, i.as_module_index().unwrap())?,
+        )),
     }
 }