docs: add datatype vecf16 and svector

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
tensorchord · Mar 5, 2024 · 7a5f83c · 7a5f83c
1 parent 6518c57
commit 7a5f83c
Show file tree

Hide file tree

Showing 2 changed files with 145 additions and 4 deletions.
diff --git a/src/getting-started/overview.md b/src/getting-started/overview.md
@@ -103,14 +103,20 @@ SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
 
 `vecf16` type is the same with `vector` in anything but the scalar type. It stores 16-bit floating point numbers. If you want to reduce the memory usage to get better performance, you can try to replace `vector` type with `vecf16` type.
 
+For more usage of `vecf16`, please refer to [vector types](../usage/vector-types.html).
+
 ### Sparse vector
 
-`svector` type is a sparse vector type. It stores a vector in a sparse format. It is suitable for vectors with many zeros. 
+`svector` type is a sparse vector type. It stores a vector in a sparse format. It is suitable for vectors with many zeros.
+
+For more usage of `svector`, please refer to [vector types](../usage/vector-types.html).
 
 ### Binary vector
 
 `bvector` type is a binary vector type. It is a fixed-length bit string. Except for above 3 distances, we also support `jaccard` distance `<~>`, which defined as $1 - \frac{|X\cap Y|}{|X\cup Y|}$. And `hamming` distance is the same with squared Euclidean distance, you can use `<->` operator to calculate it. We also provide `binarize` function to construct a `bvector` from a `vector`, which set the positive elements to 1, otherwise 0.
 
+For more usage of `bvector`, please refer to [vector types](../usage/vector-types.html).
+
 ## Roadmap 🗂️
 
 Please check out [ROADMAP](../community/roadmap).

diff --git a/src/usage/vector-types.md b/src/usage/vector-types.md
@@ -24,16 +24,38 @@ We support three operators to calculate the distance between two `bvector` value
 - `<=>` (`bvector_cos_ops`): cosine distance, defined as $1 - \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.
 - `<~>` (`bvector_jaccard_ops`): Jaccard distance, defined as $1 - \frac{|X\cap Y|}{|X\cup Y|}$.
 
-```sql
-
 Index can be created on `bvector` type as well.
 
 ```sql
-CREATE INDEX bvector ON items USING vectors (embedding bvector_l2_ops);
+CREATE INDEX your_index_name ON items USING vectors (embedding bvector_l2_ops);
 
 SELECT * FROM items ORDER BY embedding <-> '[1,0,1]' LIMIT 5;
 ```
 
+### Data type cast
+
+Cast between vector:
+```sql
+SELECT '[1, 0, 1]'::vector::bvector;
+SELECT '[1, 0, 1]'::bvector::vector;
+```
+
+From ARRAY or real[] to bvector:
+```sql
+SELECT ARRAY[1, 0, 1]::real[]::vector::bvector;
+```
+
+From string constructor:
+```sql
+SELECT '[1, 0, 1]'::bvector;
+```
+
+From binarize constructor:
+```sql
+SELECT binarize(ARRAY[-2, -1, 0, 1, 2]::real[]::vector);;
+-- [0, 0, 0, 1, 1]
+```
+
 ### Performance
 
 The `bvector` type is optimized for storage and performance. It uses a bit-packed representation to store the binary vector. The distance calculation is also optimized for binary vectors.
@@ -45,3 +67,116 @@ We upsert 1M binary vectors into the table and then run a KNN query for each emb
 ![bvector](./images/bvector.png)
 
 We can see that the `bvector`'s accuracy is not as good as the `vector` type, but it exceeds 95%  if we adopt adaptive retrieval.
+
+## `svector` sparse vector
+
+Different from dense vectors, sparse vectors are very high-dimensional but contain few non-zero values. Though you can treat them as traditional dense vectors, they can be calculated and stored much more efficiently by [some ways](https://en.wikipedia.org/wiki/Sparse_matrix).
+
+Typically, sparse vectors could generated from:
+- Word-word occurrence matrices
+- Term frequency-inverse document frequency (TF-IDF) vectors
+- User-item interaction matrices
+- Network adjacency matrices
+
+`pgvecto.rs` supports sparse vectors by [COO(coordinate format)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)), it's called `svector`.
+
+::: tip
+`svector` is 32-bit float, 16-bit float sparse vector is not supported now.
+:::
+
+Here's an example of creating a table with a svector column and inserting values:
+
+```sql {3}
+CREATE TABLE items (
+  id bigserial PRIMARY KEY,
+  embedding svector(10) NOT NULL
+);
+
+INSERT INTO items (embedding) VALUES ('[0.1,0,0,0,0,0,0,0,0,0]'), ('[0,0,0,0,0,0,0,0,0,0.5]');
+```
+
+We support three operators to calculate the distance between two `svector` values.
+
+- `<->` (`svector_l2_ops`): squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
+- `<#>` (`svector_dot_ops`): negative dot product, defined as $- \Sigma x_iy_i$.
+- `<=>` (`svector_cos_ops`): cosine distance, defined as $1 - \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.
+
+Index can be created on `svector` type as well.
+
+```sql
+CREATE INDEX your_index_name ON items USING vectors (embedding svector_l2_ops);
+
+SELECT * FROM items ORDER BY embedding <-> '[0.3,0,0,0,0,0,0,0,0,0]'::svector LIMIT 1;
+```
+
+## Data type cast
+
+Cast between vector:
+```sql
+SELECT '[0.3, 0, 0, 0, 0.5]'::vector::svector;
+SELECT '[0.3, 0, 0, 0, 0.5]'::svector::vector;
+```
+
+From ARRAY or real[] to svector:
+```sql
+SELECT ARRAY[random(), 0, 0, 0, 0.5]::real[]::vector::svector;
+```
+
+From string constructor:
+```sql
+SELECT '[0.3, 0, 0, 0, 0.5]'::svector;
+```
+
+From index and value constructor:
+```sql
+SELECT to_svector(5, '{0,4}', '{0.3,0.5}');
+-- [0.3, 0, 0, 0, 0.5]
+```
+
+## `vecf16` half-precision vector
+
+Stored as half precision format number format, `vecf16` take advantage of 16-bit float, which requires half the storage and bandwidth compared to `vector`.
+It is often faster than regular `vector` data type, but may lose some precision.
+
+Here's an example of creating a table with a vecf16 column and inserting values:
+
+```sql {3}
+CREATE TABLE items (
+  id bigserial PRIMARY KEY,
+  embedding vecf16(3) NOT NULL
+);
+
+INSERT INTO items (embedding) VALUES ('[0.1, 0.2, 0]'), ('[0, 0.1, 0.2]');
+```
+
+We support three operators to calculate the distance between two `vecf16` values.
+
+- `<->` (`vecf16_l2_ops`): squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
+- `<#>` (`vecf16_dot_ops`): negative dot product, defined as $- \Sigma x_iy_i$.
+- `<=>` (`vecf16_cos_ops`): cosine distance, defined as $1 - \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.
+
+Index can be created on `vecf16` type as well.
+
+```sql
+CREATE INDEX your_index_name ON items USING vectors (embedding vecf16_l2_ops);
+
+SELECT * FROM items ORDER BY embedding <-> '[0.3,0.2,0.1]'::svector LIMIT 1;
+```
+
+### Data type cast
+
+Cast between vector:
+```sql
+SELECT '[0.3, 0.2, 0.1]'::vector::vecf16;
+SELECT '[0.3, 0.2, 0.1]'::vecf16::vector;
+```
+
+From ARRAY or real[] to vecf16:
+```sql
+SELECT ARRAY[random(), 0, 0.1]::real[]::vector::vecf16;
+```
+
+From string constructor:
+```sql
+SELECT '[0.3, 0.2, 0.1]'::vecf16;
+```