Skip to content

Commit

Permalink
docs: add datatype vecf16 and svector
Browse files Browse the repository at this point in the history
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
  • Loading branch information
cutecutecat committed Mar 5, 2024
1 parent 6518c57 commit 7a5f83c
Show file tree
Hide file tree
Showing 2 changed files with 145 additions and 4 deletions.
8 changes: 7 additions & 1 deletion src/getting-started/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,14 +103,20 @@ SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;

`vecf16` type is the same with `vector` in anything but the scalar type. It stores 16-bit floating point numbers. If you want to reduce the memory usage to get better performance, you can try to replace `vector` type with `vecf16` type.

For more usage of `vecf16`, please refer to [vector types](../usage/vector-types.html).

### Sparse vector

`svector` type is a sparse vector type. It stores a vector in a sparse format. It is suitable for vectors with many zeros.
`svector` type is a sparse vector type. It stores a vector in a sparse format. It is suitable for vectors with many zeros.

For more usage of `svector`, please refer to [vector types](../usage/vector-types.html).

### Binary vector

`bvector` type is a binary vector type. It is a fixed-length bit string. Except for above 3 distances, we also support `jaccard` distance `<~>`, which defined as $1 - \frac{|X\cap Y|}{|X\cup Y|}$. And `hamming` distance is the same with squared Euclidean distance, you can use `<->` operator to calculate it. We also provide `binarize` function to construct a `bvector` from a `vector`, which set the positive elements to 1, otherwise 0.

For more usage of `bvector`, please refer to [vector types](../usage/vector-types.html).

## Roadmap 🗂️

Please check out [ROADMAP](../community/roadmap).
Expand Down
141 changes: 138 additions & 3 deletions src/usage/vector-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,38 @@ We support three operators to calculate the distance between two `bvector` value
- `<=>` (`bvector_cos_ops`): cosine distance, defined as $1 - \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.
- `<~>` (`bvector_jaccard_ops`): Jaccard distance, defined as $1 - \frac{|X\cap Y|}{|X\cup Y|}$.

```sql

Index can be created on `bvector` type as well.

```sql
CREATE INDEX bvector ON items USING vectors (embedding bvector_l2_ops);
CREATE INDEX your_index_name ON items USING vectors (embedding bvector_l2_ops);

SELECT * FROM items ORDER BY embedding <-> '[1,0,1]' LIMIT 5;
```

### Data type cast

Cast between vector:
```sql
SELECT '[1, 0, 1]'::vector::bvector;
SELECT '[1, 0, 1]'::bvector::vector;
```

From ARRAY or real[] to bvector:
```sql
SELECT ARRAY[1, 0, 1]::real[]::vector::bvector;
```

From string constructor:
```sql
SELECT '[1, 0, 1]'::bvector;
```

From binarize constructor:
```sql
SELECT binarize(ARRAY[-2, -1, 0, 1, 2]::real[]::vector);;
-- [0, 0, 0, 1, 1]
```

### Performance

The `bvector` type is optimized for storage and performance. It uses a bit-packed representation to store the binary vector. The distance calculation is also optimized for binary vectors.
Expand All @@ -45,3 +67,116 @@ We upsert 1M binary vectors into the table and then run a KNN query for each emb
![bvector](./images/bvector.png)

We can see that the `bvector`'s accuracy is not as good as the `vector` type, but it exceeds 95% if we adopt adaptive retrieval.

## `svector` sparse vector

Different from dense vectors, sparse vectors are very high-dimensional but contain few non-zero values. Though you can treat them as traditional dense vectors, they can be calculated and stored much more efficiently by [some ways](https://en.wikipedia.org/wiki/Sparse_matrix).

Typically, sparse vectors could generated from:
- Word-word occurrence matrices
- Term frequency-inverse document frequency (TF-IDF) vectors
- User-item interaction matrices
- Network adjacency matrices

`pgvecto.rs` supports sparse vectors by [COO(coordinate format)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)), it's called `svector`.

::: tip
`svector` is 32-bit float, 16-bit float sparse vector is not supported now.
:::

Here's an example of creating a table with a svector column and inserting values:

```sql {3}
CREATE TABLE items (
id bigserial PRIMARY KEY,
embedding svector(10) NOT NULL
);

INSERT INTO items (embedding) VALUES ('[0.1,0,0,0,0,0,0,0,0,0]'), ('[0,0,0,0,0,0,0,0,0,0.5]');
```

We support three operators to calculate the distance between two `svector` values.

- `<->` (`svector_l2_ops`): squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
- `<#>` (`svector_dot_ops`): negative dot product, defined as $- \Sigma x_iy_i$.
- `<=>` (`svector_cos_ops`): cosine distance, defined as $1 - \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.

Index can be created on `svector` type as well.

```sql
CREATE INDEX your_index_name ON items USING vectors (embedding svector_l2_ops);

SELECT * FROM items ORDER BY embedding <-> '[0.3,0,0,0,0,0,0,0,0,0]'::svector LIMIT 1;
```

## Data type cast

Cast between vector:
```sql
SELECT '[0.3, 0, 0, 0, 0.5]'::vector::svector;
SELECT '[0.3, 0, 0, 0, 0.5]'::svector::vector;
```

From ARRAY or real[] to svector:
```sql
SELECT ARRAY[random(), 0, 0, 0, 0.5]::real[]::vector::svector;
```

From string constructor:
```sql
SELECT '[0.3, 0, 0, 0, 0.5]'::svector;
```

From index and value constructor:
```sql
SELECT to_svector(5, '{0,4}', '{0.3,0.5}');
-- [0.3, 0, 0, 0, 0.5]
```

## `vecf16` half-precision vector

Stored as half precision format number format, `vecf16` take advantage of 16-bit float, which requires half the storage and bandwidth compared to `vector`.
It is often faster than regular `vector` data type, but may lose some precision.

Here's an example of creating a table with a vecf16 column and inserting values:

```sql {3}
CREATE TABLE items (
id bigserial PRIMARY KEY,
embedding vecf16(3) NOT NULL
);

INSERT INTO items (embedding) VALUES ('[0.1, 0.2, 0]'), ('[0, 0.1, 0.2]');
```

We support three operators to calculate the distance between two `vecf16` values.

- `<->` (`vecf16_l2_ops`): squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
- `<#>` (`vecf16_dot_ops`): negative dot product, defined as $- \Sigma x_iy_i$.
- `<=>` (`vecf16_cos_ops`): cosine distance, defined as $1 - \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.

Index can be created on `vecf16` type as well.

```sql
CREATE INDEX your_index_name ON items USING vectors (embedding vecf16_l2_ops);

SELECT * FROM items ORDER BY embedding <-> '[0.3,0.2,0.1]'::svector LIMIT 1;
```

### Data type cast

Cast between vector:
```sql
SELECT '[0.3, 0.2, 0.1]'::vector::vecf16;
SELECT '[0.3, 0.2, 0.1]'::vecf16::vector;
```

From ARRAY or real[] to vecf16:
```sql
SELECT ARRAY[random(), 0, 0.1]::real[]::vector::vecf16;
```

From string constructor:
```sql
SELECT '[0.3, 0.2, 0.1]'::vecf16;
```

0 comments on commit 7a5f83c

Please sign in to comment.