Skip to content

Commit

Permalink
doc: search rewrite
Browse files Browse the repository at this point in the history
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
  • Loading branch information
cutecutecat committed Jan 25, 2024
1 parent 81e3e2b commit 6605c55
Show file tree
Hide file tree
Showing 2 changed files with 123 additions and 59 deletions.
30 changes: 15 additions & 15 deletions src/usage/compatibility.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# `pgvector` compatibility

`pgvecto.rs` is natively compatible with `pgvector` at:
* `CREATE TABLE` commands, for instance, `CREATE TABLE t (val vector(3))`
* `INSERT INTO` commands, for instance, `INSERT INTO t (val) VALUES ('[0.6,0.6,0.6]')`
* `CREATE TABLE` commands, e.g. `CREATE TABLE t (val vector(3))`
* `INSERT INTO` commands, e.g. `INSERT INTO t (val) VALUES ('[0.6,0.6,0.6]')`

`pgvecto.rs` can be configured to be compatible with `pgvector` at:
* Index options, which allows you to create index by `USING hnsw (val vector_ip_ops)`
Expand All @@ -20,32 +20,32 @@ For index `ivfflat` and `hnsw` only the following options are available.

Index options for `ivfflat`:

| Key | Type | Default | Description |
| ----- | ------- | ------- | ------------------------- |
| lists | integer | `100` | Number of cluster units. |
| Key | Type | Range | Default | Description |
| ----- | ------- | ---------------- | ------- | -------------------------------------- |
| nlist | integer | `[1, 1_000_000]` | `100` | Number of cluster units. |

Query options for `ivfflat`:

| Option | Type | Default | Description |
| ---------------- | ------------------------ | ------- | ----------------------------------------- |
| ivfflat.probes | integer (`[1, 1000000]`) | `10` | Number of lists to scan. |
| Option | Type | Range | Default | Description |
| -------------- | ------- | -------------- | ------- | ----------------------------------------- |
| ivfflat.probes | integer | `[1, 1000000]` | `10` | Number of lists to scan. |

::: warning
Default value of `ivfflat.probes` is `10` instead of `1` from pgvector.
:::

Index options for `hnsw`:

| key | type | default | description |
| --------------- | ------- | ------- | -------------------------------- |
| m | integer | `16` | Maximum degree of the node. |
| ef_construction | integer | `64` | Search extent in construction. |
| Key | Type | Range | Default | Description |
| --------------- | ------- | ------------ | ------- | -------------------------------------- |
| m | integer | `[4, 128]` | `16` | Maximum degree of the node. |
| ef_construction | integer | `[10, 2000]` | `64` | Search scope in building. |

Query options for `hnsw`:

| Option | Type | Default | Descrcompatibilitymodeiption |
| -------------- | ------------------------ | ------- | ----------------------------------------- |
| hnsw.ef_search | integer (`[1, 65535]`) | `100` | Search scope of HNSW. |
| Option | Type | Range | Default | Description |
| -------------- | ------- | -------------- | ------- | ----------------------------------------- |
| hnsw.ef_search | integer | `[1, 65535]` | `100` | Search scope of HNSW. |

::: warning
Default value for `hnsw.ef_search` is `100` instead of `40` from pgvector.
Expand Down
152 changes: 108 additions & 44 deletions src/usage/search.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,147 @@
# Search

The SQL for searching is very simple. Here is an example of searching the $5$ nearest embedding in table `items`:

Get the nearest 5 neighbors to a vector
```sql
SET vectors.hnsw_ef_search = 64;
SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
```

The vector index will search for `64` nearest rows, and `5` nearest rows is gotten since there is a `LIMIT` clause.
## Operators

## Search modes
| Name | Description |
| ---- | -------------------------- |
| <-> | squared Euclidean distance |
| <#> | negative dot product |
| <=> | cosine distance |

There are two search modes: `basic` and `vbase`.
For operator formula, see [overview](../getting-started/overview).

### `basic`
## Filter

For a given category, get the nearest 10 neighbors to a vector
```sql
SELECT 1 FROM items WHERE category_id = 1 ORDER BY embedding <#> '[0.5,0.5,0.5]' limit 10
```

`basic` is the default search mode. In this mode, vector indexes behave like a vector search library. It works well if all of your queries is like this:
## Query options

Search options are specified by [PostgreSQL GUC](https://www.postgresql.org/docs/current/config-setting.html).

Set `ivf` scan lists to 1 in session:
```sql
SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
SET vectors.ivf_nprobe=1;
```

It's recommended if your do **not** take advantages of
Set `hnsw` search scope to 40 in transaction:
```sql
SET LOCAL vectors.hnsw_ef_search=40;
```

* database transaction
* deletions without `VACUUM`
* WHERE clauses and very complex SQL statements
Set search mode to `vbase` as system default:
```sql
ALTER SYSTEM SET vectors.search_mode=vbase;
```

### `vbase`
Query options for `ivf`:

`vbase` is another search mode. In this mode, vector indexes behave like a database index. In `vbase` mode, searching results become a stream and every time the database pulls a row, the vector index computes a row to return. It's quite different from an ordinary vector search if you are using a vector search library, such as *faiss*. The latter always wants to know how many results are needed before searching. The original idea comes from [VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity](https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi).
| Option | Type | Range | Default | Description |
| ------------------------ | ------- | -------------- | ------- | ----------------------------------------- |
| vectors.ivf_nprobe | integer | `[1, 1000000]` | `10` | Number of lists to scan. |

Query options for `hnsw`:

| Option | Type | Range | Default | Description |
| ------------------------ | ------- | -------------- | ------- | ----------------------------------------- |
| vectors.hnsw_ef_search | integer | `[1, 65535]` | `100` | Search scope of HNSW. |

Query options for general:

Assuming you are using HNSW algorithm, you may want the following SQL to work:
| Option | Type | Range | Default | Description |
| ------------------------ | ------- | ------------------ | --------- | -------------------------------------- |
| vectors.enable_index | boolean | | `on` | Enables or disables the query planner. |
| vectors.search_mode | enum | `"basic", "vbase"` | `"basic"` | Search mode. |
| vectors.enable_prefilter | boolean | | `on` | Enables or disables the prefilter. |

# Advanced usage

Sometimes you expect the search to return the exact number of vectors equal to `LIMIT`, but it can't:
```sql
SET vectors.hnsw_ef_search = 64;
SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' WHERE id % 2 = 0 LIMIT 64;
SELECT COUNT(1) FROM (SELECT 1 FROM t WHERE (category_id = 1) ORDER BY val <-> '[1,1,1]' limit 10) t2;
--- returns 1, much less than 10
```
That is why we introduce search mode and prefilter.

In `basic` mode, you may only get `32` rows because the HNSW algorithm does search simply so the filter condition is ignored.
## Search modes

In `vbase` mode, the HNSW algorithm is guaranteed to return rows as many as you need, so you can always get correct behavior if your do take advantages of:
There are two search modes: `basic` and `vbase`.

* database transaction
* deletions without `VACUUM`
* `WHERE` clauses and very complex SQL statements
### `basic`

You can enable `vbase` by a SQL statement `SET vectors.search_mode = vbase;`.
`basic` is the default search mode.

## Prefilter
In this mode, the filter is applied after `vectors.hnsw_ef_search` vectors are returned.
Therefore you need to increase `vectors.hnsw_ef_search` until filtered vectors are sufficient.

The appropriate value depends on the input data distribution and filtering rate.
Too large `vectors.hnsw_ef_search` will result in wasted memory.

If your queries include a `WHERE` clause, you can set set search mode to `vbase`. It's good and it even works on all conditions. `vbase` is a **postfilter** method: it pulls rows as many as you need, but it scans rows that you may not need. Since some rows will definitely be removed by the `WHERE` clause, we can skip scanning them, which will make the search faster. We call it **prefilter**.
It's recommended in these situations:

Prefilter speeds your query in the following condition:
* Search without filter and transaction
* Returning insufficient vectors is acceptable
* The `vbase' search mode fails due to out-of-memory

* You create a multicolumn vector index containing a vector column and many payload columns.
* The `WHERE` clause in a query is just simple like `(id % 2 = 0) AND (age > 50)`.
### `vbase`

Prefilter is also used in internal implementation for handling deleted rows in `pgvecto.rs`.
`vbase` is the recommended search mode when any filter is enabled.

Prefilter may have a negative impact on precision. Test the precision before using it.
In this mode, the filter is applied after `range` vectors are returned.
The value of `range` is **automatically chosen** by the `vbase` algorithm.
It is **transparent** to the user.

Prefilter is enabled by default because it almost only works if you create a multicolumn vector index.
In most cases, `vbase` mode would return enough vectors for your filter.
For how it works, see the thesis [VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity](https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi).

## Options
It's recommended in these situations:

Search options are specified by PostgreSQL GUC. You can use `SET` command to apply these options in session or `SET LOCAL` command to apply these options in transaction.
* Search with filter or transaction
* Returning enough vectors is important
* Tired of tuning `vectors.hnsw_ef_search` in `basic` mode

Runtime parameters for planning a query:
You can enable `vbase` by a SQL statement `SET vectors.search_mode = vbase;`.

| Option | Type | Range | Default | Description |
| -------------------- | ------- | ------------------ | --------- | ---------------------------------------------------------------------------- |
| vectors.enable_index | boolean | | `on` | Enables or disables the query planner's use of vector index-scan plan types. |
| vectors.search_mode | enum | `"basic", "vbase"` | `"basic"` | Search mode. |
## Prefilter

`prefilter` is an enhancement strategy that can be set in any search mode.
It is enabled by default.

If enabled, an additional filter is applied to select vectors before they are collected.
This reduces expensive distance calculations, and all collected vectors will match the filter.

The acceleration ratio is positively correlated to the filtering rate.
A filtering rate of 10% in will result in a 10x acceleration of distance calculations.

::: details
Example: 10% filtering rate
```sql
SELECT * FROM generate_series(1, 10) WHERE generate_series <= 1;
```
:::

However, prefilter can have a negative impact on precision if:
* The filter is not relevant to the vector distance
* The filtering rate is too low, e.g. 1%.

::: details
Example: 1% filtering rate
```sql
SELECT * FROM generate_series(1, 100) WHERE generate_series <= 1;
```
:::

If you need a high level of precision, please test your scenarios and consider turning it off:
```sql
ALTER SYSTEM SET vectors.enable_prefilter=off;
```

Runtime parameters for executing a query:

| Option | Type | Range | Default | Description |
| ------------------------ | ------- | -------------- | ------- | ----------------------------------------- |
| vectors.enable_prefilter | boolean | | `on` | Enables or disables the use of prefilter. |
| vectors.ivf_nprobe | integer | `[1, 1000000]` | `10` | Number of lists to scan. |
| vectors.hnsw_ef_search | integer | `[1, 65535]` | `100` | Search scope of HNSW. |

0 comments on commit 6605c55

Please sign in to comment.