doc: search rewrite

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
tensorchord · Jan 25, 2024 · 6605c55 · 6605c55
1 parent 81e3e2b
commit 6605c55
Show file tree

Hide file tree

Showing 2 changed files with 123 additions and 59 deletions.
diff --git a/src/usage/compatibility.md b/src/usage/compatibility.md
@@ -1,8 +1,8 @@
 # `pgvector` compatibility
 
 `pgvecto.rs` is natively compatible with `pgvector` at:
-* `CREATE TABLE` commands, for instance, `CREATE TABLE t (val vector(3))`
-* `INSERT INTO` commands, for instance, `INSERT INTO t (val) VALUES ('[0.6,0.6,0.6]')`
+* `CREATE TABLE` commands, e.g. `CREATE TABLE t (val vector(3))`
+* `INSERT INTO` commands, e.g. `INSERT INTO t (val) VALUES ('[0.6,0.6,0.6]')`
 
 `pgvecto.rs` can be configured to be compatible with `pgvector` at: 
 * Index options, which allows you to create index by `USING hnsw (val vector_ip_ops)`
@@ -20,32 +20,32 @@ For index `ivfflat` and `hnsw` only the following options are available.
 
 Index options for `ivfflat`:
 
-| Key   | Type    | Default | Description               |
-| ----- | ------- | ------- | ------------------------- |
-| lists | integer | `100`   | Number of cluster units.  |
+| Key   | Type    | Range            | Default | Description                            |
+| ----- | ------- | ---------------- | ------- | -------------------------------------- |
+| nlist | integer | `[1, 1_000_000]` | `100`  | Number of cluster units.               |
 
 Query options for `ivfflat`:
 
-| Option           | Type                     | Default | Description                               |
-| ---------------- | ------------------------ | ------- | ----------------------------------------- |
-| ivfflat.probes   | integer (`[1, 1000000]`) | `10`    | Number of lists to scan.                  |
+| Option         | Type    | Range          | Default | Description                               |
+| -------------- | ------- | -------------- | ------- | ----------------------------------------- |
+| ivfflat.probes | integer | `[1, 1000000]` | `10`    | Number of lists to scan.                  |
 
 ::: warning
 Default value of `ivfflat.probes` is `10` instead of `1` from pgvector.
 :::
 
 Index options for `hnsw`:
 
-| key             | type    | default | description                      |
-| --------------- | ------- | ------- | -------------------------------- |
-| m               | integer | `16`    | Maximum degree of the node.      |
-| ef_construction | integer | `64`    | Search extent in construction.   |
+| Key             | Type    | Range        | Default | Description                            |
+| --------------- | ------- | ------------ | ------- | -------------------------------------- |
+| m               | integer | `[4, 128]`   | `16`    | Maximum degree of the node.            |
+| ef_construction | integer | `[10, 2000]` | `64`    | Search scope in building.              |
 
 Query options for `hnsw`:
 
-| Option         | Type                     | Default | Descrcompatibilitymodeiption                               |
-| -------------- | ------------------------ | ------- | ----------------------------------------- |
-| hnsw.ef_search | integer (`[1, 65535]`)   | `100`   | Search scope of HNSW.                     |
+| Option         | Type    | Range          | Default | Description                               |
+| -------------- | ------- | -------------- | ------- | ----------------------------------------- |
+| hnsw.ef_search | integer | `[1, 65535]`   | `100`   | Search scope of HNSW.                     |
 
 ::: warning
 Default value for `hnsw.ef_search` is `100` instead of `40` from pgvector.

diff --git a/src/usage/search.md b/src/usage/search.md
@@ -1,83 +1,147 @@
 # Search
 
-The SQL for searching is very simple. Here is an example of searching the $5$ nearest embedding in table `items`:
-
+Get the nearest 5 neighbors to a vector
 ```sql
 SET vectors.hnsw_ef_search = 64;
 SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
 ```
 
-The vector index will search for `64` nearest rows, and `5` nearest rows is gotten since there is a `LIMIT` clause.
+## Operators
 
-## Search modes
+| Name  | Description                |
+| ----  | -------------------------- |
+| <->   | squared Euclidean distance |
+| <#>   | negative dot product       |
+| <=>   | cosine distance            |
 
-There are two search modes: `basic` and `vbase`.
+For operator formula, see [overview](../getting-started/overview).
 
-### `basic`
+## Filter
+
+For a given category, get the nearest 10 neighbors to a vector
+```sql
+SELECT 1 FROM items WHERE category_id = 1 ORDER BY embedding <#> '[0.5,0.5,0.5]' limit 10
+```
 
-`basic` is the default search mode. In this mode, vector indexes behave like a vector search library. It works well if all of your queries is like this:
+## Query options
 
+Search options are specified by [PostgreSQL GUC](https://www.postgresql.org/docs/current/config-setting.html).
+
+Set `ivf` scan lists to 1 in session:
 ```sql
-SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
+SET vectors.ivf_nprobe=1;
 ```
 
-It's recommended if your do **not** take advantages of
+Set `hnsw` search scope to 40 in transaction:
+```sql
+SET LOCAL vectors.hnsw_ef_search=40;
+```
 
-* database transaction
-* deletions without `VACUUM`
-* WHERE clauses and very complex SQL statements
+Set search mode to `vbase` as system default:
+```sql 
+ALTER SYSTEM SET vectors.search_mode=vbase;
+```
 
-### `vbase`
+Query options for `ivf`:
 
-`vbase` is another search mode. In this mode, vector indexes behave like a database index. In `vbase` mode, searching results become a stream and every time the database pulls a row, the vector index computes a row to return. It's quite different from an ordinary vector search if you are using a vector search library, such as *faiss*. The latter always wants to know how many results are needed before searching. The original idea comes from [VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity](https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi).
+| Option                   | Type    | Range          | Default | Description                               |
+| ------------------------ | ------- | -------------- | ------- | ----------------------------------------- |
+| vectors.ivf_nprobe       | integer | `[1, 1000000]` | `10`    | Number of lists to scan.                  |
+
+Query options for `hnsw`:
+
+| Option                   | Type    | Range          | Default | Description                               |
+| ------------------------ | ------- | -------------- | ------- | ----------------------------------------- |
+| vectors.hnsw_ef_search   | integer | `[1, 65535]`   | `100`   | Search scope of HNSW.                     |
+
+Query options for general:
 
-Assuming you are using HNSW algorithm, you may want the following SQL to work:
+| Option                   | Type    | Range              | Default   | Description                            |
+| ------------------------ | ------- | ------------------ | --------- | -------------------------------------- |
+| vectors.enable_index     | boolean |                    | `on`      | Enables or disables the query planner. |
+| vectors.search_mode      | enum    | `"basic", "vbase"` | `"basic"` | Search mode.                           |
+| vectors.enable_prefilter | boolean |                    | `on`      | Enables or disables the prefilter.     |
 
+# Advanced usage
+
+Sometimes you expect the search to return the exact number of vectors equal to `LIMIT`, but it can't:
 ```sql
-SET vectors.hnsw_ef_search = 64;
-SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' WHERE id % 2 = 0 LIMIT 64;
+SELECT COUNT(1) FROM (SELECT 1 FROM t WHERE (category_id = 1) ORDER BY val <-> '[1,1,1]' limit 10) t2;
+--- returns 1, much less than 10
 ```
+That is why we introduce search mode and prefilter.
 
-In `basic` mode, you may only get `32` rows because the HNSW algorithm does search simply so the filter condition is ignored.
+## Search modes
 
-In `vbase` mode, the HNSW algorithm is guaranteed to return rows as many as you need, so you can always get correct behavior if your do take advantages of:
+There are two search modes: `basic` and `vbase`.
 
-* database transaction
-* deletions without `VACUUM`
-* `WHERE` clauses and very complex SQL statements
+### `basic`
 
-You can enable `vbase` by a SQL statement `SET vectors.search_mode = vbase;`.
+`basic` is the default search mode.
 
-## Prefilter
+In this mode, the filter is applied after `vectors.hnsw_ef_search` vectors are returned.
+Therefore you need to increase `vectors.hnsw_ef_search` until filtered vectors are sufficient.
+
+The appropriate value depends on the input data distribution and filtering rate.
+Too large `vectors.hnsw_ef_search` will result in wasted memory.
 
-If your queries include a `WHERE` clause, you can set set search mode to `vbase`. It's good and it even works on all conditions. `vbase` is a **postfilter** method: it pulls rows as many as you need, but it scans rows that you may not need. Since some rows will definitely be removed by the `WHERE` clause, we can skip scanning them, which will make the search faster. We call it **prefilter**.
+It's recommended in these situations:
 
-Prefilter speeds your query in the following condition:
+* Search without filter and transaction
+* Returning insufficient vectors is acceptable
+* The `vbase' search mode fails due to out-of-memory
 
-* You create a multicolumn vector index containing a vector column and many payload columns.
-* The `WHERE` clause in a query is just simple like `(id % 2 = 0) AND (age >  50)`.
+### `vbase`
 
-Prefilter is also used in internal implementation for handling deleted rows in `pgvecto.rs`.
+`vbase` is the recommended search mode when any filter is enabled.
 
-Prefilter may have a negative impact on precision. Test the precision before using it.
+In this mode, the filter is applied after `range` vectors are returned.
+The value of `range` is **automatically chosen** by the `vbase` algorithm. 
+It is **transparent** to the user.
 
-Prefilter is enabled by default because it almost only works if you create a multicolumn vector index.
+In most cases, `vbase` mode would return enough vectors for your filter.
+For how it works, see the thesis [VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity](https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi).
 
-## Options
+It's recommended in these situations:
 
-Search options are specified by PostgreSQL GUC. You can use `SET` command to apply these options in session or `SET LOCAL` command to apply these options in transaction.
+* Search with filter or transaction
+* Returning enough vectors is important
+* Tired of tuning `vectors.hnsw_ef_search` in `basic` mode
 
-Runtime parameters for planning a query:
+You can enable `vbase` by a SQL statement `SET vectors.search_mode = vbase;`.
 
-| Option               | Type    | Range              | Default   | Description                                                                  |
-| -------------------- | ------- | ------------------ | --------- | ---------------------------------------------------------------------------- |
-| vectors.enable_index | boolean |                    | `on`      | Enables or disables the query planner's use of vector index-scan plan types. |
-| vectors.search_mode  | enum    | `"basic", "vbase"` | `"basic"` | Search mode.                                                                 |
+## Prefilter
+
+`prefilter` is an enhancement strategy that can be set in any search mode.
+It is enabled by default.
+
+If enabled, an additional filter is applied to select vectors before they are collected.
+This reduces expensive distance calculations, and all collected vectors will match the filter.
+
+The acceleration ratio is positively correlated to the filtering rate.
+A filtering rate of 10% in will result in a 10x acceleration of distance calculations.
+
+::: details
+Example: 10% filtering rate
+```sql
+SELECT * FROM generate_series(1, 10) WHERE generate_series <= 1;
+```
+:::
+
+However, prefilter can have a negative impact on precision if:
+* The filter is not relevant to the vector distance
+* The filtering rate is too low, e.g. 1%.
+
+::: details
+Example: 1% filtering rate
+```sql
+SELECT * FROM generate_series(1, 100) WHERE generate_series <= 1;
+```
+:::
+
+If you need a high level of precision, please test your scenarios and consider turning it off:
+```sql 
+ALTER SYSTEM SET vectors.enable_prefilter=off;
+```
 
-Runtime parameters for executing a query:
 
-| Option                   | Type    | Range          | Default | Description                               |
-| ------------------------ | ------- | -------------- | ------- | ----------------------------------------- |
-| vectors.enable_prefilter | boolean |                | `on`    | Enables or disables the use of prefilter. |
-| vectors.ivf_nprobe       | integer | `[1, 1000000]` | `10`    | Number of lists to scan.                  |
-| vectors.hnsw_ef_search   | integer | `[1, 65535]`   | `100`   | Search scope of HNSW.                     |