Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB vector search doc #18502

Open
wants to merge 75 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
48f906c
TiDB vector data type and vector index Doc
EricZequan Sep 2, 2024
0b31525
remove vector index part
EricZequan Sep 2, 2024
96f2701
modify cluster type
EricZequan Sep 2, 2024
3fdab1b
fix
EricZequan Sep 2, 2024
98e417b
modify expression
EricZequan Sep 3, 2024
9bd83d6
fix ci
EricZequan Sep 3, 2024
52934cb
fix comment
EricZequan Sep 3, 2024
26027f9
fix ci
EricZequan Sep 3, 2024
df68638
fix ci
EricZequan Sep 3, 2024
febd534
fix comment
EricZequan Sep 4, 2024
69429af
vector-search-overview: refine descriptions
qiancai Sep 4, 2024
d285b55
vector-search-data-types: refine descriptions
qiancai Sep 4, 2024
3852ed3
vector-search-functions-and-operators: refine descriptions
qiancai Sep 4, 2024
93b7e90
add remaining doc
EricZequan Sep 5, 2024
15e18d0
remove rows
EricZequan Sep 5, 2024
402f1d0
fix
EricZequan Sep 5, 2024
f869c42
get started: refine descriptions
qiancai Sep 5, 2024
374b687
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 5, 2024
ea2b0b1
integrate-with-django-orm: refine descriptions
qiancai Sep 6, 2024
2cf3e79
Apply suggestions from code review
qiancai Sep 6, 2024
0eea4d7
fix comment
EricZequan Sep 6, 2024
93fe602
fix comment
EricZequan Sep 6, 2024
abe5eac
integrate-with-peewee/sqlalchemy: refine descriptions
qiancai Sep 6, 2024
a01ba0e
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 6, 2024
48f47af
get-started and integrate-with-jinaai-embedding: refine descriptions
qiancai Sep 6, 2024
f9b6dc0
get-started-using-sql: update connection instructions
qiancai Sep 6, 2024
ae94864
Update vector-search-data-types.md
breezewish Sep 6, 2024
f485b58
integrate-with-llamaindex: refine descriptions
qiancai Sep 9, 2024
2e56e20
integrate-with-langchain: refine descriptions
qiancai Sep 9, 2024
bb7ce0b
overview and limitation: refine descriptions
qiancai Sep 9, 2024
d50b638
add vector index doc introduction
EricZequan Sep 10, 2024
5290946
fix comment
EricZequan Sep 10, 2024
eb604f7
modify introduction of self-hosted tidb connection type
EricZequan Sep 10, 2024
65656f8
modify tidb connection when using tidb self-hosted
EricZequan Sep 11, 2024
99f9ab7
fix comment
EricZequan Sep 11, 2024
2c5efa4
fix comment
EricZequan Sep 11, 2024
b2e32b7
fix comment
EricZequan Sep 11, 2024
9a51513
shorten index example case
EricZequan Sep 11, 2024
c09d85f
fix comment
EricZequan Sep 11, 2024
bf063b7
fix comment
EricZequan Sep 12, 2024
9eaf1f7
fix comment
EricZequan Sep 13, 2024
c62f7c9
fix comment
EricZequan Sep 13, 2024
0083ea1
fix comment
EricZequan Sep 13, 2024
6a673b9
index & improve performance: refine descriptions
qiancai Sep 13, 2024
5ebe116
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 13, 2024
a250dff
add vector index part in other document
EricZequan Sep 14, 2024
94e8ab5
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
EricZequan Sep 14, 2024
634c602
modify index name when create vector index
EricZequan Sep 14, 2024
4b54e6d
Update vector-search-improve-performance.md
EricZequan Sep 14, 2024
5eeb336
refine descriptions for TiDB self-managed connection
qiancai Sep 14, 2024
c6dad29
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 14, 2024
9a77c65
fix comment
EricZequan Sep 14, 2024
502bd52
vector-index: refine descriptions
qiancai Sep 14, 2024
652b639
remove index part when create table in integration-doc
EricZequan Sep 14, 2024
05784ea
Resolve merge conflicts
EricZequan Sep 14, 2024
cfacbef
fix comment
EricZequan Sep 14, 2024
3a68b70
Merge remote-tracking branch 'upstream/master' into pr/18502
qiancai Sep 14, 2024
eb42ae8
TiDB Serverless -> TiDB Cloud Serverless
qiancai Sep 14, 2024
8d938d4
add the experimental warning
qiancai Sep 18, 2024
f7a31f2
fix comment
EricZequan Sep 19, 2024
894fcd4
fix comment
EricZequan Sep 19, 2024
92e1aee
fix comment
EricZequan Sep 23, 2024
0f91e5a
Apply suggestions from code review
qiancai Sep 24, 2024
39958f3
UI changes: Endpoint Type -> Connection Type
qiancai Sep 24, 2024
f8af8f6
fix comment
EricZequan Sep 24, 2024
12abce8
fix comment
EricZequan Sep 24, 2024
c9ef22f
fix comment
EricZequan Sep 25, 2024
3350c10
remove 'vector64()' sytax
EricZequan Sep 26, 2024
f18d840
Update desc about tiflash upgrade
JaySon-Huang Sep 27, 2024
9cbc09e
Update desc about br support
JaySon-Huang Sep 29, 2024
13bc862
Add limitation about BR restore
JaySon-Huang Sep 29, 2024
7704118
Update desc about limitation
JaySon-Huang Sep 29, 2024
d135903
add limit about cdc
wk989898 Sep 29, 2024
f374485
Update tiflash-configuration
JaySon-Huang Sep 29, 2024
2ba7811
fix comment
EricZequan Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added media/vector-search/embedding-search.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
243 changes: 243 additions & 0 deletions vector-search-data-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
---
title: 向量数据类型
summary: 本文介绍 TiDB 的向量数据类型。
---

# 向量数据类型 (Vector)

“向量”指的是一组浮点数序列,例如 `[0.3, 0.5, -0.1, ...]`。针对 AI 应用中大量使用到的嵌入向量 (Vector Embedding) 数据,TiDB 专门提供了向量类型用于进行高效的存储和访问。

目前有以下几种向量类型:

- `VECTOR`: 存储一组单精度浮点数 (Float) 向量。向量可以是任意维度。
- `VECTOR(D)`: 存储一组单精度浮点数 (Float) 向量,向量维度固定为 `D`。

相比使用 [`JSON`](/data-type-json.md) 类型,使用向量类型具有以下额外优势:

- 可指定维度。指定一个固定维度后,不符合维度的数据将被阻止写入到表中。
- 存储格式更优。向量类型针对向量数据进行了特别优化,具有比 `JSON` 类型更高的空间和性能效率。

## 语法

可以使用以下格式的字符串来表示一个类型为向量的值:

```sql
'[<float>, <float>, ...]'
```

例如:

```sql
CREATE TABLE vector_table (
id INT PRIMARY KEY,
embedding VECTOR(3)
);

INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]');

INSERT INTO vector_table VALUES (2, NULL);
```

将不符合语法的字符串作为向量数据插入时会产生错误:

```sql
[tidb]> INSERT INTO vector_table VALUES (3, '[5, ]');
ERROR 1105 (HY000): Invalid vector text: [5, ]
```

在下方例子中,`embedding` 向量列指定了维度为 3,因此插入不同其他维度的向量数据会引发错误:

```sql
[tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]');
ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3)
```

可参阅[向量函数与操作符](/vector-search-functions-and-operators.md)了解支持在向量类型上进行运算的所有函数和操作符。


## 混合存储不同维度的向量

省略 `VECTOR` 类型中的维度参数后,就可以在同一列中存储不同维度的向量:

```sql
CREATE TABLE vector_table (
id INT PRIMARY KEY,
embedding VECTOR
);

INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 3 dimensions vector, OK
INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 2 dimensions vector, OK
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
```

## 比较

[比较运算符](/vector-search-functions-and-operators.md) 如 `=`, `!=`, `<`, `>`, `<=` 和 `>=` 等都能正常对向量数据进行比较。可参阅[向量函数与操作符](/vector-search-functions-and-operators.md)了解所有支持向量类型的函数和操作符。

比较向量类型时,TiDB 以向量内的各个元素为单位进行依次比较,如:

- `[1] < [12]`
- `[1,2,3] < [1,2,5]`
- `[1,2,3] = [1,2,3]`
- `[2,2,3] > [1,2,3]`

当两个向量维度不一样时,TiDB 采用字典序 (Lexicographical Order) 进行比较,具体规则如下:

- 两个向量内各个元素逐一进行数值比较。
- 遇到第一个不一样的元素时,它们之间的数值比较结果即是两个向量之间的比较结果。
- 如果一个向量是另一个向量的前缀,那么维度小的向量 _小于_ 维度大的向量。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 长度相同且各个元素相同的两个向量 _相等_ 。
- 空向量 _小于_ 任何非空向量。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 两个空向量 _相等_ 。

例如:

- `[] < [1]`
- `[1,2,3] < [1,2,3,0]`

在进行向量比较时,可以使用 [显式转换](#类型转换-cast) 将向量数据从字符串转换为向量类型,以避免 TiDB 直接基于字符串进行比较:

```sql
-- 因为给出的实际上是字符串,因此 TiDB 会按字符串进行比较
[tidb]> SELECT '[12.0]' < '[4.0]';
+--------------------+
| '[12.0]' < '[4.0]' |
+--------------------+
| 1 |
+--------------------+
1 row in set (0.01 sec)

-- 显式转换为向量类型,从而按照向量的比较规则进行正确的比较:
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]');
+--------------------------------------------------+
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') |
+--------------------------------------------------+
| 0 |
+--------------------------------------------------+
1 row in set (0.01 sec)
```

## 运算

向量类型支持算术运算 `+` 和 `-`,对应的是两个向量进行以元素为单位的加法和减法。不支持对不同维度向量进行算术运算,这类运算会产生错误。

以下是一些示例:

```sql
[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]');
+---------------------------------------------+
| VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]') |
+---------------------------------------------+
| [9] |
+---------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]');
breezewish marked this conversation as resolved.
Show resolved Hide resolved
+-----------------------------------------------------+
| VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') |
+-----------------------------------------------------+
| [1,1,1] |
+-----------------------------------------------------+
1 row in set (0.01 sec)

[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[1,2,3]');
ERROR 1105 (HY000): vectors have different dimensions: 1 and 3
```

## 类型转换 (Cast)

### 向量与字符串之间的转换

可以使用以下函数在向量和字符串之间进行转换:

- `CAST(... AS VECTOR)`: String ⇒ Vector
- `CAST(... AS CHAR)`: Vector ⇒ String
- `VEC_FROM_TEXT`: String ⇒ Vector
- `VEC_AS_TEXT`: Vector ⇒ String

出于易用性考虑,若函数只支持向量数据类型(如向量相关距离函数),那么你也可以直接传入符合格式要求的字符串数据,TiDB 会进行隐式转换:

```sql
-- VEC_DIMS 只接受向量类型,因此你可以直接传入字符串类型,TiDB 会隐式转换为向量类型:
[tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]');
+------------------------------+
| VEC_DIMS('[0.3, 0.5, -0.1]') |
+------------------------------+
| 3 |
+------------------------------+
1 row in set (0.01 sec)

-- 也可以使用 VEC_FROM_TEXT 显式地将字符串转换为向量类型后传递给 VEC_DIMS 函数:
[tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]'));
+---------------------------------------------+
| VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) |
+---------------------------------------------+
| 3 |
+---------------------------------------------+
1 row in set (0.01 sec)

-- 也可以使用 CAST(... AS VECTOR) 进行显式转换:
[tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR));
+----------------------------------------------+
| VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) |
+----------------------------------------------+
| 3 |
+----------------------------------------------+
1 row in set (0.01 sec)
```

当运算符或函数接受多种数据类型时,隐式转换不会发生,请先显式地将字符串类型转换为向量类型后,再传递给这些运算符或函数。例如,进行比较运算前,需要显式地转换字符串为向量类型,否则将会按照字符串类型进行比较,而非按照向量类型进行比较:

```sql
-- 传入的类型是字符串,因此 TiDB 会按字符串进行比较:
[tidb]> SELECT '[12.0]' < '[4.0]';
+--------------------+
| '[12.0]' < '[4.0]' |
+--------------------+
| 1 |
+--------------------+
1 row in set (0.01 sec)

-- 转换为向量类型,以便使用向量类型的比较规则:
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]');
+--------------------------------------------------+
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') |
+--------------------------------------------------+
| 0 |
+--------------------------------------------------+
1 row in set (0.01 sec)
```

向量也可以显式地转换为字符串。可使用 `VEC_AS_TEXT()` 函数:

```sql
-- 字符串首先被隐式地转换成向量,然后被显示地转为字符串,因而获得了一个规范化的格式:
[tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]');
+--------------------------------------+
| VEC_AS_TEXT('[0.3, 0.5, -0.1]') |
+--------------------------------------+
| [0.3,0.5,-0.1] |
+--------------------------------------+
1 row in set (0.01 sec)
```

若要了解其他转换函数,请参阅[向量函数和操作符](/vector-search-functions-and-operators.md)。

### 向量与其他数据类型之间的转换

目前无法直接在向量和其他数据类型(如 `JSON`)之间进行转换,但你可以使用字符串作为中间类型进行转换。

## 约束

- 向量最大支持 16383 维。
- 向量数据中不支持 `NaN`、`Infinity` 和 `-Infinity` 浮点数。
- 目前向量类型只支持单精度浮点数,不支持双精度浮点数。未来版本将支持这一功能。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

有关其他限制,请参阅 [向量搜索限制](/vector-search-limitations.md)。

## MySQL 兼容性

向量数据类型只在 TiDB 中支持,MySQL 不支持。

## 另请参阅

- [向量函数和操作符](/vector-search-functions-and-operators.md)
Loading
Loading