-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: implement pgvector datatype and evaluation #124292
sql: implement pgvector datatype and evaluation #124292
Conversation
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. It looks like your PR touches SQL parser code but doesn't add or edit parser tests. Please make sure you add or edit parser tests if you edit the parser. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
3342745
to
8b12d14
Compare
2877549
to
8b018a1
Compare
da48dc0
to
959612f
Compare
fa290d7
to
7661327
Compare
d43ea22
to
76ad36d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work! once you add a case to unsupported_types.go
. I think Yahor will be taking a look as well.
Reviewed 14 of 14 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)
pkg/sql/sem/builtins/pgvector_builtins.go
line 41 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I don't really see the advantage beside code reuse, but if you feel strongly we can do it. I think a disadvantage is that the reader has to remember the weird names of the operators (
<=>
in this case) to validate that the code makes sense, whereas in the current implementation it's clear that cosine_distance calling CosDistance is the right thing.
My thinking was that it'd be nicer for the optimizer and vectorized engine (when we add a vectorized implementation) to only have to consider the operators. This is something we could change later, though, so I'd be alright with leaving it this way for now if you prefer. I guess it would also be pretty simple to just add a couple norm rules that convert the builtin function calls to the corresponding operators.
pkg/sql/sem/eval/binary_op.go
line 1236 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I think these should be fine because of logic that nullifies binary operators and builtins before they're called. I added tests, good point.
TILBinOp
also has a CalledOnNullInput
field.
pkg/sql/types/types.go
line 491 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Hmm, but at least according to the pg_vector source code it is okay to compare vectors with different widths...
Right, no need for a change then.
pkg/sql/logictest/testdata/logic_test/mixed_version_pgvector
line 1 at r4 (raw file):
# LogicTest: cockroach-go-testserver-23.2
Thanks for adding this. It reminds me - we also need a case for VECTOR
in unsupported_types.go
.
pkg/sql/logictest/testdata/logic_test/vector
line 25 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I did, it is permitted :/
jordan=# create table t(v vector); CREATE TABLE jordan=# insert into t values('[1]'),('[1,2]'); INSERT 0 2 jordan=# select * from t; v ------- [1] [1,2] (2 rows)
Ah well, that's a little sad. Thanks for checking.
pkg/sql/parser/testdata/select_exprs
line 2068 at r4 (raw file):
SELECT (("[1,2]") <-> ("[3,4]")) -- fully parenthesized SELECT "[1,2]" <-> "[3,4]" -- literals removed SELECT _ <-> _ -- identifiers removed
Do you know why the vectors are considered identifiers instead of literals? Just because of the double quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pgvector 0.7 is brand new, but I still think it's worth getting some of the new functionality in for this initial version. In particular the new halfvec
type is significant (it's necessary to support openai's larger embedding model without truncation) and we should make sure we have the right structure for multiple vector types.
One thing that pgvector doesn't do AFAICT but seems useful is if each vector had a bit to indicate whether it is normalized, so that cosine_distance could automatically turn into the faster inner_product when we don't need to normalize again. Would that be too much of a deviation from pgvector?
I was thinking it would be best to treat vectors as binary blobs as much as possible, and only view them as arrays when needed. That would minimize decoding overhead on the assumption that eventually we're not computing inner products with a Go for loop but passing the blob off to some simd-optimized implementation.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! I only have some minor nits (and I'll keep on learning about the pgvector implementation).
Reviewed 51 of 68 files at r1, 12 of 21 files at r2, 2 of 4 files at r3, 14 of 14 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, and @vidit-bhat)
pkg/sql/parser/sql.y
line 34 at r4 (raw file):
"github.com/cockroachdb/cockroach/pkg/geo/geopb" "github.com/cockroachdb/cockroach/pkg/geo/geopb"
nit: this import is duplicated with the line above.
pkg/sql/randgen/datum.go
line 297 at r4 (raw file):
case types.TSQueryFamily: return tree.NewDTSQuery(tsearch.RandomTSQuery(rng)) case types.PGVectorFamily:
nit: are there some interesting vectors we could add to randInterestingDatums
?
Relatedly, is PGVector a scalar type? In other words, should it be included into types.ScalarTypes
?
pkg/sql/rowenc/encoded_datum.go
line 333 at r4 (raw file):
// Note that at time of this writing we don't support arrays of JSON // (tracked via #23468) nor of TSQuery / TSVector / PGVector types (tracked by // #90886), so technically we don't need to do a recursive call here,
nit: arrays of PGVector type are tracked by #121432.
pkg/sql/sem/cast/cast_map.go
line 75 at r4 (raw file):
oid.T_text: {MaxContext: ContextAssignment, origin: ContextOriginAutomaticIOConversion, Volatility: volatility.Immutable}, }, oidext.T_pgvector: {
Just checking: updates to castMap
were generated via cast_map_gen.sh
?
pkg/util/vector/vector.go
line 74 at r4 (raw file):
// String implements the fmt.Stringer interface. func (v T) String() string { strs := make([]string, len(v))
nit: should we use strings.Builder
?
pkg/util/vector/vector.go
line 83 at r4 (raw file):
// Size returns the size of the vector in bytes. func (v T) Size() uintptr { return uintptr(len(v)) * 4
nit: should we do s/len/cap/
to be more precise about memory usage? Also perhaps include 24 bytes for the slice overhead?
pkg/util/vector/vector.go
line 175 at r4 (raw file):
normB += t2[i] * t2[i] } // Use sqrt(a * b) over sqrt(a) * sqrt(b)
nit: I see that this comment comes from the pgvector source code, but I find it confusing since it doesn't actually match the formula.
76ad36d
to
dfb53c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)
pkg/sql/randgen/datum.go
line 297 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: are there some interesting vectors we could add to
randInterestingDatums
?Relatedly, is PGVector a scalar type? In other words, should it be included into
types.ScalarTypes
?
I don't think so because the length information is important but the interesting datums code doesn't have a hook for that. I guess we could add it, but I'm not sure how important this is.
Added to scalar types.
pkg/sql/sem/cast/cast_map.go
line 75 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Just checking: updates to
castMap
were generated viacast_map_gen.sh
?
No, this script doesn't seem to create pgvector outputs.
pkg/util/vector/vector.go
line 74 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should we use
strings.Builder
?
Done.
pkg/util/vector/vector.go
line 83 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should we do
s/len/cap/
to be more precise about memory usage? Also perhaps include 24 bytes for the slice overhead?
Done.
pkg/util/vector/vector.go
line 175 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I see that this comment comes from the pgvector source code, but I find it confusing since it doesn't actually match the formula.
I think the comment says "over" but it really means "instead of"?
pkg/sql/logictest/testdata/logic_test/mixed_version_pgvector
line 1 at r4 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Thanks for adding this. It reminds me - we also need a case for
VECTOR
inunsupported_types.go
.
Done.
a11bc8b
to
6fcb358
Compare
I removed it from scalar types again, since that causes |
6fcb358
to
98cca93
Compare
I'd like to get this merged, at the very least to stop rebasing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me, ship it!
Reviewed 2 of 21 files at r2, 29 of 38 files at r5, 2 of 3 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, and @vidit-bhat)
pkg/util/vector/vector.go
line 175 at r4 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I think the comment says "over" but it really means "instead of"?
Ah, I see, thanks.
Looks like there was a test failure:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 29 of 38 files at r5, 2 of 3 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, and @vidit-bhat)
4f84ffe
to
c0c0cf7
Compare
Release note (sql change): implement pgvector encoding, decoding, and operators, without index acceleration.
c0c0cf7
to
3e5a520
Compare
This commit adds the pgvector datatype and associated evaluation operators and functions. It doesn't include index acceleration.
Functionality included:
CREATE EXTENSION vector
vector
datatype with optional length, storage and retrieval in non-indexed table columns<->
operator - L2 distance<#>
operator - (negative) inner product<=>
operator - cosine distancel1_distance
builtinl2_distance
builtincosine_distance
builtininner_product
builtinvector_dims
builtinvector_norm
builtinUpdates #121432
Epic: None
Release note (sql change): implement pgvector encoding, decoding, and operators, without index acceleration.