-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support sparse vector #299
Conversation
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
pub struct SparseF32Element {
pub index: u32,
pub value: F32,
} It'll be better if we use two arrays (array of |
Probably need some experiments to see the best calculation method/storage layout. In pyanns, it stored the data along with the indices like [1, 0.1, 2, 0.1]. Therefore the read is totally sequential. Seperate them needs a seperate random access. |
Also there should be chances to use simd to optimize the computation. if statement inside calculation should be much slower |
We will read both two arrays at the same time so it's still sequential.
This layout should be nice for SIMD. |
I'm actually thinking of bitset like encoding for indices such as roaringbitmap. Then you can get the intersection of the set easily, and then select out the value to compute. Not sure how much improvement we can have here. Is it worth trying? |
Roaring bitmap is designed for a large data. How could we take advantage of it? |
Do we need to store the vector norm for L2/cos distance in sparse vector? @usamoi |
We can handle the branchless/SIMD optimization in other PR since it doesn't affect internal layout |
After reading the thesis, I think our final target should be https://github.com/UNSW-database/simd_set_operations/blob/main/setops/src/intersect/broadcast.rs#L489-L561. On AVX512 machine, it's pretty competitive with the array representation. |
Do we have the reference of format for |
src/datatype/svecf32.rs
Outdated
let values_ptr = pgrx::pg_sys::pq_getmsgbytes(buf, values_bytes as _); | ||
let mut output = SVecf32::new_zeroed_in_postgres(len as usize); | ||
output.dims = dims; | ||
std::ptr::copy(indexes_ptr, output.indexes_mut() as _, indexes_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's detect possible data corruption for indexes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
src/datatype/svecf32.rs
Outdated
ptr.add(offset).cast() | ||
} | ||
} | ||
fn indexes_mut(&mut self) -> *mut u16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_vectors_svecf32_recv
needs it. I updated the returned value from a pointer to a reference.
src/datatype/svecf32.rs
Outdated
fn indexes_mut(&mut self) -> *mut u16 { | ||
self.phantom.as_mut_ptr().cast() | ||
} | ||
fn values_mut(&mut self) -> *mut F32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_vectors_svecf32_recv
needs it. I updated the returned value from a pointer to a reference.
src/datatype/svecf32.rs
Outdated
fn indexes(&self) -> *const u16 { | ||
self.phantom.as_ptr().cast() | ||
} | ||
fn values(&self) -> *const F32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
src/datatype/svecf32.rs
Outdated
pub fn len(&self) -> usize { | ||
self.len as usize | ||
} | ||
fn indexes(&self) -> *const u16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
} | ||
.friendly(); | ||
} | ||
SVecf32::new_in_postgres(SparseF32Ref { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check if the dimensions are in [1, 65535]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
src/prelude/error.rs
Outdated
@@ -70,6 +70,14 @@ INFORMATION: left_dimensions = {left_dimensions}, right_dimensions = {right_dime | |||
left_dimensions: u16, | |||
right_dimensions: u16, | |||
}, | |||
#[error("\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use BadLiteral
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
src/datatype/svecf32.rs
Outdated
#[repr(C, align(8))] | ||
pub struct SVecf32 { | ||
varlena: u32, | ||
len: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you swap the position of len
and dims
?
I'm considering removing from_datum
trick, and dims
is a more meaningful information if we choose one to be in header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
} | ||
} | ||
|
||
pub fn dims(&self) -> u16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need it, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in new commit
@@ -130,11 +161,11 @@ impl<S: G> DynamicIndexing<S> { | |||
} | |||
} | |||
|
|||
pub fn vector(&self, i: u32) -> &[S::Scalar] { | |||
pub fn content(&self, i: u32) -> S::VectorRef<'_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the name of vector
and open
. Afterall we already use the two names everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
I don't think we need from kv string capability now. User can easily transform kv to the array in any language. The kv format I think is rare |
And I think we don't need to worry about the sparse computation now. We can always improve it later |
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
3c56ad3
to
3e55f63
Compare
use std::fmt::Display; | ||
|
||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
pub struct SparseF32Element { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it.
Please fix the implementation of comparsion operators. |
part of #252
Uasge
Design
Storage
The storage part is in
crates/service/src/prelude/storage
.DenseMmap
is originalRawMmap
, sparse storage struct is followingIndex
Hnsw
doesn't make any modification. Forivf
, it expands all vectors to dense vector as the subsamples of kmeans, and all kmeans algorithm is used on the dense vector. After getting centroids, it will calculate other vectors in sparse format.SQ
andPQ
aren't implemented now.