Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support sparse vector #299

Merged
merged 25 commits into from
Feb 18, 2024
Merged

feat: support sparse vector #299

merged 25 commits into from
Feb 18, 2024

Conversation

silver-ymz
Copy link
Member

@silver-ymz silver-ymz commented Jan 23, 2024

part of #252

Uasge

CREATE TABLE t (val svector(6));

INSERT INTO t (val) SELECT ARRAY[0, random(), 0, 0, random(), random()]::real[]::vector::svector FROM generate_series(1, 1000);

INSERT INTO t (val) SELECT "[0, 1, 1, 0, 0, 0]";

CREATE INDEX ON t USING vectors (val svector_l2_ops)
WITH (options = "[indexing.hnsw]");

CREATE INDEX ON t USING vectors (val svector_dot_ops)
WITH (options = "[indexing.hnsw]");

CREATE INDEX ON t USING vectors (val svector_cos_ops)
WITH (options = "[indexing.ivf]");

Design

Storage

The storage part is in crates/service/src/prelude/storage.

DenseMmap is original RawMmap, sparse storage struct is following

pub struct SparseMmap {
    vectors: MmapArray<SparseF32Element>,
    offsets: MmapArray<u32>,
    payload: MmapArray<Payload>,
    dims: u16,
}

Index

Hnsw doesn't make any modification. For ivf, it expands all vectors to dense vector as the subsamples of kmeans, and all kmeans algorithm is used on the dense vector. After getting centroids, it will calculate other vectors in sparse format.

SQ and PQ aren't implemented now.

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
@silver-ymz silver-ymz marked this pull request as ready for review January 23, 2024 13:46
@silver-ymz silver-ymz changed the title [WIP] feat: support sparse vector feat: support sparse vector Jan 23, 2024
@silver-ymz
Copy link
Member Author

PTAL @VoVAllen @usamoi

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
@silver-ymz silver-ymz changed the base branch from main to 0.3 January 29, 2024 07:36
src/datatype/svecf32.rs Outdated Show resolved Hide resolved
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
@silver-ymz silver-ymz changed the base branch from 0.3 to main January 29, 2024 14:57
crates/service/src/prelude/global/mod.rs Show resolved Hide resolved
crates/service/src/prelude/scalar/sparse_f32.rs Outdated Show resolved Hide resolved
crates/service/src/prelude/storage/mod.rs Outdated Show resolved Hide resolved
src/datatype/svecf32.rs Outdated Show resolved Hide resolved
crates/service/src/algorithms/quantization/product.rs Outdated Show resolved Hide resolved
crates/service/src/algorithms/quantization/product.rs Outdated Show resolved Hide resolved
src/prelude/error.rs Outdated Show resolved Hide resolved
src/datatype/svecf32.rs Outdated Show resolved Hide resolved
crates/service/src/prelude/storage/sparse.rs Outdated Show resolved Hide resolved
crates/service/src/algorithms/quantization/scalar.rs Outdated Show resolved Hide resolved
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
@usamoi
Copy link
Collaborator

usamoi commented Feb 2, 2024

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
pub struct SparseF32Element {
    pub index: u32,
    pub value: F32,
}

It'll be better if we use two arrays (array of index: u16 and value: F32). We'll save 25% memory on vectors.

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

Probably need some experiments to see the best calculation method/storage layout. In pyanns, it stored the data along with the indices like [1, 0.1, 2, 0.1]. Therefore the read is totally sequential. Seperate them needs a seperate random access.

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

Also there should be chances to use simd to optimize the computation. if statement inside calculation should be much slower

@usamoi
Copy link
Collaborator

usamoi commented Feb 2, 2024

Seperate them needs a seperate random access

We will read both two arrays at the same time so it's still sequential.

there should be chances to use simd to optimize the computation

This layout should be nice for SIMD.

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

I'm actually thinking of bitset like encoding for indices such as roaringbitmap. Then you can get the intersection of the set easily, and then select out the value to compute. Not sure how much improvement we can have here. Is it worth trying?

@usamoi
Copy link
Collaborator

usamoi commented Feb 2, 2024

I'm actually thinking of bitset like encoding for indices such as roaringbitmap. Then you can get the intersection of the set easily, and then select out the value to compute. Not sure how much improvement we can have here. Is it worth trying?

Roaring bitmap is designed for a large data. How could we take advantage of it?

@VoVAllen
Copy link
Member

VoVAllen commented Feb 5, 2024

Do we need to store the vector norm for L2/cos distance in sparse vector? @usamoi

@VoVAllen
Copy link
Member

VoVAllen commented Feb 5, 2024

We can handle the branchless/SIMD optimization in other PR since it doesn't affect internal layout

@VoVAllen
Copy link
Member

VoVAllen commented Feb 5, 2024

I just feel bitset-like structure is easier to use SIMD. Find this might be useful as a referenece https://github.com/UNSW-database/simd_set_operations , with his thesis including thorough analysis

After reading the thesis, I think our final target should be https://github.com/UNSW-database/simd_set_operations/blob/main/setops/src/intersect/broadcast.rs#L489-L561. On AVX512 machine, it's pretty competitive with the array representation.

@usamoi usamoi mentioned this pull request Feb 6, 2024
14 tasks
@usamoi
Copy link
Collaborator

usamoi commented Feb 6, 2024

Do we have the reference of format for svector_from_kv_string?

let values_ptr = pgrx::pg_sys::pq_getmsgbytes(buf, values_bytes as _);
let mut output = SVecf32::new_zeroed_in_postgres(len as usize);
output.dims = dims;
std::ptr::copy(indexes_ptr, output.indexes_mut() as _, indexes_bytes);
Copy link
Collaborator

@usamoi usamoi Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's detect possible data corruption for indexes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

ptr.add(offset).cast()
}
}
fn indexes_mut(&mut self) -> *mut u16 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it.

Copy link
Member Author

@silver-ymz silver-ymz Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_vectors_svecf32_recv needs it. I updated the returned value from a pointer to a reference.

fn indexes_mut(&mut self) -> *mut u16 {
self.phantom.as_mut_ptr().cast()
}
fn values_mut(&mut self) -> *mut F32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_vectors_svecf32_recv needs it. I updated the returned value from a pointer to a reference.

fn indexes(&self) -> *const u16 {
self.phantom.as_ptr().cast()
}
fn values(&self) -> *const F32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return a reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

pub fn len(&self) -> usize {
self.len as usize
}
fn indexes(&self) -> *const u16 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return a reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

}
.friendly();
}
SVecf32::new_in_postgres(SparseF32Ref {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check if the dimensions are in [1, 65535].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

@@ -70,6 +70,14 @@ INFORMATION: left_dimensions = {left_dimensions}, right_dimensions = {right_dime
left_dimensions: u16,
right_dimensions: u16,
},
#[error("\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use BadLiteral.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

#[repr(C, align(8))]
pub struct SVecf32 {
varlena: u32,
len: u16,
Copy link
Collaborator

@usamoi usamoi Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you swap the position of len and dims?
I'm considering removing from_datum trick, and dims is a more meaningful information if we choose one to be in header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

}
}

pub fn dims(&self) -> u16 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need it, do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in new commit

@@ -130,11 +161,11 @@ impl<S: G> DynamicIndexing<S> {
}
}

pub fn vector(&self, i: u32) -> &[S::Scalar] {
pub fn content(&self, i: u32) -> S::VectorRef<'_> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the name of vector and open. Afterall we already use the two names everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in new commit

@VoVAllen
Copy link
Member

VoVAllen commented Feb 6, 2024

Do we have the reference of format for svector_from_kv_string?

I don't think we need from kv string capability now. User can easily transform kv to the array in any language. The kv format I think is rare

@VoVAllen
Copy link
Member

VoVAllen commented Feb 6, 2024

And I think we don't need to worry about the sparse computation now. We can always improve it later

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
src/sql/finalize.sql Outdated Show resolved Hide resolved
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
use std::fmt::Display;

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct SparseF32Element {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

@usamoi usamoi added this pull request to the merge queue Feb 18, 2024
Merged via the queue into tensorchord:main with commit d7a490c Feb 18, 2024
7 checks passed
@usamoi
Copy link
Collaborator

usamoi commented Feb 18, 2024

Please fix the implementation of comparsion operators.

@silver-ymz silver-ymz deleted the feat/sparse branch February 18, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants