-
-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Tantivy documents as a trait #2071
Conversation
* Fix windows build
We don't need to serialize with a custom value, but just provide an API over the data. |
I see your point, but I don't think removing it is adding a huge amount of additional complexity, it makes the code a bit more sparse and abstracted, but the way we serialize and deserialize data is exactly the same, I just re-wrote it from a somewhat dense block of code on the The only thing I can see happening is if we provide custom documents for indexing but not retrieval, the question will inevitably come up as "Why?" which I can understand, also it can be quite convenient deserializing into a custom type but it's not necessarily the end of the world I suppose. |
I had another look and misunderstood it before. There's no custom serialization format, right? Overall the PR looks good, nice job! |
Format is effectively the exact same as it was before, just with the additional codes to handle collections and objects, but it is not user defined. |
src/schema/document/mod.rs
Outdated
#[inline] | ||
/// If the Value is a pre-tokenized string, returns the associated string. Returns None | ||
/// otherwise. | ||
fn as_tokenized_text(&self) -> Option<&'a PreTokenizedString> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_pretokenized_text
...
@PSeitz You can merge whenever you see fit. |
# Conflicts: # Cargo.toml # examples/warmer.rs # src/aggregation/bucket/histogram/date_histogram.rs # src/core/index.rs # src/directory/mmap_directory.rs # src/functional_test.rs # src/indexer/index_writer.rs # src/indexer/segment_writer.rs # src/lib.rs # src/query/boolean_query/boolean_query.rs # src/query/boolean_query/mod.rs # src/query/disjunction_max_query.rs # src/query/fuzzy_query.rs # src/query/more_like_this/more_like_this.rs # src/query/range_query/range_query.rs # src/query/regex_query.rs # src/query/term_query/term_query.rs # tests/failpoints/mod.rs
|
src/schema/value.rs
Outdated
if let Some(val) = number.as_u64() { | ||
Self::U64(val) | ||
} else if let Some(val) = number.as_i64() { | ||
Self::I64(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if let Some(val) = number.as_u64() { | |
Self::U64(val) | |
} else if let Some(val) = number.as_i64() { | |
Self::I64(val) | |
if let Some(val) = number.as_i64() { | |
Self::I64(val) | |
} else if let Some(val) = number.as_u64() { | |
Self::U64(val) |
src/schema/value.rs
Outdated
} | ||
serde_json::Value::String(val) => Self::Str(val), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if can_be_rfc3339_date_time(&text) {
match OffsetDateTime::parse(&text, &Rfc3339) {
Ok(dt) => {
let dt_utc = dt.to_offset(time::UtcOffset::UTC);
Self::Date(DateTime::from_utc(dt_utc))
}
Err(_) => Self::Str(text),
}
} else {
Self::Str(text)
}
fn can_be_rfc3339_date_time(text: &str) -> bool {
if let Some(&first_byte) = text.as_bytes().get(0) {
if first_byte >= b'0' && first_byte <= b'9' {
return true;
}
}
false
}
src/query/fuzzy_query.rs
Outdated
r#"{ | ||
"attributes": { | ||
"aa": "japan" | ||
"as": "japan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as": "japan" | |
"aa": "japan" |
add Binary prefix to binary de/serialization
I had an extra pass and fixed the remaining issues. Thanks for the really nice PR! |
Thank you for your work. I have a question. The current document serialization interface |
No, but I think you could add a stored bytes field and put your custom serialized doc there |
Thank you for your answer. Is there any example of custom document and related serialization and deserialization? |
There is no custom document serialization and deserialization, having your real doc nested in a bytes field would be a workaround. |
Sorry, my previous statement was not clear enough. I want to be able to fully serialize/deserialize TantivyDocument, just like the old version of document:
Should I modify |
#[derive(Clone, Debug, serde::Serialize, serde::Deserialize, Default)]
pub struct TantivyDocument {
field_values: Vec<FieldValue>,
} |
Problem
Building on what #1352 describes, one of Tantivy's biggest limitations/pain points IMO is the fact that you must convert whatever document type you are using in your code, to a Tantivy
Document
type, which often involves re-allocating and a lot of extra code in order to walk through potentially nested objects (I.e. JSON objects) to be able to index it with tantivy.Use cases
A limited solution
The solution to this issue initially could be thought up as a basic document trait which simply creates an
fn field_values(&self) -> impl Iterator<Item = (Field, Value)>
method for accessing the document data. In theory with the use of GATs now we can even make theValue
take borrowed data avoiding the allocation issue.Problems with this approach
Although the above suggestion works for a basic setup, it doesn't really solve the issue since you still need a set of concrete tantivy values before you can index data, by which point the amount of effort it takes to convert to a
Document
is very small.A fully flexible solution
To get around the issue of making document indexing more flexible and more powerful, we can extend how much of the system is represented by traits, in particular, we replace
Document, Value, FieldValue, serde_json::Map<String, serde_json::Value>, serde_json::Value
with a set of traits as described bellow:DocumentAccess
traitUsing GATs we can avoid a complicated set of lifetimes and unsafe, instead, we simply described the type the document uses for its values which can borrow data (
Value<'a>
), the owned version of this type which can be the same type potentially, but can also be different depending on application (OwnedValue
) and finally, we describe theFieldsValuesIter
which is just used to get around not be able to use anonymous types in the form ofimpl Iterator
.As you can see with the code below, we've replaced the
FieldValue
type with a simple tuple instead, technically this could be kept and made generic, but I don't think it's that useful to do so.Compatibility
The original
Document
type has been kept, and simply implements the trait, meaning a user already using tantivy should not experience any direct conflict if they just keep using the original type.DocValue<'a>
traitThis trait is fairly simple in what it does, it simply defines the common methods on the old
Document
type as methods of the trait, and then has a genericJsonVisitor
type which can be used to represent JSON data in more flexible ways and without allocation.The original
Value
type implements this trait for compatibility and ergonomics, technically speaking if you wanted to do an approach similar to the first solution you can absolutely do this just by using the original tantivy types.JsonVisitor<'a>
traitThe JSON visitor effectively just replaces the
Map<String, Value>
with a trait that allows for walking through the object.This also means that any type which implements this trait can also be serialized via serde_json.
JsonValueVisitor<'a>
traitSimilar to the JSON visitor it describes the behaviour of a
serde_json::Value
rather than just an object.Deserialization via
ValueDeserialize
Originally I wanted to use something like
serde::DeserializedOwned
for this job, but it became obvious that what it gained in ease of compatibility with serde, it lost in terms of complexity when handling custom deserializer. A single simple trait became better for this purpose.One thing that could be improved further is passing a JSON deserializer for the JSON values, at the moment we pass in a
Map<String, serde_json::Value>
which is fairly limited at the moment and not the most ergonomic thing in the world.Compatibility and generic methods
One of the things the trait approach requires is a set of generic where documents to be handled, which gives us two ways to handle it:
Index
, etc... so it's one document type for the whole index, with a default type keeping compatibility with existing code. The problem is it causes everything to require generics (See: Add document trait ChillFish8/tantivy#2 and the thousands of lines changed, which isn't even finished!)