Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Add EXISTS filter #556

Merged
merged 58 commits into from
Aug 4, 2022
Merged

Add EXISTS filter #556

merged 58 commits into from
Aug 4, 2022

Conversation

loiclec
Copy link
Contributor

@loiclec loiclec commented Jun 14, 2022

What does this PR do?

Fixes issue #2484 in the meilisearch repo.

It creates a field EXISTS filter which selects all documents containing the field key.
For example, with the following documents:

[{
	"id": 0,
	"colour": []
},
{
	"id": 1,
	"colour": ["blue", "green"]
},
{
	"id": 2,
	"colour": 145238
},
{
	"id": 3,
	"colour": null
},
{
	"id": 4,
	"colour": {
		"green": []
	}
},
{
	"id": 5,
	"colour": {}
},
{
	"id": 6
}]

Then the filter colour EXISTS selects the ids [0, 1, 2, 3, 4, 5]. The filter colour NOT EXISTS selects [6].

Details

There is a new database named facet-id-exists-docids. Its keys are field ids and its values are bitmaps of all the document ids where the corresponding field exists.

To create this database, the indexing part of milli had to be adapted. The implementation there is basically copy/pasted from the code handling the facet-id-f64-docids database, with appropriate modifications in place.

There was an issue involving the flattening of documents during (re)indexing. Previously, the following JSON:

{
    "id": 0,
    "colour": [],
    "size": {}
}

would be flattened to:

{
    "id": 0
}

prior to being given to the extraction pipeline.

This transformation would lose the information that is needed to populate the facet-id-exists-docids database. Therefore, I have also changed the implementation of the flatten-serde-json crate. Now, as it traverses the Json, it keeps track of which key was encountered. Then, at the end, if a previously encountered key is not present in the flattened object, it adds that key to the object with an empty array as value. For example:

{
    "id": 0,
    "colour": {
        "green": [],
        "blue": 1
    },
    "size": {}
} 

becomes

{
    "id": 0,
    "colour": [],
    "colour.green": [],
    "colour.blue": 1,
    "size": []
} 

@loiclec loiclec requested a review from irevoire June 14, 2022 15:22
@irevoire irevoire added DB breaking The related changes break the DB API breaking The related changes break the milli API labels Jun 14, 2022
@irevoire irevoire marked this pull request as ready for review June 15, 2022 12:40
filter-parser/src/lib.rs Outdated Show resolved Hide resolved
filter-parser/src/lib.rs Outdated Show resolved Hide resolved
filter-parser/src/lib.rs Outdated Show resolved Hide resolved
filter-parser/src/lib.rs Outdated Show resolved Hide resolved
infos/src/main.rs Outdated Show resolved Hide resolved
milli/src/heed_codec/facet/mod.rs Outdated Show resolved Hide resolved
Comment on lines 34 to 68
pub struct FieldIdCodec;

impl<'a> heed::BytesDecode<'a> for FieldIdCodec {
type DItem = FieldId;

fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, _) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
Some(field_id)
}
}

impl<'a> heed::BytesEncode<'a> for FieldIdCodec {
type EItem = FieldId;

fn bytes_encode(field_id: &Self::EItem) -> Option<Cow<[u8]>> {
Some(Cow::Owned(field_id.to_be_bytes().to_vec()))
}
}

pub struct FieldIdDocIdCodec;

impl<'a> heed::BytesDecode<'a> for FieldIdDocIdCodec {
type DItem = (FieldId, DocumentId);

fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);

let document_id_bytes = bytes[..4].try_into().ok()?;
let document_id = u32::from_be_bytes(document_id_bytes);

Some((field_id, document_id))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerollmops don't we already have a codec for the FieldIds?

Copy link
Member

@Kerollmops Kerollmops Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... I don't remember, we should check the codecs that we use in the Index database.

milli/src/index.rs Outdated Show resolved Hide resolved
milli/src/heed_codec/facet/mod.rs Outdated Show resolved Hide resolved
milli/src/heed_codec/facet/mod.rs Outdated Show resolved Hide resolved
milli/src/heed_codec/facet/mod.rs Outdated Show resolved Hide resolved
milli/src/update/index_documents/mod.rs Outdated Show resolved Hide resolved
milli/src/update/index_documents/mod.rs Outdated Show resolved Hide resolved
irevoire
irevoire previously approved these changes Jun 16, 2022
Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me thank you 👍

@loiclec loiclec requested a review from ManyTheFish June 16, 2022 09:43
milli/src/heed_codec/facet/field_id_codec.rs Outdated Show resolved Hide resolved
@loiclec loiclec force-pushed the filter/field-exist branch 2 times, most recently from b32bbfc to b327fde Compare July 4, 2022 07:32
Loïc Lecrenier added 6 commits July 19, 2022 10:07
OR, AND, NOT, TO must now be followed by spaces
The idea is to directly create a sorted and merged list of bitmaps
in the form of a BTreeMap<FieldId, RoaringBitmap> instead of creating
a grenad::Reader where the keys are field_id and the values are docids.

Then we send that BTreeMap to the thing that handles TypedChunks, which
inserts its content into the database.
@loiclec loiclec force-pushed the filter/field-exist branch from c8fca39 to aed8c69 Compare July 19, 2022 08:08
@loiclec loiclec requested a review from ManyTheFish July 20, 2022 07:07
ManyTheFish
ManyTheFish previously approved these changes Jul 20, 2022
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

loiclec and others added 5 commits July 20, 2022 16:20
Co-authored-by: Many the fish <many@meilisearch.com>
561: Enriched documents batch reader r=curquiza a=Kerollmops

~This PR is based on #555 and must be rebased on main after it has been merged to ease the review.~
This PR contains the work in #555 and can be merged on main as soon as reviewed and approved.

- [x] Create an `EnrichedDocumentsBatchReader` that contains the external documents id.
- [x] Extract the primary key name and make it accessible in the `EnrichedDocumentsBatchReader`.
- [x] Use the external id from the `EnrichedDocumentsBatchReader` in the `Transform::read_documents`.
- [x] Remove the `update_primary_key` from the _transform.rs_ file.
- [x] Really generate the auto-generated documents ids.
- [x] Insert the (auto-generated) document ids in the document while processing it in `Transform::read_documents`.

Co-authored-by: Kerollmops <clement@meilisearch.com>
595: Update version for next release (v0.32.0) r=ManyTheFish a=curquiza

In order to release on `main` (for v0.29.0, not v0.28.1)

<img width="1014" alt="Capture d’écran 2022-07-21 à 13 20 35" src="https://user-images.githubusercontent.com/20380692/180178725-381fbdf1-c0fb-4fa9-9954-452aec5a1574.png">


Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
@loiclec
Copy link
Contributor Author

loiclec commented Jul 21, 2022

@ManyTheFish thanks for your review :)

A big PR was merged into main since then and I had to make this PR up-to-date with the new main. Rebasing was quite difficult to do so instead I merged the previous filter/field-exists and main, resolved the conflicts, and pushed the result again under the same branch. I hope it's not too messy :(

ManyTheFish
ManyTheFish previously approved these changes Jul 21, 2022
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! let's merge like this.
But! In this case of big merge conflicts, I prefer to squash all my commits before rebasing. Yes, I lose my history but I find it way clearer!

filter-parser/fuzz/.gitignore Outdated Show resolved Hide resolved
Co-authored-by: Many the fish <many@meilisearch.com>
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @loiclec, sorry for the time!

You can merge it if everything is ok on your side.

@loiclec
Copy link
Contributor Author

loiclec commented Aug 4, 2022

Thanks everyone!
bors merge

@bors
Copy link
Contributor

bors bot commented Aug 4, 2022

@bors bors bot merged commit 21284cf into main Aug 4, 2022
@bors bors bot deleted the filter/field-exist branch August 4, 2022 10:04
bors bot added a commit that referenced this pull request Aug 18, 2022
596: Filter operators: NOT + IN[..] r=irevoire a=loiclec

# Pull Request

## What does this PR do?
Implements the changes described in meilisearch/meilisearch#2580
It is based on top of #556 

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API breaking The related changes break the milli API DB breaking The related changes break the DB
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants