Add EXISTS filter #556

loiclec · 2022-06-14T15:22:38Z

What does this PR do?

Fixes issue #2484 in the meilisearch repo.

It creates a field EXISTS filter which selects all documents containing the field key.
For example, with the following documents:

[{
	"id": 0,
	"colour": []
},
{
	"id": 1,
	"colour": ["blue", "green"]
},
{
	"id": 2,
	"colour": 145238
},
{
	"id": 3,
	"colour": null
},
{
	"id": 4,
	"colour": {
		"green": []
	}
},
{
	"id": 5,
	"colour": {}
},
{
	"id": 6
}]

Then the filter colour EXISTS selects the ids [0, 1, 2, 3, 4, 5]. The filter colour NOT EXISTS selects [6].

Details

There is a new database named facet-id-exists-docids. Its keys are field ids and its values are bitmaps of all the document ids where the corresponding field exists.

To create this database, the indexing part of milli had to be adapted. The implementation there is basically copy/pasted from the code handling the facet-id-f64-docids database, with appropriate modifications in place.

There was an issue involving the flattening of documents during (re)indexing. Previously, the following JSON:

{
    "id": 0,
    "colour": [],
    "size": {}
}

would be flattened to:

{
    "id": 0
}

prior to being given to the extraction pipeline.

This transformation would lose the information that is needed to populate the facet-id-exists-docids database. Therefore, I have also changed the implementation of the flatten-serde-json crate. Now, as it traverses the Json, it keeps track of which key was encountered. Then, at the end, if a previously encountered key is not present in the flattened object, it adds that key to the object with an empty array as value. For example:

{
    "id": 0,
    "colour": {
        "green": [],
        "blue": 1
    },
    "size": {}
}

becomes

{
    "id": 0,
    "colour": [],
    "colour.green": [],
    "colour.blue": 1,
    "size": []
}

filter-parser/src/lib.rs

infos/src/main.rs

milli/src/heed_codec/facet/mod.rs

irevoire · 2022-06-15T13:00:29Z

milli/src/heed_codec/facet/mod.rs

+pub struct FieldIdCodec;
+
+impl<'a> heed::BytesDecode<'a> for FieldIdCodec {
+    type DItem = FieldId;
+
+    fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
+        let (field_id_bytes, _) = try_split_array_at(bytes)?;
+        let field_id = u16::from_be_bytes(field_id_bytes);
+        Some(field_id)
+    }
+}
+
+impl<'a> heed::BytesEncode<'a> for FieldIdCodec {
+    type EItem = FieldId;
+
+    fn bytes_encode(field_id: &Self::EItem) -> Option<Cow<[u8]>> {
+        Some(Cow::Owned(field_id.to_be_bytes().to_vec()))
+    }
+}
+
+pub struct FieldIdDocIdCodec;
+
+impl<'a> heed::BytesDecode<'a> for FieldIdDocIdCodec {
+    type DItem = (FieldId, DocumentId);
+
+    fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
+        let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
+        let field_id = u16::from_be_bytes(field_id_bytes);
+
+        let document_id_bytes = bytes[..4].try_into().ok()?;
+        let document_id = u32::from_be_bytes(document_id_bytes);
+
+        Some((field_id, document_id))
+    }
+}


@Kerollmops don't we already have a codec for the FieldIds?

Hum... I don't remember, we should check the codecs that we use in the Index database.

milli/src/index.rs

milli/src/update/index_documents/extract/extract_facet_exists_docids.rs

milli/src/heed_codec/facet/mod.rs

milli/src/update/index_documents/extract/extract_facet_exists_docids.rs

milli/src/update/index_documents/extract/extract_fid_docid_facet_values.rs

milli/src/update/index_documents/mod.rs

irevoire

looks good to me thank you 👍

milli/src/heed_codec/facet/field_id_codec.rs

OR, AND, NOT, TO must now be followed by spaces

The idea is to directly create a sorted and merged list of bitmaps in the form of a BTreeMap<FieldId, RoaringBitmap> instead of creating a grenad::Reader where the keys are field_id and the values are docids. Then we send that BTreeMap to the thing that handles TypedChunks, which inserts its content into the database.

milli/src/update/index_documents/extract/extract_fid_docid_facet_values.rs

milli/src/update/index_documents/extract/mod.rs

milli/src/update/index_documents/typed_chunk.rs

ManyTheFish

LGTM

Co-authored-by: Many the fish <many@meilisearch.com>

561: Enriched documents batch reader r=curquiza a=Kerollmops ~This PR is based on #555 and must be rebased on main after it has been merged to ease the review.~ This PR contains the work in #555 and can be merged on main as soon as reviewed and approved. - [x] Create an `EnrichedDocumentsBatchReader` that contains the external documents id. - [x] Extract the primary key name and make it accessible in the `EnrichedDocumentsBatchReader`. - [x] Use the external id from the `EnrichedDocumentsBatchReader` in the `Transform::read_documents`. - [x] Remove the `update_primary_key` from the _transform.rs_ file. - [x] Really generate the auto-generated documents ids. - [x] Insert the (auto-generated) document ids in the document while processing it in `Transform::read_documents`. Co-authored-by: Kerollmops <clement@meilisearch.com>

595: Update version for next release (v0.32.0) r=ManyTheFish a=curquiza In order to release on `main` (for v0.29.0, not v0.28.1) <img width="1014" alt="Capture d’écran 2022-07-21 à 13 20 35" src="https://user-images.githubusercontent.com/20380692/180178725-381fbdf1-c0fb-4fa9-9954-452aec5a1574.png"> Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>

loiclec · 2022-07-21T13:01:30Z

@ManyTheFish thanks for your review :)

A big PR was merged into main since then and I had to make this PR up-to-date with the new main. Rebasing was quite difficult to do so instead I merged the previous filter/field-exists and main, resolved the conflicts, and pushed the result again under the same branch. I hope it's not too messy :(

ManyTheFish

Ouch! let's merge like this.
But! In this case of big merge conflicts, I prefer to squash all my commits before rebasing. Yes, I lose my history but I find it way clearer!

filter-parser/fuzz/.gitignore

Co-authored-by: Many the fish <many@meilisearch.com>

ManyTheFish

Hello @loiclec, sorry for the time!

You can merge it if everything is ok on your side.

Outdated

loiclec · 2022-08-04T09:45:50Z

Thanks everyone!
bors merge

bors · 2022-08-04T10:04:56Z

Build succeeded:

596: Filter operators: NOT + IN[..] r=irevoire a=loiclec # Pull Request ## What does this PR do? Implements the changes described in meilisearch/meilisearch#2580 It is based on top of #556 Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>

loiclec requested a review from irevoire June 14, 2022 15:22

irevoire added DB breaking The related changes break the DB API breaking The related changes break the milli API labels Jun 14, 2022

irevoire marked this pull request as ready for review June 15, 2022 12:40

irevoire suggested changes Jun 15, 2022

View reviewed changes

ManyTheFish suggested changes Jun 15, 2022

View reviewed changes

irevoire previously approved these changes Jun 16, 2022

View reviewed changes

loiclec requested a review from ManyTheFish June 16, 2022 09:43

Kerollmops previously requested changes Jun 28, 2022

View reviewed changes

milli/src/heed_codec/facet/field_id_codec.rs Outdated Show resolved Hide resolved

loiclec dismissed irevoire’s stale review via b32bbfc July 4, 2022 07:31

loiclec force-pushed the filter/field-exist branch 2 times, most recently from b32bbfc to b327fde Compare July 4, 2022 07:32

Kerollmops added 18 commits July 12, 2022 14:52

Do not allocate when parsing CSV headers

048e174

Update grenad to 0.4.2

eb63af1

Rework the DocumentsBatchBuilder/Reader to use grenad

419ce39

Fix the tests for the new DocumentsBatchBuilder/Reader

e8297ad

Fix the fuzz tests

6d0498d

Fix the cli for the new DocumentsBatchBuilder/Reader structs

a4ceef9

Fix http-ui to fit with the new DocumentsBatchBuilder/Reader structs

f29114f

Fix the benchmarks

a97d4d6

Introduce the validate_documents_batch function

bdc4263

Improve the .gitignore of the fuzz crate

cefffde

Introduce the validate_documents_batch function

0146175

Move the Object type in the lib.rs file and use it everywhere

fcfc4ca

Fix the indexation tests

399eec5

Support the auto-generated ids when validating documents

2ceeb51

Make sur that we do not accept floats as documents ids

19eb3b4

Make the nested primary key work

8ebf5ee

Do not leak an internal grenad Error

dc3f092

Fix the format used for a geo deleting benchmark

ea85220

Loïc Lecrenier added 6 commits July 19, 2022 10:07

Refactor index_documents_check_exists_database tests

c17d616

Make filter parser more strict regarding spacing around operators

ea0642c

OR, AND, NOT, TO must now be followed by spaces

Run cargo fmt

80b962b

Remove custom implementation of BytesEncode/Decode for the FieldId

4f0bd31

Add integration tests for the EXISTS filter

1eb1e73

loiclec force-pushed the filter/field-exist branch from c8fca39 to aed8c69 Compare July 19, 2022 08:08

ManyTheFish suggested changes Jul 19, 2022

View reviewed changes

milli/src/update/index_documents/extract/extract_fid_docid_facet_values.rs Outdated Show resolved Hide resolved

milli/src/update/index_documents/extract/mod.rs Outdated Show resolved Hide resolved

milli/src/update/index_documents/typed_chunk.rs Show resolved Hide resolved

Loïc Lecrenier added 2 commits July 19, 2022 13:54

Fix compiler error

d0eee5f

Avoid using too much memory when indexing facet-exists-docids

1506683

loiclec requested a review from ManyTheFish July 20, 2022 07:07

ManyTheFish previously approved these changes Jul 20, 2022

View reviewed changes

loiclec and others added 5 commits July 20, 2022 16:20

Add a code comment, as suggested in PR review

41a0ce0

Co-authored-by: Many the fish <many@meilisearch.com>

Update version for next release (v0.32.0)

d5e9b73

Merge branch 'filter/field-exist'

0700370

loiclec dismissed ManyTheFish’s stale review via 0700370 July 21, 2022 12:56

loiclec mentioned this pull request Jul 21, 2022

Filter operators: NOT + IN[..] #596

Merged

ManyTheFish previously approved these changes Jul 21, 2022

View reviewed changes

filter-parser/fuzz/.gitignore Outdated Show resolved Hide resolved

Update filter-parser/fuzz/.gitignore

1fe224f

Co-authored-by: Many the fish <many@meilisearch.com>

loiclec dismissed ManyTheFish’s stale review via 1fe224f July 21, 2022 14:12

ManyTheFish approved these changes Aug 4, 2022

View reviewed changes

bors bot merged commit 21284cf into main Aug 4, 2022

bors bot deleted the filter/field-exist branch August 4, 2022 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EXISTS filter #556

Add EXISTS filter #556

loiclec commented Jun 14, 2022

irevoire Jun 15, 2022

Kerollmops Jun 15, 2022 •

edited

Loading

irevoire left a comment

ManyTheFish left a comment

loiclec commented Jul 21, 2022

ManyTheFish left a comment

ManyTheFish left a comment

loiclec commented Aug 4, 2022

bors bot commented Aug 4, 2022

Add EXISTS filter #556

Add EXISTS filter #556

Conversation

loiclec commented Jun 14, 2022

What does this PR do?

Details

irevoire Jun 15, 2022

Choose a reason for hiding this comment

Kerollmops Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

irevoire left a comment

Choose a reason for hiding this comment

ManyTheFish left a comment

Choose a reason for hiding this comment

loiclec commented Jul 21, 2022

ManyTheFish left a comment

Choose a reason for hiding this comment

ManyTheFish left a comment

Choose a reason for hiding this comment

loiclec commented Aug 4, 2022

bors bot commented Aug 4, 2022

Kerollmops Jun 15, 2022 •

edited

Loading