Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Enriched documents batch reader #561

Merged
merged 35 commits into from
Jul 21, 2022
Merged

Enriched documents batch reader #561

merged 35 commits into from
Jul 21, 2022

Conversation

Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Jun 20, 2022

This PR is based on #555 and must be rebased on main after it has been merged to ease the review.
This PR contains the work in #555 and can be merged on main as soon as reviewed and approved.

  • Create an EnrichedDocumentsBatchReader that contains the external documents id.
  • Extract the primary key name and make it accessible in the EnrichedDocumentsBatchReader.
  • Use the external id from the EnrichedDocumentsBatchReader in the Transform::read_documents.
  • Remove the update_primary_key from the transform.rs file.
  • Really generate the auto-generated documents ids.
  • Insert the (auto-generated) document ids in the document while processing it in Transform::read_documents.

@Kerollmops Kerollmops force-pushed the enriched-documents-batch-reader branch 2 times, most recently from b02c7e8 to cb490c3 Compare June 21, 2022 10:15
@Kerollmops Kerollmops added the no breaking The related changes are not breaking (DB nor API) label Jun 21, 2022
@Kerollmops
Copy link
Member Author

Kerollmops commented Jun 21, 2022

I just ran a benchmark to check that we improve the time we take to index documents compared to what we lost in #555.

https://github.com/meilisearch/milli/actions/runs/2540950078

@Kerollmops Kerollmops force-pushed the enriched-documents-batch-reader branch from fc1c80d to b6ab40b Compare June 22, 2022 08:29
@irevoire irevoire added the performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption label Jul 5, 2022
@loiclec loiclec self-requested a review July 11, 2022 10:20
Copy link
Contributor

@loiclec loiclec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a fairly big PR for me, so I did my best, but overall I think it looks good :) There are mostly two potential issues:

  • should we verify that the floats inside the geo search fields are valid longitude/latitudes, for example by checking that they are not NaN, inf, etc.?
  • I think there might be a bug in the code that finds the nested primary key

Apart from that, I had two minor comments:

  1. we'll need to reintroduce the functionality provided by serde_impl.rs, as I don't think it will be acceptable, performance-wise, to deserialise the whole array of added documents at once
  2. the “guessed” primary ID uses the whole DocumentsBatchIndex, but the Meilisearch documentation says that the guess should only use the first document. This is not a change in behaviour introduced by this PR though, so it's not blocking as far as I am concerned

milli/src/documents/serde_impl.rs Show resolved Hide resolved
milli/src/update/index_documents/enrich.rs Show resolved Hide resolved
milli/src/update/index_documents/enrich.rs Outdated Show resolved Hide resolved
milli/src/update/index_documents/enrich.rs Show resolved Hide resolved
@Kerollmops Kerollmops force-pushed the enriched-documents-batch-reader branch from e9c406e to 805c513 Compare July 11, 2022 16:38
@loiclec
Copy link
Contributor

loiclec commented Jul 12, 2022

Thank you for the changes! :)

Regarding the nested primary keys again. Could you check that the function works correctly when we have the document:

{ "a" : { "b" : { "c" :  1 }}}

and the primary keys a.b or a.b.c.d?
It is my understanding that both would select the value Number(1).

@Kerollmops
Copy link
Member Author

Thank you very much for this bug-finding again @loiclec, I don't know how you find them but that's very impressive 👀 💪

@Kerollmops Kerollmops force-pushed the enriched-documents-batch-reader branch from 903a3e0 to 448114c Compare July 12, 2022 13:22
@Kerollmops Kerollmops marked this pull request as ready for review July 12, 2022 13:22
@Kerollmops Kerollmops requested a review from loiclec July 12, 2022 13:23
@Kerollmops
Copy link
Member Author

I just rebased this PR on top of the main branch but I would like @loiclec's review as I think you did change the transform file, but please just review this file if it's too much 😃 🙏

@Kerollmops Kerollmops requested a review from irevoire July 12, 2022 13:30
loiclec
loiclec previously approved these changes Jul 13, 2022
Copy link
Contributor

@loiclec loiclec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I can't spot anything that's wrong, although I mostly looked at the transform file and (relatively) quickly read through the changes since the last review.

Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it's a super cool PR!
I think we can simplify the transform even more with your enriched batch reader 😁

milli/src/update/index_documents/enrich.rs Show resolved Hide resolved
@@ -205,8 +191,7 @@ impl<'a, 'i> Transform<'a, 'i> {
// it, transform it into a string and validate it, and then update it in the
// document. If none is found, and we were told to generate missing document ids, then
// we create the missing field, and update the new document.
let mut uuid_buffer = [0; uuid::fmt::Hyphenated::LENGTH];
let external_id = if primary_key_id_nested {
if primary_key_id_nested {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of this part too and delete the flatten_from_field_mapping function!
We'll flatten the document later with the function that works with the field id map 😁

Also, in case we need to keep this if, the comment above is outdated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... Thank you indeed, it maybe emerged from the rebase I did on main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this section and the code that supported it :)

irevoire added a commit to meilisearch/specifications that referenced this pull request Jul 13, 2022
@Kerollmops
Copy link
Member Author

I think we can simplify the transform even more with your enriched batch reader 😁

Thank you for your review 🙏, would it be possible for you to do that next week, as I am on vacation, please?

@ManyTheFish ManyTheFish self-requested a review July 18, 2022 09:31
Otherwise it is not possible to iterate over all documents while
using the fields index at the same time.
@loiclec loiclec force-pushed the enriched-documents-batch-reader branch from 3201801 to fc9f3f3 Compare July 18, 2022 14:08
Co-authored-by: Many the fish <many@meilisearch.com>
@loiclec
Copy link
Contributor

loiclec commented Jul 20, 2022

@ManyTheFish I committed your suggested change to add a comment. Was there anything else to do?

@curquiza curquiza dismissed irevoire’s stale review July 21, 2022 07:07

changed applied or issue will be open

@curquiza
Copy link
Member

Checked with @loiclec, we can finally merge!!!

bors merge

@bors
Copy link
Contributor

bors bot commented Jul 21, 2022

Build succeeded:

@bors bors bot merged commit 941af58 into main Jul 21, 2022
@bors bors bot deleted the enriched-documents-batch-reader branch July 21, 2022 07:52
gmourier pushed a commit to meilisearch/specifications that referenced this pull request Oct 3, 2022
gmourier pushed a commit to meilisearch/specifications that referenced this pull request Oct 3, 2022
gmourier added a commit to meilisearch/specifications that referenced this pull request Oct 3, 2022
* Bump openapi spec version to v0.29

* Update 0001-script-based-tokenizer.md (#159)

Change tokenizer specs to better fit Charabia implementation

* Update the geosearch error (#161)

Implemented in meilisearch/milli#561

* Auto-batching - Enable feature by default and remove unwanted options (#162)

* Update specs according to new auto-batching behavior

* update batchUid to make it internal and hidden from a task resource representation

* Remove the batchUid mentions from the task API

* Update open-api.yaml

Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com>

* update future possibilities

Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com>

* Search API — Filters - Introduce IN and EXISTS and describe filter capabilities in more precisely (#163)

* Write a specification for the new (and old) search filters

EXISTS
IN
NOT (new behaviour)
!= (new behaviour)

* Apply suggestions from code review

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com>

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com>

* Add missing settings object in the task details field of a settingsUpdate task type (#164)

* Remove `name` from indexes resource definition (#165)

* Misc — Soft deleted documents (Performance optimization) (#168)

* create a spec for the soft deleted documents

* Rename spec file, minor adjustements

* Replace You and We by A user and Meilisearch

Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com>

* Add Stats Seen event (#169)

* Add examples component for each summarized task type (#170)

* Version API — Catch up (#171)

* Add version-api.md

* Add PR number as a spec file prefix

* Add health-api.md (#172)

* Search API — Add `matchingStrategy` parameter with `last` / `all` strategies (#173)

* Introduce a proposal to boot the specification

* Update telemetry

* Replace wordMatchingStrategy by matchingStrategy

* fix missing backtick md

Co-authored-by: Many the fish <legendre.maxime.isn@gmail.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: cvermand <33010418+bidoubiwa@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants