Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Optimize document transform #402

Merged
merged 8 commits into from
Oct 26, 2021
Merged

Optimize document transform #402

merged 8 commits into from
Oct 26, 2021

Conversation

MarinPostma
Copy link
Contributor

@MarinPostma MarinPostma commented Oct 24, 2021

This pr optimizes the transform of documents additions in the obkv format. Instead on accepting any serializable objects, we instead treat json and CSV specifically:

  • For json, we build a serde Visitor, that transform the json straight into obkv without intermediate representation.
  • For csv, we directly write the lines in the obkv, applying other optimization as well.

@MarinPostma MarinPostma marked this pull request as ready for review October 25, 2021 08:28
@MarinPostma MarinPostma force-pushed the optimize-document-transform branch from e42e845 to 3fcccc3 Compare October 25, 2021 08:29
@curquiza curquiza changed the title optimize document transform Optimize document transform Oct 25, 2021
@curquiza curquiza added the DB breaking The related changes break the DB label Oct 25, 2021
milli/src/documents/mod.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
milli/src/documents/mod.rs Outdated Show resolved Hide resolved
milli/src/documents/builder.rs Outdated Show resolved Hide resolved
@MarinPostma
Copy link
Contributor Author

we don't know the document id at this point, and returning the whole document is not very helpful to locate it in a large csv document, unlike the line number...

@MarinPostma MarinPostma force-pushed the optimize-document-transform branch from a7ce5ce to 135fa16 Compare October 25, 2021 14:28
@MarinPostma MarinPostma force-pushed the optimize-document-transform branch from 135fa16 to b16c2ad Compare October 25, 2021 15:41
Kerollmops
Kerollmops previously approved these changes Oct 25, 2021
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. we just need to make it pass the CI now 🎪

@MarinPostma
Copy link
Contributor Author

I always forget that you need to run on all targets here... 🤦‍♂️

@MarinPostma MarinPostma force-pushed the optimize-document-transform branch from 055ba13 to baddd80 Compare October 25, 2021 16:31
@MarinPostma
Copy link
Contributor Author

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 26, 2021

@bors bors bot merged commit d7943fe into main Oct 26, 2021
@bors bors bot deleted the optimize-document-transform branch October 26, 2021 10:04
bors bot added a commit to meilisearch/meilisearch that referenced this pull request Oct 26, 2021
1847: Optimize document transform r=MarinPostma a=MarinPostma

integrate the optimization from meilisearch/milli#402.

optimize payload read, by reading it to RAM first instead of streaming it. This means that the payload must fit into RAM, which should not be a problem.

Add BufWriter to the obkv writer to improve write speed.

I have measured a gain of 40-45% in speed after these optimizations.


Co-authored-by: marin postma <postma.marin@protonmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
DB breaking The related changes break the DB
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants