fix: bson builder should handle schema flexibly #2334

tychoish · 2023-12-31T14:36:35Z

The RecordStructBuilder for bson previously assumed (as is the case
with our MongoDB implementation) that it would never see bson
documents with fields that didn't appear in the schema/projection and
would error otherwise.

While it does mean that the schema inference (which for MongoDB is
done somewhat probabalistically) controls the projection, it does mean
that you will never see documents that have fields that aren't in the
schema.

For handling the pure-BSON document streams, it does mean that it
would be very easy to see a document that has fields that aren't in
the schema. This increases with #2333, but could easily happen
otherwise.

scsmithr · 2024-01-01T20:58:04Z

While it does mean that the schema inference (which for MongoDB is
done somewhat probabalistically) controls the projection, it does mean
that you will never see documents that have fields that aren't in the
schema.

Is this saying that the current implementation in main would skip bson documents that have unexpected schemas? And the fix here is we can now sanely handle those documents? What happens to the extra columns/fields? Are those just dropped?

Also would we be able to throw together a quick test for this?

crates/datasources/src/bson/builder.rs

Co-authored-by: Sean Smith <scsmithr@gmail.com>

tychoish · 2024-01-02T02:05:47Z

Is this saying that the current implementation in main would skip bson documents that have unexpected schemas? And the fix here is we can now sanely handle those documents? What happens to the extra columns/fields? Are those just dropped?

I believe that the current mongodb implementation doesn't have this problem because we project out the fields that we don't expect to see, so that we're never in a case where a document would have a field we don't expect (caveat: nested documents/arrays are an edge case that I'd want to think about.)

The failure mode, is that the query will fail, because we'll be building a result set, see a field we don't expect to see and return an error, which could happen with bson-files.

For MongoDB if the schema inference does not pick up a field, the query will remove that field from the result set, so we are already dropping data silently (not great, but the solution is to not infer schema, (addressable with #2333),) so this just brings the implementations on par with each other.

Also would we be able to throw together a quick test for this?

So the bug, (which to be fair, isn't released), requires the schema inference to be "wrong," (or incomplete) relative to our expectations. Building a test case that has that would be pretty fragile (all of the new code is hit by the existing test, so I'm not worried about this being worse than baseline.)

universalmind303 · 2024-01-02T14:40:46Z

can we add some tests for this.

tests/tests/test_bson.py

tests/tests/dupes.bson

tests/tests/scripts/make_bad_bson.go

tychoish · 2024-01-02T21:43:03Z

In any case, I wrote the duplicate fields always picks the first one test as a rust unittest.

fix: bson builder should handle schema flexibly

4846428

tychoish requested a review from scsmithr January 1, 2024 18:12

scsmithr reviewed Jan 1, 2024

View reviewed changes

crates/datasources/src/bson/builder.rs Outdated Show resolved Hide resolved

Update crates/datasources/src/bson/builder.rs

931748f

Co-authored-by: Sean Smith <scsmithr@gmail.com>

tychoish added 2 commits January 2, 2024 08:57

Merge remote-tracking branch 'origin/main' into tycho/bson-builder

6493eab

fixup

5118e65

make bson test weird

122d225

universalmind303 reviewed Jan 2, 2024

View reviewed changes

tests/tests/test_bson.py Outdated Show resolved Hide resolved

tychoish added 2 commits January 2, 2024 10:59

better handling

446f54c

test example

64b02b1

tychoish requested review from universalmind303 and scsmithr January 2, 2024 17:31

scsmithr reviewed Jan 2, 2024

View reviewed changes

tests/tests/dupes.bson Outdated Show resolved Hide resolved

tests/tests/scripts/make_bad_bson.go Outdated Show resolved Hide resolved

tychoish added 2 commits January 2, 2024 16:35

add it as a unit test

47b0542

write it as a unittest

f07f2b1

tychoish requested a review from scsmithr January 2, 2024 21:43

scsmithr approved these changes Jan 2, 2024

View reviewed changes

Merge branch 'main' into tycho/bson-builder

d09d1c9

tychoish enabled auto-merge (squash) January 2, 2024 23:29

tychoish added 2 commits January 2, 2024 18:29

test buffer semantics

469b640

bson tests covered by unit tests

7b6177d

tychoish merged commit f1b0523 into main Jan 2, 2024
13 checks passed

tychoish deleted the tycho/bson-builder branch January 2, 2024 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bson builder should handle schema flexibly #2334

fix: bson builder should handle schema flexibly #2334

tychoish commented Dec 31, 2023

scsmithr commented Jan 1, 2024

tychoish commented Jan 2, 2024

universalmind303 commented Jan 2, 2024

tychoish commented Jan 2, 2024

fix: bson builder should handle schema flexibly #2334

fix: bson builder should handle schema flexibly #2334

Conversation

tychoish commented Dec 31, 2023

scsmithr commented Jan 1, 2024

tychoish commented Jan 2, 2024

universalmind303 commented Jan 2, 2024

tychoish commented Jan 2, 2024