-
-
Notifications
You must be signed in to change notification settings - Fork 6
Serde format for in situ JSON deserialization #7
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Ah, I thought we are going to shift all the bytes from the right because the last I believe maybe like this? (Playground) #[derive(Debug, PartialEq)]
struct S<'a> {
msg: &'a str,
}
fn main() {
let mut j = "{\"msg\":\"abc\\nxyz\"}".to_owned();
let actual = S {
msg: unsafe {
let p = j.as_mut_ptr().add(12);
*p = b'\n';
std::ptr::copy(p, p.offset(-1), 4);
j.get_unchecked(8..15)
},
};
let expected = S { msg: "abc\nxyz" };
assert_eq!(actual, expected);
assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
assert_eq!(j, "{\"msg\":\"abc\nxyzz\"}");
} |
No, we shouldn't move around more bytes than necessary. That's more work. It doesn't save memory. The whole input buffer needs to outlive the deserialized object anyway. |
I think the right amount of copying in the example would be the equivalent of: let p = j.as_mut_ptr();
*p.offset(11) = b'\n';
ptr::copy(p.offset(13), p.offset(12), 3);
j.get_unchecked(8..15) |
@dtolnay I don't think that matters, for me I believe reducing one Either way, how should one get started on this? |
https://serde.rs/impl-deserializer.html would be the best place to start, and modify the implementation to take &mut str and process escaped strings in place. Once the basics work, you could pull in some of the number parsing from serde_json that isn't in the example code and then put together some benchmarks. |
@dtolnay I tried building on top of
Any idea to solve the lifetime problem? |
Thanks to alice in https://users.rust-lang.org/t/help-about-multiple-mutable-borrows/33494/8?u=pickfire, I found using @dtolnay I believe you can take a look at the prototype now at pickfire/json@6bd0238 in which it only implements a basic |
I've only skimmed the code, but it looks like a good start. This looks problematic though: let mut s = "\"\\n\u{2603}\"".to_owned();
assert_eq!("\n\u{2603}", from_str::<String>(&s).unwrap());
let undefined_behavior = &[10, 10, 226, 152];
assert_eq!(undefined_behavior, from_mut_str::<&str>(&mut s).unwrap().as_bytes());
assert!(std::str::from_utf8(undefined_behavior).is_err()); For a valid JSON input containing no escape sequences other than |
Is that because I have yet add the other escapes? pickfire/json@6bd0238#diff-f400f58d5e78694b93db5d4fe5f2a246R681 #[test]
fn test_from_mut_str_newline() {
#[derive(Debug, PartialEq, Deserialize)]
struct S<'a> {
msg: &'a str,
}
let mut j = "{\"msg\":\"\\n\"}".to_owned();
let actual = from_mut_str::<S>(&mut j).unwrap();
let expected = S { msg: "\n" };
assert_eq!(actual, expected);
assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
assert_eq!(j, "{\"msg\":\"\n\n\"}"); // not a commitment in the API, but in practice this is what it will be
} Not a single |
No I don't think that is the reason. The input string in my repro is "\n☃" which only contains one escape. |
Yes, even just a single |
To clarify, my repro contains Heads up that I am going to step away from this issue for a while because I am swamped with other things at the moment, but hopefully Discord will be able to provide help if you link them to your repo. |
@dtolnay I fixed the snowman bug, I forgot to ignore the memory before the first |
@pickfire I don't think the snowman bug is fixed in revision 2e4b6b08650b41582430ca004b190b57e0a69c2e. The issue at hand is, that your code modifies the byte array of the
The basic functionality of your code works. It can correctly deserialize a JSON string: let mut s = r#""\n☃""#.to_owned();
assert_eq!("\n☃", from_str::<String>(&s).unwrap()); Let's see how the string looks like on a byte level:
Now, if we use the new method assert_eq!("\n☃", from_mut_str::<String>(&mut s).unwrap()); If we look at the bytes after the in situ deserialization step, they look like this
They start out valid and as expected, with the literal newline character and the remaining bytes shifted forwards. But after the snowman symbol (bytes 0xe2, 0x98, 0x83) we see another 0x83. This is not a valid start of a UTF-8 encoded codepoint. It starts with the two bits Here are two ways to show how this is problematic: Normally taking the bytes of a string and creating a stringslice from it should always work. After all, the bytes must be valid UTF-8. If we do this here, we receive an error: std::str::from_utf8(s.as_bytes())
// Results in:
// Err(Utf8Error { valid_up_to: 5, error_len: Some(1) }) Another way is iterating over the chars in for c in s.chars() {
println!("{:?}", c);
}
// Output:
// '\"'
// '\n'
// '☃'
// 'â' The last character could be anything or even some runtime error, since this is undefined behavior, and we do not see the last Now how to fix this? std::str::from_utf8(&[0x22, 0x0a, 0xe2, 0x98, 0x83, 0, 0x22]) // <- the second 0x83 is replaced with 0
// Ok("\"\n☃\u{0}\"") One difficulty which might arise, is that the UTF-8 invariant also needs to be uphold if the deserialization code panics somewhere, like a user-written deserialization function. As a guideline, you should never be forced to compare the bytes of the string in one of your tests, like you do for I hope this explanation helps you to understand the issue at hand better. I am looking forward to seeing an in situ parser :) |
@jonasbb Wow, thanks a lot for reviewing this. Nice.
Now I know, I did not notice the always, I thought keeping the remaining bytes are still good based on the initial design.
How do you generate that? It is troublesome for me to compare byte by byte by remembering the numbers.
Yup, I noticed the last
The reason I did that is that I do not know how to represent it in valid UTF-8 (and it took me quite a while to figure out to represent it in bytes), now I need to figure out how to zeroed the slice.
I was just thinking about that when I was halfway reading that, I did not thought about this since I did not know that |
The original design in the first post only shows the ASCII subset of Unicode. Here, each codepoint is encoded in a single byte and any shifting modifications will still leave it as valid ASCII. The problem occurs with characters outside of the ASCII subset, which require 2-4 bytes in UTF-8, since here chopping of the beginning or end of the encoded codepoint leads to invalid UTF-8.
The numbers I generated with this snippet, the annotations with for b in s.as_bytes() {
println!("0x{:0>2x} {:>3} {:0>8b}", b, b, b);
} |
Patch updated with zero filling the extra bytes. pickfire/json@e4f7b20 Feel free to check on it. After that, I will get on with the others |
@jonasbb pickfire/json@21379e3 Added the rest, interested to check it out? I think just left the test, doc and maybe some code clean up. |
Comparing the performance of benchmark recently. Looks like
#[bench]
fn bench_deserialize_from_str(b: &mut Bencher) {
let j = input_json();
b.bytes = j.len() as u64;
b.iter(|| {
serde_json::from_str::<Twitter>(&j).unwrap();
});
}
#[bench]
fn bench_deserialize_from_mut_str(b: &mut Bencher) {
let j = input_json();
b.bytes = j.len() as u64;
b.iter(|| {
let mut j = j.clone();
serde_json::from_mut_str::<Twitter>(&mut j).unwrap();
});
} Edit: Running locally for the first time got me different results. Weird. (I took the best case for
|
@dtolnay The current implementation is unsound where it have a footgun such that it will have undefined behavior if people try to deserialize it twice (probably no reason to do it but it is troublesome once they do it), should we make it unsafe for this? Is there anyway to take ownership of the string and prevent people from calling the function twice? |
Can you share some compilable sample code that demonstrates the undefined behavior? |
Just do the same thing twice. Code below not tested but shows a rough idea. use serde::Deserialize;
#[derive(Deserialize, Debug)]
struct User {
fingerprint: String,
location: String,
}
fn main() {
// The type of `j` is `&mut str`
let mut j = "
{
\"fingerprint\": \"0xF9BA143B95FF6D82\",
\"location\": \"Menlo Park, CA\"
}".to_owned();
let u: User = serde_json::from_mut_str(&mut j).unwrap();
let u: User = serde_json::from_mut_str(&mut j).unwrap(); // <-- this
println!("{:#?}", u);
} But you see the difference for |
I don't understand why that would lead to undefined behavior. The second call would probably return an error because the string is no longer valid JSON, but that's fine. |
Wait, I meant an error (now I am wondering if that issue happens before I pad the extra data with 0 to prevent invalid unicode characters) but yes it may still be unexpected. |
Potentially returning errors wouldn't be a reason to make a function unsafe. |
@dtolnay Thanks for the review. But how should I start with a separate crate? Do I just implement the same things I did for the serde-json crate? |
Originally filed as serde-rs/json#318, but since this will involve quite a bit more unsafe code than currently exists in serde_json I think it could be better to put in a separate crate. Much of the implementation can be drawn from serde_json.
The idea is to implement a function like:
backed by a Serde
Deserializer
that knows how to unescape JSON strings in place.In the example, the incoming bytes in memory would contain a subslice looking like:
and JSON deserialization shuffles them to lay out the correct output string bytes in place while modifying the minimum amount of memory:
The text was updated successfully, but these errors were encountered: