-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decimal literal number #1752
Decimal literal number #1752
Conversation
With the latest minor change the significance of the literal is preserved. |
@wtlangford @nicowilliams I wonder what this silence means. Does this pull request have a future? |
Just that I've been busy this last week. I just need to scrounge up the time to take a proper look at it. |
@wtlangford @nicowilliams so wdyt here? |
😕 |
Hey, sorry, I've been super busy. I'll have time in the next day or two to give this a good, proper read-through, due to the holiday. |
Hi.
Any news?
… On 20 Nov 2018, at 16:14, William Langford ***@***.***> wrote:
Hey, sorry, I've been super busy. I'll have time in the next day or two to give this a good, proper read-through, due to the holiday.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yep, I'm reading my way through it. It's large, so I'm making sure to give
it all the attention it needs. Should be done soon!
…On Mon, Nov 26, 2018, 08:36 Leonid S. Usov ***@***.***> wrote:
Hi.
Any news?
> On 20 Nov 2018, at 16:14, William Langford ***@***.***>
wrote:
>
> Hey, sorry, I've been super busy. I'll have time in the next day or two
to give this a good, proper read-through, due to the holiday.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub, or mute the thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1752 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADQ4VzHRkV2Qpsv-4MKVjfvW8xWZa1DNks5uy-5HgaJpZM4X9oar>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, but I've finally made it through!
I need to go back and read up a bit on the decNumber library and how it works to better understand the allocations we're making and the semantics of it, but I didn't want to block the rest of my comments on that.
I'm going to run some benchmarks of my own, but in the meantime, do you have any benchmarking on how this change impacts performance (speed, mem usage, etc)?
docs/content/3.manual/manual.yml
Outdated
@@ -292,11 +292,37 @@ sections: | |||
program can be a useful way of formatting JSON output from, | |||
say, `curl`. | |||
|
|||
An important point about the identity filter is that it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might argue that we shouldn't be so explicit about this behavior in the documentation (though the changelog/etc is a different story).
I still like to think of it as an internal detail that we're keeping the original decimal representation around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow your advises with regards to the documentation, because this is quite a fuzzy topic and we could argue for no practical reason.
I simply wanted to cover the thing which was one of the main motivations behind the whole PR - the fact that today jq is not able to keep the input unchanged if the number literal appears to be of grater significance than double can represent
@@ -0,0 +1,38 @@ | |||
/* ------------------------------------------------------------------ */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably omit the documentation pdf and example c files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be glad to remove everything which is not used, but I wasn't sure if we can do that with regards to the open source nature of this part. If we can exclude all other things then let's do this
const char * str = jv_number_as_string(x, buf, JVP_DTOA_FMT_MAX_LEN); | ||
put_str(str, F, S, flags & JV_PRINT_ISATTY); | ||
if (jvp_number_is_nan(x)) { | ||
jv_dump_term(C, jv_null(), flags, indent, F, S); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the behavior change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have actually fixed this annoying thing that when we fall back to null here we used output it as plain text null and it wouldn't get highlighted as real null
values in the output. Just try it in console, the actual null
and a null
of a NAN are shown differently today.
src/jv_type_invalid.c
Outdated
int jv_invalid_has_msg(jv inv) { | ||
assert(JVP_HAS_KIND(inv, JV_KIND_INVALID)); | ||
int r = JVP_HAS_FLAGS(inv, JVP_FLAGS_INVALID_MSG); | ||
jv_free(inv); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This free changes the semantics of this function. Previously, it would not free inv
, and was only freeing the extra copy on the msg
performed by jv_invalid_get_msg
(see https://github.com/stedolan/jq/pull/1752/files/bf7f1f2b8176597a04bf4a8c00243567d3c7c83d#diff-5c46af0444da42be404b9ff2acf24da2L124)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's a change of implementation to follow the actual semantics of every single place this function was called, which was resulting a memory leak.
Please check that in all places where it was called the parameter has been jv_copy
ied.
This is exactly one of the things I have tried addressing in my API name refactoring. Today there is a big mess because one can't tell by the naming convention whether the function will be consuming the value or not.
At the same time, there is a general tendency to consume parameters in the calls, so instead of fixing all places where this function was called I have decided to change the implementation. And actually optimised it as well - previously it would construct a message just to see that it wasn't nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's a change of implementation to follow the actual semantics of every single place this function was called, which was resulting a memory leak.
Please check that in all places where it was called the parameter has beenjv_copy
ied
Ah, I hadn't realized (yes, more evidence of your point. 😃). This is fine, then.
src/jv_type_private.h
Outdated
#define JVP_FLAGS(j) ((j).kind_flags) | ||
#define JVP_KIND(j) (JVP_FLAGS(j) & KIND_MASK) | ||
|
||
#define JVP_HAS_FLAGS(j, flags) (JVP_FLAGS(j) == flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put parens around flags
and kind
here.
Also, I'd like the name to be a bit clearer. JVP_HAS_FLAGS
would imply to me that I can ask about any particular bits on the flags, but this is actually saying "does it have only and exactly these specific flags?". Can we find a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed about parens.
Naming - i'm open to suggestions
src/jv_type_private.h
Outdated
#define JVP_KIND(j) (JVP_FLAGS(j) & KIND_MASK) | ||
|
||
#define JVP_HAS_FLAGS(j, flags) (JVP_FLAGS(j) == flags) | ||
#define JVP_HAS_KIND(j, kind) (JVP_KIND(j) == kind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better called JVP_IS_KIND
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. Or maybe not.
This way it's HAS_FLAGS and HAS_KIND. But i'm not too married to this.
src/jv_type_array.c
Outdated
jv_array_foreach(a, ai, aelem) { | ||
jv_free(aelem); | ||
jv_array_foreach(b, bi, belem) { | ||
if (!jv_equal(jv_array_get(jv_copy(a), ai + bi), jv_copy(belem))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the jv_copy
from belem
here, and you can skip the jv_free(belem)
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/jv_type_number.c
Outdated
static struct { | ||
uint64_t refcnt; | ||
struct dtoa_context C; | ||
} shared_dtoa_context; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about these shared contexts and thread-safety. While the jq vm itself is single-threaded (and the main jq
executable as well), we do expose a C API that other, multi-threaded programs could use. Two programs attempting to use the same contexts here could encounter issues.
I understand the purpose was to reduce allocations/improve performance, but we have to be careful with globals and shared state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Indeed, this shared data will not be usable even with external syncrhonization techniques.
I can change it to thread local storage. The only constraint here would be that the jv vm would need to be started and used and destroyed on the same thread.
Or, we could think of some "vm context" to be used anywhere inside the implementation its needed, but it looks like a much more impacting refactor, it would require passing the context everywhere around. But it would mean that a given vm instance could be used across multiple threads given some external sync to prevent simultaneous execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid the thread local storage so we don't have to pin a vm to its original thread. That seems like a pretty heavy cost on our users.
The vm context probably costs a bit more to solve than we'd like (certainly in implementation complexity).
That said, why do these need to be shared at all? Why not just make a dtoa_context
or a dec_context
on the stack when we need one? These don't seem to be particularly large (though the dtoa_context
could be sneaky if it has to allocate a lot during processing). I'd like to investigate the performance cost of just making these on the stack every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's pretty heavy. The dtoa algo is creating several BigNum buffers on the heap and reusing the context will actually reuse those buffers, while creating and freeing this context will alloc and release all those BigNums.
Currently it's not an issue since the dtoa context is intialized once per parse (but actually multiple times per dump, so dumping many different numbers is not as efficient as it could have been, while dumping one array of many numbers is reusing the same context)
I have solved both issues with this approach. Actually, we could have one dtoa context per thread and expose a "thread_free" method for jv vm to cleanup once thread is not used any more.
Actually I kind of liked my approach of having shared data reused by multiple instances. If jv_object could somehow get the reference to the vm context it lives in, it could be our solution of choice to share the data between instances of the same vm and clean up once all objects in that vm are gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have just realised that I can come up with a thread local implementation immune to thread move.
The shared objects will be managed per thread, with refcount; once a new jv_number requires a shared context it received an initialized pointer. Then, every time it needs the contex it should pass the pointer and the system will compare it to the current thread's shared pointer. If it's different then it will reduce the refcount for the other thread's context and increment the refcount for this thread's context, and return this thread's context, so that the jv_number already has a different pointer to work with currently.
this way, the refcount will properly flow between threads and the contexts will be freed once the last object is deallocated or moved from that thread
🎉 let's rock and roll! |
Re. performance. I have linked in my other posts to the benchmarks of this library. Just to clarify, we aren't using any of the math except for comparison, and only in case two literals are being compared. Also, if you have noticed, I have delayed the I haven't run any checks cause I didn't have any infra for that at the time. I remember you mentioned trying things out so I thought that maybe you had some handy setup available, so left it out of my work scope. |
Please see |
Guys, where are we standing with this? Is this pending on me having to run the benchmarks? I'm not sure if I'll have time for this until the end of this week. Thanks. |
I've been running some benchmarks and looking at solutions for the global
state. I'll have more for you soon!
…On Wed, Dec 19, 2018, 12:24 Leonid S. Usov ***@***.***> wrote:
Guys, where are we standing with this?
Is this pending on me having to run the benchmarks? I'm not sure if I'll
have time for this until the end of this week.
However maybe there are other concerns/issues? If we make a clear roadmap
I'll try and commit to this. However if you are not interested, which would
be a bummer, I'd rather not invest additional time
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1752 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADQ4VyX5umYy-ssaq9-CFkAdHSSeS7XMks5u6nZQgaJpZM4X9oar>
.
|
Sorry for the delay! The holidays got the better of me. After further reflection, I feel that thread-locals are the right way to go, so I pushed up a copy of your branch to my fork with an extra commit on it (wtlangford@b2cf0f9) that switches the shared contexts over to using pthread's TLD (and makes pthread no longer optional). There's no major speed penalty to it compared to your branch, and we go back to being thread-safe. (Interestingly, most libc implementations I've looked at use similar mechanisms for their strtod context allocations). Relating to documentation: I don't want to make any guarantees that we'll maintain precision on your number (or even that we'll be keeping bignum support- maybe this causes lots of issues we aren't aware of yet and want to revert back to IEEE754 doubles). However, I'm happy with documenting our current behavior as "behavior", but not a "promise":
The last thing I might want is to put some |
🎉
Sent from mobile device, typos and false corrections are possible
Regards,
Leonid S. Usov
… On 5 Jan 2019, at 23:04, William Langford ***@***.***> wrote:
Sorry for the delay! The holidays got the better of me.
After further reflection, I feel that thread-locals are the right way to go, so I pushed up a copy of your branch to my fork with an extra commit on it ***@***.***) that switches the shared contexts over to using pthread's TLD (and makes pthread no longer optional).
There's no major speed penalty to it compared to your branch, and we go back to being thread-safe. (Interestingly, most libc implementations I've looked at use similar mechanisms for their strtod context allocations).
Relating to documentation:
I'm happy to document as guarantees the semantics we intend to guarantee. For semantics we don't intend to guarantee (details about the specific implementation, etc), I'm fine with documenting them, but we need to be very explicit that we are not making guarantees about them.
I don't want to make any guarantees that we'll maintain precision on your number (or even that we'll be keeping bignum support- maybe this causes lots of issues we aren't aware of yet and want to revert back to IEEE754 doubles). However, I'm happy with documenting our current behavior as "behavior", but not a "promise":
Numbers you don't do math on will retain their precision
The last thing I might want is to put some #ifdef USE_DEC_NUMBER guards around all of the decnumber code and a configure flag to control it. I'd like us to be able to build without this support for testing/measurement purposes. (We'd likely also want to use that configure flag to control whether or not the decNumber sources get built into the binary, too).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
The library adds support for decimal numbers of arbitrary length. Downloaded from ICU, under ICU 1.8.1 license http://download.icu-project.org/files/decNumber/decNumber-icu-368.zip
Extend jv_number to use decNumber for storing number literals. Any math operations on the numbers will truncate them to double precision. Comparisons when both numbers are literal numbers will compare them without truncation. Delay conversion of numbers to doubles until a math operation is performed, to preserve precision. A literal jv_number will only need conversion to double once, and will reuse the resultant double on subsequent conversions. Outputting literal jv_numbers preserves the original precision. Add strong pthread requirement to manage contexts/allocations for converting numbers between their decNumber, string, and double formats.
Allow building jq in a mode that doesn't use decnumber for benchmarking purposes. decnumber support is enabled by default, and this option is meant to be removed once we're happy with the performance.
Sorry again for the delays! We're at the end of them, though; at this point, all that's left is a review from @nicowilliams! I've pushed a branch (https://github.com/stedolan/jq/compare/dec_literal_number) which contains all of this awesome work (less the @leonid-s-usov: The commits on my branch still have you as an author. Would you prefer to reset your branch to those or do you want me to close this PR and open one under the branch I pushed up? Either is fine with me. |
I'll review this weekend. Thanks @wtlangford! |
Thanks much, I’ll pull from your repo and reset
Sent from mobile device, typos and false corrections are possible
Regards,
Leonid S. Usov
… On Apr 4, 2019, at 5:47 AM, Nico Williams ***@***.***> wrote:
I'll review this weekend. Thanks @wtlangford!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
905fd84
to
2353d03
Compare
@leonid-s-usov Let's backup. Until now all Using an atomic CAS would prevent the leaking. Of course, at this time we don't use any atomic operations, and we'd have to start using atomic increment/decrement for the Anyways, please disregard as I am convinced we can make this thread-safe. |
That's cool, I understand it, and I agree. What I'm adding on top is that I believe in most cases it will be sufficient to make an atomic CAS to a member of the private structure pointed to by the jvp, under the implementing types' responsibility. This may be much more efficient than a common requirement to make a swap of the full jvp structure. |
Yes, that's what I had in mind. |
Also, no need to CAS a |
For this branch it's not about the double, apparently, but about this one, for example
|
I've chatted with @nicowilliams: I'm going to go ahead and merge this and we'll just watch for any issues that may pop up as time goes on. The thread-safety issues we discussed above can be addressed whenever we're ready to cross that bridge. Sorry to let this languish here for so long! |
Wow!
Amazing. Thank you!
Sent from mobile device, typos and false corrections are possible
Regards,
Leonid S. Usov
… On 22 Oct 2019, at 21:11, William Langford ***@***.***> wrote:
Merged #1752 into master.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
This was merged back in 2019, but no new builds since then. It would be nice not to have to create custom builds with this fix. @stedolan @leonid-s-usov @itchyny Any plans to cut a new build? |
Hi all
@pkoppstein @wtlangford @nicowilliams, let me present you the descendant of #1743, and a proud fixer of #1741 and such, the decimal number literal implementation.
Based on the other PR it features full PASS under valgrind.
Unlike my suggestion at #1741 I haven't implemented the simple math (+ and -) to be run by the decimal number logic, although it's trivial to achieve currently.
Comparisons, however, are fully decimal, so please feel free to try
and vice versa