Skip to content
This repository has been archived by the owner on Jul 9, 2024. It is now read-only.

Approximate UUID timestamp calculations #127

Closed
sergeyprokhorenko opened this issue Aug 1, 2023 · 24 comments
Closed

Approximate UUID timestamp calculations #127

sergeyprokhorenko opened this issue Aug 1, 2023 · 24 comments

Comments

@sergeyprokhorenko
Copy link

As a result of the attempt to implement the standard, one unfortunate shortcoming was revealed. I propose to add the following text to the section "6.1. Timestamp Considerations" in the next version of the standard:

Approximate calculations:
The UUID timestamp MAY be calculated with any reasonable calculation error. For example, for UUIDv7, a value in microseconds MAY be divided by 1024, or a value in any unit of time MAY have its right side removed by less than about a millisecond.

I believe this is now in line with the standard, as the standard says:

Implementations MAY alter the actual timestamp. Some examples include security considerations around providing a real clock value within a UUID, to correct inaccurate clocks, or to handle leap seconds. This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.

However, developers would like reasonable computational errors to be allowed explicitly.

@x4m
Copy link

x4m commented Aug 1, 2023

Let me elaborate a bit.
We have syscalls gettimeofday() and clock_gettime() that return real time with microsecond and nanosecond precision respectively. To get millisecond resultion we must divide microseconds by 1000 or nanoseconds by 1000000.
The division instruction is expensive on many machines. Despite CSPRNG might be expensive too, I still see this division as a source of energy waste.
If we compute rand_a as number of 1/1024 second intervals obtained from gettimeofday() or 1/(1024 * 1024) from clock_gettime(), the computation can be done with bit shift by (seconds<<10 + us>>10) instead of (seconds*1000 + us/1000). Will such unix_ts_ms be in line with the standard?

@broofa
Copy link
Contributor

broofa commented Aug 3, 2023

one unfortunate shortcoming was revealed

What was the shortcoming? I don't understand how the new verbiage improves upon what's in the spec already.

I still see this division as a source of energy waste

The cost of a divide operation is several orders of magnitude below what we should be concerned with.

@x4m
Copy link

x4m commented Aug 4, 2023

DIV r32 can cost you some tens of nanoseconds [0]. On some platforms RDTSC is cheaper than DIV[1] and total cost of gettimeofday\clock_gettime will be comparable to this avoidable math manipulation. At which point are we concerned with performance?

[0] https://www.agner.org/optimize/instruction_tables.pdf
[1] https://stackoverflow.com/questions/6498972/faster-equivalent-of-gettimeofday

@broofa
Copy link
Contributor

broofa commented Aug 4, 2023

At which point are we concerned with performance?

When the rate at which generating UUIDs becomes an issue for people using them.

While I'll grant this is anecdotal, the NPM uuid library typically benchmarks in the 100K-1M UUIDs/second range depending on version and platform. And RFC4122 actually imposes a hard limit of 10M uuids/second. It requires implementations to throw an error if that's exceeded, which we do. In the 10+ years I've been maintaining this project, I've only had one report of that limit being hit and that was in the context of benchmarking, not an actual production use case.

So in absolute terms I think it's safe to say that generation times on the order of 1000 nanoseconds are at the extreme limit of what's expected. In real world use cases UUIDs are generated in contexts where other more performance-intensive operations are the gating factor. Hence, why a few 10's of nanoseconds aren't a concern.

For what it's worth, the performance work we've done on uuid has always been "academic" in nature. I.e. Performance for performance's sake, not because it's actually required. We've reached a point where I reject performance PRs on the basis of code readability and maintainability because nobody cares about performance at this level of granularity.

@sergeyprokhorenko
Copy link
Author

sergeyprokhorenko commented Aug 4, 2023

Here is a quote from the section 2. Motivation of the renewed standard:

The UUID generation algorithm described here supports very high allocation rates of 10 million per second per machine or more if necessary, so that they could even be used as transaction IDs.

This means a generation time of 100 nanoseconds per UUID. Therefore, tens of nanoseconds of the division operation is too much

@broofa
Copy link
Contributor

broofa commented Aug 5, 2023

'Taking a bit more time to figure out what it is that's being asked for here, I think I understand the issue. To answer my own question ("what was the shortcoming?") the concern seems to be that the spec doesn't explicitly allow for deliberately sloppy timestamps computations, such as dividing by 1024 instead of 1000 to convert from microseconds to milliseconds.

Assuming that's a fair synopsis, I would ask that we not take this new verbiage. The existing verbiage acknowledges that system clocks are known to be imperfect, while conveying the desire for accuracy. It's a good balance of "timestamps need to be accurate" and "timestamps are known to be inaccurate".

The proposed verbiage carries a different tone. It tacitly endorses implementations that are deliberately inaccurate, which I think is a mistake. And it does so for performance reasons that are (imho) well beyond the margins of actual user concern.


On the topic of performance ...

Here is a quote from the section 2. Motivation of the renewed standard

2. Motivation is just the original verbiage from RFC4122. It's included for (I assume) historical reference... and as it turns out, it is incorrect(!) Version 1 does not allow for "10 million per second ... or more". It allows for at most 10M/second.

[@kyzer-davis: Can you confirm the original RFC motivation is misleading in this regard? I don't see any provisions in 4122 that allow for generators to support rates greater than 1 uuid/timestamp interval. Would this be worth calling out or revising in the new spec?]

Thus, "10 million per second" isn't a requirement. It's an upper limit. And it's a theoretical limit based on the nature of the v1 (and v6 by extension) algorithms. None of the other versions have such a limit. v4, v5, v7? Those algorithms will let you crank out a trillion uuids/second if you can figure out how to do that. So should the spec concern itself with what might be needed to achieve picosecond-level optimizations?

@sergeyprokhorenko
Copy link
Author

sergeyprokhorenko commented Aug 5, 2023

The existing verbiage acknowledges that system clocks are known to be imperfect, while conveying the desire for accuracy.

This claim is based on nothing and is misleading. This is not in the standard. This is an irresponsible speculation.

@sergeyprokhorenko
Copy link
Author

sergeyprokhorenko commented Aug 5, 2023

The sole purpose of the timestamp is to order UUIDs by generation time. It does not matter how much the timestamp differs from real time. Moreover, the standard calls for distorting the timestamp when it is necessary for information security. For example, you can add or subtract several years to hide the true moment of generation. And the standard does not limit the motives for which the timestamp can be deliberately distorted.

@sergeyprokhorenko
Copy link
Author

Thus, "10 million per second" isn't a requirement. It's an upper limit.

This statement grossly contradicts the text of the standard.

@broofa
Copy link
Contributor

broofa commented Aug 6, 2023

This claim is based on nothing and is misleading. This is not in the standard. This is an irresponsible speculation.

Inflammatory, a little offensive, and nothing of substance to respond to.

The sole purpose of the timestamp is to order UUIDs by generation time

Than we definitely shouldn't allow non-standard time intervals. Doing so causes ordering to become a hot mess as soon as uuids from different generators start to commingle.

For example, Take a uuid generated now (2023-08-06T13:55:18.285Z) from a generator using the proposed 1 µs/1024 time interval and insert it into a list of uuids generated with the standard 1ms interval and it will sort as though it was generated in May of last year (2022-05-04T18:39:28.637Z to be specific).

That's going to wreak havoc with the stated goal of better DB locality.

Thus, "10 million per second" isn't a requirement. It's an upper limit.

This statement grossly contradicts the text of the standard.

Prove it.

Here's the's v1 implementation for uuid. It's been written to throw a hard error if callers attempt to create more than 10M uuids/second because that's what the spec demands. If we have that wrong then please put up a PR that corrects this. I would genuinely welcome that. You'd be doing the [very large] uuid community a pretty big service by getting rid of that throw statement.

Distinctly not holding my breath on this, however.

@x4m
Copy link

x4m commented Aug 7, 2023

Folks, just my 2 cents.
"DB locality" == locality in one node. It's impossible to have time-ordered locality in many nodes.
"10M upper limit" sounds like "640k ought to be enough for anybody".
Even with 10M upper limit division instruction could take ~10% of CPU costs. Yes, in JS you do not care about it. And hardly will every reach 1B uuids per second. But performance-oriented system can get to this numbers easily. Also, think about low-power IOT devices. In that world you can substitute performance with energy expenses. It's better to save ~10% spent on such a frequent task like uuid generation.

@x4m
Copy link

x4m commented Aug 7, 2023

Also, from a DB standpoint, mixing series of UUIDs with seriously different clocks wont make any impact on performance. Crucial part is that working set of new data is small. When the data is generated randomly across keyspace of whole dataset - that's a problem. If we have 13 data sources, one is dividing by 1000, another by 1002, one more by 1004 ... 1024 - this won't affect cache locality much.

@fabiolimace
Copy link

fabiolimace commented Aug 7, 2023

I think the benefit of optimizing the generation of timestamps by avoiding division is of little relevance. The test below shows that the difference between division and shift (division by a power of 2) can be negligible compared to the duration to get the current time.

Measure the duration to get 1 billion milliseconds using div and shift separately

Test result:

user@computer:~$ gcc -O0 measure_duration_get_milliseconds.c && ./a.out 
duration to get seconds (baseline):       16.958577 s
duration to get milliseconds using div:   17.956647 s
duration to get milliseconds using shift: 17.617555 s

Also, in the long run, the difference between the timestamps produced with division and shift will get bigger and bigger. Currently, the difference is more than 400 days, considering the test below.

Measure the distance between milliseconds using div and milliseconds using shift

Test result:

user@computer:~$ gcc -O0 measure_distance_get_milliseconds.c && ./a.out 
timestamp using div:   1691404335877 ms
timestamp using shift: 1731998039876 ms
distance in seconds:   40593703 s
distance in days:      469 days

Assembly difference between division and shift in the tests:

user@computer:~$ diff get_milliseconds_with_div.s get_milliseconds_with_shift.s 
23,33c23,27
< 	imulq	$1000, %rax, %rsi
< 	movq	-24(%rbp), %rcx
< 	movabsq	$4835703278458516699, %rdx
< 	movq	%rcx, %rax
< 	imulq	%rdx
< 	movq	%rdx, %rax
< 	sarq	$18, %rax
< 	sarq	$63, %rcx
< 	movq	%rcx, %rdx
< 	subq	%rdx, %rax
< 	addq	%rsi, %rax
---
> 	salq	$10, %rax
> 	movq	%rax, %rdx
> 	movq	-24(%rbp), %rax
> 	sarq	$20, %rax
> 	orq	%rdx, %rax

@kyzer-davis
Copy link
Contributor

I have not been checking this repo so I apologize for missing this thread.

@broofa, the old doc said:

The UUID generation algorithm described here supports very high allocation rates of up to 10 million per second per machine if necessary, so that they could even be used as transaction IDs.

The latest new doc says:

The UUID generation algorithm described here supports very high allocation rates of 10 million per second per machine or more if necessary, so that they could even be used as transaction IDs.

The key being that it is now uncapped and 10m is just an example. I don't remember exactly when this was changed but I believe the idea is that one could create way more than this with modern computers. After-all that metric may just be something from early 2000s era computing vs some hard limit.


As for the dividing a microsecond by 1024 vs 1000 for performance reasons. I see no issue adding something to "Altering, Fuzzing, or Smearing:" sub-section which gives an example of another reason one might modify the timestamp.

Some proposed text:

Some examples include security considerations around providing a real clock value within a UUID, to correct inaccurate clocks, to handle leap seconds, or for performance reasons such as dividing a microsecond by 1024 (or some other value) instead of 1000 to obtain a millisecond value. This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.

Just to note, we just concluded IETF last call so I am not sure what changes I can realistically add that won't reset the AD reviews. Let me run this by the WG Chairs and get their opinion.

@bradleypeabody
Copy link
Contributor

My take on this is that dividing by 1024 for performance reasons should be allowed by the spec, but not encouraged. From an interoperability perspective if something is actually using the timestamp or needs the ordering provided by whatever the best clock source is, then this is problematic. But, that said, one of the goals of this new spec is to allow implementations to make these kinds of tradeoffs - if the interoperability in terms of timestamp accuracy just isn't a concern for your application, then sure, use 1024ths of a second instead of milliseconds. But I don't think we should promote this idea in the spec because it's a slippery slope - why not change the timestamp entirely to some other numbering system, and if you did, is that still UUIDv7?

Just from a quick playground example, if you look at the difference for a specific timestamp, you can see here that https://go.dev/play/p/3_dA3xapsaR the date 2023-09-15 00:00:00 when converted to unix timestamp and divided by 1024 instead of 1000 and then converted back to a date becomes 2022-06-12 06:33:45 - over a year earlier. The timestamp becomes mostly meaningless at this point, at least in terms of interoperability.

So again, if someone wants to do this for their system and the interoperability concerns are acceptable to them because they are sure nobody cares about how the timestamp reads, or they control all of the parsing code, great. But I don't think we should be telling people to do this as advice.

How about this text instead:

Altering, Fuzzing, or Smearing: : Implementations MAY alter the actual timestamp. Some examples include security considerations around providing a real clock value within a UUID, to correct inaccurate clocks, or to handle leap seconds, or to improve performance when converting a system timestamp to the necessary value (e.g. improving UUID generation speed but reducing timestamp value accuracy). Implementations should document such timestamp behaviors in order to improve interoperability. This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.

@ramsey
Copy link

ramsey commented Aug 15, 2023

if the interoperability in terms of timestamp accuracy just isn't a concern for your application, then sure, use 1024ths of a second instead of milliseconds. But I don't think we should promote this idea in the spec because it's a slippery slope - why not change the timestamp entirely to some other numbering system, and if you did, is that still UUIDv7?

I would argue that you should use a UUIDv8 in this case.

@sergeyprokhorenko
Copy link
Author

sergeyprokhorenko commented Aug 15, 2023

e.g. improving UUID generation speed but reducing timestamp value accuracy

I prefer the wording of @kyzer-davis as more specific, or the wording of @bradleypeabody needs to be clarified. The words "timestamp value accuracy" can cause controversy, as some may think that it is about minutes or hours, but not years. By the way, for security reasons, a deviation of several years is also a good opportunity. I want to remind that the only purpose of the timestamp is to ensure the generation of monotonically increasing keys to quickly search for records. But from a DB standpoint, mixing series of UUIDs with seriously different clocks wont make any impact on performance. Therefore, the interoperability will not be affected. See also: 6.11 Opacity

I agree with the suggestion to document intentional timestamp deviations.

@LiosK
Copy link

LiosK commented Aug 16, 2023

Could be somewhat off topic: tv_sec * 1000 + tv_nsec >> 20 makes some sense to me, but tv_sec << 10 + tv_nsec >> 20 doesn't because multiplication is almost as fast as left shift. With the former approach, the resulting variance from tv_sec * 1000 + tv_nsec / 1000000 is less than 50 milliseconds, which is usually as negligible as rounding errors. Do we really need explicit wording about this technique in the spec?

Plus, from my benchmarks, under extremely high workload, the syscall overhead becomes prominent, so caching the tv_sec * 1000 + tv_nsec / 1000000 result in the user land should improve the throughput way more than eliminating divisions.

@LiosK
Copy link

LiosK commented Aug 16, 2023

Btw, consider this approach as well: (uint64_t)tv_sec * 1000 + ((uint64_t)tv_nsec * 1073 >> 30)

1073 comes from: (2^20 / 1000^2) * (2^10 / 2^10) ~= 1073 / 2^10

Beware of tv_nsec overflows.

@LiosK
Copy link

LiosK commented Aug 19, 2023

After some research, I learned that

(uint64_t)tp.tv_sec * 1000 + ((uint64_t)tp.tv_nsec * UINT64_C(18014398510) >> 54)

is 100% compatible with

(uint64_t)tp.tv_sec * 1000 + (uint64_t)tp.tv_nsec / 1000000

Two expressions result in exactly the same value for each tv_nsec ranging from 0 to 999,999,999.

Plus, compilers conduct this sort of optimization automatically (see this paper for details), so when I look at a quick benchmark result, the former code is much faster than the latter with the unoptimized executable, while such a performance difference is not observable with the optimized executable.

I'd say we don't need to be concerned about divisions and can just stick to:

(uint64_t)tp.tv_sec * 1000 + (uint64_t)tp.tv_nsec / 1000000

@x4m
Copy link

x4m commented Aug 19, 2023

Yup, I worried a bit about -O0 too, but thought that it makes sense. Relying on compiler optimizations did not strike me as a good idea. However, this optimization seems straightforward enough, when I saw the code in C :)

@kyzer-davis
Copy link
Contributor

I have committed the text change under ietf-wg-uuidrev/rfc4122bis@26f38c9 for draft-10

@kyzer-davis
Copy link
Contributor

Merged, and closing. Thanks group.

@sergeyprokhorenko
Copy link
Author

@kyzer-davis,
Is it possible to change
"such as dividing a microsecond by 1024 (or some other value) instead of 1000 to obtain a millisecond value"
for
"such as dividing a number of microseconds by 1024 (or some other value) instead of 1000 to obtain a millisecond value"?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants