Selecting fastest date/time parser #6103

tsafin · 2021-05-26T22:48:48Z

tsafin
May 26, 2021

Introduction

As part of Lua/SQL types consistency task we need to approach datetime implementation in Tarantool SQL engine. But before doing so (remember we want to have Lua/SQL consistent?) we have to establish date/time/timestamp types support to Lua, and be able to store those types in Tarantool box.

It might be easy to pick Unicode ICU date/time parsing implementation, (and ICU is already integrated into Tarantool core for collation support), but problem is - there are reasonable doubts that ICU is fast enough.

We needed to find best date/time parsing code which we could use from Lua side.

For obvious performance reasons none of popular pure Lua Date/Time implementations could serve us well enough. Neither Penlight pl.date (which is deprecated at the moment, BTW), nor Thijs Schreijer "Tieske"' LuaDate could provide us adequate
performance levels we would like to have in builtin Tarantool module.

We need to select among C/C++ implementations which we could use as a basis for FFI-based module.

Possible candidates

In the benchmarks repository bench-timestamp we have used following C/C++ date/time parsers for our experiments:

c-dt by Christian Hansen' chansen/c-dt which, is largely precursor to his excellent Perl5
p5-time-moment module, which @Mons recommended us to look into (thanks, @Mons!).

We had to patch c-dt slightly to properly integrate it whole cmake build process (making sure that we are using the same
set of compilers, with the same set of optimizations settings selected);
Google Civil Time (cctz) C++ implementation;
industry-standard Unicode organization unicode-org/icu C++ implementation;
and, as a bonus (and simply "because we can") there is simple re2c-based reimplementation of c-dt
datetime parser, which shows the pure beauty of deterministic finite automata for parsing regular grammars :)
(see timings details below)

Googletest and Google benchmark

We use Google Test and Google Benchmark frameworks as a drivers for running unit-tests and benchmarks for us.

NB! We do not yet properly (seamlessly) integrated them to the build process, and repository is not yet self-contained, thus one
should install googletest and libbenchmark-dev as prerequisites from elsewhere.

Examples: parsing of ISO-8601 date/time format

We have built unit-tests and benchmarks which use the same set of predefined date/time literals, represented in ISO-8601
format. Below we show the code which we need to use for each implementation,

c-dt

c-dt is flexible while dealing with multiple date/time formats it accepts. You do not worry which format to select for parsing date, the choice is done automagically.

The trick is - there is no aggregate function for parsing date + time + timezone represented in the same literal, but there are separate dt_parse_iso_date(), dt_parse_iso_time_* and dt_parse_iso_zone_* which parse corresponding parts of input text, and which may be composed to the full timestamp parser (see parse_datetime_extended() in the bench-cdt.cpp)

Essentially parsing of single literal will look like following:

	const char civil_string[] = "2015-02-18T10:50:31.521345123+10:00";
	int64_t secs;
	int64_t nanosecs;
	int64_t offset;
	int rc = parse_datetime_extended(civil_string, sizeof(civil_string) - 1,
					 &secs, &nanosecs, &offset);
	assert(rc == 0);
	assert(nanosecs == 521345123);

CCTZ

Both Google CCTZ and ICU do not have any automagical way to parse any format of timestamps - you have to select format string before applying it at the runtime.

i.e. parsing of same literal "2015-02-18T10:50:31.521345123+10:00" in CCTZ will look like:

    const std::string civil_string = "2015-02-18T10:50:31.521345123+10:00";
    cctz::time_zone lax;
    std::chrono::system_clock::time_point tp;
    // load_time_zone("America/Los_Angeles", &lax);
    const bool ok = cctz::parse("%Y-%m-%dT%H:%M:%E9S%Ez", civil_string, lax, &tp);
    assert(ok);

Advantage here (at least, comparing to c-dt above) is that CCTZ does have timezone database, and you could use symbolic form of timezone offset in timestamp, i.e. instead of "UTC+03:00" your could simply say "Europe/Moscow".

ICU

Unicode' organization ICU is the most powerful, but sometimes is (expectedly) the slowest one among all mentioned here.

We still need to know what kind of format we about to parse, so it's not as much convenient as c-dt, and is similar to CCTZ.

One caveat though - you better to precompile format beforehand, otherwise full cycle of format peparation, and then parsing data would be too slow.

const char16_t civil_string[] = u"2015-02-18T10:50:31.521345123+10:00";
const char16_t format[] = u"YYYY-MM-ddTHH:mm:ss.SX";

UErrorCode status = U_ZERO_ERROR;
const UDateFormat *const_fmt = udat_open(UDAT_PATTERN, UDAT_PATTERN, "en_US", NULL, 0, format, 0, &status);

UDate d = udat_parse(const_fmt, civil_string, sizeof(civil_string) - 1, &pos, &status);
assert(!U_FAILURE(status));

If you are curious enough to see the difference between timings of when we have removed out of cycle
invariant udat_open call and when we do the full cycle, then please compare timings of ICU_Parse1 (full cycle)
and ICU_Parse1 below...

Benchmark results

We ran default release mode executable, compiled by gcc 8.3 compilers, which given following numbers.
(Be warned that your mileage may vary significantly)...

19:04 $ taskset -c 0 ./bench --benchmark_repetitions=5
2021-05-08 19:04:59
Running ./bench
Run on (8 X 1992.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------
Benchmark                      Time           CPU Iterations
-------------------------------------------------------------
CDT_Parse                   4555 ns       4555 ns     111968
CDT_Parse                   4567 ns       4567 ns     111968
CDT_Parse                   4157 ns       4156 ns     111968
CDT_Parse                   3808 ns       3808 ns     111968
CDT_Parse                   4318 ns       4318 ns     111968
CDT_Parse_mean              4281 ns       4281 ns     111968
CDT_Parse_median            4318 ns       4318 ns     111968
CDT_Parse_stddev             315 ns        315 ns     111968
CDT_Parse1                    96 ns         96 ns    7083418
CDT_Parse1                    87 ns         86 ns    7083418
CDT_Parse1                    97 ns         97 ns    7083418
CDT_Parse1                    88 ns         88 ns    7083418
CDT_Parse1                   103 ns        103 ns    7083418
CDT_Parse1_mean               94 ns         94 ns    7083418
CDT_Parse1_median             96 ns         96 ns    7083418
CDT_Parse1_stddev              7 ns          7 ns    7083418
CCTZ_Parse1                  776 ns        776 ns     812659
CCTZ_Parse1                  737 ns        737 ns     812659
CCTZ_Parse1                  672 ns        672 ns     812659
CCTZ_Parse1                  671 ns        671 ns     812659
CCTZ_Parse1                  789 ns        789 ns     812659
CCTZ_Parse1_mean             729 ns        729 ns     812659
CCTZ_Parse1_median           737 ns        737 ns     812659
CCTZ_Parse1_stddev            56 ns         56 ns     812659
ICU_Parse1                 83369 ns      83363 ns       7199
ICU_Parse1                 93202 ns      93203 ns       7199
ICU_Parse1                 85615 ns      85611 ns       7199
ICU_Parse1                114555 ns     114530 ns       7199
ICU_Parse1                 98986 ns      98983 ns       7199
ICU_Parse1_mean            95145 ns      95138 ns       7199
ICU_Parse1_median          93202 ns      93203 ns       7199
ICU_Parse1_stddev          12498 ns      12490 ns       7199
ICU_Parse1_Inv               300 ns        300 ns    2333216
ICU_Parse1_Inv               258 ns        258 ns    2333216
ICU_Parse1_Inv               281 ns        281 ns    2333216
ICU_Parse1_Inv               275 ns        275 ns    2333216
ICU_Parse1_Inv               291 ns        291 ns    2333216
ICU_Parse1_Inv_mean          281 ns        281 ns    2333216
ICU_Parse1_Inv_median        281 ns        281 ns    2333216
ICU_Parse1_Inv_stddev         16 ns         16 ns    2333216
RE_Parse1                     11 ns         11 ns   71724166
RE_Parse1                     10 ns         10 ns   71724166
RE_Parse1                     12 ns         12 ns   71724166
RE_Parse1                     13 ns         13 ns   71724166
RE_Parse1                     12 ns         12 ns   71724166
RE_Parse1_mean                12 ns         12 ns   71724166
RE_Parse1_median              12 ns         12 ns   71724166
RE_Parse1_stddev               1 ns          1 ns   71724166

Please compare *_Parse1 numbers (parsing of a single literal):

CDT_Parse1 uses Christian Hansen c-dt;
CCTZ_Parse1 uses Google CCTZ;
ICU_Parse1' uses Unicode ICU. ICU_Parse1_Inv is a version modified with removed all invariants out of loop;
RE_Parse1 is a bonus version, where c-dt implementation has been replaced with simple DFA generated by RE2C.

Speed comparison ratio

So, essentially, we have this table with comparison of average times for each implementation

Algo	Median	x best
RE_Parse1	12 ns	1
CDT_Parse1	96 ns	8x
ICU_Parse1_Inv	281 ns	23x
CCTZ_Parse1	737 ns	61x

As we see, RE2C is the fastest, c-dt is good enough, while CCTZ and ICU are 20x - 100x times slower.

Tarantool plan of actions

Despite the "bestest" timings we have got from our own RE2C implementation, from managerial prospective it's still preferable to use C-DT as a basis for Lua date/time parsing implementation.
Due to it's maturity and completeness;
Lack of full timezone database support in c-dt might be compensated via special route, if parser, when it has found presence of symbolic timezone suffix. In this case we should call ICU database for conversion of timezone name to the actual time offset (relative to processed date). To accomodate some foreseen overhead for these ICU database calls we may memoize internally all already discovered timezone offsets.

rybakit · 2021-05-27T11:44:24Z

rybakit
May 27, 2021

To accomodate some foreseen overhead for these ICU database calls we may memoize internally all already discovered timezone offsets.

I wonder how memorization will cope with DST clock shifts?

1 reply

tsafin May 27, 2021
Author

Each local caching, certainly, should be date-related.

(Assumption is that when parsing millions of entry-logs within the same timezone, we save a bit for timezone resolution, because we know that inside of same date we are guaranteed to be inside the same daylight saving period)

ligurio · 2021-06-03T09:38:16Z

ligurio
Jun 3, 2021
Collaborator

In section Benchmark results test results contain a warning:

WARNING Library was built as DEBUG. Timings may be affected.

Is debug mode was enabled intentionally?

1 reply

tsafin Jun 16, 2021
Author

Is debug mode was enabled intentionally?

Nope, that's artifact of using single binary of gbench, which is not recompiled for our mode, and is rather red herring.
I assume that it's not a big deal, because our own libraries (i.e. CCTZ, c-dt, and ICU were all compiled as RelWithDebInfo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

Selecting fastest date/time parser #6103

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tarantool

Selecting fastest date/time parser #6103

tsafin May 26, 2021

Introduction

Possible candidates

Googletest and Google benchmark

Examples: parsing of ISO-8601 date/time format

c-dt

CCTZ

ICU

Benchmark results

Speed comparison ratio

Tarantool plan of actions

Replies: 2 comments · 2 replies

rybakit May 27, 2021

tsafin May 27, 2021 Author

ligurio Jun 3, 2021 Collaborator

tsafin Jun 16, 2021 Author

tsafin
May 26, 2021

Replies: 2 comments 2 replies

rybakit
May 27, 2021

tsafin May 27, 2021
Author

ligurio
Jun 3, 2021
Collaborator

tsafin Jun 16, 2021
Author