isdigit(Char(::UInt8)) performance regression? #25883

samoconnor · 2018-02-05T00:56:55Z

In 0.7 it seems that isdigit(Char(x::Uint8)) produces more than 3 times as many instructions as
x::UInt8 in UInt8('0'):UInt8('9').

Is there a good reason not to define isdigit(x::UInt8) = x in UInt8('0'):UInt8('9') in Base?

julia> VERSION
v"0.6.2"

julia> code_native(x->isdigit(Char(x)), (UInt8,))
	.section	__TEXT,__text,regular,pure_instructions
	pushq	%rbp
	movq	%rsp, %rbp
	addb	$-48, %dil
	cmpb	$10, %dil
	setb	%al
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

julia> VERSION
v"0.7.0-DEV.3639"

# x in '0':'9' produces 5 instructions...
julia> code_native(x->x in UInt8('0'):UInt8('9'), (UInt8,))
	.section	__TEXT,__text,regular,pure_instructions
	addb	$-48, %dil
	cmpb	$10, %dil
	setb	%al
	retq
	nopl	(%rax)
;}

# ... but isdigit produces...
julia> code_native(x->isdigit(Char(x)), (UInt8,))
	.section	__TEXT,__text,regular,pure_instructions
	movzbl	%dil, %eax
	testb	%dil, %dil
	js	L22
	shll	$24, %eax
	cmpl	$805306368, %eax        ## imm = 0x30000000
	jae	L43
	xorl	%eax, %eax
	retq
L22:
	movl	%eax, %ecx
	andl	$63, %ecx
	andl	$192, %eax
	leal	(%rcx,%rax,4), %eax
	shll	$16, %eax
	orl	$3229614080, %eax       ## imm = 0xC0800000
L43:
	cmpl	$956301313, %eax        ## imm = 0x39000001
	setb	%al
	retq
	nopw	%cs:(%rax,%rax)

The text was updated successfully, but these errors were encountered:

samoconnor · 2018-02-05T01:14:56Z

Related: Is the omission of == and in methods for Integers and Chars deliberate?
i.e. should things like the following be defined?

==(i::T, c::Char) where T <: Integer = i == T(c)

in(a::T, b::StepRange{Char,S}) where {T <: Integer, S} = in(a, StepRange{T, S}(b))

This would allow 0x30 == '0' and 0x33 in '0':''9'.

ararslan · 2018-02-05T02:56:55Z

Is the omission of == and in methods for Integers and Chars deliberate?

Yes, they used to be defined but were removed: #16024

samoconnor · 2018-02-05T04:55:14Z

Thx for the reference @ararslan.
I think I tend to agree with @stevengj's comments in #16024.

The mapping between characters and integers [0:127] has been well defined since 1963.

If I write x == '\n', there is no ambiguity, I want to know if x is 0x0A.

#16024 (comment) raises the example of Set(Any[120, 'x']) == Set(['x']) == true as a problem. That particular use case seems unlikely. I would say that anyone putting 'x' in a Set as a special marker for something should use a Symbol :x instead. But maybe this was just intended as a minimal example and there are related real practical problems lurking...

The RFCs for UTF-8-aware wire-formats like HTTP and JSON are careful to ensure that most parsing can be done in the octet domain (ignoring UTF-8) and decoding of UTF-8 can be left to higher layers. The new CodeUnits wrapper is nice for parsers like this, but it isn't nice to have to write UInt8('[') every time I want to write c == '['.

However, in my parser module I can always define something like this:

==(a, b) = Base.isequal(a, b)
!=(a, b) = !(a == b)
==(i::T, c::Char) where T <: Integer = Base.isequal(i, T(c))
in(a, b) = Base.in(a, b)
in(a::T, b::StepRange{Char,S}) where {S, T <: Integer} = Base.in(a, StepRange{T, S}(b))

The argument that stuff not needed by base needn't be in Base is a sensible one. Maybe someone should build an ASCII.jl package that adds all the methods that "make obvious sense" when you are writing a byte parser, but are not helpful if you're doing linear algebra.

stevengj · 2018-02-05T13:26:12Z

However, in my parser module I can always define something like this:

This is type piracy, which means it can easily conflict with definitions in other modules.

samoconnor · 2018-02-05T13:29:36Z

Well, I'm not importing and modifying Base.==, I'm defining a non-expoted local MyModule.== who's default method is to call Base.==. If I understand correctly, this shouldn't have any impact outside MyModule.

JeffBezanson · 2018-02-05T18:18:16Z

I think we just made a judgment call that characters and integers aren't the same kind of thing in general, and so should not be considered equal. It might make sense to add methods to isdigit et. al. for integers though, since those functions are explicitly for classifying characters --- there's nothing else the call could mean.

stevengj · 2018-02-05T18:49:13Z

+1 for integer (or at least UInt8) versions of these functions

StefanKarpinski · 2018-02-06T01:36:46Z

Disambiguating question: when one writes isdigit(b::UInt8) is b being interpreted as a code point or a code unit? In this particular case, they give the same answer; perhaps it doesn't matter.

stevengj · 2018-02-06T02:38:09Z

A codepoint. Fortunately, because of the self-synchronizing property of UTF-8, you will never get false positives for codeunits that are part of multi-byte encoded codepoints.

samoconnor · 2018-02-06T04:06:34Z

Right, the RFC says: "[UTF-8] has the quality of preserving the full US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for a US-ASCII character, and nothing else."

StefanKarpinski · 2018-02-06T13:32:26Z

It seems like we should just broaden the character predicates to take integer code point arguments. This has the advantage that for some integer types like bytes, the answer is considerably easier to compute that way. There's probably also some optimization work to be done on conversion from UInt8 to Char. I did play around with some hand optimizations but it was not clearly better than what LLVM was generating already—LLVM is really good at partially specializing bit manipulation code on types and known values. Specifically, I tried these definitions:

function Char(b::UInt8)
    u = b % UInt32
    b < 0x80 && return reinterpret(Char, u << 24)
    m = 0xc2800000 + (UInt32(b ≥ 0xc0) << 24)
    reinterpret(Char, ((u & 0x3f) << 16) | m)
end

function Char(b::Int8)
    0 ≤ b || code_point_err(u)::Union{}
    reinterpret(Char, (b % UInt32) << 24)
end

But the generated code is nearly identical to the above for the existing definitions.

nalimilan · 2018-02-06T14:58:27Z

It sounds too bad that we'd need to allow things like isdigit(1) === false because of performance. It could be confusing for people who don't know what's a codepoint and it's inconsistent with the separation of characters and integers in the rest of the API.

The ideal solution would be that the compiler removes unnecessary conversions. Given how smart LLVM can be, I'm surprised that even simple things like code_native(x->Char(x) == '0', (UInt8,)) generate so many instructions. (But I really can't help improving that...)

StefanKarpinski · 2018-02-06T15:02:40Z

It's because the Int8 to Char conversion is no longer just a zero extension, although it should be possible for the compiler to figure out that for fixes characters to compare with a lot of this work doesn't matter. Given that the fast path here is just compare and shift, it's worth looking at if this is actually a performance issue or not. Did you benchmark it by any chance, @samoconnor?

StefanKarpinski · 2018-02-06T15:05:22Z

A thought that @Keno and I talked about at some point was having a CodePoint type that represents a code point in a way that is decoupled from the representation of Char, which would allow expressing this as isdigit(CodePoint(b)). A slight wrinkle is that CodePoint is essentially the same as a as-yet-hypothetical CharU32 representing a character encoded in UTF-32.

Keno · 2018-02-06T16:13:46Z

Oh yeah, this was basically the reason I wanted that (because I didn't like defining these predicate functions on integers).

stevengj · 2018-02-06T16:44:08Z

myisdigit(x::UInt8) = UInt8('0') ≤ x ≤ UInt8('9') seems to be about 1.7× faster than isdigit(Char(x)) on my machine with a recent Julia 0.7.

samoconnor · 2018-02-07T00:06:03Z

@stevengj Note that x in UInt8('0'):UInt8('9') seems to produce less instructions than UInt8('0') ≤ x ≤ UInt8('9') (see dump in original post above).

@StefanKarpinski, I haven't done a specific benchmark on this. I've been benchmarking and studying code_native in my LazyJSON code. I noticed that a function I thought should reduce down to only a handful of instructions was a bit big and had large int constants like $3229614080 from the Unicode code.

[Aside: I'm making progress, in both speed and memory allocation:]

Access value close to start:
LazyJSON.jl:  0.000558 seconds (3.23 k allocations: 121.719 KiB)
JSON.jl:      1.025800 seconds (9.34 M allocations: 913.682 MiB, 12.53% gc time)

Access 2 values close to end:
LazyJSON.jl:  0.162446 seconds (2.66 k allocations: 71.250 KiB)
JSON.jl:      1.019581 seconds (9.34 M allocations: 913.682 MiB, 12.22% gc time)

Another consideration is how the definition works alongside other code. e.g.

if !isdigit(c)  ||
    isjsonws(c) ||
    c == '}'    ||
    c == ']'

I'm guessing that if isdigit has a nice simple definition the compiler/llvm has a better chance of reducing the whole expression to something optimal. I would worry that in the worst case a more complex unicode isdigit would not get inlined and you might end up with a function call in an inner loop.

samoconnor · 2018-02-08T22:08:25Z

LLVM does (or did) have a pass to convert a sequence of integer comparisons to a jump table.
(from #18285 (comment))

stevengj · 2018-02-09T13:09:34Z

x in UInt8('0'):UInt8('9') seems to produce less instructions than UInt8('0') ≤ x ≤ UInt8('9')

I think that's because ... ≤ x ≤ … is short-circuiting; range containment seems equivalent to ifelse(UInt8('0') ≤ x, x ≤ UInt8('9'), false). According to @btime, however, they are both the same speed, presumably due to branch prediction.

To get a benchmark that benefits less from branch prediction (which makes doing the same isdigit call over and over deceptive), I tried benchmarking sum(isdigit, b) for b = [UInt8(rand(Bool) ? '3' : 'a') for i=1:1000]. Results:

sum(b): 74ns
isdigit(Char(x)): 764ns
UInt8('0') ≤ x ≤ UInt8('9'): 285ns
x in UInt8('0'):UInt8('9') or ifelse version: 216ns

So, in the absence of branch prediction, you can get a 4–5× speedup (not counting the time for the sum operations) by operating directly on bytes.

nalimilan added the strings "Strings!" label Feb 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

isdigit(Char(::UInt8)) performance regression? #25883

isdigit(Char(::UInt8)) performance regression? #25883

samoconnor commented Feb 5, 2018

samoconnor commented Feb 5, 2018

ararslan commented Feb 5, 2018

samoconnor commented Feb 5, 2018

stevengj commented Feb 5, 2018 •

edited

Loading

samoconnor commented Feb 5, 2018

JeffBezanson commented Feb 5, 2018

stevengj commented Feb 5, 2018 •

edited

Loading

StefanKarpinski commented Feb 6, 2018 •

edited

Loading

stevengj commented Feb 6, 2018

samoconnor commented Feb 6, 2018

StefanKarpinski commented Feb 6, 2018 •

edited

Loading

nalimilan commented Feb 6, 2018

StefanKarpinski commented Feb 6, 2018

StefanKarpinski commented Feb 6, 2018

Keno commented Feb 6, 2018

stevengj commented Feb 6, 2018 •

edited

Loading

samoconnor commented Feb 7, 2018

samoconnor commented Feb 8, 2018

stevengj commented Feb 9, 2018 •

edited

Loading

isdigit(Char(::UInt8)) performance regression? #25883

isdigit(Char(::UInt8)) performance regression? #25883

Comments

samoconnor commented Feb 5, 2018

samoconnor commented Feb 5, 2018

ararslan commented Feb 5, 2018

samoconnor commented Feb 5, 2018

stevengj commented Feb 5, 2018 • edited Loading

samoconnor commented Feb 5, 2018

JeffBezanson commented Feb 5, 2018

stevengj commented Feb 5, 2018 • edited Loading

StefanKarpinski commented Feb 6, 2018 • edited Loading

stevengj commented Feb 6, 2018

samoconnor commented Feb 6, 2018

StefanKarpinski commented Feb 6, 2018 • edited Loading

nalimilan commented Feb 6, 2018

StefanKarpinski commented Feb 6, 2018

StefanKarpinski commented Feb 6, 2018

Keno commented Feb 6, 2018

stevengj commented Feb 6, 2018 • edited Loading

samoconnor commented Feb 7, 2018

samoconnor commented Feb 8, 2018

stevengj commented Feb 9, 2018 • edited Loading

stevengj commented Feb 5, 2018 •

edited

Loading

stevengj commented Feb 5, 2018 •

edited

Loading

StefanKarpinski commented Feb 6, 2018 •

edited

Loading

StefanKarpinski commented Feb 6, 2018 •

edited

Loading

stevengj commented Feb 6, 2018 •

edited

Loading

stevengj commented Feb 9, 2018 •

edited

Loading