Workaround for missing str.isascii() in Python 3.6 #389

Adaephon-GH · 2019-11-29T23:14:51Z

What do these changes do?

This would allow for checking if host contains only ASCII characters with Python 3.6 and 3.5.

Are there changes in behavior for the user?

Performance tests with %timeit in ipython on Python 3.6 show that this check takes about 0.18 μs, if the first character in host is non-ASCII. 0.87 μs if the 10th character is the first non-ASCII character and 1.46 μs if the 20th character is non-ASCII. The times are about the same, if host is purely ASCII and 1, 10 or 20 characters long, respectively.

While this is quite a bit slower than str.isascii() on Python 3.8 on the same machine (about 0.038 μs, independ of length or position of the characters) it is about 25 times faster than running IDNA encoding needlessly: for 20 characters idna.encode(host, uts46=True).decode("ascii") takes about 40 μs if host is ASCII.

If some unicode character is found, the added time is negligible in comparison to the time needed for encoding: on 20 characters it takes 64 μs if one character is Unicode and about 85 - 150 μs if it contains only Unicode characters (There seems to be quite a spread depending on the characters used). So about 0.1 - 2.3 % more time, depending on where the first Unicode character is placed and how many there ares.

Related issue number

#388

Checklist

I think the code is well written
Unit tests for the changes exist (Tested by existing tests)
Documentation reflects the changes (No effect on user facing documentation)

This would allow for checking if `host` contains only ASCII characters with Python 3.6 and 3.5. Performance tests with `%timeit` in `ipython` on Python 3.6 show that this check takes about 0.18 μs, if the first character in `host` is non-ASCII. 0.87 μs if the 10th character is the first non-ASCII character and 1.46 μs if the 20th character is non-ASCII. The times are about the same, if `host` is purely ASCII and 1, 10 or 20 characters long, respectively. While this is quite a bit slower than `str.isascii()` on Python 3.8 on the same machine (about 0.038 μs, independ of length or position of the characters) it is about 25 times faster than running IDNA encoding needlessly: for 20 characters `idna.encode(host, uts46=True).decode("ascii")` takes about 40 μs if `host` is ASCII. If some unicode character is found, the added time is negligible in comparison to the time needed for encoding: on 20 characters it takes 64 μs if one character is Unicode and about 85 - 150 μs if it contains only Unicode characters (There seems to be quite a spread depending on the characters used). So about 0.1 - 2.3 % more time, depending on where the first Unicode character is placed and how many there ares.

codecov · 2019-11-29T23:17:23Z

Codecov Report

Merging #389 into master will decrease coverage by 0.29%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #389     +/-   ##
=========================================
- Coverage   99.54%   99.24%   -0.3%     
=========================================
  Files           2        2             
  Lines         660      664      +4     
  Branches      150      152      +2     
=========================================
+ Hits          657      659      +2     
- Misses          3        5      +2

Impacted Files	Coverage Δ
yarl/__init__.py	`98.99% <100%> (-0.4%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8218ec7...611c3e3. Read the comment docs.

Adaephon-GH · 2019-11-29T23:29:15Z

Dumb question on my part: Where do I find the CHANGES folder?

asvetlov · 2019-12-01T11:09:20Z

Heh, CHANGES folder is missing.
Need to setup towncrier for the project eventually.

The problem of the fix is a performance I guess.
str.isascii() is a highly optimized C code, it doesn't have a loop at all.
The loop in python is super expensive, I expect that the PR works slower than the original on Python 3.6, not faster.

Lexical comparison of two single letter strings ("characters") looks to be faster than first calling `ord()` on the character and doing a numerical comparison.

Adaephon-GH · 2019-12-01T23:52:55Z

I ran some more performance tests in order to check, whether string comparisons would be faster than first calling ord(). The test were ran on a different machine, so the times do not match the ones in my initial commit, but the relations match up.

First, from what I saw, running the loop with the lexical comparison (char > '\x7f') always takes less time - sometimes less than half - than calling ord(char) > 127. Hence the update to the pull request. (The loop is also faster than using generator expressions like all(char <= '\x7f' for char in host, which I first suggested in my issue report.)

As the test is ran in order to determine whether idna.encode() has to be run, the questions are:

Is the test faster than idna.encode() in those cases where no encoding is needed, i.e. when host only contains ASCII characters.
How much of an performance impact has the test, when the encoding is required anyway?

Based my test results (admittedly not very comprehensive and limited to two machines with similar setup) the answer to the first question is "Yes". Using string comparison, the test is about 50 to 70 times faster than encoding the string, if there are only ASCII characters in the string. That is a reduction of at least 98 % in the time needed.

Obviously, an all-ASCII string is the worst-case scenario for the test, as we have to loop over the whole string, only to find nothing. The test will return sooner the earlier a non-ASCII character appears in the string. In contrast, the time needed to encode a string seems to go up with the number of non-ASCII characters, although independent of position. It also seems to depend on which Unicode characters appear.

In those cases where only the very last character was an Unicode character (bad for the test, good for the encoding), the test added only around 1.1 % (0.88 % - 1.28 %) of the time the encoding needed anyway.

If the test string only contains Unicode characters, the added compute time can be as low as 0.04 % for long strings (40 characters) and still only 0.3 % for short strings (5 characters).

So while the test is expensive when compared to str.isascii(), it is far less expensive when compared to the idna.encode(). It either reduces the time needed to encode host by at least 98% (for all-ASCII strings) or it adds at most 1.3 % to the time needed (and that only strings with few Unicode characters near the end of the string).

Personally, I would think that this trade-off is acceptable, unless the majority of users of yarl use it for non-ASCII host names most of the time.

For completeness sake (and in order to make my test results verifiable) here are the tests and their results (ran on a AMD Phenom II 1090T):

% ipython
Python 3.6.1 (default, Apr 22 2017, 00:43:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.10.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import idna                                                                                                                                                                   

In [2]: def check_lex(s): 
   ...:     for c in s: 
   ...:         if c > '\x7f': 
   ...:             return False 
   ...:     return True 
   ...:                                                                                                                                                                               

In [3]: def check_ord(s): 
   ...:     for c in s: 
   ...:         if ord(c) > 127: 
   ...:             return False 
   ...:     return True 
   ...:                                                                                                                                                                               

In [4]: # 40-letter string, all ASCII                                                                                                                                                 

In [5]: host = 'abcdefghijklmnopqrstabcdefghijklmnopqrst'                                                                                                                             

In [6]: %timeit check_lex(host)                                                                                                                                                       
1.99 µs ± 5.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit check_ord(host)                                                                                                                                                       
4.09 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                         
116 µs ± 436 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: # 40-letter string, all ASCII except the last letter                                                                                                                          

In [10]: host = 'abcdefghijklmnopqrstabcdefghijklmnopqrsλ'                                                                                                                            

In [11]: %timeit check_lex(host)                                                                                                                                                      
2.09 µs ± 13.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit check_ord(host)                                                                                                                                                      
4.17 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
163 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: # 40-letter string, all ASCII except the 21st letter                                                                                                                         

In [15]: host = 'abcdefghijklmnopqrstλabcdefghijklmnopqrs'                                                                                                                            

In [16]: %timeit check_lex(host)                                                                                                                                                      
1.2 µs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [17]: %timeit check_ord(host)                                                                                                                                                      
2.34 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [18]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
161 µs ± 187 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [19]: # 40-letter string, no ASCII letters                                                                                                                                         

In [20]: host = 'αβγδεζηθικλμνξοπρςσταβγδεζηθικλμνξοπρςστ'                                                                                                                            

In [21]: %timeit check_lex(host)                                                                                                                                                      
241 ns ± 0.676 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [22]: %timeit check_ord(host)                                                                                                                                                      
330 ns ± 19.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [23]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
636 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: # 5-letter string, all ASCII                                                                                                                                                 

In [25]: host = "abcde"                                                                                                                                                               

In [26]: %timeit check_lex(host)                                                                                                                                                      
423 ns ± 0.631 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [27]: %timeit check_ord(host)                                                                                                                                                      
672 ns ± 3.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [28]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
28.9 µs ± 67.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [29]: # 5-letter string, no ASCII letters                                                                                                                                          

In [30]: host = "αβγδε"                                                                                                                                                               

In [31]: %timeit check_lex(host)                                                                                                                                                      
249 ns ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [32]: %timeit check_ord(host)                                                                                                                                                      
316 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
79 µs ± 59.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [34]: # 5-letter string, no ASCII letters but different alphabet                                                                                                                   

In [35]: host = "äöüßÄ"                                                                                                                                                               

In [36]: %timeit check_lex(host)                                                                                                                                                      
228 ns ± 0.876 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [37]: %timeit check_ord(host)                                                                                                                                                      
280 ns ± 0.698 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [38]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
69.1 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [39]: # 5-letter string, all ASCII except the last letter                                                                                                                          

In [40]: host = "abcdλ"                                                                                                                                                               

In [41]: %timeit check_lex(host)                                                                                                                                                      
486 ns ± 1.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [42]: %timeit check_ord(host)                                                                                                                                                      
727 ns ± 7.32 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [43]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
50.1 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [44]: # 5-letter string, all ASCII except the last letter                                                                                                                          

In [45]: %timeit check_lex(host)                                                                                                                                                      
414 ns ± 3.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [46]: %timeit check_ord(host)                                                                                                                                                      
685 ns ± 7.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [47]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
47.3 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

asvetlov · 2019-12-04T17:20:36Z

Sorry for the delay.
Ok, if you see the speedup let's merge the PR.

str.isascii() is the shortest function in the Python source code, it is barely

static PyObject *
unicode_isascii_impl(PyObject *self)
{
    if (PyUnicode_READY(self) == -1) {
        return NULL;
    }
    return PyBool_FromLong(PyUnicode_IS_ASCII(self));
}

The function can be backported to py 3.5/3.6 easily but I don't have a motivaton to do it.
Python 3.6 is slowly passing away, I don't want to spend my personal time on improving the speed of outdated python versions.

asvetlov · 2019-12-04T17:21:47Z

thanks for the contribution!

Adaephon-GH · 2019-12-04T19:27:33Z

You're welcome! Thanks for merging!

Two things: - isascii on strings is actuall 3.7+ - 私 was becoming 代名詞 It turns out a small number of words - 私, 君, 余, but not 僕 etc. - have a lemma that looks like 私-代名詞. This is weird.

Do lexical comparison

611c3e3

Lexical comparison of two single letter strings ("characters") looks to be faster than first calling `ord()` on the character and doing a numerical comparison.

asvetlov merged commit 879a3f6 into aio-libs:master Dec 4, 2019

Adaephon-GH deleted the patch-1 branch December 4, 2019 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for missing str.isascii() in Python 3.6 #389

Workaround for missing str.isascii() in Python 3.6 #389

Adaephon-GH commented Nov 29, 2019 •

edited by webknjaz

Loading

codecov bot commented Nov 29, 2019 •

edited

Loading

Adaephon-GH commented Nov 29, 2019

asvetlov commented Dec 1, 2019

Adaephon-GH commented Dec 1, 2019

asvetlov commented Dec 4, 2019

asvetlov commented Dec 4, 2019

Adaephon-GH commented Dec 4, 2019

Workaround for missing str.isascii() in Python 3.6 #389

Workaround for missing str.isascii() in Python 3.6 #389

Conversation

Adaephon-GH commented Nov 29, 2019 • edited by webknjaz Loading

What do these changes do?

Are there changes in behavior for the user?

Related issue number

Checklist

codecov bot commented Nov 29, 2019 • edited Loading

Codecov Report

Adaephon-GH commented Nov 29, 2019

asvetlov commented Dec 1, 2019

Adaephon-GH commented Dec 1, 2019

asvetlov commented Dec 4, 2019

asvetlov commented Dec 4, 2019

Adaephon-GH commented Dec 4, 2019

Adaephon-GH commented Nov 29, 2019 •

edited by webknjaz

Loading

codecov bot commented Nov 29, 2019 •

edited

Loading