Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for missing str.isascii() in Python 3.6 #389

Merged
merged 2 commits into from
Dec 4, 2019

Conversation

Adaephon-GH
Copy link
Contributor

@Adaephon-GH Adaephon-GH commented Nov 29, 2019

What do these changes do?

This would allow for checking if host contains only ASCII characters with Python 3.6 and 3.5.

Are there changes in behavior for the user?

Performance tests with %timeit in ipython on Python 3.6 show that this check takes about 0.18 μs, if the first character in host is non-ASCII. 0.87 μs if the 10th character is the first non-ASCII character and 1.46 μs if the 20th character is non-ASCII. The times are about the same, if host is purely ASCII and 1, 10 or 20 characters long, respectively.

While this is quite a bit slower than str.isascii() on Python 3.8 on the same machine (about 0.038 μs, independ of length or position of the characters) it is about 25 times faster than running IDNA encoding needlessly: for 20 characters idna.encode(host, uts46=True).decode("ascii") takes about 40 μs if host is ASCII.

If some unicode character is found, the added time is negligible in comparison to the time needed for encoding: on 20 characters it takes 64 μs if one character is Unicode and about 85 - 150 μs if it contains only Unicode characters (There seems to be quite a spread depending on the characters used). So about 0.1 - 2.3 % more time, depending on where the first Unicode character is placed and how many there ares.

Related issue number

#388

Checklist

  • I think the code is well written
  • Unit tests for the changes exist (Tested by existing tests)
  • Documentation reflects the changes (No effect on user facing documentation)

This would allow for checking if `host` contains only ASCII characters with Python 3.6 and 3.5. 

Performance tests with `%timeit` in `ipython` on Python 3.6 show that this check takes about 0.18 μs, if the first character in `host` is non-ASCII. 0.87 μs if the 10th character is the first non-ASCII character and 1.46 μs if the 20th character is non-ASCII. The times are about the same, if `host` is purely ASCII and 1, 10 or 20 characters long, respectively. 

While this is quite a bit slower than `str.isascii()` on Python 3.8 on the same machine (about 0.038 μs, independ of length or position of the characters) it is about 25 times faster than running IDNA encoding needlessly: for 20 characters `idna.encode(host, uts46=True).decode("ascii")` takes about 40 μs if `host` is ASCII.

If some unicode character is found, the added time is negligible in comparison to the time needed for encoding: on 20 characters it takes 64 μs if one character is Unicode and about 85 - 150 μs if it contains only Unicode characters (There seems to be quite a spread depending on the characters used). So about 0.1 - 2.3 % more time, depending on where the first Unicode character is placed and how many there ares.
@codecov
Copy link

codecov bot commented Nov 29, 2019

Codecov Report

Merging #389 into master will decrease coverage by 0.29%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #389     +/-   ##
=========================================
- Coverage   99.54%   99.24%   -0.3%     
=========================================
  Files           2        2             
  Lines         660      664      +4     
  Branches      150      152      +2     
=========================================
+ Hits          657      659      +2     
- Misses          3        5      +2
Impacted Files Coverage Δ
yarl/__init__.py 98.99% <100%> (-0.4%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8218ec7...611c3e3. Read the comment docs.

@Adaephon-GH
Copy link
Contributor Author

Dumb question on my part: Where do I find the CHANGES folder?

@asvetlov
Copy link
Member

asvetlov commented Dec 1, 2019

Heh, CHANGES folder is missing.
Need to setup towncrier for the project eventually.

The problem of the fix is a performance I guess.
str.isascii() is a highly optimized C code, it doesn't have a loop at all.
The loop in python is super expensive, I expect that the PR works slower than the original on Python 3.6, not faster.

Lexical comparison of two single letter strings ("characters") looks to be faster than first calling `ord()` on the character and doing a numerical comparison.
@Adaephon-GH
Copy link
Contributor Author

I ran some more performance tests in order to check, whether string comparisons would be faster than first calling ord(). The test were ran on a different machine, so the times do not match the ones in my initial commit, but the relations match up.

First, from what I saw, running the loop with the lexical comparison (char > '\x7f') always takes less time - sometimes less than half - than calling ord(char) > 127. Hence the update to the pull request. (The loop is also faster than using generator expressions like all(char <= '\x7f' for char in host, which I first suggested in my issue report.)

As the test is ran in order to determine whether idna.encode() has to be run, the questions are:

  1. Is the test faster than idna.encode() in those cases where no encoding is needed, i.e. when host only contains ASCII characters.
  2. How much of an performance impact has the test, when the encoding is required anyway?

Based my test results (admittedly not very comprehensive and limited to two machines with similar setup) the answer to the first question is "Yes". Using string comparison, the test is about 50 to 70 times faster than encoding the string, if there are only ASCII characters in the string. That is a reduction of at least 98 % in the time needed.

Obviously, an all-ASCII string is the worst-case scenario for the test, as we have to loop over the whole string, only to find nothing. The test will return sooner the earlier a non-ASCII character appears in the string. In contrast, the time needed to encode a string seems to go up with the number of non-ASCII characters, although independent of position. It also seems to depend on which Unicode characters appear.

In those cases where only the very last character was an Unicode character (bad for the test, good for the encoding), the test added only around 1.1 % (0.88 % - 1.28 %) of the time the encoding needed anyway.

If the test string only contains Unicode characters, the added compute time can be as low as 0.04 % for long strings (40 characters) and still only 0.3 % for short strings (5 characters).

So while the test is expensive when compared to str.isascii(), it is far less expensive when compared to the idna.encode(). It either reduces the time needed to encode host by at least 98% (for all-ASCII strings) or it adds at most 1.3 % to the time needed (and that only strings with few Unicode characters near the end of the string).

Personally, I would think that this trade-off is acceptable, unless the majority of users of yarl use it for non-ASCII host names most of the time.

For completeness sake (and in order to make my test results verifiable) here are the tests and their results (ran on a AMD Phenom II 1090T):

% ipython
Python 3.6.1 (default, Apr 22 2017, 00:43:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.10.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import idna                                                                                                                                                                   

In [2]: def check_lex(s): 
   ...:     for c in s: 
   ...:         if c > '\x7f': 
   ...:             return False 
   ...:     return True 
   ...:                                                                                                                                                                               

In [3]: def check_ord(s): 
   ...:     for c in s: 
   ...:         if ord(c) > 127: 
   ...:             return False 
   ...:     return True 
   ...:                                                                                                                                                                               

In [4]: # 40-letter string, all ASCII                                                                                                                                                 

In [5]: host = 'abcdefghijklmnopqrstabcdefghijklmnopqrst'                                                                                                                             

In [6]: %timeit check_lex(host)                                                                                                                                                       
1.99 µs ± 5.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit check_ord(host)                                                                                                                                                       
4.09 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                         
116 µs ± 436 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: # 40-letter string, all ASCII except the last letter                                                                                                                          

In [10]: host = 'abcdefghijklmnopqrstabcdefghijklmnopqrsλ'                                                                                                                            

In [11]: %timeit check_lex(host)                                                                                                                                                      
2.09 µs ± 13.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit check_ord(host)                                                                                                                                                      
4.17 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
163 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: # 40-letter string, all ASCII except the 21st letter                                                                                                                         

In [15]: host = 'abcdefghijklmnopqrstλabcdefghijklmnopqrs'                                                                                                                            

In [16]: %timeit check_lex(host)                                                                                                                                                      
1.2 µs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [17]: %timeit check_ord(host)                                                                                                                                                      
2.34 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [18]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
161 µs ± 187 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [19]: # 40-letter string, no ASCII letters                                                                                                                                         

In [20]: host = 'αβγδεζηθικλμνξοπρςσταβγδεζηθικλμνξοπρςστ'                                                                                                                            

In [21]: %timeit check_lex(host)                                                                                                                                                      
241 ns ± 0.676 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [22]: %timeit check_ord(host)                                                                                                                                                      
330 ns ± 19.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [23]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
636 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: # 5-letter string, all ASCII                                                                                                                                                 

In [25]: host = "abcde"                                                                                                                                                               

In [26]: %timeit check_lex(host)                                                                                                                                                      
423 ns ± 0.631 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [27]: %timeit check_ord(host)                                                                                                                                                      
672 ns ± 3.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [28]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
28.9 µs ± 67.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [29]: # 5-letter string, no ASCII letters                                                                                                                                          

In [30]: host = "αβγδε"                                                                                                                                                               

In [31]: %timeit check_lex(host)                                                                                                                                                      
249 ns ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [32]: %timeit check_ord(host)                                                                                                                                                      
316 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
79 µs ± 59.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [34]: # 5-letter string, no ASCII letters but different alphabet                                                                                                                   

In [35]: host = "äöüßÄ"                                                                                                                                                               

In [36]: %timeit check_lex(host)                                                                                                                                                      
228 ns ± 0.876 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [37]: %timeit check_ord(host)                                                                                                                                                      
280 ns ± 0.698 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [38]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
69.1 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [39]: # 5-letter string, all ASCII except the last letter                                                                                                                          

In [40]: host = "abcdλ"                                                                                                                                                               

In [41]: %timeit check_lex(host)                                                                                                                                                      
486 ns ± 1.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [42]: %timeit check_ord(host)                                                                                                                                                      
727 ns ± 7.32 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [43]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
50.1 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [44]: # 5-letter string, all ASCII except the last letter                                                                                                                          

In [45]: %timeit check_lex(host)                                                                                                                                                      
414 ns ± 3.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [46]: %timeit check_ord(host)                                                                                                                                                      
685 ns ± 7.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [47]: %timeit idna.encode(host, uts46=True).decode("ascii")                                                                                                                        
47.3 µs ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@asvetlov
Copy link
Member

asvetlov commented Dec 4, 2019

Sorry for the delay.
Ok, if you see the speedup let's merge the PR.

str.isascii() is the shortest function in the Python source code, it is barely

static PyObject *
unicode_isascii_impl(PyObject *self)
{
    if (PyUnicode_READY(self) == -1) {
        return NULL;
    }
    return PyBool_FromLong(PyUnicode_IS_ASCII(self));
}

The function can be backported to py 3.5/3.6 easily but I don't have a motivaton to do it.
Python 3.6 is slowly passing away, I don't want to spend my personal time on improving the speed of outdated python versions.

@asvetlov asvetlov merged commit 879a3f6 into aio-libs:master Dec 4, 2019
@asvetlov
Copy link
Member

asvetlov commented Dec 4, 2019

thanks for the contribution!

@Adaephon-GH
Copy link
Contributor Author

You're welcome! Thanks for merging!

@Adaephon-GH Adaephon-GH deleted the patch-1 branch December 4, 2019 19:27
krackers referenced this pull request in polm/cutlet Jul 12, 2020
Two things:

- isascii on strings is actuall 3.7+
- 私 was becoming 代名詞

It turns out a small number of words - 私, 君, 余, but not 僕 etc. -
have a lemma that looks like 私-代名詞. This is weird.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants