-
-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for missing str.isascii() in Python 3.6 #389
Conversation
This would allow for checking if `host` contains only ASCII characters with Python 3.6 and 3.5. Performance tests with `%timeit` in `ipython` on Python 3.6 show that this check takes about 0.18 μs, if the first character in `host` is non-ASCII. 0.87 μs if the 10th character is the first non-ASCII character and 1.46 μs if the 20th character is non-ASCII. The times are about the same, if `host` is purely ASCII and 1, 10 or 20 characters long, respectively. While this is quite a bit slower than `str.isascii()` on Python 3.8 on the same machine (about 0.038 μs, independ of length or position of the characters) it is about 25 times faster than running IDNA encoding needlessly: for 20 characters `idna.encode(host, uts46=True).decode("ascii")` takes about 40 μs if `host` is ASCII. If some unicode character is found, the added time is negligible in comparison to the time needed for encoding: on 20 characters it takes 64 μs if one character is Unicode and about 85 - 150 μs if it contains only Unicode characters (There seems to be quite a spread depending on the characters used). So about 0.1 - 2.3 % more time, depending on where the first Unicode character is placed and how many there ares.
Codecov Report
@@ Coverage Diff @@
## master #389 +/- ##
=========================================
- Coverage 99.54% 99.24% -0.3%
=========================================
Files 2 2
Lines 660 664 +4
Branches 150 152 +2
=========================================
+ Hits 657 659 +2
- Misses 3 5 +2
Continue to review full report at Codecov.
|
Dumb question on my part: Where do I find the |
Heh, CHANGES folder is missing. The problem of the fix is a performance I guess. |
Lexical comparison of two single letter strings ("characters") looks to be faster than first calling `ord()` on the character and doing a numerical comparison.
I ran some more performance tests in order to check, whether string comparisons would be faster than first calling First, from what I saw, running the loop with the lexical comparison ( As the test is ran in order to determine whether
Based my test results (admittedly not very comprehensive and limited to two machines with similar setup) the answer to the first question is "Yes". Using string comparison, the test is about 50 to 70 times faster than encoding the string, if there are only ASCII characters in the string. That is a reduction of at least 98 % in the time needed. Obviously, an all-ASCII string is the worst-case scenario for the test, as we have to loop over the whole string, only to find nothing. The test will return sooner the earlier a non-ASCII character appears in the string. In contrast, the time needed to encode a string seems to go up with the number of non-ASCII characters, although independent of position. It also seems to depend on which Unicode characters appear. In those cases where only the very last character was an Unicode character (bad for the test, good for the encoding), the test added only around 1.1 % (0.88 % - 1.28 %) of the time the encoding needed anyway. If the test string only contains Unicode characters, the added compute time can be as low as 0.04 % for long strings (40 characters) and still only 0.3 % for short strings (5 characters). So while the test is expensive when compared to Personally, I would think that this trade-off is acceptable, unless the majority of users of yarl use it for non-ASCII host names most of the time. For completeness sake (and in order to make my test results verifiable) here are the tests and their results (ran on a AMD Phenom II 1090T):
|
Sorry for the delay.
The function can be backported to py 3.5/3.6 easily but I don't have a motivaton to do it. |
thanks for the contribution! |
You're welcome! Thanks for merging! |
Two things: - isascii on strings is actuall 3.7+ - 私 was becoming 代名詞 It turns out a small number of words - 私, 君, 余, but not 僕 etc. - have a lemma that looks like 私-代名詞. This is weird.
What do these changes do?
This would allow for checking if
host
contains only ASCII characters with Python 3.6 and 3.5.Are there changes in behavior for the user?
Performance tests with
%timeit
inipython
on Python 3.6 show that this check takes about 0.18 μs, if the first character inhost
is non-ASCII. 0.87 μs if the 10th character is the first non-ASCII character and 1.46 μs if the 20th character is non-ASCII. The times are about the same, ifhost
is purely ASCII and 1, 10 or 20 characters long, respectively.While this is quite a bit slower than
str.isascii()
on Python 3.8 on the same machine (about 0.038 μs, independ of length or position of the characters) it is about 25 times faster than running IDNA encoding needlessly: for 20 charactersidna.encode(host, uts46=True).decode("ascii")
takes about 40 μs ifhost
is ASCII.If some unicode character is found, the added time is negligible in comparison to the time needed for encoding: on 20 characters it takes 64 μs if one character is Unicode and about 85 - 150 μs if it contains only Unicode characters (There seems to be quite a spread depending on the characters used). So about 0.1 - 2.3 % more time, depending on where the first Unicode character is placed and how many there ares.
Related issue number
#388
Checklist