Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read(io, Char): fix read with too many leading ones #50552

Merged
merged 3 commits into from
Jul 17, 2023

Conversation

StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Jul 14, 2023

Fixes #50532. The read(io, Char) method didn't correctly handle the case where the lead byte starts with too many leading ones; this fix makes it handle that case correctly, which makes read(io, Char) match collect(s) in its interpretation of what a character is in all invalid cases.

@StefanKarpinski StefanKarpinski requested a review from stevengj July 14, 2023 15:32
@stevengj
Copy link
Member

Thanks! Now passes my tests from JuliaIO/BufferedStreams.jl#77 where I first noticed this issue.

@stevengj stevengj added io Involving the I/O subsystem: libuv, read, write, etc. strings "Strings!" bugfix This change fixes an existing bug labels Jul 14, 2023
@stevengj
Copy link
Member

stevengj commented Jul 14, 2023

The following saves one comparison by utilizing UInt8 wraparound:

-    l::UInt8 = leading_ones(b0)
+    l = 0x08 * (0x04 - (leading_ones(b0) % UInt8))
     c = UInt32(b0) << 24
-    if 2 ≤ l ≤ 4
+    if l < 0x18
         s = 16
-        l = 8(4-l)

(Not sure if it's worth making s a UInt8 as well, since it's compared to l?)

base/io.jl Show resolved Hide resolved
base/io.jl Show resolved Hide resolved
@StefanKarpinski
Copy link
Member Author

I was kind of hoping the optimizer would figure that out for me, but might as well do it explicitly. This might be what I had intended the code to do originally but got foiled by promotion with Int.

@stevengj
Copy link
Member

Maybe backport to 1.10 at least?

@stevengj stevengj added the backport 1.10 Change should be backported to the 1.10 release label Jul 14, 2023
@StefanKarpinski StefanKarpinski added the backport 1.6 Change should be backported to release-1.6 label Jul 17, 2023
@StefanKarpinski StefanKarpinski merged commit ffe1a07 into master Jul 17, 2023
@StefanKarpinski StefanKarpinski deleted the sk/fix-read-invalid-char branch July 17, 2023 14:16
KristofferC pushed a commit that referenced this pull request Jul 17, 2023
Fixes #50532. The `read(io, Char)` method didn't correctly handle the
case where the lead byte starts with too many leading ones; this fix
makes it handle that case correctly, which makes `read(io, Char)` match
`collect(s)` in its interpretation of what a character is in all invalid
cases. Also fix and test `read(::File, Char)` which has the same bug.

(cherry picked from commit ffe1a07)
KristofferC added a commit that referenced this pull request Jul 24, 2023
Backported PRs:
- [x] #50411 <!-- Fix weird dispatch of * with zero arguments -->
- [x] #50202 <!-- Remove dynamic dispatch from _wait/wait2 -->
- [x] #50064 <!-- Fix numbered prompt with input only with comment -->
- [x] #50026 <!-- Store heapsnapshot files in tempdir() instead of
current directory -->
- [x] #50402 <!-- Add CPU feature helper function -->
- [x] #50387 <!-- update newpages pointer after actually sweeping pages
-->
- [x] #50424 <!-- avoid potential type-instability in _replace_(str,
...) -->
- [x] #50444 <!-- Optimize getfield lowering to avoid boxing in some
cases -->
- [x] #50474 <!-- docs: Fix a `!!! note` which was miscapitalized -->
- [x] #50466 <!-- relax assertion involving pg->nold to reflect that it
may be a bit in… -->
- [x] #50490 <!-- Fix compat annotation for italic printstyled -->
- [x] #50488 <!-- fix typo in `Base.isassigned` with `Tridiagonal` -->
- [x] #50476 <!-- Profile: Add specifying dir for `take_heap_snapshot`
and handling if current dir is unwritable -->
- [x] #50461 <!-- fix typo in the --gcthreads argument description -->
- [x] #50528 <!-- ssair: Correctly handle stmt insertion at end of basic
block -->
- [x] #50533 <!-- ensure internal_obj_base_ptr checks whether objects
past freelist pointer are in freelist -->
- [x] #49322 <!-- improve cat design / performance -->
- [x] #50540 <!-- gc: remove over-eager assertion -->
- [x] #50542 <!-- gf: remove unnecessary assert cycle==depth -->
- [x] #50559 <!-- Expand kwcall lowering positional default check to
vararg -->
- [x] #50058 <!-- Add unwrapping mechanism for triangular mul and solves
-->
- [x] #50551 <!-- typeintersect: also record chained `innervars` -->
- [x] #50552 <!-- read(io, Char): fix read with too many leading ones
-->
- [x] #50541 <!-- precompile: ensure globals are not accidentally
created where disallowed -->
- [x] #50576 <!-- use atomic compare exchange when setting the GC
mark-bit -->
- [x] #50578 <!-- gf: make method overwrite/delete an error during
precompile -->
- [x] #50516 <!-- Fix visibility of assert on GCC12/13 -->
- [x] #50597 <!-- Fix memory corruption if task is launched inside
finalizer -->
- [x] #50591 <!-- build: fix various makefile bugs -->
- [x] #50599 <!-- faster invalid object lookup in conservative gc -->
- [x] #50634 <!-- 🤖 [master] Bump the SparseArrays stdlib from b4b0e72
to 99c99b4 -->
- [x] #50639 <!-- Backport LLVM patches to fix various issues. -->
- [x] #50546 <!-- Revert storage of method instance in LineInfoNode -->
- [x] #50631 <!-- Shift DCE pass to optimize imaging mode code better
-->
- [x] #50525 <!-- only check that values are finite in `generic_lufact`
when `check=true` -->
- [x] #50587 <!-- isassigned for ranges with BigInt indices -->
- [x] #50144 <!-- Page based heap size heuristics -->


Need manual backport:
- [ ] #50595 <!-- Rename ENV variable `JULIA_USE_NEW_PARSER` ->
`JULIA_USE_FLISP_PARSER` -->



Non-merged PRs with backport label:
- [ ] #50637 <!-- Remove SparseArrays legacy code -->
- [ ] #50618 <!-- inference: continue const-prop' when concrete-eval
returns non-inlineable -->
- [ ] #50598 <!-- only limit types in stack traces in the REPL -->
- [ ] #50594 <!-- Disallow non-index Integer types in isassigned -->
- [ ] #50568 <!-- `Array(::AbstractRange)` should return an `Array` -->
- [ ] #50523 <!-- Avoid generic call in most cases for getproperty -->
- [ ] #50172 <!-- print feature flags used for matching pkgimage -->
@KristofferC KristofferC removed the backport 1.10 Change should be backported to the 1.10 release label Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.6 Change should be backported to release-1.6 bugfix This change fixes an existing bug io Involving the I/O subsystem: libuv, read, write, etc. strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read(io, Char) doesn't match collect(string) for malformed UTF-8
3 participants