Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch supertype of CodeUnits from DenseVector to AbstractVector #54002

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented Apr 9, 2024

This is a breaking change, but it was a bug that it subtyped DenseVector previsouly, as there is no guarantee that CodeUnits are stored densely in memory.

Test failures currently seem unrelated

See #53996

This is a breaking change, but it was a bug that it subtyped `DenseVector`
previsouly, as there is no guarantee that CodeUnits are stored densely in memory.
@jakobnissen jakobnissen added minor change Marginal behavior change acceptable for a minor release needs pkgeval Tests for all registered packages should be run with this change needs nanosoldier run This PR should have benchmarks run on it bugfix This change fixes an existing bug labels Apr 9, 2024
@jakobnissen
Copy link
Contributor Author

As mentioned in #53996, since this is a breaking bugfix, this needs a PkgEval + Nanosoldier run in case someone relied on its supertype.
Depending on the results of these runs, it may be triage worthy.

@mbauman
Copy link
Member

mbauman commented Apr 9, 2024

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier runtests()

Can we do both in the same comment?

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@jakobnissen
Copy link
Contributor Author

Nanosoldier report looks fine - some suspicious numbers for BitSet but I can't see how that could be related.

@Zentrik
Copy link
Member

Zentrik commented Apr 10, 2024

Those are a couple benchmarks going from 10ns on master to 20ns and the rest of the benchmarks on BitSet seem fine (https://tealquaternion.camdvr.org/compare.html?end=54002&showRawData=true&nonRelevant=true&name=BitSet), so probably just noise.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@jakobnissen
Copy link
Contributor Author

There are some legitimate package errors in there: PyCall, StringViews, StrBase, Base58.
Small enough that I could probably make PRs to all the packages to fix them, but then again, this might also mean this change causes issues in packages not in PkgEval.

@jakobnissen jakobnissen removed needs nanosoldier run This PR should have benchmarks run on it needs pkgeval Tests for all registered packages should be run with this change labels Apr 10, 2024
base/strings/basic.jl Outdated Show resolved Hide resolved
@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Apr 10, 2024
@LilithHafner
Copy link
Member

This PR is too breaking.

However, we have a codeunits function. We should, instead, require that strings with non-contiguous memory implement the codeunits function and return something that is not a DenseVector (or if it is, make sure it is a valid DenseVector). This would be a small breaking documentation change.

(In 2.0?, we should probably just? remove DenseVector and make it a trait/interface)

@KristofferC
Copy link
Member

There are some legitimate package errors in there:

It is not enough to look at if a package tests starts to fail based on a change to call it a "legitimate package error". Many times, packages directly test the buggy behavior in the test itself which causes the error while the package code itself is completely fine. For example:

StrBase is literally a direct call to strides on a CodeUnits: https://github.com/JuliaString/StrBase.jl/blob/90ac50265de0f356f414b9d493140fc724b7ae9d/test/basic.jl#L882 so that isn't breaking anything, it's just a nonsense test that tests the very thing that is being changed here.

Or StringViews.jl is directly testing the change that is done here:
https://github.com/JuliaStrings/StringViews.jl/blob/f94b09e5f5056ab17a3f50490e500aee4476479d/test/runtests.jl#L35

PyCall is also just a test that isn't testing any real functionality in the package https://github.com/JuliaPy/PyCall.jl/blob/2f600fbebee50ab0672e153455e3c0fda1694fba/test/runtests.jl#L563

I guess the people calling this "too breaking" just looked at the list of PkgEval results and decided it based upon that but as someone who has to do this a lot of releases, that is insufficient to make any decision at all.

I would say that the current behavior is buggy, it should be changed, and the packages just need to tweak their tests a bit (not change any code in their package) to pass with this.

@jakobnissen
Copy link
Contributor Author

Some of these failures are significant. The failures you link to, which seem less significant, are just the first to fail in their test suite.

  • For PyCall, pybytes is a user-facing function that no longer work on codeunits. Hence, this will break scripts that call pybytes to pass it into Python. This is shown as an example on the README.md of PyCall, so I'd call this breakage.
  • For StringViews, a large part of the functionality of the whole package is defined as methods with ::DenseStringView, which relies on codeunits being a subtype of dense vector. I.e. if I remove the test you link to here, then it just fails with another test (concretely, regex matching then breaks, because it's only defined for DenseStringViewAndSub)

StrBase though, does pass its tests when removing that one test for strides.

@gbaraldi
Copy link
Member

I agree with Kristoffer (I can't make that triage time :( ). This is a bug, not a behaviour change we are doing willy nilly, if someone expects codeunits do be a dense vector and it's not that's UB, i.e passing it to a C function which then might segfault.

@oscardssmith
Copy link
Member

is there some reason triage's suggested fix wouldn't also fix the bug?

@gbaraldi
Copy link
Member

gbaraldi commented Apr 25, 2024

IMO triages change is more breaking and more subtle because it seems it would basically that that codeunits doesn't take an AbstractString as it's argument but something like a AbstractDenseString

@brenhinkeller
Copy link
Contributor

That seems less likely to actually break existing code -- is there even any extant nontrivially used AbstractString type that does not store its contents densely?

@mbauman
Copy link
Member

mbauman commented Apr 25, 2024

I mean, the point of this PR was precisely to do this experiment. It is one possible fix, but not necessarily the fix.

is there even any extant nontrivially used AbstractString type that does not store its contents densely

Yes. The issue in #53996 is InlineStrings.jl — those are backed by non-mutable structs that don't necessarily land on the heap with a memory address. They do have support for grabbing a pointer through a Ref{<:InlineString}, but it's not a c-compatible one: it's backwards (stride -1), the pointer should really be offset by N for each InlineN, and it's not necessarily null terminated. It's at that c boundary that's at the crux of where many of these failures are coming from, actually.

It's very much worth noting that CodeUnits{T, <:InlineString} is currently broken (and broken loudly) when trying to use codeunits as a dense vector:

julia> pointer(codeunits(inline"hello"))
ERROR: MethodError: no method matching unsafe_convert(::Type{Ptr{UInt8}}, ::String7)

@oscardssmith
Copy link
Member

so the triage suggested change would be to define codeunits(s::InlineString) = codeunits(string(s)) or something like that.

@mbauman
Copy link
Member

mbauman commented Apr 25, 2024

Looking through current uses of the type specifically, it seems like CodeUnits has two purposes — or, rather, is commonly used in two contexts. It's either used to make a string act like an Julia vector (e.g., CSV does this) or it's used as a Cstring and passed to C (or other languages like PyCall). The latter case doesn't just require it to be a DenseVector, it needs to be stride 1, null-terminated, with UInt8 code units. And CodeUnits doesn't promise any of that, but folks are using it like it does with varying levels of additional checks that (often partially) assert (some of) those preconditions.

It really seems like packages are using CodeUnits as a way to launder strings into c-compatible strings on their way to a ccall, when they might as well just pass the string itself to ccall directly. Or they may even assume that it's necessary. I get it, it really looks like I can pass codeunits("hello") and even codeunits(inline"hello") to C. The vibes are great, and it is a nice way to distinguish between APIs like separating a string file name from string data, which does seem to get tangled up in the process.

So that's at the root of the question, really. Is CodeUnits a C-compatible vector like char*? Or is it a Julia vector? I don't think it's worth keeping it a DenseVector if we don't additionally promise its pointer can act like a Cstring.

@jakobnissen
Copy link
Contributor Author

Originally, CodeUnits was designed to allow users to access the bytes of a string safely (i.e. without wrapping it in a mutable vector). I work with strings a lot, and it's quite useful to occasionally process strings as just a sequence of bytes. This original purpose does not require denseness, nor null-termination or anything like that.

So, given that:

  • CodeUnits was not originally intended to promise C-compatibility
  • CodeUnits doesn't actually promise it currently and indeed does violate it for inline strings
  • When needing to ccall, you can just pass in your String directly

I think we should consider CodeUnits "just" an AbstractVector.

@jakobnissen
Copy link
Contributor Author

From triage:

  • DenseArray is not great API, because denseness cuts across too many types, and e.g. does not cover views and doesn't really fit with static arrays. It should be a trait.
  • Just removing the supertype is too breaking for now
  • The short-term fix for packages like InlineStrings is to implement codeunits(s::InlineString) = codeunits(string(s)). This is not a good solution, but it's semantically correct which the status quo isn't.
  • The long-term plan should be to implement an isdense trait as suggested in Make an isdense trait #54715, and then try to move the ecosystem from relying on DenseArray to this trait. If such a time come when major packages no longer rely on DenseArray, we could remove the supertype from CodeUnits. Or, maybe it turns out no-one really relies on it anyway so it won't matter. We'll see at that time.

@jakobnissen
Copy link
Contributor Author

I'll keep this PR open, so it can be revisited later when the situation changes w.r.t. the isdense trait

@jakobnissen jakobnissen removed the triage This should be discussed on a triage call label Jun 6, 2024
@jakobnissen jakobnissen added the forget me not PRs that one wants to make sure aren't forgotten label Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug forget me not PRs that one wants to make sure aren't forgotten minor change Marginal behavior change acceptable for a minor release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants