-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Buffer type and make Array an abstraction on top of it #12447
Comments
+100. To expound on the conversation between @malmaud, @StefanKarpinski, @carnaval, and myself.
|
+100 also, love that it would fit in perfectly with what @quinnj has been working on in https://github.com/quinnj/Strings.jl. |
We should definitely start experimenting with an Array defined mostly in julia, mixing in some SubArray features. How should we handle arrays of references? I don't think we want to have explicit write barriers in julia code for example. If the storage of an Array can be switched at any time, you have extra pointer loads to worry about, and aliasing issues. Another issue is in-line allocation of small arrays. Ideally arrays and strings would not mandate 2 objects for each. |
Small correction: that's only true for That small correction aside, I am enthusiastic about this proposal, including mixing in some features of SubArrays into an Array-as-buffer-wrapper type. A big advantage is that However, I'm not so sure about switching the backing buffer of the array, and I think Jeff's points deserve careful thought. |
@malmaud, what use cases do you see for dynamically switching the backing buffer of the array? |
Yeah, I'm not too sure on the need to be able to switch out backing buffers. I mean I would say we just make immutable Array
...
end Let's see, what else do we currently store in Array, element-size? length? is-shared? is-aligned? dimensions? Would we really need the ability to change any of those fields? I would think since |
Right now we allow changing data pointers for unshared vectors, which we use when we have to reallocate, i.e. for |
This is very timely for 0.5. We used to have a Buffer in the very early days, before the Julia public release, and it was quite nice to be able to define Array itself in Julia. |
I don't personally see much use for switching out the buffer - I think @johnmyleswhite got feedback from a serious Torch user that it would be useful, but I don't really know what for. |
I don't think having a write barrier builtin from julia is too bad. I'm actually somewhat surprised we never needed it. However this hypothetical buffer object will definitely need a flag to indicate whether it's full of pointers or bits for the gc, ruining a bit the illusion of dumb bytes, but well, there is no way around that. It would be interesting to see if we can provide the feature of inline small buffer allocation in a first class (and maybe more general) way. It's kinda contrary to the local philosophy to have builtin Arrays be variable length but no one else can. I would be fine if resizing was limited to another type than Array (ArrayList ?) but I feel this opinion is not so popular. That way paying the pointer indirection cost is a conscious decision (that I would venture you don't need to do very often). Maybe we could even try to support nd-resizing that way, to add lines and columns. |
I've definitely thought about this before, too, and would love to see
|
What if someone puts their whole system in a tensor and only ever operates on views of its backing memory buffer? Under the reasoning above, wouldn't all of these array instances be considered to be aliasing memory with each other? |
That seems to mostly depend on whether the views are disjoint. |
Right, |
That's all true. But how is that different from the status quo? |
I think the point is that we'll use different types that represent aliased (or possibly aliased) data. Currently |
Honestly I don't know the status quo in Julia, I was just adding 2 cents because I saw something wrong (or at least questionable) on the internet :) |
@mbauman, maybe we need |
@quinnj that makes a lot of sense to me |
See #10507. I think |
Ultimately though, will the proposal there return a |
There's no way to return an Performance is indeed my big worry with the |
@timholy, but if |
It'd be very difficult to shoehorn all of SubArray and ReshapedArray (performantly!) into Array without slowing down the normal, non-view case. You probably could make it work with a bunch of type parameters, but then it may as well be a different type. By exposing the Buffer type, we could hoist the Array's buffer directly into a field of the view objects, which should help with their performance (if LLVM optimizations don't beat us to the punch). It could also mitigate some of the MAX_TYPE_DEPTH and over-specialization issues that we'll probably start running into with lots of deeply nested type parameter lists flying around. The certain operations are some combinations of reshapes of sub-views, if I remember correctly. |
I'll certainly defer to our resident array ninjas on the subtleties of reshape and sub operations. I'm going to make a probably poor attempt at least for the |
Here's an example from numpy: numpy/numpy#6166 (comment), where I think the generalized |
Indexing of SubArrays already has zero cost over indexing of ordinary Arrays, so I think we could shoehorn it all into Array if we wanted to, but there may be reasons not to. LLVM can also already hoist the load of the array pointer from SubArray out of loops in most/all cases where it matters because they are immutable, so I don't expect improved performance for that case. Rather, I expect that trying to implement Arrays in Julia will force us to sort out issues that currently affect other AbstractArray types, e.g. the stupid GC root-related performance issues and unhoisted array pointer loads for BitArrays. |
The biggest issue is that the there is no (nice) view type that is closed under both reshaping and taking a cartesian subset. The only type that is closed is one that stores one memory-offset per element, and that would be horrible. See #9874 (comment). |
Tentatively marking as 0.6 to encourage experimentation in that timeframe. |
Agreed, that should definitely be the plan. I think this needed the stack allocation and an inline immutable PRs to be merged first to make this a perform. |
I think the current cache behavior (excellent) needs to be kept, which means that the Array data must still immediately follow the header. |
I was looking for progress on this since @StefanKarpinski announced there is a feature freeze quite soon! In principle this doesn't seem to hard for isbits element types... I took an attempt at it here using a native-Julia
|
I forgot to mention my approach only works nicely for element types which are |
Neat experiment. But finalizers are yucky, would be much better if this used julia's own GC to manage the buffer. |
OK I admit I didn't know much about how they are implemented. I've been reading around - is the main problem here that they are slow? Would type-based finalization fix the performance problem? I can kind-of see that deferred finalization of many resources is not desirable, but our GC already does that for memory, which is what this finalizer is all about...
This is no doubt true, and makes more sense. On the other hand, having the ability to boss around the GC from julia code could also be pretty awesome (or potentially nightmarish). However, asking it for a buffer (and telling it whether it is full of data or full of pointers it should scan) is probably precisely the only sane thing you can boss it around to do from Julia code... I really should read the code for the GC one day, but here's a question. If the buffer contains arrays where some fields are data and some fields are references to other Julia objects, how does the GC know which parts of the buffer to scan? (Is this only a problem in the future when non-isbits immutables get inlined?) |
No and no.
No the GC keep track of the memory usage, but not resources behind finalizer.
I'd prefer not giving people access to it unless a really convincing use case is made.
Yes.
The buffer is always well typed.
Which is a requirement for this change. |
Ahh! Thanks, I was just beginning to think this is the only way it would make sense. So a future
The circular nature didn't escape me :) It's almost cute. Though, strictly, if it is true that |
No. |
This'll happen in the 1.0 release. It should perhaps have had more attention in 0.6 so we could have it on master longer, but hard stop is really just the 1.0 release. |
This would be useful for buffered |
I think this is fixed now on the v1.11 prereleases due to #51319. |
There's been some talk of switching to a model for arrays with two parts (and similar for strings) as part of Arrrayageddon:
A few advantages:
This is essentially the model that Torch uses, where the memory buffer is the Storage type and the array abstraction is the Tensor type.
The text was updated successfully, but these errors were encountered: