-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for index maps: Blocked and diagonal sparse arrays #56
Comments
Notice that there exists subtle issues with supporting batched CSR sparse arrays. For instance, each batch may have different number of specified elements, that is, the batch of column indices must be a ragged array. There are a couple of ways for an array library that lacks support for ragged arrays to still support batched CSR sparse arrays: |
I think I meant blocked. I've modified the issue to reflect this. |
In case of PyTorch BSR/BSC sparse tensors, indices are defined as indices of blocks, not of tensor elements as suggested here. |
That's supported without any additional modifications in |
This is a very interesting idea—so, essentially, you can take any sparse matrix and project it into a higher dimensional space (using an Custom formats already support a |
Many requests for Binsparse parser features (e.g. automatic reformatting, etc.) seem to require a highly capable tensor framework. Finch can currently implement rudimentary support for +, * and % on indices, but supports these kinds of maps a little differently (using wrapper array objects rather than an index map expression). I don't expect most frameworks to have default support for index maps (or even if they support some, they may not support all). However, I think most frameworks could implement a program which writes the tensor to COO, performs math on the coordinates, sorts them, and converts back. I think that as we consider in-memory formats, we might want to move towards a clearer distinction between optional and required features. |
A more moderate proposal here is allowing some kind of dimension grouping syntax. For example, we could say that (1, 2), (3, 4) represents a block matrix interpretation of a 4-D tensor, and (1, 2, 3) is a flattened 3-tensor. |
There is tremendous value in defining a spec which generalizes to established patterns, even if they are not directly supported by the reference implementation, or considered optional etc. If pytorch were to consider adoption of the binsparse spec the ability to express and differentiate between blocked and hybrid sparse-compressed would be an important feature (outside of the existence of any implementation, as we already implement conversion mechanisms) However, I do not think that this proposed mechanism is an ideal way to do it. To parrot some of the points raised above by @BenBrock and @willow-ahrens it puts a large onus on consumer implementations to be able to parse such expressions. The spec does seem to have some ambiguities with respect to custom format levels sub-specification. I have reviewed the language there and that assessment is based on that and some of the examples defining pre-defined formats using custom formats. I believe that allowing for
If this is true then how would one specific the nested levels for a 4-d tensor compressed along dim 1 and with a 2-d dense value? In other words a CSR matrix, with dense matrix value? It seems to me that this would be the same with out any way to indicate to the consuming library in the JSON descriptor what the correct interpretation of the binary payload is. The BSR 2-d array with shape |
@amjames Ah, of course -- in this case, I misunderstood what @pearu said earlier:
That lead me to believe BSR is indeed 4D, not 2D, when indexed. If it is indeed 2D, then my index map proposal handles that; in my proposal the block-size is
I'm of the opinion that optional or partial support may be okay here. For example, if BSR is supported; recognize BSR index map patterns and then parse just the block size. |
@hameerabbasi that is kind of what I am getting at with the idea of a non-scalar 'element'. Rather than needing to ID different patterns from mapping expressions the block size would be encoded into the shape of a n-d element that coupled with the tensor shape should give you all the information you need to properly interpret the index arrays and the data buffer. The index_map proposal is still a bit hard on implementations because if you do not support blocked sparse layouts you still need to parse the expressions and properly classify them to indicate supported or not. I am of course assuming there is some other use cases where a library may want to support other types of Rather than a non-scalar element, perhaps it would be better expressed as a value-modifier on the data types section? In that case a BSR tensor with 3x3 blocks might have levels defined the same as CSR, but something like : "data_types": {
"values": "blocked[int8, 3, 3]"
...
} Actually a "Value Modifier" would be better here. The blocked value interpretation is probably more closely related to |
If the 4-tensor is really meant to represent a 2-matrix, we should make sure the format reflects this, not the values. If the requirement is that binsparse must be able to represent every format in every system to be adopted, we need to accept that parsers will be allowed to error when the format is not compatible, or use a basic reformatting algorithm to convert unsupported formats to supported ones. Upon further reflection, I think that the strategy I laid out to convert between an "index modifier" representation and a regular representation is not so difficult to implement. Like I said before, parsers that do not support blocked formats can simply copy the blocked format to COO, then do simple indexing arithmetic to convert (I, J, i, j) tuples to (I * m + i, J * n + j) tuples, then copy into an appropriate format. |
What do people here think of an alternative "tree" representation, to reduce parsing load? "index_map": ["i0 / 3", "i0 % 3", "i1 / 3", "i1 % 3"], becomes "index_map": [
{"/": ["i0", 3]},
{"%": ["i0", 3]},
{"/": ["i1", 3]},
{"%": ["i1", 3]},
], Much more complicated to write by hand; potentially easier to generate and parse. |
I'm hesitant about using any expression involving (+, -, *, /, %) to encode blocking, because it's very easy to write equivalent and invalid expressions, and may require symbolic analysis to simplify the expression and determine whether it is valid. For example, we could write indices that
If we want to restrict those cases, we could propose a simpler strategy which first splits dimensions, then fuses them. The first vector would specify what dimensions the new split indices would have. If we want to split a (24, 64) matrix into a (6, 4, 32, 32) tensor, then the first vector would be [(6, 4), (32, 32)]. Then, we would fuse with a list of dimension groups. For example, if we wanted to fuse dimension 0 with 2 and dimension 1 with 3, we would write [(0, 2), (1, 3)]. So the format would look like:
|
@hameerabbasi I think my proposal is similar to yours, I'm much more comfortable with these restricted styles of tensor splitting and fusion because they don't have as many edge cases |
@willow-ahrens Fair enough; your proposal does address reshaping and blocking; but it doesn't address diagonal matrices or broadcasting. Broadcasting could be handled potentially via a new level format; one that repeats indices and makes everything read-only; but I don't yet see a clean way to represent diagonal matrices. |
Of this list I think OOB is really the only one that should be forbidden. (2) Is similar to @hameerabbasi I'm curious where you see broadcasting as a useful concept for data at rest/in-transit. Temporarily inserting stride-0 dimensions so that iterators can be created for operating on arrays is not something I think of as needed for this kind of transfer/storage format. I do think that diagonal representations are important too, could those also be represented by a level format that signals the interpretation of indices at that level changes to represent diagonal offsets? |
@amjames Notice that this isn't just about over-the-network or on-disk, it's about inter-library exchange of data as well. For example, one could do the following: def downstream_lib1_fn(arr: array) -> array:
arr = lib1.from_binsparse(arr) # Or `lib1.asarray(arr)` for sparse-dense agnostic code
# Perform ops on arr to get ret
return ret
# User code
# `lib2.broadcast_to` produced a view into an existing sparse array
broadcasted_arr = lib2.broadcast_to(lib2_arr, shape)
downstream_lib1_fn(broadcasted_arr) Here, |
I'm writing this issue to gauge support for index maps, which define a virtual indexing space alongside a physical storage space. These can be useful to define batched and diagonal sparse arrays. An index map can be A mapping from the N virtual indices to M storage indices using the operators
+
,-
,*
,/
,%
and parantheses only (division is integer division), the usual BODMAS rules apply. This also leads to supporting the diagonal format. I'll raise an issue on the binsparse repo and link it here.Here is an example field for a blocked CSR array, with a batch size of
3
.The way this can be interpreted is there are two virtual dimensions in the array, and four physical storage dimensions.
i0
andi1
are the virtual indices, and the four elements ofindex_map
tell us how to index into the existing four dimensions with two indices. Here is another example for a 5x5 diagonal matrix:This can also support "broadcasting" in NumPy via virtual dimensions. Here is an example with
np.broadcast_to(x[:, None, :], ...)
withx.ndim == 2
:xref data-apis/array-api#840 (comment)
The text was updated successfully, but these errors were encountered: