Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Support kDLBool type #75

Closed
alonre24 opened this issue Jun 3, 2021 · 18 comments · Fixed by #114
Closed

[DISCUSS] Support kDLBool type #75

alonre24 opened this issue Jun 3, 2021 · 18 comments · Fixed by #114

Comments

@alonre24
Copy link
Contributor

alonre24 commented Jun 3, 2021

To represent tensors of type bool - is it possible adding to the DLDataTypeCode enum another type (kDLBool)? That can be very useful, as some well known machine learning platforms (TF, torch, Onnxruntime) support models whose inputs/outputs are tensors of type bool. Thanks!

@tqchen
Copy link
Member

tqchen commented Jun 3, 2021

bool is encoded as UInt(bits=1, lanes=1) type. One thing that is indeed true is that we have not clearly specified how to store the sub-byte types, in this case they are stored as the byte

@alonre24
Copy link
Contributor Author

alonre24 commented Jun 3, 2021

As I understand, the size of bool element is usually one byte, so representing it with bits=1 seems not appropriate. On the other hand, using UInt(bits=8, lanes=1) to represent bool doesn't allow distinguish between bool and uint8 types. Therefore, I suggest adding bool to DLDataTypeCode - both for having the flexibility of the desired bool representation (1 or 8 bits), and for not overlapping with existing types.

@tqchen
Copy link
Member

tqchen commented Jun 3, 2021

Thanks @alonre24 On compiler projects like LLVM, uint(1) is used to represent the bool type(think it as a register type). That does not preclude us to think of a storage type of the element type. In this particular case, I believe specifying the storage type clearly for sub-byte types would be sufficient.

@lantiga
Copy link

lantiga commented Jun 3, 2021

Thank you @tqchen, there seems to be different ways in which the different frameworks decide to implement bool. It is not guaranteed to be uint(1), for instance in PyTorch is uint(8), which makes it undistinguishable from a tensor with a uint(8) dtype.
There are talks about creating a bit tensor in PyTorch to make bool sub-byte, but at the end of the day one has no guarantees there.

I'm of the opinion that having a separate DLDataTypeCode for bool and then using the bits field to describe whether that bool is held in 1 bit or byte in the buffer a would make a lot of sense for interoperability.

@leofang
Copy link
Contributor

leofang commented Jun 4, 2021

It is not guaranteed to be uint(1), for instance in PyTorch is uint(8), which makes it undistinguishable from a tensor with a uint(8) dtype.

Indeed, this issue would happen in MPI programs too (MPI_C_BOOL vs MPI_UNIT8_T).

@tqchen
Copy link
Member

tqchen commented Jun 4, 2021

What I originally meant was to specify the sub-bit storage layout, which means uint(1) maps to the bool and stores as uint8 in DLPack.

When the bool was indeed stored as sub-byte in bitmask setting, it then becomes uint1x8(8 of 1 bits). So we can still distinguish the two using the current convention.

@leofang
Copy link
Contributor

leofang commented Jun 7, 2021

I just realized in CuPy we do not support exchanging bool arrays because of lack of kDLBool.

@tqchen I think this is a nontrivial interpretation that not everybody expects (nor was it documented). For example the ongoing NumPy PR did not support bool either (numpy/numpy#19083). AFAIK only PyTorch supports it but it's unclear to me where they learned this. I am +1 for adding kDLBool for better clarity.

cc: @rgommers @emcastillo @hameerabbasi @seberg

@seberg
Copy link
Contributor

seberg commented Jun 7, 2021

My view is that adding a new type for bool seems best (and clearest). Does PyTorch even really support it, or just export bool as a uint8 (losing the bool type information on round-tripping)?

@tqchen tqchen changed the title Support kDLBool type [DISCUSS] Support kDLBool type Jun 7, 2021
@tqchen
Copy link
Member

tqchen commented Jun 7, 2021

Thanks everyone for great discussion so far. Trying to summarize the thread.

A0: i1 to represent bool

We can use uint(1) to represent the intend of the data type. Additionally, the specification says that any sub-byte data type would need to be stored in a full byte. This means uint1 would need to be stored as uint8(full byte) consistent with the current behavior.

For cases where the intention is to pack the bit-masking(say pack 8 bools into a byte). The data type can be represented as a vector type uint1x8, which can be stored as a full byte.

A1: Introduce a bool type

One of the natural extension to think about this is to introduce bool with bits=8. Note that the default printing behavior might not be too attractive here, since while we can say int8, bool8 does not makes too much sense as a dtype name.

Discussions

Both A0 and A1 should achieve the goal of supporting bool in DLPack.

  • A0 is likely more consistent with the low level compiler implementations, for example LLVM implements bool using i1
  • A1 makes the choice of bool more explicit, although we could also introduce a helper macro (IsBool(DLDataType) to make things explicit in the case of A0)

@seberg
Copy link
Contributor

seberg commented Jun 7, 2021

For cases where the intention is to pack the bit-masking(say pack 8 bools into a byte). The data type can be represented as a vector type uint1x8, which can be stored as a full byte.

I think this as storing exactly 8-tuples of bool, under the specifications? So a shape=1 [itemsize] would translate into shape=8 [bits] booleans/bits for uint1x8?
This means:

  • A shape of SHAPE [bits] has to be described as uint1xSHAPE, with SHAPE=15 you get 15 [bits] being uint1x15?
  • A multi-dimensional bit-array stored contiguously can't be described (since it would requite bit-sized strides that are not in the last dimensions/cannot be absorbed into lanes).

Also, someone might get the idea to store uint2 in a packed/unpacked format, at which point the type distinction between boolean and bit-sized integers becomes fuzzy. That might be an unnecessary "what if", but creating bool explicitly avoids it.

It seems to me that if there is a possibility of bit-sized stride support (e.g. in form of bit-arrays), creating the explicit bool dtype is a lot cleaner. If that will never happen, I still think it is nicer, but I doubt there are serious practical issues with using uint1 as bool.

I had the impression that the standard already explicitly supports bit-strides, since the header includes the line (in a comment):

size *= (t->dtype.bits * t->dtype.lanes + 7) / 8;

although, I guess there is a fair chance that nobody would do that and you can consider it undefined right now.

A0 is likely more consistent with the low level compiler implementations, for example LLVM implements bool using i1

Yes, but I don't understand that this is a reason for or against A0? The storage needs to be clearly defined as 1 or 0 (or possible value != 0 and value == 0), of course. But the type-safety aspect is unrelated to what the compiler uses behind the scenes?

@tqchen
Copy link
Member

tqchen commented Jun 8, 2021

Thanks @seberg .

To followup on the conversation about the bitmasking. The particular comment of uint1x8 was with respect to the bit level case. In the most common cases we are not doing that, and uint1 plus the storage spec of sub-bit type would indeed resolve the issue(because all types padded to full byte to be stored). The same principle will likely applies to uint2 as well.

size *= (t->dtype.bits * t->dtype.lanes + 7) / 8; is a realization of the sub-byte storage convention.

To followup on the multi-dimensional bit-array. We would indeed require some form of logical to physical mapping in this case. Explicitly specifying a vector type uint1x8, or uint1x32 would corresponds to the physical storage plan of the data. Such plan is also usually easier to direct manipulate in the presence of SIMD. The logical case is a separate question.

In the case of A1, bool1 create some inconsistency with the current specification. Since the bool1 was intended as a bit array, while the sub-byte storage convention would indicate the other case(it should be stored as 1 byte). Of course as we discussed bit-array may be out of the scope, so not too relevant here. Just want to show some possible inconsistencies.

@seberg
Copy link
Contributor

seberg commented Jun 10, 2021

Yeah, you are right, I managed to mis-parse the code as rounding up the size rather than the itemsize :(. Although I guess many here likely didn't realize that sub-bit datatypes are standardized to be stored padded to full bytes :).

In the case of A1, bool1 create some inconsistency with the current specification. Since the bool1 was intended as a bit array

You are right. Do not use bool1 for bit-strided storage, then (by itself)! What you get is:

  • bool8 indicates a convention where True := value != 0 and False := value == 0
  • bool1 indicates a convention where True := value == 1 and False := value == 0 stored in 1 byte as per current DLPack standard
  • uint1 can be used for the set of values {0, 1} instead of meaning {False, True} (which could be surprising!)

(For what its worth, NumPy can't always guarantee bool1 style, as it, annoyingly, supports both in practice. But DLPack could be opinionated about bool1 vs. bool8. If you would switch to always using sub-byte strides, bool8 has to be specified as one of the two meanings.)

What about sub-byte strides?

I do agree that there is an inconsistency if you would use bool1 to implicitly indicate bit-strided (non byte-sized) storage, but not define uint2 to use this. But I do not understand that as an argument against introducing a new bool dtype. To me that is primarily a type safety (and clarity) issue – i.e. the last point.

I have heard the bit-mask requests for NumPy often enough to think that it may be good to anticipate it (not necessarily act on that right now, though). The future might for example see a flag (e.g. on the dtype field) to indicate true bit-sized storage (non byte-padded). That flag will allow both bool1 and uint2 (or even uint1) to be stored in a packed, sub-byte strided, manner.

@tirthasheshpatel
Copy link
Contributor

@tqchen The discussion seems to have reached an impasse at the uint vs bool argument. #75 (comment) is a nice summary of the issue. Here is my opinion about the thread:

  • Using uint(1) or uint(8) as bool: This seems obscure to me as suggested in [DISCUSS] Support kDLBool type #75 (comment). Another argument against this is that there is no way to distinguish between uint8 and bool. But @tqchen noted that it is more consistent with the lower level implementation in compilers e.g. LLVM.
  • Using kDLBool: More explicit and easier to use. Clear distinction between uint8 and bool. Worst case, introduces some redundancy for low level compilers like LLVM (but that doesn't seem to be a problem (like introducing ambiguity), right?)

I would vote for an explicit kDLBool since its more explicit. I admit, I am not aware of the problems with low-level compiler projects like LLVM but I don't think that adding a kDLBool should cause a problem when exporting at such low levels. @tqchen Do you agree with this? If yes, we can proceed with merging #76. It would definitely be nice to have the bool data type in DLPack!

@leofang
Copy link
Contributor

leofang commented Oct 27, 2022

@tqchen @rgommers @seberg @kmaehashi @tirthasheshpatel @jakirkham @kkraus14

I'd like us to revisit this topic, and specially I'd like to request for adding kDLBool. Look at the NumPy/CuPy/PyTorch adoptions as of today, it is clearly that none of these libraries support exchanging bool tensors, and it has increasingly become a source of bug reports (see the above linked issues). Given that these are already the 3 major Array API adopting libraries, we should add the bool type to clear any ambiguity and unblock them.

@tqchen
Copy link
Member

tqchen commented Oct 27, 2022

Thanks for rekindle this topic. After thinking a bit more on this topic. I also now agree that having kDLBool is going to be helpful.

I would assume the current state would call for something like

Bool(bits=8, lane=1)

to represent the current bool type in numpy/PyTorch?

@seberg
Copy link
Contributor

seberg commented Oct 27, 2022

Sounds good, it might be nice to review if anyone has expects True to be only the bit set, rather than anything != 0.

@tqchen
Copy link
Member

tqchen commented Nov 9, 2022

seems there is general consensus on this. We can consider a PR to add such spec. perhaps @leofang can send a PR?

I think it makes sense to require true be the only bit set

@leofang
Copy link
Contributor

leofang commented Nov 9, 2022

Shoooot my apology @tqchen I completely missed the traffic here for some reason. Will do later today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants