Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish and officially release quantized types #4857

Open
strongoier opened this issue Apr 26, 2022 · 42 comments
Open

Polish and officially release quantized types #4857

strongoier opened this issue Apr 26, 2022 · 42 comments
Assignees
Labels
feature request Suggest an idea on this project

Comments

@strongoier
Copy link
Contributor

Quantized types are an experimental feature introduced in the QuanTaichi paper. With this useful feature, users can significantly save memory usage of their Taichi programs. The feature can also enable acceleration of atomic operations on mobile phones.

However, the feature has been neither officially announced nor extensively maintained. As Taichi has come to its 1.0 version, I think it is time to polish the feature and make it available to users. My plan is to refine the API and implementation so that it can fit into current Taichi better, be more user-friendly, and become deployable with Taichi AOT. I would like to write an RFC for it.

@strongoier strongoier added the feature request Suggest an idea on this project label Apr 26, 2022
@taichi-ci-bot taichi-ci-bot moved this to Untriaged in Taichi Lang Apr 26, 2022
@strongoier strongoier self-assigned this Apr 26, 2022
@strongoier strongoier added this to the Taichi v1.1.0 milestone Apr 26, 2022
@strongoier strongoier moved this from Untriaged to In Progress in Taichi Lang Apr 26, 2022
@strongoier
Copy link
Contributor Author

Before writing a formal RFC, I would like to briefly summarize some previous discussions on this topic. I think there are still some issues to be solved, and I hope to continue the discussion here.

Background

A quantized type normally has no native support. Therefore, you need to specify a parent primitive type (e.g. a 32-bit int) and describe how you would like to pack a group of quantized types (e.g. a 15-bit int and a 17-bit int) inside.

In Taichi, this is done by introducing two SNode types, bit_struct and bit_array. Example usages:

i4 = ti.quant.int(bits=4)
u28 = ti.quant.int(bits=28, signed=False)

p = ti.field(dtype=i4)
q = ti.field(dtype=u28)
ti.root.dense(ti.i, 4).bit_struct(num_bits=32).place(p, q)

r = ti.field(dtype=i4)
ti.root.dense(ti.i, 4).bit_array(ti.i, 8, num_bits=32).place(r)

What are the problems of current APIs?

  1. bit_struct and bit_array are not consistent with other SNode types. A normal SNode specifies two things: how to split the axes, and how are cells stored in the container. Meanwhile, a normal SNode has no limitations on components of its cells. However, bit_struct has nothing to do with axes, and both bit_struct and bit_array must only have place SNodes as components of its cells with limitation on total number of bits of all components of its cells. These make the APIs inconsistent.
  2. Users cannot use quantized types outside the SNode system. This is especially problematic when it comes to deployment, because ndarrays, which are first-class citizens in Taichi AOT, cannot work with quantized types.

What are our current thoughts on solving the problems?

As bit_array is deeply coupled with the SNode system (it indeed handles axes splitting) while used not that often, we prefer to keep it unchanged. Our main focus is around bit_struct.

Potential change 1: add type ti.types.bit_struct

ti.types.bit_struct is similar to ti.types.struct, with the following differences:

  • The whole ti.types.bit_struct is stored with a primitive type.
  • Members of ti.types.bit_struct must be quantized types.
  • The memory layout of members is clearly defined.

Example usage:

s_ty = ti.types.bit_struct(32, {'a': i4, 'b': u28})
s = ti.field(dtype=s_ty)
ti.root.dense(ti.i, 4).place(s)
s[I].a, s[I].b  # access

s_arr = ti.ndarray(dtype=s_ty, shape=4)
s_arr[I].a, s_arr[I].b  # access

Pros:

  • It can be supported for both SNodes and ndarrays, therefore problem 2 is solved.
  • It gets rid of problem 1 about bit_struct SNode.

Cons:

  • ti.types.bit_struct focuses on storage, so its members may not be a logical group. This can result in hard-to-read user programs.
  • When used in SNodes, users can no longer change storage layout without modifying computation code. This sacrifices one important advantage of the SNode system.

Potential change 2: add helper function bit_struct_wrapper()

bit_struct_wrapper() is introduced to replace the bit_struct SNode. Example usage:

p = ti.field(dtype=i4)
q = ti.field(dtype=u28)
ti.root.dense(ti.i, 4).place(bit_struct_wrapper(32, [p, q]))

It aims at solving problem 1 without sacrificing anything. However, it can do nothing with problem 2 because it is not compatible with ndarrays.

Considering that none of these proposed changes is perfect, shall we apply none, one, or both of them? Or do you have other ideas? @k-ye @ailzhang @yuanming-hu

@k-ye
Copy link
Member

k-ye commented Apr 27, 2022

I really like 1, because it makes the type system neat :-) However, considering that changing to bit_struct_wrapper should be easier, and that 1 and 2 are not mutually exclusive, I think it's reasonable to go with 2 first. As for ndarray quant type, at the bare minimum we can support storing just fixed-point scalar number first, then quantized vector types, then quantized struct types.

@ailzhang
Copy link
Contributor

ailzhang commented Apr 28, 2022

+1 on implementing #2 as a start! Btw I feel like based on the deployment need, not modifying computation code might not be as hard requirement as we thought. IMHO if it's a s/old/new it shouldn't be a huge problem for people who want to maximize performance. (or is it more complicated than that? :P
For ndarray + quant, is it correct understanding that supporting fixed-point scalar number can already solve our problem of floating point atomics?

@strongoier
Copy link
Contributor Author

not modifying computation code might not be as hard requirement as we thought

I agree with that. The main point here is, when we introduce a new language construct, especially as fundamental as a type, we should let it make sense in most cases, instead of being a deployment-only thing.

For ndarray + quant, is it correct understanding that supporting fixed-point scalar number can already solve our problem of floating point atomics?

IMO yes. @k-ye

@strongoier
Copy link
Contributor Author

strongoier commented Apr 29, 2022

After an offline discussion with @ailzhang, we reach the following consensus:

  1. Ndarray is designed mainly for deployment purposes, with the two unique advantages - avoiding memory copy and recompilation. It is important that it can be interpreted by user programs and common third-party frameworks in a trivial way. Therefore, supporting complex data storage mechanism with ndarrays doesn't make much sense. For cases where those complex storage is really needed, users should refer to the SNode AOT solution.
  2. That said, we still want to solve the problem that floating point atomics on mobile phones are too slow, without the SNode AOT solution. The proposal here is to add fixed32 and fixed64 types, and let users convert float from/to them:
f_ty = ti.types.fixed32(scale=100.0, signed=False)
arr = ti.ndarray(float, 10)

@ti.kernel
def foo(a: ti.types.ndarray()):
    for i in a:
        x = ti.cast(a[i], f_ty)
        ...  # calculations on x
        a[i] = ti.cast(x, float)

foo(arr)

WDYT @k-ye
If this solution looks good, we can finalize this as an individual feature request and then re-consider other aspects of quant APIs with fewer restrictions.

@k-ye
Copy link
Member

k-ye commented Apr 29, 2022

To me it seems like fixed32 is just a small wrapper around the quant API? While I agree that

it can be interpreted by user programs and common third-party frameworks in a trivial way.

it's also not too hard to convert custom quant types into primitive types.

One thing I've been thinking about: if we make quant vectors workable on mobile, how much larger scale can we get for simulation. Note that graphical APIs are already offering f16 vectors, e.g. Metal has half4, so this is something to consider.

So yeah, I guess we can agree that Ndarray doesn't need to support fancy bit_struct. But I think it's reasonable to consider quantized scalars and vectors/matrices.

@ailzhang
Copy link
Contributor

@k-ye Yup that wrapper is mainly used to solve the floating point atomics problem we've seen.

Is it correct understanding that to achieve much larger scale simulation on mobile, we can try adding e.g. half4 as primitive type which applies to both field and ndarray?

@k-ye
Copy link
Member

k-ye commented Apr 29, 2022

Is it correct understanding that to achieve much larger scale simulation on mobile, we can try adding e.g. half4 as primitive type which applies to both field and ndarray?

Yep. Additionally, this could also help with vec4 loading optimization (cc @turbo0628 @qiao-bo )

@strongoier
Copy link
Contributor Author

To me it seems like fixed32 is just a small wrapper around the quant API?

This indeed requires us to support quantized scalars. Our current APIs cannot be used outside SNodes. However, when a quant type is used as an individual scalar, number of bits other than 32/64 doesn't make sense. As there are already f32/f64, the only meaningful types to provide are fixed32/64.

Additionally, this could also help with vec4 loading optimization

I don't quite get the point here. Using native half4/vec4 in codegen instead of current ad-hoc expansion will certainly be an optimization strategy for our ti.types.vector(4, dtype=f16/f32). How does it relate to our quantized types?

@strongoier
Copy link
Contributor Author

strongoier commented May 10, 2022

After yet another discussion with @k-ye @ailzhang @jim19930609, I have formed a mental picture of future plans and would like to share it here.

Task A: Refine current APIs of quantized types and make them available again

Although current APIs work only in the SNode system, they are still useful and we hope to expose them in a cleaner way.

Subtask A.1: Determine public APIs of quantized type definitions

Previously, we have two groups of APIs, type_factory and quant. The latter is built on top of the former, and is used in the QuanTaichi paper. However, in some real use cases the former is adopted. Having both adds unnecessary burden for users to learn these APIs.

We would like to only keep quant as it is closer to users, and make it available at ti.types.quant for consistency with other types. type_factory will be removed, and its methods will be made private under ti.types.quant.

To sum up, we will have ti.types.quant.int/fixed/float/_custom_int/_custom_float. All current usages need to be updated.

Subtask A.2: Solve the inconsistency problem of bit_struct SNode

This corresponds to problem 1 and potential change 2 mentioned above. I plan to add an API ti.bit_struct_wrapper(number_of_bits, list_of_fields, with_shared_exponent) to solve the inconsistency problem and also make place() clean. This requires refactoring our SNode system implementation a bit as we are getting rid of the bit_struct SNode.

Task B: Add new all-purpose and deployable APIs of quantized types

For deployment purposes, where performance is valued the most, it is worth providing some new APIs (users have to write things in a new way). The new APIs should work both in the SNode system and for Ndarrays.

Subtask B.1: Allow unrestricted usage of quantized types as dtype

Currently, quantized types ti.types.quant.int/float/fixed can serve as dtype of fields, with the condition that they are placed as a bit_struct or bit_array. We hope to allow direct usage of them as dtype with no limitations, so that they can also be used in Ndarrays and thus easily deployable. Note that in this case, we need to pad a quantized type to a primitive type with minimum number of bits for storage purposes.

You may wonder what is the use case, considering that no memory can be saved. In fact, the above support is mainly targeting acceleration of atomic operations on mobile phones, by replacing float32 with 32-bit fixed point numbers. Meanwhile, it enables experimenting with different precisions and provides basis for subsequent tasks.

Subtask B.2: Add a quantized vector type

To enable the main advantage, saving memory, of quantized types, we hope to add a quantized vector type ti.types.quant.vector(n, dtype), where dtype must be one of ti.types.quant.int/float/fixed. The whole type will be padded to a primitive type with minimum number of bits that can hold n dtype. This targets common cases like packing two or three components of some physical quantities together.

Subtask B.3 (optional): Add a quantized struct type

Similar to Subtask B.2, we can add a quantized struct type ti.types.quant.struct, which was previously mentioned as ti.types.bit_struct. This can be an optional task when real need arises.

Task C: Add documentation and examples for quantized types

After this step we can have an official announcement of the rebirth of quantized types!

@ailzhang
Copy link
Contributor

minor nit: for subtask b.2, I wonder if ti.types.vector(n, dtype) where ti.types.quant.int/float/fixed are added to the whitelist of dtype makes it simpler for users?

@k-ye
Copy link
Member

k-ye commented May 11, 2022

Thanks for writing this up! Overall it looks like a great roadmap. I have a few questions here:

Could you provide an overview of the quant API?


Subtask A.1

I plan to add an API ti.bit_struct_wrapper(number_of_bits, list_of_fields, with_shared_exponent) to solve the inconsistency problem

I wonder if with_shared_exponent is only meaningful for vector/matrix types?

Subtask A.2

type_factory will remain as an internal API at ti.types.quantized_types.type_factory

nit: I feel like we don't have to have both ti.types.quant and ti.types.quantized_types. Maybe just ti.types.quant.type_factory?

@strongoier
Copy link
Contributor Author

minor nit: for subtask b.2, I wonder if ti.types.vector(n, dtype) where ti.types.quant.int/float/fixed are added to the whitelist of dtype makes it simpler for users?

ti.types.vector and ti.types.quant.vector are different in many ways. ti.types.quant.vector is actually stored as a primitive type, has limitations on number of bits, and can accept quant-only configurations like with_shared_exponent.

I wonder if with_shared_exponent is only meaningful for vector/matrix types?

For struct it can make sense as well..

nit: I feel like we don't have to have both ti.types.quant and ti.types.quantized_types. Maybe just ti.types.quant.type_factory?

ti.types.quant is the actual API we want to expose. As type_factory is hidden, we have to visit the whole module path ti.types.quantized_types for internal usage.

@k-ye
Copy link
Member

k-ye commented May 11, 2022

we have to visit the whole module path ti.types.quantized_types for internal usage.

I think there are different ways to handle this: use __all__ to control the public symbols, use quant._type_factory, etc.

@strongoier
Copy link
Contributor Author

I think there are different ways to handle this: use __all__ to control the public symbols, use quant._type_factory, etc.

Ah yes. I was stuck at the assumption that we could not break the two same-level classes, quant and type_factory . However it is now a chance to refine things more aggressively.

Now I have a new design: we get rid of the legacy "type_factory" and directly provide the following APIs - ti.types.quant.int/fixed/float/_custom_int/_custom_float. WDYT @k-ye

BTW which one seems better, quant.int or quant_int?

@k-ye
Copy link
Member

k-ye commented May 11, 2022

Cool! I prefer quant.int more, as they can be scoped in the same namespace quant. WDYT? (cc @ailzhang @jim19930609 )

@ailzhang
Copy link
Contributor

+1 on quant.int!

@yuanming-hu
Copy link
Member

yuanming-hu commented May 11, 2022

I wonder if with_shared_exponent is only meaningful for vector/matrix types?

For struct it can make sense as well..

I feel like in real use cases shared exponents are typically used only in vectors. Do you have an example where you need that in a struct? :-) @strongoier

Another question of mine: if I'd split the 64 bits into x: fixed21, y: fixed22, z: fixed21, can it be expressed as a quantized vector3? See also the RGB565 format in OpenGL etc.: https://www.khronos.org/opengl/wiki/Image_Format

@strongoier
Copy link
Contributor Author

I feel like in real use cases shared exponents are typically used only in vectors. Do you have an example where you need that in a struct? :-)

Not really. My point here is just that we don't have to throw an error if those fields are not grouped as a vector.

Another question of mine: if I'd split the 64 bits into x: fixed21, y: fixed22, z: fixed21, can it be expressed as a quantized vector3?

In fact we hope that elements of a vector have the same type. A quantized struct is needed for this purpose.

@yuanming-hu
Copy link
Member

yuanming-hu commented May 11, 2022

In fact we hope that elements of a vector have the same type. A quantized struct is needed for this purpose.

I see. Thanks for the clarification!

I feel like the user may want to access the components via [] - for example color = (fixed5, fixed6, fixed5) and the user writes luminance = a[0] + a[1] + a[2]. Do we plan to support that? :-)

@strongoier
Copy link
Contributor Author

I feel like the user may want to access the components via [] - for example color = (fixed5, fixed6, fixed5) and the user writes luminance = a[0] + a[1] + a[2]. Do we plan to support that? :-)

Yep. It is fine to support that as syntax sugar.

@yuanming-hu
Copy link
Member

Yep. It is fine to support that as syntax sugar.

I'm thinking about this: for ti.types.quant.vector(n, dtype), can dtype be a list of quantized types? For example, we may want to allow something like rgb565 = ti.types.quant.vector(3, [fixed5, fixed6, fixed5]) :-) Then it's not simply a syntax sugar, but a real vector type. (Are we worrying about dynamic indexing here?)

@strongoier
Copy link
Contributor Author

I'm thinking about this: for ti.types.quant.vector(n, dtype), can dtype be a list of quantized types? For example, we may want to allow something like rgb565 = ti.types.quant.vector(3, [fixed5, fixed6, fixed5]) :-) Then it's not simply a syntax sugar, but a real vector type. (Are we worrying about dynamic indexing here?)

I understand your point here. TBH this touches some underlying design philosophy of Taichi, which I get a bit confused from time to time.

As far as I understand, in earlier Taichi a vector is a pure math concept. It promises math operations, but nothing about storage. Because of this, it has great flexibility, allowing components to be non-contiguous, and to have different types. Also because of this, it cannot be directly mapped to native vector types, and cannot support dynamic indexing perfectly.

As time goes by, different voices arise in the community. Many users consider vectors as containers of contiguous same-typed values. As a result, many recent or planned efforts go in this direction - dynamic indexing, native types, etc.

However, these two directions are inherently conflicting - giving more support to one of them means giving less support to the other. To avoid getting design choices back and forth, IMHO we need to have a consistent and clear underlying principle. Then we can easily determine whether a quantized vector can have components with different types.

BTW I have another question: why do we have a struct type in the presence of a vector type which can have components with different types?

@k-ye
Copy link
Member

k-ye commented May 11, 2022

To avoid getting design choices back and forth, IMHO we need to have a consistent and clear underlying principle.

I also agree on this. We have spent some great amount of time debating on this, and concluded that vector/matrix should behave just like how most users would expect: They are containers holding homogeneous elements, dynamically-indexable, and providing linalg methods. Most of the time, using a Taichi vector/matrix should feel no different from using a GLM/GLSL one. It simplifies the user experience, the API design and the implementation.

If it comes to a point where a non-trivial amount of usage for heterogeneous-vector show up, 1) From a storage point of view, this could supposedly be implemented via quant structs; and 2) we should consider how to offer a proxy/adaptor to help them convert between this quant struct and vectors (in the mathematical sense). WDYT?

@strongoier
Copy link
Contributor Author

strongoier commented Jul 4, 2022

After yet another discussion with @Hanke98 @k-ye @ailzhang, I would like to share some updates.

In Taichi v1.1, we hope to first release a refined version of quantized types used with SNodes, in order to get this feature officially announced and tried by users. The TODO list of this whole issue is thus shuffled by priority, and I'll track v1.1 blockers here:

  • Quantized types definition refinement.
  • bit_struct SNode refinement.
  • bit_array SNode refinement.
  • Documentation and examples.

Quantized types definition refinement plan

We only present three basic quantized types, ti.types.quant.int/fixed/float, to users.

bit_struct SNode refinement plan

Let me illustrate the API change with the following example. It is indeed an altered version of ti.bit_struct_wrapper() proposed in #4857 (comment), with a more natural way to express shared exponents.

Common part:

u4 = ti.types.quant.int(bits=4, signed=False)
f15 = ti.types.quant.float(exp=5, frac=10)
f18 = ti.types.quant.float(exp=5, frac=13)

p = ti.field(dtype=u4)
q = ti.field(dtype=f15)
r = ti.field(dtype=f18)

Old API:

blk = ti.root.dense(ti.i, 4).bit_struct(num_bits=32)
blk.place(p)
blk.place(q, r, shared_exponent=True)

Previous proposal:

ti.root.dense(ti.i, 4).place(ti.bit_struct_wrapper(32, [p, [q, r]]))

New API:

bitpack = ti.BitpackedFields(max_num_bits=32)
bitpack.place(p)
bitpack.place(q, r, shared_exponent=True)
ti.root.dense(ti.i, 4).place(bitpack)

bit_array SNode refinement plan

  1. bit_array will remain a SNode, but will be renamed to quant_array, considering that bit_array is usually used to refer to 0/1 arrays.
  2. bit_vectorize is currently a ad-hoc configuration function specifying how many bits are vectorized together, which leads to confusing API explanations like bit_vectorize(1) means off while bit_vectorize(32) means on. As the vectorization unit is always the physical type, bit_vectorize will be turned into an on/off switch inside a loop config.
  3. The current implementation of bit_vectorize only applies to 0/1 arrays, which should be turned off by default. Struct fors on quant_arrays with bit_vectorize off should work properly.
  4. ti.types.quant.fixed should be supported as elements in quant_arrays.

Documentation and examples

We need to write a tutorial about using quantized types, based on what we already have in taichi_elements and quantaichi. These repos should also be updated with latest APIs.

@k-ye
Copy link
Member

k-ye commented Jul 4, 2022

Just one comment, I wonder if we can put BitpackedFields under ti.quant as well :-)

@strongoier
Copy link
Contributor Author

Just one comment, I wonder if we can put BitpackedFields under ti.quant as well :-)

Unfortunately, we only have a ti.types.quant module for type definitions, which may not be suitable for BitpackedFields...

@ailzhang ailzhang removed this from the Taichi v1.1.0 milestone Aug 12, 2022
@strongoier
Copy link
Contributor Author

Let me share some updates here. All planned refinement in #4857 (comment) has been realized and announced to users in Taichi v1.1. Furthermore, current codegen regarding quantized types is independent of SNode now, which allows flexible extensions in the future. A potential future direction is to allow quantized types in Ndarrays (similar to #4857 (comment) task B), which will get implemented when more real requirements arise.

strongoier added a commit that referenced this issue Sep 16, 2022
…tion (#6074)

Related issue = #5959, #4857

Support for different element types of matrix fields was introduced in
#2135 for quant. As discussed in
#4857 (comment),
the only case we need to support is different element types with **same
compute type**. This PR adds the validity check and removes test cases
which are actually not allowed.

<!--
Thank you for your contribution!

If it is your first time contributing to Taichi, please read our
Contributor Guidelines:
  https://docs.taichi-lang.org/docs/contributor_guide

- Please always prepend your PR title with tags such as [CUDA], [Lang],
[Doc], [Example]. For a complete list of valid PR tags, please check out
https://github.com/taichi-dev/taichi/blob/master/misc/prtags.json.
- Use upper-case tags (e.g., [Metal]) for PRs that change public APIs.
Otherwise, please use lower-case tags (e.g., [metal]).
- More details:
https://docs.taichi-lang.org/docs/contributor_guide#pr-title-format-and-tags

- Please fill in the issue number that this PR relates to.
- If your PR fixes the issue **completely**, use the `close` or `fixes`
prefix so that GitHub automatically closes the issue when the PR is
merged. For example,
    Related issue = close #2345
- If the PR does not belong to any existing issue, free to leave it
blank.
-->

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project
Projects
Status: In Progress
Development

No branches or pull requests

4 participants