diff --git a/docs/differentiable_programming.rst b/docs/differentiable_programming.rst
index dfded4f865872..cf39c52b6e85a 100644
--- a/docs/differentiable_programming.rst
+++ b/docs/differentiable_programming.rst
@@ -1,7 +1,9 @@
+.. _differentiable:
+
 Differentiable programming
 ==========================
 
-Please check out `the DiffTaichi paper <https://arxiv.org/pdf/1910.00935.pdf>`_ and `video <https://www.youtube.com/watch?v=Z1xvAZve9aE>`_ to learn more about Taichi differentiable programming.
+This page is work in progress. Please check out `the DiffTaichi paper <https://arxiv.org/pdf/1910.00935.pdf>`_ and `video <https://www.youtube.com/watch?v=Z1xvAZve9aE>`_ to learn more about Taichi differentiable programming.
 
 The `DiffTaichi repo <https://github.com/yuanming-hu/difftaichi>`_ contains 10 differentiable physical simulators built with Taichi differentiable programming.
 
diff --git a/docs/external.rst b/docs/external.rst
index b38f61994cc0c..a122818c6b00e 100644
--- a/docs/external.rst
+++ b/docs/external.rst
@@ -1,10 +1,12 @@
+.. _external:
+
 Interacting with external arrays
-====================================
+================================
 
 Here ``external arrays`` refer to ``numpy.ndarray`` or ``torch.Tensor``.
 
 Conversion between Taichi tensors and external arrays
---------------------------------------------------------
+-----------------------------------------------------
 
 Use ``to_numpy``/``from_numpy``/``to_torch``/``from_torch``:
 
@@ -49,7 +51,7 @@ Use ``to_numpy``/``from_numpy``/``to_torch``/``from_torch``:
 
 
 Use external arrays as Taichi kernel parameters
--------------------------------------------------
+-----------------------------------------------
 
 The type hint for external array parameters is ``ti.ext_arr()``. Please see the example below.
 Note that struct-for's on external arrays are not supported.
diff --git a/docs/faq.rst b/docs/faq.rst
index 445456adce1de..533604420ef76 100644
--- a/docs/faq.rst
+++ b/docs/faq.rst
@@ -1,5 +1,5 @@
 Frequently Asked Questions
-====================================================
+==========================
 
 **Can a user iterate over irregular topology instead of grids, such as tetrahedra meshes, line segment vertices?**
 These structures have to be represented using 1D arrays in Taichi. You can still iterate over it using `for i in x` or `for i in range(n)`.
diff --git a/docs/global_settings.rst b/docs/global_settings.rst
index 84ecbc1e6cd5f..3796c57b4a29b 100644
--- a/docs/global_settings.rst
+++ b/docs/global_settings.rst
@@ -1,5 +1,5 @@
 Global Settings
-------------------
+---------------
 
 - Restart the Taichi runtime system (clear memory, destroy all variables and kernels): ``ti.reset()``
 - Eliminate verbose outputs: ``ti.get_runtime().set_verbose(False)``
@@ -10,4 +10,4 @@ Global Settings
 - To specify which GPU to use for CUDA: ``export CUDA_VISIBLE_DEVICES=0``
 - To specify which Arch to use: ``export TI_ARCH=cuda``
 - To print intermediate IR generated: ``export TI_PRINT_IR=1``
-- To print verbosed details: ``export TI_VERBOSE=1``
+- To print verbose details: ``export TI_VERBOSE=1``
diff --git a/docs/internal.rst b/docs/internal.rst
index 3f7a2abada84a..ebdcc770f2eb6 100644
--- a/docs/internal.rst
+++ b/docs/internal.rst
@@ -37,3 +37,27 @@ To print out all statistics in Python:
 .. code-block:: Python
 
     ti.core.print_stat()
+
+
+Why Python frontend
+-------------------
+
+Embedding Taichi in ``python`` has the following advantages:
+
+* Easy to learn. Taichi has a very similar syntax to Python.
+* Easy to run. No ahead-of-time compilation is needed.
+* This design allows people to reuse existing python infrastructure:
+
+  * IDEs. A python IDE mostly works for Taichi with syntax highlighting, syntax checking, and autocomplete.
+  * Package manager (pip). A developed Taichi application and be easily submitted to ``PyPI`` and others can easily set it up with ``pip``.
+  * Existing packages. Interacting with other python components (e.g. ``matplotlib`` and ``numpy``) is just trivial.
+
+* The built-in AST manipulation tools in ``python`` allow us to do magical things, as long as the kernel body can be parsed by the Python parser.
+
+However, this design has drawbacks as well:
+
+* Taichi kernels must parse-able by Python parsers. This means Taichi syntax cannot go beyond Python syntax.
+
+  * For example, indexing is always needed when accessing elements in Taichi tensors, even if the tensor is 0D. Use ``x[None] = 123`` to set the value in ``x`` if ``x`` is 0D. This is because ``x = 123`` will set ``x`` itself (instead of its containing value) to be the constant ``123`` in python syntax, and, unfortunately, we cannot modify this behavior.
+
+* Python has relatively low performance. This can cause a performance issue when initializing large Taichi tensors with pure python scripts. A Taichi kernel should be used to initialize a huge tensor.
diff --git a/docs/meta.rst b/docs/meta.rst
index 1d3b2e80e98ea..a9bfecb63dc3f 100644
--- a/docs/meta.rst
+++ b/docs/meta.rst
@@ -11,24 +11,38 @@ Taichi provides metaprogramming infrastructures. Metaprogramming can
 
 Taichi kernels are *lazily instantiated* and a lot of computation can happen at *compile-time*. Every kernel in Taichi is a template kernel, even if it has no template arguments.
 
+
+.. _template_metaprogramming:
+
+Template metaprogramming
+------------------------
+
+.. code-block:: python
+
+    @ti.kernel
+    def copy(x: ti.template(), y: ti.template()):
+        for i in x:
+            y[i] = x[i]
+
+
 Dimensionality-independent programming using grouped indices
---------------------------------------------------------------
+-------------------------------------------------------------
 
 .. code-block:: python
 
-  @ti.kernel
-  def copy(x: ti.template(), y: ti.template()):
-    for I in ti.grouped(y):
-      x[I] = y[I]
-
-  @ti.kernel
-  def array_op(x: ti.template(), y: ti.template()):
-    # If tensor x is 2D
-    for I in ti.grouped(x): # I is a vector of size x.dim() and data type i32
-      y[I + ti.Vector([0, 1])] = I[0] + I[1]
-    # is equivalent to
-    for i, j in x:
-      y[i, j + 1] = i + j
+    @ti.kernel
+    def copy(x: ti.template(), y: ti.template()):
+        for I in ti.grouped(y):
+            x[I] = y[I]
+
+    @ti.kernel
+    def array_op(x: ti.template(), y: ti.template()):
+        # If tensor x is 2D
+        for I in ti.grouped(x): # I is a vector of size x.dim() and data type i32
+            y[I + ti.Vector([0, 1])] = I[0] + I[1]
+        # is equivalent to
+        for i, j in x:
+            y[i, j + 1] = i + j
 
 Tensor size reflection
 ------------------------------------------
diff --git a/docs/snode.rst b/docs/snode.rst
index cef443bb73f64..e9d70603ce5b5 100644
--- a/docs/snode.rst
+++ b/docs/snode.rst
@@ -14,7 +14,8 @@ Our language provides *structural nodes (SNodes)* to compose the hierarchy and p
 
 * dynamic: Variable-length array, with a predefined maximum length. It serves the role of ``std::vector`` in C++ or ``list`` in Python, and can be used to maintain objects (e.g. particles) contained in a block.
 
-See :ref:`layout` for more details about data layout. ``ti.root`` is the root node of the data structure.
+
+See :ref:`layout` for more details. ``ti.root`` is the root node of the data structure.
 
 .. function:: snode.place(x, ...)
 
@@ -172,6 +173,12 @@ Working with ``dynamic`` SNodes
     Inserts ``val`` into the ``dynamic`` node with indices ``indices``.
 
 
+Taichi tensors like powers of two
+---------------------------------
+
+Non-power-of-two tensor dimensions are promoted into powers of two and thus these tensors will occupy more virtual address space.
+For example, a (dense) tensor of size ``(18, 65)`` will be materialized as ``(32, 128)``.
+
 
 Indices
 -------
diff --git a/docs/syntax.rst b/docs/syntax.rst
index 4e3bafbb95d1d..6adab10e14b65 100644
--- a/docs/syntax.rst
+++ b/docs/syntax.rst
@@ -10,15 +10,11 @@ Kernel arguments must be type-hinted. Kernels can have at most 8 parameters, e.g
 
     @ti.kernel
     def print_xy(x: ti.i32, y: ti.f32):
-      print(x + y)
+        print(x + y)
 
-    @ti.kernel
-    def copy(x: ti.template(), y: ti.template()):
-      for i in x:
-        y[i] = x[i]
 
-A kernel can have **scalar** return value. If a kernel has a return value, it must be type-hinted.
-The return value will be automatically casted into the hinted type. e.g.,
+A kernel can have a **scalar** return value. If a kernel has a return value, it must be type-hinted.
+The return value will be automatically cast into the hinted type. e.g.,
 
 .. code-block:: python
 
@@ -29,94 +25,63 @@ The return value will be automatically casted into the hinted type. e.g.,
     res = add_xy(2.3, 1.1)
     print(res)  # 3, since return type is ti.i32
 
-.. note::
-    For differentiable programming kernels should better have either serial statements or a single parallel for-loop. If you don't use differentiable programming, feel free to ignore this tip.
 
 .. note::
 
-    For now, we only support one scalar as return value. Returning ``ti.Matrix`` or `ti.Vector`` is not supported. Python-style tuple return is not supported. e.g.:
+    For now, we only support one scalar as return value. Returning ``ti.Matrix`` or ``ti.Vector`` is not supported. Python-style tuple return is not supported either. For example:
 
     .. code-block:: python
 
         @ti.kernel
         def bad_kernel() -> ti.Matrix:
-            return ti.Matrix([[1, 0], [0, 1]])  # ERROR!
+            return ti.Matrix([[1, 0], [0, 1]])  # Error
 
         @ti.kernel
         def bad_kernel() -> (ti.i32, ti.f32):
             x = 1
             y = 0.5
-            return x, y  #  ERROR!
-
-.. note::
-  For correct gradient behaviors in differentiable programming, please refrain from using kernel return values. Instead, store the result into a global variable (e.g. ``loss[None]``).
-
-(TODO: move the following to advanced topics)
-
-* For differentiable programming kernels should better have either serial statements or a single parallel for-loop. If you don't use differentiable programming, feel free to ignore this tip.
-
-.. code-block:: python
-
-        @ti.kernel
-        def a_hard_kernel_to_auto_differentiate():
-          sum = 0
-          for i in x:
-            sum += x[i]
-          for i in y:
-            y[i] = sum
+            return x, y  # Error
 
-        # instead, split it into multiple kernels to be nice to the Taichi autodiff compiler:
 
-        @ti.kernel
-        def reduce():
-          for i in x:
-            sum[None] += x[i]
+We also support **template arguments** (see :ref:`template_metaprogramming`) and **external array arguments** (see :ref:`external`) in Taichi kernels.
 
-        @ti.kernel
-        def assign()
-          for i in y:
-            y[i] = sum[None]
+.. warning::
 
-        def main():
-          with ti.Tape(loss):
-            ...
-            sum[None] = 0
-            reduce()
-            assign()
-            ...
+   When using differentiable programming, there are a few more constraints on kernel structures. See the **Kernel Simplicity Rule** in :ref:`differentiable`.
 
+   Also, please do not use kernel return values in differentiable programming, since the return value will not be tracked by automatic differentiation. Instead, store the result into a global variable (e.g. ``loss[None]``).
 
 Functions
------------------------------------------------
+---------
 
-Use ``@ti.func`` to decorate your Taichi functions. These functions are callable only in `Taichi`-scope. Don't call them in `Python`-scope. All function calls are force-inlined, so no recursion supported.
+Use ``@ti.func`` to decorate your Taichi functions. These functions are callable only in `Taichi`-scope. Do not call them in `Python`-scopes.
 
 .. code-block:: python
 
    @ti.func
    def laplacian(t, i, j):
-     return inv_dx2 * (
-         -4 * p[t, i, j] + p[t, i, j - 1] + p[t, i, j + 1] + p[t, i + 1, j] +
-         p[t, i - 1, j])
+       return inv_dx2 * (
+           -4 * p[t, i, j] + p[t, i, j - 1] + p[t, i, j + 1] + p[t, i + 1, j] +
+           p[t, i - 1, j])
 
    @ti.kernel
    def fdtd(t: ti.i32):
-     for i in range(n_grid): # Parallelized over GPU threads
-       for j in range(n_grid):
-         laplacian_p = laplacian(t - 2, i, j)
-         laplacian_q = laplacian(t - 1, i, j)
-         p[t, i, j] = 2 * p[t - 1, i, j] + (
-             c * c * dt * dt + c * alpha * dt) * laplacian_q - p[
-                        t - 2, i, j] - c * alpha * dt * laplacian_p
+       for i in range(n_grid): # Parallelized
+           for j in range(n_grid): # Serial loops in each parallel threads
+               laplacian_p = laplacian(t - 2, i, j)
+               laplacian_q = laplacian(t - 1, i, j)
+               p[t, i, j] = 2 * p[t - 1, i, j] + (
+                   c * c * dt * dt + c * alpha * dt) * laplacian_q - p[
+                              t - 2, i, j] - c * alpha * dt * laplacian_p
 
 
 .. warning::
 
-    Functions with multiple ``return``'s are not supported for now. Use a **local** variable to store the results, so that you end up with only one ``return``:
+    Functions with multiple ``return`` statements are not supported for now. Use a **local** variable to store the results, so that you end up with only one ``return`` statement:
 
     .. code-block:: python
 
-      # Bad function - two return's
+      # Bad function - two return statements
       @ti.func
       def safe_sqrt(x):
         if x >= 0:
@@ -124,7 +89,7 @@ Use ``@ti.func`` to decorate your Taichi functions. These functions are callable
         else:
           return 0.0
 
-      # Good function - single return
+      # Good function - single return statement
       @ti.func
       def safe_sqrt(x):
         rst = 0.0
@@ -143,14 +108,9 @@ Use ``@ti.func`` to decorate your Taichi functions. These functions are callable
     Function arguments are passed by value.
 
 
-Data layout
--------------------
-Non-power-of-two tensor dimensions are promoted into powers of two and thus these tensors will occupy more virtual address space.
-For example, a tensor of size ``(18, 65)`` will be materialized as ``(32, 128)``.
-
 
 Scalar arithmetics
------------------------------------------
+------------------
 Supported scalar functions:
 
 .. function:: ti.sin(x)
@@ -158,7 +118,7 @@ Supported scalar functions:
 .. function:: ti.asin(x)
 .. function:: ti.acos(x)
 .. function:: ti.atan2(x, y)
-.. function:: ti.cast(x, type)
+.. function:: ti.cast(x, data_type)
 .. function:: ti.sqrt(x)
 .. function:: ti.floor(x)
 .. function:: ti.ceil(x)
@@ -167,7 +127,7 @@ Supported scalar functions:
 .. function:: ti.tanh(x)
 .. function:: ti.exp(x)
 .. function:: ti.log(x)
-.. function:: ti.random(type)
+.. function:: ti.random(data_type)
 .. function:: abs(x)
 .. function:: int(x)
 .. function:: float(x)
@@ -175,62 +135,44 @@ Supported scalar functions:
 .. function:: min(x, y)
 .. function:: pow(x, y)
 
-Note: when these scalar functions are applied on :ref:`matrix` and :ref:`vector`, it's applied element-wise, for example:
-
-.. code-block:: python
-
-    A = ti.sin(B)
-    # is equalivant to (assuming B is a 3x2 matrix):
-    for i in ti.static(range(3)):
-        for j in ti.static(range(2)):
-            A[i, j] = ti.sin(B[i, j])
-
 .. note::
 
   Python 3 distinguishes ``/`` (true division) and ``//`` (floor division). For example, ``1.0 / 2.0 = 0.5``,
   ``1 / 2 = 0.5``, ``1 // 2 = 0``, ``4.2 // 2 = 2``. Taichi follows this design:
 
-     - *true divisions* on integral types will first cast their operands to the default float point type.
-     - *floor divisions* on float-point types will first cast their operands to the default integer type.
+     - **true divisions** on integral types will first cast their operands to the default float point type.
+     - **floor divisions** on float-point types will first cast their operands to the default integer type.
 
   To avoid such implicit casting, you can manually cast your operands to desired types, using ``ti.cast``.
-  Read :ref:`default_precisions` for more details on default numerical types.
+  See :ref:`default_precisions` for more details on default numerical types.
 
-Debugging
--------------------------------------------
-
-Debug your program with ``print(x)``. For example, if ``x`` is ``23``, then it shows:
-
-.. code-block::
-
-    [debug] x = 23
-
-in the console.
+.. note::
 
-.. warning::
+    When these scalar functions are applied on :ref:`matrix` and :ref:`vector`, they are applied in an element-wise manner.
+    For example:
 
-    This is not the same as the ``print`` in Python-scope. For now ``print`` in Taichi only takes **scalar numbers** as input. Strings, vectors and matrices are not supported. Please use ``print(v[0]); print(v[1])`` if you want to print a vector.
+    .. code-block:: python
 
+        B = ti.Matrix([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
 
-Why Python frontend
------------------------------------
+        A = ti.sin(B)
+        # is equivalent to
+        for i in ti.static(range(2)):
+            for j in ti.static(range(3)):
+                A[i, j] = ti.sin(B[i, j])
 
-Embedding the language in ``python`` has the following advantages:
 
-* Easy to learn. Taichi has a very similar syntax to Python.
-* Easy to run. No ahead-of-time compilation is needed.
-* This design allows people to reuse existing python infrastructure:
+Debugging
+---------
 
-  * IDEs. A python IDE mostly works for Taichi with syntax highlighting, syntax checking, and autocomplete.
-  * Package manager (pip). A developed Taichi application and be easily submitted to ``PyPI`` and others can easily set it up with ``pip``.
-  * Existing packages. Interacting with other python components (e.g. ``matplotlib`` and ``numpy``) is just trivial.
+Debug your program with ``print(x)``. For example, if ``x`` is ``23``, then it prints
 
-* The built-in AST manipulation tools in ``python`` allow us to do magical things, as long as the kernel body can be parsed by the Python parser.
+.. code-block::
 
-However, this design has drawbacks as well:
+    [debug] x = 23
 
-* Taichi kernels must parse-able by Python parsers. This means Taichi syntax cannot go beyond Python syntax.
+in the console.
 
-  * For example, indexing is always needed when accessing elements in Taichi tensors, even if the tensor is 0D. Use ``x[None] = 123`` to set the value in ``x`` if ``x`` is 0D. This is because ``x = 123`` will set ``x`` itself (instead of its containing value) to be the constant ``123`` in python syntax, and, unfortunately, we cannot modify this behavior.
+.. warning::
 
-* Python has relatively low performance. This can cause a performance issue when initializing large Taichi tensors with pure python scripts. A Taichi kernel should be used to initialize a huge tensor.
+    This is not the same as the ``print`` in Python-scope. For now ``print`` in Taichi only takes **scalar numbers** as input. Strings, vectors and matrices are not supported. Please use ``print(v[0]); print(v[1])`` if you want to print a vector.
diff --git a/docs/type.rst b/docs/type.rst
index bdc5edd517584..726f48c5144a6 100644
--- a/docs/type.rst
+++ b/docs/type.rst
@@ -58,38 +58,42 @@ Binary operations on different types will give you a promoted type, following th
 .. _default_precisions:
 
 Default precisions
----------------------------------------
+------------------
 
 By default, numerical literals have 32-bit precisions.
 For example, ``42`` has type ``ti.i32`` and ``3.14`` has type ``ti.f32``.
-Default precisions can be specified when initializing Taichi:
+Default integer and float-point precisions (``default_ip`` and ``default_fp``) can be specified when initializing Taichi:
 
 .. code-block:: python
 
-  ti.init(..., default_fp=ti.f32)
-  ti.init(..., default_fp=ti.f64)
+    ti.init(..., default_fp=ti.f32)
+    ti.init(..., default_fp=ti.f64)
 
-  ti.init(..., default_ip=ti.i32)
-  ti.init(..., default_ip=ti.i64)
+    ti.init(..., default_ip=ti.i32)
+    ti.init(..., default_ip=ti.i64)
 
 
 Type casts
----------------------------------------
+----------
 
-Use ``ti.cast`` to type-cast scalar values.
+Use ``ti.cast`` to cast scalar values.
 
 .. code-block:: python
 
-  a = 1.4
-  b = ti.cast(a, ti.i32)
-  c = ti.cast(b, ti.f32)
+    a = 1.4
+    b = ti.cast(a, ti.i32)
+    c = ti.cast(b, ti.f32)
+
+    # Equivalently, use ``int()`` and ``float()``
+    #   to converting to default float-point/integer types
+    b = int(a)
+    c = float(b)
 
-  # Equivalently, use ``int()`` and ``float()``
-  #   to converting to default float-point/integer types
-  b = int(a)
-  c = float(b)
+    # Element-wise casts in matrices
+    mat = ti.Matrix([[3.0, 0.0], [0.3, 0.1]])
+    mat_int = mat.cast(int)
+    mat_int2 = mat.cast(ti.i32)
 
-  # Element-wise casts in matrices
-  mat = ti.Matrix([[3.0, 0.0], [0.3, 0.1]])
-  mat_int = mat.cast(int)
-  mat_int2 = mat.cast(ti.i32)
+Use ``ti.bit_cast`` to bit-cast a value into another data type. The underlying bits will be preserved in this cast.
+The new type must have the same width as the the old type.
+For example, bit-casting ``i32`` to ``f64`` is not allowed. Use this operation with caution.