diff --git a/docs/data_layout.rst b/docs/data_layout.rst
deleted file mode 100644
index 1f952a4e9aa1e..0000000000000
--- a/docs/data_layout.rst
+++ /dev/null
@@ -1,113 +0,0 @@
-Advanced data layouts
-===========================
-
-Memory layout is key to performance, especially for memory-bound applications.
-A carefully designed data layout can significantly improve cache/TLB-hit rates and cacheline utilization.
-
-We suggested starting with the default layout specification (simply by specifying ``shape`` when creating tensors using ``ti.var/Vector/Matrix``),
-and then migrate to more advanced layouts using the ``ti.root.X`` syntax.
-
-Taichi decouples algorithms from data layouts, and the Taichi compiler automatically optimizes data accesses
-on a specific data layout. These Taichi features allow programmers to quickly experiment with different data layouts
-and figure out the most efficient one on a specific task and computer architecture.
-
-
-The default data layout using ``shape``
--------------------------------------------------------
-
-By default, when allocating a ``ti.var`` , it follows the most naive data layout
-
-.. code-block:: python
-
-  val = ti.var(ti.f32, shape=(32, 64, 128))
-  # C++ equivalent: float val[32][64][128]
-
-Or equivalently, the same data layout can be specified using advanced `data layout description`:
-
-.. code-block:: python
-
-  # Create the global tensor
-  val = ti.var(ti.f32)
-  # Specify the shape and layout
-  ti.root.dense(ti.ijk, (32, 64, 128)).place(val)
-
-However, oftentimes this data layout is suboptimal for computer graphics tasks.
-For example, ``val[i, j, k]`` and ``val[i + 1, j, k]`` are very far away (``32 KB``) from each other,
-and leads to poor access locality under certain computation tasks. Specifically,
-in tasks such as texture trilinear interpolation, the two elements are not even within the same ``4KB`` pages,
-creating a huge cache/TLB pressure.
-
-Advanced data layout specification
---------------------------------------
-
-A better layout might be
-
-.. code-block:: python
-
-  val = ti.var(ti.f32)
-  ti.root.dense(ti.ijk, (8, 16, 32)).dense(ti.ijk, (4, 4, 4)).place(val)
-
-This organizes ``val`` in ``4x4x4`` blocks, so that with high probability ``val[i, j, k]`` and its neighbours are close to each other (i.e., in the same cacheline or memory page).
-
-Examples
------------
-
-2D matrix, row-major
-
-.. code-block:: python
-
-  A = ti.var(ti.f32)
-  ti.root.dense(ti.ij, (256, 256)).place(A)
-
-2D matrix, column-major
-
-.. code-block:: python
-
-  A = ti.var(ti.f32)
-  ti.root.dense(ti.ji, (256, 256)).place(A) # Note ti.ji instead of ti.ij
-
-`8x8` blocked 2D array of size `1024x1024`
-
-.. code-block:: python
-
-  density = ti.var(ti.f32)
-  ti.root.dense(ti.ij, (128, 128)).dense(ti.ij, (8, 8)).place(density)
-
-
-3D Particle positions and velocities, arrays-of-structures
-
-.. code-block:: python
-
-  pos = ti.Vector(3, dt=ti.f32)
-  vel = ti.Vector(3, dt=ti.f32)
-  ti.root.dense(ti.i, 1024).place(pos, vel)
-  # equivalent to
-  ti.root.dense(ti.i, 1024).place(pos(0), pos(1), pos(2), vel(0), vel(1), vel(2))
-
-3D Particle positions and velocities, structures-of-arrays
-
-.. code-block:: python
-
-  pos = ti.Vector(3, dt=ti.f32)
-  vel = ti.Vector(3, dt=ti.f32)
-  for i in range(3):
-    ti.root.dense(ti.i, 1024).place(pos(i))
-  for i in range(3):
-    ti.root.dense(ti.i, 1024).place(vel(i))
-
-
-Struct-fors on advanced (dense) data layouts
------------------------------------------------
-
-Struct-fors on nested dense data structures will automatically follow their data order in memory. For example, if 2D scalar tensor ``A`` is stored in row-major order,
-
-.. code-block:: python
-
-  for i, j in A:
-    A[i, j] += 1
-
-will iterate over elements of ``A`` following row-major order. If ``A`` is column-major, then the iteration follows the column-major order.
-
-If ``A`` is blocked, the iteration will happen within each block first. This maximizes memory bandwidth utilization in most cases.
-
-Struct-fors on sparse tensors follows the same philosophy, and will be discussed further in :ref:`sparse`.
diff --git a/docs/index.rst b/docs/index.rst
index 7320fb8c8d6a2..93508b794dbf9 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -27,6 +27,7 @@ The Taichi Programming Language
    scalar_tensor
    vector
    matrix
+   snode
 
 
 .. toctree::
@@ -34,7 +35,7 @@ The Taichi Programming Language
    :maxdepth: 3
 
    meta
-   data_layout
+   layout
    sparse
    differentiable_programming
    odop
diff --git a/docs/internal.rst b/docs/internal.rst
index 996767a3bdcc9..3f7a2abada84a 100644
--- a/docs/internal.rst
+++ b/docs/internal.rst
@@ -1,9 +1,6 @@
 Internal designs (WIP)
 ======================
 
-Vector type system
-------------------
-
 
 Intermediate representation
 ---------------------------
diff --git a/docs/layout.rst b/docs/layout.rst
new file mode 100644
index 0000000000000..fcb6a6b8b4849
--- /dev/null
+++ b/docs/layout.rst
@@ -0,0 +1,245 @@
+.. _layout:
+
+Advanced dense layouts
+======================
+
+Tensors (:ref:`scalar_tensor`) can be *placed* in a specific shape and *layout*.
+Defining a proper layout can be critical to performance, especially for memory-bound applications. A carefully designed data layout can significantly improve cache/TLB-hit rates and cacheline utilization. Although when performance is not the first priority, you probably don't have to worry about it.
+
+In Taichi, the layout is defined in a recursive manner. See :ref:`snode` for more details about how this works. We suggest starting with the default layout specification (simply by specifying ``shape`` when creating tensors using ``ti.var/Vector/Matrix``),
+and then migrate to more advanced layouts using the ``ti.root.X`` syntax if necessary.
+
+Taichi decouples algorithms from data layouts, and the Taichi compiler automatically optimizes data accesses on a specific data layout. These Taichi features allow programmers to quickly experiment with different data layouts and figure out the most efficient one on a specific task and computer architecture.
+
+
+From ``shape`` to ``ti.root.X``
+-------------------------------
+
+For example, this declares a 0-D tensor:
+
+.. code-block:: python
+
+    x = ti.var(ti.f32)
+    ti.root.place(x)
+    # is equivalent to:
+    x = ti.var(ti.f32, shape=())
+
+This declares a 1D tensor of size ``3``:
+
+.. code-block:: python
+
+    x = ti.var(ti.f32)
+    ti.root.dense(ti.i, 3).place(x)
+    # is equivalent to:
+    x = ti.var(ti.f32, shape=3)
+
+This declares a 2D tensor of shape ``(3, 4)``:
+
+.. code-block:: python
+
+    x = ti.var(ti.f32)
+    ti.root.dense(ti.ij, (3, 4)).place(x)
+    # is equivalent to:
+    x = ti.var(ti.f32, shape=(3, 4))
+
+You may wonder, why not simply specify the ``shape`` of the tensor? Why bother using the more complex version?
+Good question, let go forward and figure out why.
+
+
+Row-major versus column-major
+-----------------------------
+
+Let's start with the simplest layout.
+
+Since address spaces are linear in modern computers, for 1D Taichi tensors, the address of the ``i``-th element is simply ``i``.
+
+To store a multi-dimensional tensor, however, it has to be flattened, in order to fit into the 1D address space.
+For example, to store a 2D tensor of size ``(3, 2)``, there are two ways to do this:
+
+    1. The address of ``(i, j)``-th is ``base + i * 2 + j`` (row-major).
+
+    2. The address of ``(i, j)``-th is ``base + j * 3 + i`` (column-major).
+
+To specify which layout to use in Taichi:
+
+.. code-block:: python
+
+    ti.root.dense(ti.i, 3).dense(ti.j, 2).place(x)    # row-major (default)
+    ti.root.dense(ti.j, 2).dense(ti.i, 3).place(y)    # column-major
+
+Both ``x`` and ``y`` have the same shape of ``(3, 2)``, and they can be accessed in the same manner, where ``0 <= i < 3 && 0 <= j < 2``. They can be accessed in the same manner: ``x[i, j]`` and ``y[i, j]``.
+However, they have a very different memory layouts:
+
+.. code-block::
+
+    #     address low ........................... address high
+    # x:  x[0,0]   x[0,1]   x[0,2] | x[1,0]   x[1,1]   x[1,2]
+    # y:  y[0,0]   y[1,0] | y[0,1]   y[1,1] | y[0,2]   y[1,2]
+
+See? ``x`` first increases the first index (i.e. row-major), while ``y`` first increases the second index (i.e. column-major).
+
+.. note::
+
+    For those people from C/C++, here's what they looks like:
+
+    .. code-block:: c
+
+        int x[3][2];  // row-major
+        int y[2][3];  // column-major
+
+        for (int i = 0; i < 3; i++) {
+            for (int j = 0; j < 2; j++) {
+                do_something ( x[i][j] );
+                do_something ( y[j][i] );
+            }
+        }
+
+
+Array of Structures (AoS), Structure of Arrays (SoA)
+--------------
+
+Tensors of same size can be placed together.
+
+For example, this places two 1D tensors of size ``3`` (array of structure, AoS):
+
+.. code-block:: python
+
+    ti.root.dense(ti.i, 3).place(x, y)
+
+Their memory layout:
+
+.. code-block::
+
+    #  address low ............. address high
+    #  x[0]   y[0] | x[1]  y[1] | x[2]   y[2]
+
+In contrast, this places two tensor placed separately (structure of array, SoA):
+
+.. code-block:: python
+
+    ti.root.dense(ti.i, 3).place(x)
+    ti.root.dense(ti.i, 3).place(y)
+
+Now, their memory layout:
+
+.. code-block::
+
+    #  address low ............. address high
+    #  x[0]  x[1]   x[2] | y[0]   y[1]   y[2]
+
+
+Normally, you don't have to worry about the performance nuances between different layouts, and should just define the simplest layout as a start.
+However, locality sometimes have a significant impact on the performance, especially when the tensor is huge.
+
+**To improve spatial locality of memory accesses (i.e. cache hit rate / cacheline utilization), it's sometimes helpful to place the data elements within relatively close storage locations if they are often accessed together.**
+Take a simple 1D wave equation solver for example:
+
+.. code-block:: python
+
+    N = 200000
+    pos = ti.var(ti.f32)
+    vel = ti.var(ti.f32)
+    ti.root.dense(ti.i, N).place(pos)
+    ti.root.dense(ti.i, N).place(vel)
+
+    @ti.kernel
+    def step():
+        pos[i] += vel[i] * dt
+        vel[i] += -k * pos[i] * dt
+
+
+Here, we placed ``pos`` and ``vel`` seperately. So the distance in address space between ``pos[i]`` and ``vel[i]`` is ``200000``. This will result in a poor spatial locality and lots of cache-misses, which damages the performance.
+A better placement is to place them together:
+
+.. code-block:: python
+
+    ti.root.dense(ti.i, N).place(pos, vel)
+
+Then ``vel[i]`` is placed right next to ``pos[i]``, this can increase the cache-hit rate and therefore increase the performance.
+
+
+Flat layouts versus hierarchical layouts 
+-------------------------
+
+By default, when allocating a ``ti.var``, it follows the simplest data layout.
+
+.. code-block:: python
+
+  val = ti.var(ti.f32, shape=(32, 64, 128))
+  # C++ equivalent: float val[32][64][128]
+
+However, at times this data layout can be suboptimal for certain types of computer graphics tasks.
+For example, ``val[i, j, k]`` and ``val[i + 1, j, k]`` are very far away (``32 KB``) from each other, and leads to poor access locality under certain computation tasks. Specifically, in tasks such as texture trilinear interpolation, the two elements are not even within the same ``4KB`` pages, creating a huge cache/TLB pressure.
+
+A better layout might be
+
+.. code-block:: python
+
+  val = ti.var(ti.f32)
+  ti.root.dense(ti.ijk, (8, 16, 32)).dense(ti.ijk, (4, 4, 4)).place(val)
+
+This organizes ``val`` in ``4x4x4`` blocks, so that with high probability ``val[i, j, k]`` and its neighbours are close to each other (i.e., in the same cacheline or memory page).
+
+
+Struct-fors on advanced dense data layouts
+------------------------------------------
+
+Struct-fors on nested dense data structures will automatically follow their data order in memory. For example, if 2D scalar tensor ``A`` is stored in row-major order,
+
+.. code-block:: python
+
+  for i, j in A:
+    A[i, j] += 1
+
+will iterate over elements of ``A`` following row-major order. If ``A`` is column-major, then the iteration follows the column-major order.
+
+If ``A`` is hierarchical, it will be iterated level by level. This maximizes the memory bandwidth utilization in most cases.
+
+Struct-for loops on sparse tensors follow the same philosophy, and will be discussed further in :ref:`sparse`.
+
+
+Examples
+--------
+
+2D matrix, row-major
+
+.. code-block:: python
+
+  A = ti.var(ti.f32)
+  ti.root.dense(ti.ij, (256, 256)).place(A)
+
+2D matrix, column-major
+
+.. code-block:: python
+
+  A = ti.var(ti.f32)
+  ti.root.dense(ti.ji, (256, 256)).place(A) # Note ti.ji instead of ti.ij
+
+`8x8` blocked 2D array of size `1024x1024`
+
+.. code-block:: python
+
+  density = ti.var(ti.f32)
+  ti.root.dense(ti.ij, (128, 128)).dense(ti.ij, (8, 8)).place(density)
+
+
+3D Particle positions and velocities, AoS
+
+.. code-block:: python
+
+  pos = ti.Vector(3, dt=ti.f32)
+  vel = ti.Vector(3, dt=ti.f32)
+  ti.root.dense(ti.i, 1024).place(pos, vel)
+  # equivalent to
+  ti.root.dense(ti.i, 1024).place(pos(0), pos(1), pos(2), vel(0), vel(1), vel(2))
+
+3D Particle positions and velocities, SoA
+
+.. code-block:: python
+
+  pos = ti.Vector(3, dt=ti.f32)
+  vel = ti.Vector(3, dt=ti.f32)
+  for i in range(3):
+    ti.root.dense(ti.i, 1024).place(pos(i))
+  for i in range(3):
+    ti.root.dense(ti.i, 1024).place(vel(i))
diff --git a/docs/matrix.rst b/docs/matrix.rst
index 5bbbb5570ba83..b6fb42ebe4a59 100644
--- a/docs/matrix.rst
+++ b/docs/matrix.rst
@@ -1,4 +1,4 @@
-.. _linalg:
+.. _matrix:
 
 Matrices
 ========
diff --git a/docs/scalar_tensor.rst b/docs/scalar_tensor.rst
index 0d69c849c2c2d..18a5f5fa91c87 100644
--- a/docs/scalar_tensor.rst
+++ b/docs/scalar_tensor.rst
@@ -41,7 +41,7 @@ Declaration
 
 .. note::
 
-    Not providing ``shape`` allows you to *place* the tensor as *sparse* tensors, see :ref:`sparse` for more details.
+    Not providing ``shape`` allows you to *place* the tensor in a layout other than the default *dense*, see :ref:`layout` for more details.
 
 
 .. warning::
diff --git a/docs/snode.rst b/docs/snode.rst
new file mode 100644
index 0000000000000..cef443bb73f64
--- /dev/null
+++ b/docs/snode.rst
@@ -0,0 +1,187 @@
+.. _snode:
+
+Structural nodes (SNodes)
+=========================
+
+After writing the computation code, the user needs to specify the internal data structure hierarchy. Specifying a data structure includes choices at both the macro level, dictating how the data structure components nest with each other and the way they represent sparsity, and the micro level, dictating how data are grouped together (e.g. structure of arrays vs. array of structures).
+Our language provides *structural nodes (SNodes)* to compose the hierarchy and particular properties. These constructs and their semantics are listed below:
+
+* dense: A fixed-length contiguous array.
+
+* bitmasked: This is similar to dense, but it also uses a mask to maintain sparsity information, one bit per child.
+
+* pointer: Store pointers instead of the whole structure to save memory and maintain sparsity.
+
+* dynamic: Variable-length array, with a predefined maximum length. It serves the role of ``std::vector`` in C++ or ``list`` in Python, and can be used to maintain objects (e.g. particles) contained in a block.
+
+See :ref:`layout` for more details about data layout. ``ti.root`` is the root node of the data structure.
+
+.. function:: snode.place(x, ...)
+
+    :parameter snode: (SNode) where to place
+    :parameter x: (tensor) tensor(s) to be placed
+    :return: (SNode) the ``snode`` itself
+
+    The following code places two 0-D tensors named ``x`` and ``y``:
+
+    ::
+
+        x = ti.var(dt=ti.i32)
+        y = ti.var(dt=ti.f32)
+        ti.root.place(x, y)
+
+.. function:: tensor.shape()
+
+    :parameter tensor: the tensor
+    :return: (tuple of integers) the shape of tensor
+
+    For example,
+
+    ::
+
+        ti.root.dense(ti.ijk, (3, 5, 4)).place(x)
+        x.shape() # returns (3, 5, 4)
+
+.. function:: snode.get_shape(index)
+
+    :parameter snode: (SNode)
+    :parameter index: axis (0 for ``i`` and 1 for ``j``)
+    :return: (scalar) the size of tensor alone that axis
+
+    Equivalent to ``tensor.shape()[i]``.
+
+    ::
+
+        ti.root.dense(ti.ijk, (3, 5, 4)).place(x)
+        x.snode().get_shape(0)  # 3
+        x.snode().get_shape(1)  # 5
+        x.snode().get_shape(2)  # 4
+
+.. function:: tensor.dim()
+
+    :parameter tensor: the tensor
+    :return: (scalar) the dimensionality of the tensor
+
+    Equivalent to ``len(tensor.shape())``.
+
+    ::
+
+        ti.root.dense(ti.ijk, (8, 9, 10)).place(x)
+        x.dim()  # 3
+
+.. function:: snode.parent()
+
+    :parameter snode: (SNode)
+    :return: (SNode) the parent node of ``snode``
+
+    ::
+
+        blk1 = ti.root.dense(ti.i, 8)
+        blk2 = blk1.dense(ti.j, 4)
+        blk3 = blk2.bitmasked(ti.k, 6)
+        blk1.parent()  # ti.root
+        blk2.parent()  # blk1
+        blk3.parent()  # blk2
+
+    TODO: add tensor.parent(), and add see also ref here
+
+
+Node types
+----------
+
+
+.. function:: snode.dense(indices, shape)
+
+    :parameter snode: (SNode) parent node where the child is derived from
+    :parameter indices: (Index or Indices) indices used for this node
+    :parameter shape: (scalar or tuple) shape the tensor of vectors
+    :return: (SNode) the derived child node
+
+    The following code places a 1-D tensor of size ``3``:
+
+    ::
+
+        x = ti.var(dt=ti.i32)
+        ti.root.dense(ti.i, 3).place(x)
+
+    The following code places a 2-D tensor of shape ``(3, 4)``:
+
+    ::
+
+        x = ti.var(dt=ti.i32)
+        ti.root.dense(ti.ij, (3, 4)).place(x)
+
+    .. note::
+
+        If ``shape`` is a scalar and there are multiple indices, then ``shape`` will
+        be automatically expanded to fit the number of indices. For example,
+
+        ::
+
+            snode.dense(ti.ijk, 3)
+
+        is equivalent to
+
+        ::
+
+            snode.dense(ti.ijk, (3, 3, 3))
+
+
+.. function:: snode.dynamic(index, size, chunk_size = None)
+
+    :parameter snode: (SNode) parent node where the child is derived from
+    :parameter index: (Index) the ``dynamic`` node indices
+    :parameter size: (scalar) the maximum size of the dynamic node
+    :parameter chunk_size: (optional, scalar) the number of elements in each dynamic memory allocation chunk
+    :return: (SNode) the derived child node
+
+    ``dynamic`` nodes acts like ``std::vector`` in C++ or ``list`` in Python.
+    Taichi's dynamic memory allocation system allocates its memory on the fly.
+
+    The following places a 1-D dynamic tensor of maximum size ``16``:
+
+    ::
+
+        ti.root.dynamic(ti.i, 16).place(x)
+
+
+
+.. function:: snode.bitmasked
+.. function:: snode.pointer
+.. function:: snode.hash
+
+    TODO: add descriptions here
+
+Working with ``dynamic`` SNodes
+-------------------------------
+
+.. function:: ti.length(snode, indices)
+
+    :parameter snode: (SNode, dynamic)
+    :parameter indices: (scalar or tuple of scalars) the ``dynamic`` node indices
+    :return: (scalar) the current size of the dynamic node
+
+
+.. function:: ti.append(snode, indices, val)
+
+    :parameter snode: (SNode, dynamic)
+    :parameter indices: (scalar or tuple of scalars) the ``dynamic`` node indices
+    :parameter val: (depends on SNode data type) value to store
+    :return: (``int32``) the size of the dynamic node, before appending
+
+    Inserts ``val`` into the ``dynamic`` node with indices ``indices``.
+
+
+
+Indices
+-------
+
+.. function:: ti.i
+.. function:: ti.j
+.. function:: ti.k
+.. function:: ti.ij
+.. function:: ti.ijk
+.. function:: ti.ijkl
+.. function:: ti.indices(a, b, ...)
+
+(TODO)
diff --git a/docs/syntax.rst b/docs/syntax.rst
index 86d1977915888..57131ae208bef 100644
--- a/docs/syntax.rst
+++ b/docs/syntax.rst
@@ -29,6 +29,9 @@ The return value will be automatically casted into the hinted type. e.g.,
     res = add_xy(2.3, 1.1)
     print(res)  # 3, since return type is ti.i32
 
+.. note::
+    For differentiable programming kernels should better have either serial statements or a single parallel for-loop. If you don't use differentiable programming, feel free to ignore this tip.
+
 .. note::
 
     For now, we only support one scalar as return value. Returning ``ti.Matrix`` or `ti.Vector`` is not supported. Python-style tuple return is not supported. e.g.:
@@ -44,7 +47,6 @@ The return value will be automatically casted into the hinted type. e.g.,
             y = 0.5
             return x, y  #  ERROR!
 
-
 .. note::
   For correct gradient behaviors in differentiable programming, please refrain from using kernel return values. Instead, store the result into a global variable (e.g. ``loss[None]``).
 
@@ -54,33 +56,33 @@ The return value will be automatically casted into the hinted type. e.g.,
 
 .. code-block:: python
 
-    @ti.kernel
-    def a_hard_kernel_to_auto_differentiate():
-      sum = 0
-      for i in x:
-        sum += x[i]
-      for i in y:
-        y[i] = sum
+        @ti.kernel
+        def a_hard_kernel_to_auto_differentiate():
+          sum = 0
+          for i in x:
+            sum += x[i]
+          for i in y:
+            y[i] = sum
 
-    # instead, split it into multiple kernels to be nice to the Taichi autodiff compiler:
+        # instead, split it into multiple kernels to be nice to the Taichi autodiff compiler:
 
-    @ti.kernel
-    def reduce():
-      for i in x:
-        sum[None] += x[i]
+        @ti.kernel
+        def reduce():
+          for i in x:
+            sum[None] += x[i]
 
-    @ti.kernel
-    def assign()
-      for i in y:
-        y[i] = sum[None]
+        @ti.kernel
+        def assign()
+          for i in y:
+            y[i] = sum[None]
 
-    def main():
-      with ti.Tape(loss):
-        ...
-        sum[None] = 0
-        reduce()
-        assign()
-        ...
+        def main():
+          with ti.Tape(loss):
+            ...
+            sum[None] = 0
+            reduce()
+            assign()
+            ...
 
 
 Functions
@@ -150,25 +152,27 @@ Scalar arithmetics
 -----------------------------------------
 Supported scalar functions:
 
-* ``ti.sin(x)``
-* ``ti.cos(x)``
-* ``ti.asin(x)``
-* ``ti.acos(x)``
-* ``ti.atan2(x, y)``
-* ``ti.cast(x, type)``
-* ``ti.sqrt(x)``
-* ``ti.floor(x)``
-* ``ti.inv(x)``
-* ``ti.tan(x)``
-* ``ti.tanh(x)``
-* ``ti.exp(x)``
-* ``ti.log(x)``
-* ``ti.random(type)``
-* ``abs(x)``
-* ``max(a, b)``
-* ``min(a, b)``
-* ``x ** y``
-* Inplace adds are atomic on global data. I.e., ``a += b`` is equivalent to ``ti.atomic_add(a, b)``
+.. function:: ti.sin(x)
+.. function:: ti.cos(x)
+.. function:: ti.asin(x)
+.. function:: ti.acos(x)
+.. function:: ti.atan2(x, y)
+.. function:: ti.cast(x, type)
+.. function:: ti.sqrt(x)
+.. function:: ti.floor(x)
+.. function:: ti.ceil(x)
+.. function:: ti.inv(x)
+.. function:: ti.tan(x)
+.. function:: ti.tanh(x)
+.. function:: ti.exp(x)
+.. function:: ti.log(x)
+.. function:: ti.random(type)
+.. function:: abs(x)
+.. function:: int(x)
+.. function:: float(x)
+.. function:: max(x, y)
+.. function:: min(x, y)
+.. function:: pow(x, y)
 
 Note: when these scalar functions are applied on :ref:`matrix` and :ref:`vector`, it's applied element-wise, for example:
 
@@ -194,7 +198,17 @@ Note: when these scalar functions are applied on :ref:`matrix` and :ref:`vector`
 Debugging
 -------------------------------------------
 
-Debug your program with ``print(x)``.
+Debug your program with ``print(x)``. For example, if ``x`` is ``23``, then it shows:
+
+.. code-block::
+
+    [debug] x = 23
+
+in the console.
+
+.. warning::
+
+    This is not the same as the ``print`` in Python-scope. For now ``print`` in Taichi only takes **scalar numbers** as input. Strings, vectors and matrices are not supported. Please use ``print(v[0]); print(v[1])`` if you want to print a vector.
 
 
 Why Python frontend
diff --git a/docs/tensor_matrix.rst b/docs/tensor_matrix.rst
index ca029d6b7e839..2f3f3ad10f1e3 100644
--- a/docs/tensor_matrix.rst
+++ b/docs/tensor_matrix.rst
@@ -28,7 +28,7 @@ Suppose you have a ``128 x 64`` global grid ``A``, each node containing a ``3 x
 * As you may have noticed, there are two indexing operators ``[]``, the first is for tensor indexing, the second for matrix indexing.
 * For a tensor ``F`` of element ``ti.Matrix``, make sure you first index the tensor dimensions, and then the matrix dimensions: ``F[i, j, k][0, 2]``. (Assuming ``F`` is a 3D tensor with ``ti.Matrix`` of size ``3x3`` as elements)
 * ``ti.Vector`` is simply an alias of ``ti.Matrix``.
-* See :ref:`linalg` for more on matrices.
+* See :ref:`matrix` for more on matrices.
 
 
 Matrix size
diff --git a/docs/vector.rst b/docs/vector.rst
index f0a5a049876c4..23cb5fad1f16f 100644
--- a/docs/vector.rst
+++ b/docs/vector.rst
@@ -6,9 +6,9 @@ Vectors
 A vector in Taichi can have two forms:
 
   - as a temporary local variable. An ``n`` component vector consists of ``n`` scalar values.
-  - as an element of a global tensor. In this case, the tensor is an N-dimensional array of ``n`` component vectors
+  - as an element of a global tensor. In this case, the tensor is an N-dimensional array of ``n`` component vectors.
 
-See :ref:`tensor_matrix` for more details.
+See :ref:`tensor` for more details.
 
 Declaration
 -----------
@@ -122,6 +122,7 @@ Methods
 
 
 .. function:: a.dot(b)
+.. function:: ti.dot(a, b)
 
     :parameter a: (Vector)
     :parameter b: (Vector)