Mind you that this simulator is based on google's tpu while Eyeriss is from MIT/NVIDIA
Each PE can access/store data in:
- local register (scratchpad):
- spatial flow (accessing data from a neighboring PE)
- Global memory
each has a corresponding latency
each PE is able to:
- load data from the its own register, neighbor PEs, global memory
- muliply data
- add data
- store data in its register, pass it to neighbor PEs, broadcast to global memory
data can be either:
- input feature map (typically the largest)
- weights/filters (multiply element wise filter by input, add them up, save them as output, then moves in strides)
- output feature map (smaller than input, can get smaller with large stride or large filter)
Accelerator is composed of
- PEs arranged in a big array
- Global memory that is specified for output, input and filters(aka weights)
dataflow can be:
- weights are loaded once and stored in the register different PEs
- inputs are loaded from global memory everytime
- after parallely multiplying, psums are spacially shared to neighbors for addition
inputs are discarded and finall output is shared back to memory. New input is fetched from memory and local register keeps weight. This is also typically the least efficient as per yet another eyeriss paper. But this one claims that OS is more efficient than WS
- weights are loaded from global memory everytime
- inputs are loaded once, then after used, they are spatially shared to other
- after multiplying, psums are accumulated and stored in register