This work aims to port PyTorch to HammerBlade.
This assumes that you have a working HB Cosimulation installed through bsg_bladerunner
. Then:
-
Enable
devtoolset-8
or any toolchain that supports C++14. -
Set following variable to point to
bsg_bladerunner
clone:export BRG_BSG_BLADERUNNER_DIR=<path to bsg_bladerunner that has be setup>
-
Clone hb-pytorch repo:
git clone -b hb-device git@github.com:cornell-brg/hb-pytorch.git
-
Create python virtual environment:
python3.6 -m venv ./venv_pytorch
-
Install dependencies:
pip install --upgrade pip pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing sklearn tqdm pytest ninja hypothesis thop pillow
-
Remove automatically installed PyTorch:
pip uninstall torch
-
Init pytorch third party dependencies:
git submodule update --init --recursive
-
Setup building environment variables:
cd hb-pytorch && source setup_cosim_build_env.sh
-
Build pytorch. This step can take up to 15 minutes:
python setup.py develop
Above command also compiles device kernels with RISCV toolchain and installs the kernel binary. Optionally, kernels can be compiled with Clang by running the following instead of above:
CLANG=1 python setup.py develop
It's important that
CLANG=1
has to be present everytime we build/re-build `hb-pytorch to compile kernels with Clang. To check if the current build compiled kernels with Clang, we can run:readelf -p .comment <hb-pytorch-root>/build/c10/hammerblade/kernel.riscv
The output when compiled with Clang should be something like this:
String dump of section '.comment': [ 0] clang version 10.0.0 (https://github.com/bespoke-silicon-group/llvm-project.git 3ee81f3def2c4c2a818f9f939f4421b3f3af313e) [ 7a] GCC: (GNU) 9.2.0
-
PyTorch can be used with cosim by running one of the following executables, instead of
python
:pycosim
: Runs python with cosim backendpycosim.trace
: Enables device instruction tracepycosim.wave
: Enbales device instruction trace AND waveform dumps
For example, a PyTorch program
foo.py
can be executed withhb-pytorch
's cosim backend with on of the following:pycosim foo.py pycosim.trace foo.py # To get HB device execution trace pycosim.wave foo.py # To get HB device execution trace and RTL simulation waveform.
-
Clone this repository:
git clone git@github.com:cornell-brg/hb-pytorch.git
-
Create a Python virtual environment:
python3 -m venv ./venv_pytorch source ./venv_pytorch/bin/activate
-
Install some dependencies:
pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing sklearn tqdm pytest ninja hypothesis
-
Init PyTorch third party dependencies:
git submodule update --init --recursive
-
Setup building environment variables:
source setup_emul_build_env.sh
-
Build PyTorch. This step can take up to 15 minutes:
python setup.py develop
-
Turn on emulation debug info
export HBEMUL_DEBUG=1
-
Setup emulated HB device size
export HBEMUL_TILE_X_DIM=16 export HBEMUL_TILE_Y_DIM=8
- Goto hb-pytorch directory
cd hb-pytorch/hammerblade/torch
- Run pytest
python pytest_runner.py
hammerblade/fragments/
hammerblade/environment.mk
baseline-README.md
run-hb-pytest.sh
(source
this one to run pytest!)hammerblade/torch/
hammerblade/torch/kernel
hammerblade/torch/tests/
c10/hammerblade/
- Register the kernel for HammerBlade with PyTorch by editing
aten/src/ATen/native/native_functions.yaml
func: sigmoid(Tensor self) -> Tensor
use_c10_dispatcher: full
supports_named_tensor: True
variants: function, method
dispatch:
CPU: sigmoid
CUDA: sigmoid
+ HammerBlade: sigmoid
MkldnnCPU: mkldnn_sigmoid
- Add host code to
aten/src/ATen/native/hammerblade/Sigmoid.cpp
Add the dummiest host code possible, without calling the kernel. - Add tests to
hammerblade/torch/tests/test_sigmoid.py
- With Emulation Layer, make sure the code compiles and tests fail only because of incorrect results
- Add kernel code to
hammerblade/torch/kernel/kernel_sigmoid.cpp
, which is also the dummiest code. - Change the host code to be more realistic: call the kernel and do nothing else.
- Implement both the host and kernel code for real, assuming 1x1 tile group.
- Make sure everything pass on Emulation layer, and write more tests. Then you are ready to create a PR!
- Make sure your code works on COSIM.
- Optimizations, like parallelization etc.
-
Maintaining two clones, one for emulation and one for cosim (eg.,
hb-pytorch/
andhb-pytorch-cosim/
), eases the burden of cosim evaluation. This requires two separate pytorch environments as well (eg.,venv_pytorch
andvenv_pytorch_cosim
). -
Ideally, you would only ever need to run once, to debug an issue. Use
gdb
extensively with emulation.
$ gdb python
(gdb) b tensorlib_sigmoid
(gdb) r -m pytest test_sigmoid.py
Linking would become a bottleneck when running in tight loop. As a result, gdb
could save a lot of time compared to printf debugging.
- Sometimes new cpp files are not taken into account by cmake. Since kernel authors would only ever need to add new files
either to
aten/src/Aten/native
orhammerblade/torch/
running following command might solve the failure:
touch aten/src/ATen/CMakeLists.txt # New host code sources
touch c10/hammerblade/CMakeLists.txt # New device code sources
Native Profiling tools provide ATen operator level info, including per operator execution time break down and unimplemented HB operator info.
To enable profiling tools, call torch.hammerblade.profiler.enable()
To disable profiling tools, call torch.hammerblade.profiler.disable()
To test if the profiling tools are currently running, call torch.hammerblade.profiler.is_in_ROI()
import torch
# start of ROI
torch.hammerblade.profiler.enable()
x = torch.randn(10)
y = x + x
# end of ROI
torch.hammerblade.profiler.disable()
To read profiling data, call torch.hammerblade.profiler.stats()
By default, this returns a string of per ATen operator execution time (ExecTime) and unimplemented operators (Unimpl).
One may also pass in a list using KeyArg key
. Available options are ExecTime, ExecTime-Latex, ExecTime-Raw, Unimpl
import torch
torch.hammerblade.profiler.enable()
x = torch.randn(10)
torch.hammerblade.profiler.disable()
print(torch.hammerblade.profiler.stats(key=['ExecTime-Raw'],
trimming=True))
Here trimming
is a "simulated time" correction mechanism.
HB emulation can output a file with the list of kernel calls along with associated data in json format. This can be used as:
import torch
import torch.hammerblade.kernel_logger as hblog
x = torch.rand(2, 2).hammerblade()
y = torch.rand(2, 2).hammerblade()
# Enables the log
hblog.enable()
print(x + y)
# Disbales the log
hblog.disable()
# This is excluded from the log
print(x - y)
# Logs only the tensor add
print(hblog.json())
# Clears above operations from the logger
hblog.clear()
hblog.enable()
print(x * y)
hblog.disable()
# Logs only the tensor mul
print(hblog.json())
Chart
provides a way to log down the "execution chart" of key kernels in a workload.
To use Chart
, one needs to register one or more ATen operator signatures.
import torch
M = torch.randn(2, 3)
mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 3)
# reset chart
torch.hammerblade.profiler.chart.clear()
# add signature
torch.hammerblade.profiler.chart.add("at::Tensor at::CPUType::{anonymous}::addmm(const at::Tensor&, const at::Tensor&, const at::Tensor&, c10::Scalar, c10::Scalar)")
# turn on profiling
torch.hammerblade.profiler.enable()
# run addmm
torch.addmm(M, mat1, mat2)
# end profiling
torch.hammerblade.profiler.disable()
# dump chart
print(torch.hammerblade.profiler.chart.json())
The output should be
[
{
"offload": false,
"signature": "at::Tensor at::CPUType::{anonymous}::addmm(const at::Tensor&, const at::Tensor&, const at::Tensor&, c10::Scalar, c10::Scalar)"
}
]
One may choose to redispatch a kernel that should run on CPU to HB with Route
. Route
takes in the JSON produced by Chart
. To redispatch a kernel, one just needs to
change "offload": false
to "offload": true
.
import torch
M = torch.randn(2, 3)
mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 3)
route = """[
{
"offload": false,
"signature": "at::Tensor at::CPUType::{anonymous}::addmm(const at::Tensor&, const at::Tensor&, const at::Tensor&, c10::Scalar, c10::Scalar)"
},
{
"offload": true,
"signature": "at::Tensor at::CPUType::{anonymous}::add(const at::Tensor&, const at::Tensor&, c10::Scalar)"
}
]
"""
data = json.loads(route)
torch.hammerblade.profiler.route.set_route_from_json(data)
torch.hammerblade.profiler.enable()
torch.addmm(M, mat1, mat2)
# this add should be redispatch to HB
torch.add(M, mat1)
torch.hammerblade.profiler.disable()