Releases: tenstorrent/tt-metal
v0.36.1
Metal
Wormhole Bringup
- Added some APIs to query device ethernet connectivity.
- Added first phase of ethernet data movement support, basic unit tests passing on N300.
API Changes
Notes not available.
Tools - Profiler
- Device only and host only profiling options for profile_this.py script
- Examples for fast dispatch device program profiling
Tools - Watcher
- Added kernel names/paths to watcher log file
Extra features
Notes not available.
Eager/ttNN
Infrastructure
- Added initial implementation of TTNN APIs
- Added functions to interface with torch: from_torch, to_torch
- Added functions to move tensor to/from device: to_device, from_device
- Added functions to change the layout of the tensor: to_layout
- Added matmul, add, sub, mul, reshape, permute and softmax operations
- Implemented Multi-Head-Attention using TTNN APIs
- Added 3 tutorials to showcase TTNN
- Updated Documentation to describe TTNN and its APIs
Operations
Following on-device operators are added to tt_lib.tensor
module:
- interleave repeat
- triu
- tril
- rmsnorm
- groupnorm
- silu (update to be first-class unary operator)
Models
- For BERT demo, added loading of cached pre-processed weights (stored as TT tensors) to avoid conversion from Torch to TT tensors.
- Added demo for ResNet that executes on TT hardware. Demo takes images from ImageNet and processes them in batches of 8.
v0.35.0
Metal
Wormhole Bringup
- Extended gtests to run on all available devices in Wormhole systems.
- Single device tests passing on remote chips.
API Changes
-
These 2 functions:
uint32_t CreateSemaphore(Program &program, const CoreRange &core_range, uint32_t initial_value)
uint32_t CreateSemaphore(Program &program, const CoreRangeSet &core_range_set, uint32_t initial_value)
have been replaced by
uint32_t CreateSemaphore(Program &program, const std::variant<CoreRange,CoreRangeSet> &core_spec, uint32_t initial_value)
.
-
These 3 functions:
void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreCoord &logical_core, const std::vector<uint32_t> &runtime_args)
void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRange &core_range, const std::vector<uint32_t> &runtime_args)
void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRangeSet &core_range_set, const std::vector<uint32_t> &runtime_args)
have been replaced by
void SetRuntimeArgs(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::vector<uint32_t> &runtime_args)
-
These 2 functions:
KernelID CreateDataMovementKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<DataMovementConfig> &config = {})
KernelID CreateComputeKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<ComputeConfig> &config = {})
have been replaced by:
KernelID CreateKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::variant<DataMovementConfig,ComputeConfig> & config)
Tools - Profiler
- Improved
profile_this.py
log management strategy to avoid conservative log folder checks from profiling
Extra features
- Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels. The kernel uses the same
get_arg_val<type>(<index>)
to retrieve it. The host uses the samett_metal::SetRuntimeArgs(Program program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> & core_spec, const std::vector<uint32_t> &runtime_args)
, as the host used to communicate to DataMovement Kernels.
Eager (Ops)
There have been no notable changes to communicate in this release.
Models
- Moved code that implements and tests models from tests/models to top level models folder. In the models folder, models are separated into demos (working models with end2end demo code) and experimental (models that are under development).
- Added implementation of Falcon7B for GS and PyTorch demos for nanoGPT and T5
- Added BERT Large end2end demo on GS (set up for question answering)
v0.34.0
Metal
API Changes
CreateDevice
: device_id type has changed from int to chip_id_tCreateCircularBuffer
: Three previous variants which only differ by CoreCoord, CoreRange, and CoreRangeSet function parameter have been compressed into one user-facingCreateCircularBuffer
function that’s parameterized withstd::variant<CoreCoord,CoreRange,CoreRangeSet>
. Now acceptsCircularBufferConfig
which specifies size, data format, and page size per buffer index. Return type updated fromCircularBuffer
object toCircularBufferID
(uintptr_t)GetCircularBufferConfig
: New function to retrieve a reference to configuration of aCircularBuffer
. This allows theCircularBuffer
config to be updated. Updates will take effect on the next call toLaunchProgram
.
Tools - Profiler
Tracy Python Support : Profile python side code with tracy. Similar to cProfile, the standard python profiler module, all python function calls are picked up on tracy. Additionally, TT’s binded C++ calls are also picked up automatically. The entire python script or just desired parts of it can be profiled either at function or line level.
Extra features
Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime. The kernel uses the same get_arg_val<type>(<index>)
API to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(<program, <compute_kernel_id>, <Core,CoreRange> , <vector of u32 runtime args>)
as DataMovement Kernel.
Eager (Ops)
Notes not yet available.
Models
- metal_BERT_large_15: model implementation updated to use tt-DNN operation embedding that executes on GS device. Previously this model used PyTorch embedding operation executing on CPU.
- Falcon7b: added end to end demo that is running on GS device. The demo takes a text prompt and returns text generated by the model to complete the prompt. The demo works by pre-filling the cache with decoded input prompts and then running decode for all users in parallel.
v0.33.0
Metal
Wormhole
- Basic bringup and tests running on WH B0
- Harvesting functionality working on WH B0
- Basic fast dispatch functionality working on WH B0
Host API changes
void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores)
no longer callable- Device *CreateDevice no longer requires arch parameter
- New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object:
Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
LaunchKernels
renamed toLaunchProgram(Device *device, Program &program)
to matchEnqueueProgram
and removed obsoletestagger_start
parametervoid WriteRuntimeArgsToDevice(Device *device, const Program &program)
moved to detail namespacebool CompileProgram(Device *device, Program &program)
moved to detail namespacebool ConfigureDeviceWithProgram(Device *device, const Program &program)
moved to detail namespacebool InitializeDevice(Device *device)
removed
Profiler
- Bug fix on device side to support new FW init process in fast and slow dispatch.
- RISC FW cleanup to avoid unnecessary function wrappers.
Watcher
- Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
- Add some noc sanitization and checks
- Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions
Feature: Runtime Compute Args
- Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
- The kernel uses the same
get_arg_val<type>(<index>)
api to retrieve it. - The host uses the same
tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>);
as DataMovement Kernel communication as well.
Eager (Ops)
- Added support for overriding runtime args and circular buffers
- Added support for saving and loading tensors
- Added support for uint32 tensor
Models
- 5+% increase of BERT Large performance on bare metal machines.
- 15+% increase of LLaMA 7B performance on bare metal machines.