05 Nov 17:22

github-actions

da55d40

v0.36.1

Metal

Wormhole Bringup

Added some APIs to query device ethernet connectivity.
Added first phase of ethernet data movement support, basic unit tests passing on N300.

API Changes

Notes not available.

Tools - Profiler

Device only and host only profiling options for profile_this.py script
Examples for fast dispatch device program profiling

Tools - Watcher

Added kernel names/paths to watcher log file

Extra features

Notes not available.

Eager/ttNN

Infrastructure

Added initial implementation of TTNN APIs
- Added functions to interface with torch: from_torch, to_torch
- Added functions to move tensor to/from device: to_device, from_device
- Added functions to change the layout of the tensor: to_layout
- Added matmul, add, sub, mul, reshape, permute and softmax operations
Implemented Multi-Head-Attention using TTNN APIs
Added 3 tutorials to showcase TTNN
Updated Documentation to describe TTNN and its APIs

Operations

Following on-device operators are added to tt_lib.tensor module:

interleave repeat
triu
tril
rmsnorm
groupnorm
silu (update to be first-class unary operator)

Models

For BERT demo, added loading of cached pre-processed weights (stored as TT tensors) to avoid conversion from Torch to TT tensors.
Added demo for ResNet that executes on TT hardware. Demo takes images from ImageNet and processes them in batches of 8.

Assets 4

27 Oct 23:36

github-actions

v0.35.0

b1f6f1a

v0.35.0

Metal

Wormhole Bringup

Extended gtests to run on all available devices in Wormhole systems.
Single device tests passing on remote chips.

API Changes

These 2 functions:
- uint32_t CreateSemaphore(Program &program, const CoreRange &core_range, uint32_t initial_value)
- uint32_t CreateSemaphore(Program &program, const CoreRangeSet &core_range_set, uint32_t initial_value)
have been replaced by
- uint32_t CreateSemaphore(Program &program, const std::variant<CoreRange,CoreRangeSet> &core_spec, uint32_t initial_value).
These 3 functions:
- void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreCoord &logical_core, const std::vector<uint32_t> &runtime_args)
- void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRange &core_range, const std::vector<uint32_t> &runtime_args)
- void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRangeSet &core_range_set, const std::vector<uint32_t> &runtime_args)
have been replaced by
- void SetRuntimeArgs(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::vector<uint32_t> &runtime_args)
These 2 functions:
- KernelID CreateDataMovementKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<DataMovementConfig> &config = {})
- KernelID CreateComputeKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<ComputeConfig> &config = {})
have been replaced by:
- KernelID CreateKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::variant<DataMovementConfig,ComputeConfig> & config)

Tools - Profiler

Improved profile_this.py log management strategy to avoid conservative log folder checks from profiling

Extra features

Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels. The kernel uses the same get_arg_val<type>(<index>) to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(Program program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> & core_spec, const std::vector<uint32_t> &runtime_args), as the host used to communicate to DataMovement Kernels.

Eager (Ops)

There have been no notable changes to communicate in this release.

Models

Moved code that implements and tests models from tests/models to top level models folder. In the models folder, models are separated into demos (working models with end2end demo code) and experimental (models that are under development).
Added implementation of Falcon7B for GS and PyTorch demos for nanoGPT and T5
Added BERT Large end2end demo on GS (set up for question answering)

Assets 6

13 Oct 15:22

github-actions

v0.34.0

62cd4c4

v0.34.0

Metal

API Changes

CreateDevice: device_id type has changed from int to chip_id_t
CreateCircularBuffer: Three previous variants which only differ by CoreCoord, CoreRange, and CoreRangeSet function parameter have been compressed into one user-facing CreateCircularBuffer function that’s parameterized with std::variant<CoreCoord,CoreRange,CoreRangeSet>. Now accepts CircularBufferConfig which specifies size, data format, and page size per buffer index. Return type updated from CircularBuffer object to CircularBufferID (uintptr_t)
GetCircularBufferConfig: New function to retrieve a reference to configuration of a CircularBuffer. This allows the CircularBuffer config to be updated. Updates will take effect on the next call to LaunchProgram.

Tools - Profiler

Tracy Python Support : Profile python side code with tracy. Similar to cProfile, the standard python profiler module, all python function calls are picked up on tracy. Additionally, TT’s binded C++ calls are also picked up automatically. The entire python script or just desired parts of it can be profiled either at function or line level.

Extra features

Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime. The kernel uses the same get_arg_val<type>(<index>) API to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(<program, <compute_kernel_id>, <Core,CoreRange> , <vector of u32 runtime args>) as DataMovement Kernel.

Eager (Ops)

Notes not yet available.

Models

metal_BERT_large_15: model implementation updated to use tt-DNN operation embedding that executes on GS device. Previously this model used PyTorch embedding operation executing on CPU.
Falcon7b: added end to end demo that is running on GS device. The demo takes a text prompt and returns text generated by the model to complete the prompt. The demo works by pre-filling the cache with decoded input prompts and then running decode for all users in parallel.

Assets 6

06 Oct 02:29

tt-rkim

v0.33.0

e76a336

v0.33.0

Metal

Wormhole

Basic bringup and tests running on WH B0
Harvesting functionality working on WH B0
Basic fast dispatch functionality working on WH B0

Host API changes

void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores) no longer callable
Device *CreateDevice no longer requires arch parameter
New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object: Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
LaunchKernels renamed to LaunchProgram(Device *device, Program &program) to match EnqueueProgram and removed obsolete stagger_start parameter
void WriteRuntimeArgsToDevice(Device *device, const Program &program) moved to detail namespace
bool CompileProgram(Device *device, Program &program) moved to detail namespace
bool ConfigureDeviceWithProgram(Device *device, const Program &program) moved to detail namespace
bool InitializeDevice(Device *device) removed

Profiler

Bug fix on device side to support new FW init process in fast and slow dispatch.
RISC FW cleanup to avoid unnecessary function wrappers.

Watcher

Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
Add some noc sanitization and checks
Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions

Feature: Runtime Compute Args

Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
The kernel uses the same get_arg_val<type>(<index>) api to retrieve it.
The host uses the same tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>); as DataMovement Kernel communication as well.

Eager (Ops)

Added support for overriding runtime args and circular buffers
Added support for saving and loading tensors
Added support for uint32 tensor

Models

5+% increase of BERT Large performance on bare metal machines.
15+% increase of LLaMA 7B performance on bare metal machines.

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal

Wormhole Bringup

API Changes

Tools - Profiler

Tools - Watcher

Extra features

Eager/ttNN

Infrastructure

Operations

Models

Metal

Wormhole Bringup

API Changes

Tools - Profiler

Extra features

Eager (Ops)

Models

Metal

API Changes

Tools - Profiler

Extra features

Eager (Ops)

Models

Metal

Host API changes

Profiler

Watcher

Feature: Runtime Compute Args

Eager (Ops)

Models

Releases: tenstorrent/tt-metal

v0.36.1

Metal

Wormhole Bringup

API Changes

Tools - Profiler

Tools - Watcher

Extra features

Eager/ttNN

Infrastructure

Operations

Models

v0.35.0

Metal

Wormhole Bringup

API Changes

Tools - Profiler

Extra features

Eager (Ops)

Models

v0.34.0

Metal

API Changes

Tools - Profiler

Extra features

Eager (Ops)

Models

v0.33.0

Metal

Host API changes

Profiler

Watcher

Feature: Runtime Compute Args

Eager (Ops)

Models