Skip to content

Releases: tenstorrent/tt-metal

v0.36.1

05 Nov 17:22
Compare
Choose a tag to compare

Metal

Wormhole Bringup

  • Added some APIs to query device ethernet connectivity.
  • Added first phase of ethernet data movement support, basic unit tests passing on N300.

API Changes

Notes not available.

Tools - Profiler

  • Device only and host only profiling options for profile_this.py script
  • Examples for fast dispatch device program profiling

Tools - Watcher

  • Added kernel names/paths to watcher log file

Extra features

Notes not available.

Eager/ttNN

Infrastructure

  • Added initial implementation of TTNN APIs
    • Added functions to interface with torch: from_torch, to_torch
    • Added functions to move tensor to/from device: to_device, from_device
    • Added functions to change the layout of the tensor: to_layout
    • Added matmul, add, sub, mul, reshape, permute and softmax operations
  • Implemented Multi-Head-Attention using TTNN APIs
  • Added 3 tutorials to showcase TTNN
  • Updated Documentation to describe TTNN and its APIs

Operations

Following on-device operators are added to tt_lib.tensor module:

  • interleave repeat
  • triu
  • tril
  • rmsnorm
  • groupnorm
  • silu (update to be first-class unary operator)

Models

  • For BERT demo, added loading of cached pre-processed weights (stored as TT tensors) to avoid conversion from Torch to TT tensors.
  • Added demo for ResNet that executes on TT hardware. Demo takes images from ImageNet and processes them in batches of 8.

v0.35.0

27 Oct 23:36
Compare
Choose a tag to compare

Metal

Wormhole Bringup

  • Extended gtests to run on all available devices in Wormhole systems.
  • Single device tests passing on remote chips.

API Changes

  • These 2 functions:

    • uint32_t CreateSemaphore(Program &program, const CoreRange &core_range, uint32_t initial_value)
    • uint32_t CreateSemaphore(Program &program, const CoreRangeSet &core_range_set, uint32_t initial_value)

    have been replaced by

    • uint32_t CreateSemaphore(Program &program, const std::variant<CoreRange,CoreRangeSet> &core_spec, uint32_t initial_value).
  • These 3 functions:

    • void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreCoord &logical_core, const std::vector<uint32_t> &runtime_args)
    • void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRange &core_range, const std::vector<uint32_t> &runtime_args)
    • void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRangeSet &core_range_set, const std::vector<uint32_t> &runtime_args)

    have been replaced by

    • void SetRuntimeArgs(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::vector<uint32_t> &runtime_args)
  • These 2 functions:

    • KernelID CreateDataMovementKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<DataMovementConfig> &config = {})
    • KernelID CreateComputeKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<ComputeConfig> &config = {})

    have been replaced by:

    • KernelID CreateKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::variant<DataMovementConfig,ComputeConfig> & config)

Tools - Profiler

  • Improved profile_this.py log management strategy to avoid conservative log folder checks from profiling

Extra features

  • Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels. The kernel uses the same get_arg_val<type>(<index>) to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(Program program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> & core_spec, const std::vector<uint32_t> &runtime_args), as the host used to communicate to DataMovement Kernels.

Eager (Ops)

There have been no notable changes to communicate in this release.

Models

  • Moved code that implements and tests models from tests/models to top level models folder. In the models folder, models are separated into demos (working models with end2end demo code) and experimental (models that are under development).
  • Added implementation of Falcon7B for GS and PyTorch demos for nanoGPT and T5
  • Added BERT Large end2end demo on GS (set up for question answering)

v0.34.0

13 Oct 15:22
Compare
Choose a tag to compare

Metal

API Changes

  • CreateDevice: device_id type has changed from int to chip_id_t
  • CreateCircularBuffer: Three previous variants which only differ by CoreCoord, CoreRange, and CoreRangeSet function parameter have been compressed into one user-facing CreateCircularBuffer function that’s parameterized with std::variant<CoreCoord,CoreRange,CoreRangeSet>. Now accepts CircularBufferConfig which specifies size, data format, and page size per buffer index. Return type updated from CircularBuffer object to CircularBufferID (uintptr_t)
  • GetCircularBufferConfig: New function to retrieve a reference to configuration of a CircularBuffer. This allows the CircularBuffer config to be updated. Updates will take effect on the next call to LaunchProgram.

Tools - Profiler

Tracy Python Support : Profile python side code with tracy. Similar to cProfile, the standard python profiler module, all python function calls are picked up on tracy. Additionally, TT’s binded C++ calls are also picked up automatically. The entire python script or just desired parts of it can be profiled either at function or line level.

Extra features

Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime. The kernel uses the same get_arg_val<type>(<index>) API to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(<program, <compute_kernel_id>, <Core,CoreRange> , <vector of u32 runtime args>) as DataMovement Kernel.

Eager (Ops)

Notes not yet available.

Models

  • metal_BERT_large_15: model implementation updated to use tt-DNN operation embedding that executes on GS device. Previously this model used PyTorch embedding operation executing on CPU.
  • Falcon7b: added end to end demo that is running on GS device. The demo takes a text prompt and returns text generated by the model to complete the prompt. The demo works by pre-filling the cache with decoded input prompts and then running decode for all users in parallel.

v0.33.0

06 Oct 02:29
Compare
Choose a tag to compare

Metal

Wormhole

  • Basic bringup and tests running on WH B0
  • Harvesting functionality working on WH B0
  • Basic fast dispatch functionality working on WH B0

Host API changes

  • void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores) no longer callable
  • Device *CreateDevice no longer requires arch parameter
  • New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object: Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
  • LaunchKernels renamed to LaunchProgram(Device *device, Program &program) to match EnqueueProgram and removed obsolete stagger_start parameter
  • void WriteRuntimeArgsToDevice(Device *device, const Program &program) moved to detail namespace
  • bool CompileProgram(Device *device, Program &program) moved to detail namespace
  • bool ConfigureDeviceWithProgram(Device *device, const Program &program) moved to detail namespace
  • bool InitializeDevice(Device *device) removed

Profiler

  • Bug fix on device side to support new FW init process in fast and slow dispatch.
  • RISC FW cleanup to avoid unnecessary function wrappers.

Watcher

  • Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
  • Add some noc sanitization and checks
  • Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions

Feature: Runtime Compute Args

  • Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
  • The kernel uses the same get_arg_val<type>(<index>) api to retrieve it.
  • The host uses the same tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>); as DataMovement Kernel communication as well.

Eager (Ops)

  • Added support for overriding runtime args and circular buffers
  • Added support for saving and loading tensors
  • Added support for uint32 tensor

Models

  • 5+% increase of BERT Large performance on bare metal machines.
  • 15+% increase of LLaMA 7B performance on bare metal machines.