Release v0.52.0 · tenstorrent/tt-metal

Note

This is a verified, real release, however the release notes are under construction. Thank you for understanding.

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11036234439

📦 Uncategorized

#12323: delete ctor of AllGatherFusedOpSignaler
- PR: #12324
#0: Revert "#0: Update to gcc-12.x (#12332)"
- PR: #12522
#12448: Update 1d matmul sweep test to use CoreRangeSet for core range parameters
- PR: #12523
#0: [skip ci] Fix demo invocation in Llama README
- PR: #12526
#0: Update device creations functions to use num_command_queues instead of num_hw_cqs to match mesh_device creation functions
- PR: #12517
#12273: Move full wheel build on GitHub runners and 22.04 to scheduled job and fix related
- PR: #12535
Update perf and latest features for llm models (Sept 11)
- PR: #12515
#12532: Change sweep new vector checking to use the serialized vector…
- PR: #12534
#12451: add negative ends support for slice with list splicing format
- PR: #12469
fix llama t3k demo invoke in CI
- PR: #12537
Yugao/doc
- PR: #12540
#10855: Add single-device perf measurements to sweep infra
- PR: #12338
#9340: Add optional output tensor support for assign
- PR: #12057
#0: Add ccl multichip stack overview
- PR: #12551
#12371: Migrate moreh_getitem operation from tt_eager to ttnn
- PR: #12372
#11651: Remove type_caster
- PR: #11702
#12375: Add qid and optional tensor output to ttnn.gelu_bw
- PR: #12509
#8865: Optimized ttnn.bcast dispatch times
- PR: #12383
#12196: Use split readers wherever possible in UNet Shallow
- PR: #12441
Replace exact output match with tight pcc check in post-commit
- PR: #12446
#12148: Add queue_id and optional output tensors to ttnn.mul_bw
- PR: #12162
Fix start_pos in get_rot_mat() in llama galaxy model
- PR: #12493
Yieldthought/llama31 8b/ttembed
- PR: #12560
#8865: Fix non working ops in dispatch profiling infra
- PR: #12564
#0: Remove myself from tt_lib/csrc codeowners
- PR: #12567
Update workload theoretical ethernet numbers
- PR: #12570
#12524: Update fmt and unify logging API
- PR: #12464
#0: Update fmt and unify logging API
- PR: #12587
#11133: Improve various things about the wheel, including removal of patchelf and linking runtime assets to cwd
- PR: #11884
Support for initializing with 0s for SUM reduction WHB0
- PR: #12238
#12376: Support for non-32 Height in Width Sharded Conv2d
- PR: #12382
#0: Optimize context switch decision
- PR: #12545
#0: Correct #!/bin script headers
- PR: #12582
#12538: Separate out wheel tests from build so that other wheel-dependent jobs aren't blocked by the wheel smoke tests
- PR: #12594
#0: Create Blackhole Bring-Up Programming Guide
- PR: #12610
#12552: Fix indentation pybind files
- PR: #12543
#0: Add FD nightly single-card pipeline to data pipeline
- PR: #12618
#0: [skip_ci] Updating BH bring-up programming guide
- PR: #12620
Update owner of T3K ttnn unit tests
- PR: #12622
#0: change default reduce scatter num buffers per channel to 2
- PR: #12616
#12436: port moreh_sum from tt_dnn to ttnn
- PR: #12437
#12026: add permute sweep tests for trace
- PR: #12571
#12514: port moreh_mean and moreh_mean_backward from tt_dnn to ttnn
- PR: #12519
#12207: Port moreh_dot to ttnn
- PR: #12265
#12259: Move moreh dot backward
- PR: #12261
#12164: Add queue_id and optional output tensors to backward ops
- PR: #12255
#12439: Migrate moreh_nll_loss_bwd operations (reduced and unreduced) from tt_eager to ttnn
- PR: #12494
#12578: Update Mixtral t/s/u in README
- PR: #12629
#12373: Add queue_id and optional output tensors to rsqrt_bw op
- PR: #12404
remove todos from doc
- PR: #12636
add code language formatting CclDeveloperGuide.md
- PR: #12639
#0: Update multi-chip Resnet perf numbers after dispatch optimizations
- PR: #12621
#0: Remove unused _init, _fini
- PR: #12593
#0: remove unused variable
- PR: #12646
Contiguous pages support in Reduce Scatter read/write
- PR: #12477
#12628: Resolve arithmetic error in test_multi_cq_multi_dev causing T3K multi-CQ tests to fail
- PR: #12653
#12619: Update matmul sweep timeout and core range set usage
- PR: #12655
Run on custom dispatch commands on in-service runners only
- PR: #12659
#12544: support wide channels (> 256) in maxpool
- PR: #12625
#12605: Implement recommendations for Llama readme
- PR: #12657
#0: Point UMD back to main instead of metal-main
- PR: #12478
#0: ViT Trace+2CQ implementation
- PR: #12623
#0: Add BH to custom test dispatch workflow
- PR: #12667
Update ViT on GS perf
- PR: #12670
LLama selfout specific optimizations for fused all_gather_matmul op
- PR: #12292
#12520: Adding noc_async_writes_flushed between mcast writes and mcast semaphore sets for BH
- PR: #12627
#11144: Upgrade pip version to 21.2.4 to get around 22.04 import error
- PR: #12673
Remove duplicate from sfpu_split_includes.h
- PR: #12665
#12250: port moreh_matmul from tt_dnn to ttnn
- PR: #12251
#12297: Add queue_id and optional output tensors to add_bw op
- PR: #12358
#12392: Use shallow convolution in upblock3 of UNet Shallow
- PR: #12562
#0: Make CoreRangeSet thread safe
- PR: #12679
mm_sfence->tt_driver_atomics::sfence();
- PR: #12617
[New Op] Added dropout unary op
- PR: #12474
#12392: Shallov conv unet uts
- PR: #12568
Pkeller/memmap profiler
- PR: #12067
#0: Set WH_ARCH_YAML only if we have a wormhole machine
- PR: #12704
All gather expose params
- PR: #12389
Generalize nlp create head decode
- PR: #12663
#0: Remove CCL stalls, since Fabric VC support is merged
- PR: #12720
#0: Remove incorrect norelax option
- PR: #12717
#12668: SWOC bugfix
- PR: #12674
Fix start pos in get_rot_mat
- PR: #12728
#0: Remove unused CRT_START label
- PR: #12722
#12701: Split nightly tests into specific models for better reading
- PR: #12733
#0: Relax host bound tg threshold for Resnet
- PR: #12708
Rename tt::tt_metal::Shape to LegacyShape to not conflict with TTNN
- PR: #12742
#12374: Add optional output tensor support for ttnn.full_like
- PR: #12689
YoloV4 pipeline update
- PR: #12503
#12425: Add queue_id and optional output tensors to zeros_like
- PR: #12561
#12497: ttnn.empty to use create_device_tensor
- PR: #12542
#12266: Cleanup ternary backward
- PR: #12691
#0: Use absolute addressing in startup
- PR: #12723
#12595: Run profiler gather after every sweep test regardless of status
- PR: #12606
#12730: bert slice support unit tests
- PR: #12737
Reduce scatter perf sweep
- PR: #12391
#12778: Speed up sweeps parameter generation
- PR: #12780
#0: DPrint bugfix for which dispatch cores are included in 'all'
- PR: #12745
#12730: bert slice support unit tests correction
- PR: #12779
#5783: Remove watcher dependency on generated headers
- PR: #12686
#0: Update GS Resnet perf thresholds. Seeing large variation in CI
- PR: #12744
Fix issue w/ CBs getting allocated on ETH cores
- PR: #12792
#12802: add tracy option to build_metal.sh
- PR: #12803
#12748: Cleanup clamp_bw op
- PR: #12762
#12224: Add optional output tensor support for lt_bw
- PR: #12693
#12387: Workaround to_layout for height sharded tensor
- PR: #12641
#12196: Use split_reader and act db
- PR: #12769
#12508: Skip failing test in CI
- PR: #12761
#11512: Add frac, ceil and trunc sweeps
- PR: #12760
#0: Don't overwrite CMake flags in build_metal.sh
- PR: #12824
Add subtract, subalpha and rsub sweeps, interleaved
- PR: #12822
Llama tg/sharded ccls
- PR: #12814
Update peak dram speed to 288GB/s
- PR: #12528
#11169: Watcher to report if eth link retraining occurred during teardown
- PR: #12801
#0: adding jaykru-tt as codeowner for data_movement operations
- PR: #12139
Mamba CI hanging on Untilize fix
- PR: #12677
#12749: Update Test files
- PR: #12751
#12799: Add handling for pytest errors, especially those at the beginning, and expose their messages
- PR: #12838
#12529: Update comment of dataflow api for mcast loopback functions
- PR: #12825
Fix failure in llama perf on CI
- PR: #12669
fix typo - mention higher level multichip API above CCL ops
- PR: #12836
Add Mamba unit tests to post-commit test suite
- PR: #12129
#12529: Add check for in0_mcast_num_cores=1 for noc_async_write_multicast_loopback_src
- PR: #12796
#0: Change all ops which support page_table to enable non-log2 shapes
- PR: #12842
#12198: Add 2CQ and trace support for UNet Shallow
- PR: #12820
Add supports/examples for placing Reads and Writes on CQ1
- PR: #12821
#9370: Workaround: replace WRCFG with RMWCIB instructions in reduce_revert_delta
- PR: #12832
Remove UNet from landing page
- PR: #12856
#12750: Replace zeros_like with empty_like in backward ops
- PR: #12766
#12840: Add more handling more multiple attempts by restricting the space of github_job_ids we're looking to only the ones in the workflow run attempt in question
- PR: #12858
#12729: add row-major split for BERT
- PR: #12804
#12764: Cleanup abs_bw, threshold_bw
- PR: #12827
#11919: Enable fast dispatch build-and-unit-tests
- PR: #12334
#12770: Reorganize all gather
- PR: #12808
Mesh Virtualization
- PR: #12719
#8865: Add more ops to dispatch profile infra
- PR: #12765
Sin ,cos, relu, gelu and logical_or_ new infrastructure sweeps
- PR: #12672
#10936: Enable llama tg unit test
- PR: #12263
Add support for dim=3 in ttnn.argmax
- PR: #12442
#9288: Add ttnn support for distilbert model
- PR: #9507
#0: Handling Absence of Prefill in Mistral Prefill Demo
- PR: #12612
#0: Update Mamba demo outputs
- PR: #12873
Llama prefill changes to enable vLLM
- PR: #12843
#12874: Skip flaky prefill single card perf test for now
- PR: #12876
#12877: Cache pip dependencies for data collection (infra) to save some time
- PR: #12880
bilinear support for upsample.
- PR: #12385
Add clamp, ceil, cbrt and floor sweeps, interleaved
- PR: #12870
#12557: bias shape was wrong, and need to create pure torch tensor
- PR: #12846
#12654: ttnn_tutorials resnet block document update
- PR: #12692
Update TG llama3_70b README.md
- PR: #12868
#12892: Add None for sweeps device perf on device execption
- PR: #12893
Update README.md
- PR: #12904
Update RN50 GS README.md
- PR: #12906
Update README.md
- PR: #12905
Add Unsqueeze and more support for squeeze
- PR: #12734
#10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #12894
#0: Update tg/tgg Resnet READMEs to be in line with the others
- PR: #12911
#0: upsample fix for undefined behavior on multi-devices run.
- PR: #12889
#12912: Update CONTRIBUTING.md
- PR: #12913
LLK Test Coverage - MathFid and DestAcc in reduce API
- PR: #12696
#12615: Add queue_id output tensor to slice op, concat_bw
- PR: #12718
#12829: Cleanup softplus_bw, hardtanh_bw, prod_bw
- PR: #12864
Flash Decode GQA (and MQA) Improvements (Round 1)
- PR: #12739
#12435: Add queue_id and optional output tensors to div_bw op
- PR: #12697
#12758: Add queue_id and optional output tensors to div and trunc op
- PR: #12861
Add more sweeps, reorganize sweeps folders
- PR: #12928
#12885: Add initial unit tests for infra, specifically for data collection
- PR: #12933
#12782: Change all sweep framework infra to use loguru
- PR: #12784
#0: Conv act split reader speed up
- PR: #12881
Distributed Sharded Layernorm/RMSNorm
- PR: #12635
n_heads> 1 and share_cache support for paged update cache
- PR: #12699
#12658: Update sweeps export to sqlite script
- PR: #12951
#12946: Fixed TGG mapping to match cluster_desc.yaml
- PR: #12947
#12527: refactor slice
- PR: #12791
Fix squeeze Dim == 0 with padding
- PR: #12940
#12909: Skip users in FlashDecode based on index
- PR: #12910
#12954: pip install only build in wheel build since that's all we should need
- PR: #12958
#0: Bugfix for launch message setting on watcher hang for Tensix
- PR: #12956
New sweeps - Add exp, exp2, tanh sweeps
- PR: #12950
Reduce padding on UNet Shallow input tensor
- PR: #12944
#0: Direct build jobs for RelWithDebInfo onto in-service runners
- PR: #12960
#0: Fix mesh_device grid specification for tgg resnet tests after #12479 was merged
- PR: #12948
#12750: Replace zeros_like with empty_like in backward ops
- PR: #12934
#0: Simplify erisc entry/exit
- PR: #12702
Don't install dependencies on self-hosted builders
- PR: #12965
[skip ci] Add the programming examples list to the landing READme
- PR: #12969
[skip ci] #0: Prog Examples edits
- PR: #12972
[skip ci] #0: Prog Examples readme
- PR: #12973
#12355: Support vector of optional tensor and example
- PR: #12356
#12867: Cleanup 9 Unary Backward ops
- PR: #12920
#12795: Move numpy functions.hpp
- PR: #12817
#0: Shard and Pad programming examples
- PR: #12974
[skip ci] #0: Prog examples edits
- PR: #12976
Update README.md with unet perf
- PR: #12978
[skip ci] Update README.md
- PR: #12980
[skip ci] ViT TTNN Tech Report
- PR: #12800
[skip ci] Update README.md (ViT in TT-NN)
- PR: #12981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.52.0

📦 Uncategorized