-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions About Tensor Storage and Operation Mapping in TT-BUDA #58
Comments
1. Storage of Activation and Weight Tensors
2. Operation Balancer/Placer to Tensix core mesh
|
@jhlee508 any follow up on the above? |
I need more information regarding the two balancer policies (NLP, Ribbon) used to map operations to the Tensix cores. Unfortunately, I have been unable to find detailed information on these policies in the documentation you have provided. Could you please provide any references, code, or additional resources that explain these balancing and placement policies? Any guidance would be greatly appreciated. Thank you for your time and assistance. |
The Balancer policy uses different heuristics depending on whether the model is primarily NLP (natural language processing) or CNN (convolutional neural network)/Ribbon. These policies help optimize the placement of operations (ops) during model execution. For NLP models, which often contain transformer layers, and CNN models, which consist of convolutional operations, we can define placement rules ahead of time. This is because the general shape, size, and cycle counts of these ops are known in advance. These pre-set rules guide the Balancer in determining the most efficient way to distribute ops across the grid. The Balancer works closely with the Scheduler. The Scheduler creates a window of N ops based on heuristics that are likely to fit within an epoch. The Balancer then analyzes these ops to find an optimal target cycle count and balances them by adjusting grid sizes, typically influenced by the "slowest" op in the batch. In both NLP and CNN/Ribbon policies, utilization targets are set to ensure efficient execution. The Balancer checks that the "slowest" op, which dictates the target cycle, is included in the epoch and that overall utilization is balanced. If certain conditions aren’t met—like if the target op isn't included or utilization falls below the threshold—the placement may be rejected. The only other reference I have is the source code if you would like to observe the full details:
|
Thank you so much for your detailed and kind explanation. I’ll go over the source code you have provided to understand the rest in more detail. I have one more question tensor memory layout. As the Op's distributed grid size increases (which is expected), it seems that the weights are allocated across more DRAM banks, distributed in a scattered pattern. This pattern appears to resemble the interleaved tensor storage mechanism described in Just like how you have explained the balancer algorithm earlier, I would greatly appreciate it if you could also explain the key idea behind this DRAM bank distribution mechanism, along with a reference to the relevant source code links for further reading. That would be really helpful. Thank you! |
They are different. Metal Tensors in DRAM:
Buda Tensors:
Interleaving Pattern and DRAM Bank Proximity:
You can see the source code by looking at |
Thank you for the detailed explanation! I executed a matrix multiplication (M=N=K=1024) using tt-buda and observed that four Tensix cores concurrently handled a quarter of the matrix multiplication operation (Input: [1024, 1024] x Partial Weights: [1024, 256]). The four partial weight tensors were allocated in DRAM using the default round-robin placement scheme across DRAM banks (in the order of 1, 2, 3, 4), without any special DRAM allocation algorithm. Regarding this, could you clarify how tt-buda accounts for the proximity of each DRAM bank to the corresponding consumer core in this setup? |
I believe that it will balance between the availability and proximity of the DRAM bank to the core. Since you only had one operation on the device, using a round-robin placement would have been the most effective. There is a great article series: https://www.corsix.org/content/tt-wh-part1 which goes into the physicality details of the device layout. I would recommend you read through that one for very detailed analysis. |
Closing issue for now. Please reopen if there are any outstanding items. |
Hello TT-BUDA Maintainers/Team,
I'm working with the tt-buda project and have a couple of questions regarding its implementation. I reviewed the documentation and relevant modules but couldn't find detailed explanations. Direct answers are appreciated, but referencing relevant code files would be even better.
1. Storage of Activation and Weight Tensors
2. Operation Balancer/Placer to Tensix core mesh
Thank you for your help!
The text was updated successfully, but these errors were encountered: