Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions About Tensor Storage and Operation Mapping in TT-BUDA #58

Closed
jhlee508 opened this issue Sep 15, 2024 · 9 comments
Closed

Questions About Tensor Storage and Operation Mapping in TT-BUDA #58

jhlee508 opened this issue Sep 15, 2024 · 9 comments
Assignees
Labels
question Further information is requested

Comments

@jhlee508
Copy link

jhlee508 commented Sep 15, 2024

Hello TT-BUDA Maintainers/Team,

I'm working with the tt-buda project and have a couple of questions regarding its implementation. I reviewed the documentation and relevant modules but couldn't find detailed explanations. Direct answers are appreciated, but referencing relevant code files would be even better.

1. Storage of Activation and Weight Tensors

  • How are activation, weight and output tensors stored?
    • DRAM 12-channel interleaved?
    • Page Size?

2. Operation Balancer/Placer to Tensix core mesh

  • I saw a diagram showing operations beautifully mapped onto the Tensix core mesh when running DNN models. What is the algorithm used to assign operations to the cores?

Thank you for your help!

@milank94 milank94 added the question Further information is requested label Sep 16, 2024
@milank94 milank94 self-assigned this Sep 16, 2024
@milank94
Copy link

1. Storage of Activation and Weight Tensors

  • How are activation, weight and output tensors stored?
    • DRAM 12-channel interleaved?
    • Page Size?

Activation, weight, and output tensors would be stored in either L1 SRAM or DRAM. It depends on the op layout and data movement path availability. For example, if the intermediary can be consumed by an op on the same epoch, then data is stored in L1. If the intermediary will be required by an op in the next epoch, then it would go to DRAM.

We use 1GB page sizes for the host <> device. For Wormhole, we have 4x 1GB hugepages per device and for Grayskull it is 1x 1GB hugepage per device.

2. Operation Balancer/Placer to Tensix core mesh

  • I saw a diagram showing operations beautifully mapped onto the Tensix core mesh when running DNN models. What is the algorithm used to assign operations to the cores?

We have different balancer policies that can be used for laying out the ops, most notable are: NLP, CNN, Ribbon. Each one has it's own strategy for how the ops should be laid out e.g. row-major, column-major, etc.

https://docs.tenstorrent.com/pybuda/latest/advanced_user_guide.html provides some overview these different policies and what manual overrides are available.

@milank94
Copy link

@jhlee508 any follow up on the above?

@jhlee508
Copy link
Author

I need more information regarding the two balancer policies (NLP, Ribbon) used to map operations to the Tensix cores. Unfortunately, I have been unable to find detailed information on these policies in the documentation you have provided.

Could you please provide any references, code, or additional resources that explain these balancing and placement policies? Any guidance would be greatly appreciated.

Thank you for your time and assistance.

@milank94
Copy link

The Balancer policy uses different heuristics depending on whether the model is primarily NLP (natural language processing) or CNN (convolutional neural network)/Ribbon. These policies help optimize the placement of operations (ops) during model execution.

For NLP models, which often contain transformer layers, and CNN models, which consist of convolutional operations, we can define placement rules ahead of time. This is because the general shape, size, and cycle counts of these ops are known in advance. These pre-set rules guide the Balancer in determining the most efficient way to distribute ops across the grid.

The Balancer works closely with the Scheduler. The Scheduler creates a window of N ops based on heuristics that are likely to fit within an epoch. The Balancer then analyzes these ops to find an optimal target cycle count and balances them by adjusting grid sizes, typically influenced by the "slowest" op in the batch.

In both NLP and CNN/Ribbon policies, utilization targets are set to ensure efficient execution. The Balancer checks that the "slowest" op, which dictates the target cycle, is included in the epoch and that overall utilization is balanced. If certain conditions aren’t met—like if the target op isn't included or utilization falls below the threshold—the placement may be rejected.

The only other reference I have is the source code if you would like to observe the full details:

@jhlee508
Copy link
Author

Thank you so much for your detailed and kind explanation. I’ll go over the source code you have provided to understand the rest in more detail.

I have one more question tensor memory layout. As the Op's distributed grid size increases (which is expected), it seems that the weights are allocated across more DRAM banks, distributed in a scattered pattern. This pattern appears to resemble the interleaved tensor storage mechanism described in metal's documentation. Could you let me know if this is correct, or whether buda is developed entirely independently from metal and uses a slight different mechanism to map tensors across the 12 DRAM banks?

Just like how you have explained the balancer algorithm earlier, I would greatly appreciate it if you could also explain the key idea behind this DRAM bank distribution mechanism, along with a reference to the relevant source code links for further reading. That would be really helpful. Thank you!

@milank94
Copy link

They are different. metal tensors that are interleaved in DRAM are all in a canonical form (that is each op that consumes a DRAM tensor could consume any DRAM tensor), whereas buda tensors are stored with a certain u/mblock, so each tensor has to be stored in a precise way that the consumer op expects it to be stored. The pattern the banks are interleaved in is the same, it just follows what the closest DRAM bank is to the consumer core.

Metal Tensors in DRAM:

  • Canonical Form: The term canonical here means that Metal tensors stored in DRAM are stored in a standardized or universal format. This implies that any operation (op) that consumes these tensors can work with any tensor stored in DRAM without needing to know the specific details of how it was stored.
  • Interleaving in DRAM: Interleaving means the tensors are distributed across different DRAM banks in a way that maximizes memory bandwidth and access speed. By distributing the data across banks, the system can access multiple parts of the tensor simultaneously.

Buda Tensors:

  • u/mblock Storage: This refers to the fact that Buda tensors are stored using a more specialized format, depending on a specific structure (u/mblock). Each tensor has to be stored in a precise way, meaning the operation (op) consuming the tensor expects it to be stored exactly in that format. The u/mblock pattern defines how data is organized within memory, and deviations from this pattern would make the tensor unreadable by certain operations.
  • Consumer Dependency: The consumer operation is highly dependent on how the tensor is stored, unlike Metal tensors where the operation can work with any tensor stored in DRAM. This makes Buda tensors more rigid in terms of storage requirements.

Interleaving Pattern and DRAM Bank Proximity:

  • For both Metal and Buda tensors, the pattern used to interleave data across DRAM banks is the same. However, the key difference is that this pattern is influenced by the proximity of the DRAM bank to the consumer core. The closer the bank is to the core, the more optimal the access would be in terms of performance.

You can see the source code by looking at resolve_input_queue_block_shapes in https://github.com/tenstorrent/tt-buda/blob/main/pybuda/csrc/balancer/legalizer/legalizer.cpp.

@jhlee508
Copy link
Author

jhlee508 commented Oct 2, 2024

Thank you for the detailed explanation!

I executed a matrix multiplication (M=N=K=1024) using tt-buda and observed that four Tensix cores concurrently handled a quarter of the matrix multiplication operation (Input: [1024, 1024] x Partial Weights: [1024, 256]). The four partial weight tensors were allocated in DRAM using the default round-robin placement scheme across DRAM banks (in the order of 1, 2, 3, 4), without any special DRAM allocation algorithm.

Regarding this, could you clarify how tt-buda accounts for the proximity of each DRAM bank to the corresponding consumer core in this setup?

@milank94
Copy link

milank94 commented Oct 2, 2024

I believe that it will balance between the availability and proximity of the DRAM bank to the core. Since you only had one operation on the device, using a round-robin placement would have been the most effective.

There is a great article series: https://www.corsix.org/content/tt-wh-part1 which goes into the physicality details of the device layout. I would recommend you read through that one for very detailed analysis.

@milank94
Copy link

Closing issue for now. Please reopen if there are any outstanding items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants