Skip to content

huggingface.co - Tensor Parallelism (TP) in Transformers 5 Minutes to Understand

From Quentin Gallouédec:

Quick Recap: What’s Inside a Transformer Network?

Section titled “Quick Recap: What’s Inside a Transformer Network?”

Before diving into tensor parallelism, let’s briefly review the core components of a transformer model. We focus on two major components:

  • the Multi-Head Attention (MHA) and
  • the Feed-Forward Network (FFN)

Other components (layer norms, embeddings, etc.) are omitted here because tensor parallelism cannot be applied to them, and most of the model’s parameters reside in the Attention and FFN components anyway.

The backbone of transformer models is the attention mechanism. Although many variants exist (e.g., Multi-Query Attention, Grouped-Query Attention, Linear Attention), the formulation below is the standard one:

Attention(Q,K,V)=softmax(QKd)VAttention(Q,K,V)=softmax(QKd)VAttention(Q,K,V)=softmax(QK⊤d)V \boxed{ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V }

Here, the queries QQQ Q , keys KKK K , and values VVV V are matrices of the same shape. In practice, they are obtained from the same input XXX X (token embeddings) via learned linear projections:

Q=XWQ,K=XWK,V=XWV.Q=XWQ,K=XWK,V=XWV.Q=XWQ,K=XWK,V=XWV. Q = XW_Q,\quad K = XW_K,\quad V = XW_V.

A learned output projection WOWOWO W_O then produces the final attention output:

Output=Attention(Q,K,V)WO.Output=Attention(Q,K,V)WO.Output=Attention(Q,K,V)WO. \text{Output} = \text{Attention}(Q,K,V) W_O.

Visually:

[db552dbe7bb6ee0c38da85e276c694ab_MD5

Computing attention with a large hidden dimension can be costly and may limit the model’s ability to capture diverse features. Transformers address this with Multi-Head Attention (MHA).

Instead of computing one large attention operation, we split QQQ Q , KKK K , and VVV V into hhh h smaller heads of dimension dh=d/hdh=d/hdh=d/h d_h = d/h . Each head captures different representation subspaces. Their outputs are concatenated and projected back to dimension ddd d , allowing the model to combine the information across heads.

[5b3d24bd593d2aa3a92909e01685cfe2_MD5

Another crucial component is the Feed-Forward Network (FFN). It’s usually composed of two linear layers with an activation in between. Many variations exist, but let’s consider the common structure, as it generalizes well:

[ab275364676376ba836cab29c303165b_MD5

Transformer models have grown dramatically in size. Running inference on a single GPU is already challenging, and training is often impossible without parallelism. This motivates the need to split the model across multiple GPUs, which leads us to tensor parallelism, which is one of the key techniques enabling this.

Now that we remember how attention works, let’s put that aside for a moment and introduce Tensor Parallelism (TP).

The key idea is simple: matrix multiplications can be parallelized if we split the matrices in the right way. Suppose you need to compute a matrix multiplication and you have a friend to help. How should you divide the work?

[86786cd1be45497f849d5bba74ddacdb_MD5

One option is to split the second matrix into column blocks. Each person multiplies the full first matrix by one block of columns:

[d4709b7e4954a24003070f770a4792c2_MD5

This is known as column-parallel matrix multiplication.

Another option is to split the first matrix into column blocks and the second matrix into matching row blocks. Each person computes their partial product, and then the results are summed:

[8f2b9575a9f965957bde64da7d302ca7_MD5

This is called row-parallel matrix multiplication.

These strategies are extremely useful: they let each worker operate independently on its shard of the data and, more importantly, allow us to distribute the matrices across multiple GPUs —precisely what is needed to reduce memory usage per GPU.

Now that we understand TP and MHA separately, let’s try to apply TP to MHA.

[ba58e3d28a2b9606b09bb32e9af5c549_MD5

The easiest way to do it is to split the projection matrices WQWQWQ W_Q , WKWKWK W_K , and WVWVWV W_V column-wise. Each GPU holds a subset of the output dimensions—equivalently, a subset of the attention heads.

Each GPU therefore computes its local QiQiQi Q_i , KiKiKi K_i , and ViViVi V_i for its assigned heads, with no communication required.

Since heads are independent, every GPU can compute attention for its heads entirely locally:

  • compute QiKiQiKiQiKi⊤ Q_iK_i^\top ,
  • apply softmax,
  • multiply by VVV V .

Once again, no communication is needed here.

The attention output OiOiOi O_i is thus naturally sharded by columns across GPUs.

The output projection WOWOWO W_O is then applied using a row-parallel layout:

  • each GPU multiplies its shard of the attention output by its shard of WOWOWO W_O independently,
  • then a single all-reduce (sum across GPUs) aggregates the partial results into the final output.

Tensor Parallelism in the Feed-Forward Network

Section titled “Tensor Parallelism in the Feed-Forward Network”

Similarly, we can apply TP to the FFN in an even more straightforward way.

[c0464534a25166df7f6ab8f724990ff9_MD5

  • The first linear layer is column-parallel.
  • The second linear layer is row-parallel.

Although this form of TP is elegant, it comes with a few practical constraints:

  • The TP size (number of GPUs) must be less than or equal to the number of attention heads—a single head cannot be split across GPUs.
  • The number of attention heads must be divisible by the number of GPUs, so each GPU receives an equal share of heads.
  • The feed-forward hidden dimension must be divisible by the TP size, to ensure equal distribution of the FFN parameters.

Now that we understand the theory, how to use TP in practice?

Fortunately, all transformer models integrated with the Hugging Face Transformers library can leverage TP via the tp_plan argument.

demo_tp.py
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", tp_plan="auto")
inputs = torch.tensor([1, 2, 3, 4](1,%202,%203,%204), device="cuda")
outputs = model(inputs)
Terminal window
torchrun --nproc_per_node 4 demo_tp.py

Read more about how to customize the TP plan in the Transformers’ documentation – Distributed inference.

While TP efficiently distributes large matrix multiplications, it does not solve all challenges of training or serving large models. Its scalability is limited by the number of attention heads, and because TP requires frequent communication between GPUs, performance can degrade across multiple nodes where inter-node bandwidth is lower. To overcome these limitations, additional forms of parallelism—such as Pipeline Parallelism (PP)—are needed. We’ll explore these techniques in future sections!

ArthurZ

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", tp_plan="auto") damn simple!