Bring up Llama 3.1 405B on two pods #68

tengyifei · 2024-11-01T18:54:11Z

What does this PR do?

These are the changes to make Llama 3.1 405B work on two Trillium TPU pods. It includes:

Initialize the model layer by layer on the CPU, to workaround an OOM bug when initializing all layers in the model at once on the TPU.
Save the inv_freq buffer in Llama for now, to ensure we can initialize it using load_state_dict. Otherwise, the inv_freq buffer will stay as a meta tensor.
Adds a USE_SINGLE_SLICE env var before importing jax, to prevent jax from re-initializing MegaScale client. A corresponding custom libtpu build is required.
Use a custom hybrid ring mesh to improve collectives performance.

…stributed model initialization

See huggingface#34234

Instead of leaving it as a meta tensor, mark it as persistent=True

zpcore · 2024-11-01T20:50:38Z

examples/pytorch/language-modeling/run_clm.py

+    device_attributes = xr.global_runtime_device_attributes()
+    num_slices = max(int(d.get('slice_index', 0)) for d in device_attributes) + 1
+
+    if num_slices > 1 and model_args.spmd_2d_sharding == 0:


Should we give the flexibility to choose whether to use the 2d sharding or not?

zpcore · 2024-11-01T20:51:17Z

examples/pytorch/language-modeling/run_clm.py

            model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code)
+        # note: at this point, the mode is not materialized


NIT: typo mode.

zpcore · 2024-11-01T20:59:34Z

This is great, can you point us to the material how does hybrid ring mesh work? Thanks

zpcore · 2024-11-01T21:03:42Z

examples/pytorch/language-modeling/run_clm.py

+            # Multi-slice 2D sharding
+            tensor_axis = model_args.spmd_2d_sharding
+            fsdp_axis = num_devices // tensor_axis
+            mesh_shape = (fsdp_axis, tensor_axis)  # Should be (128, 4)


The comment seems only apply to 2 slices with 4 tensor parallelism.

examples/pytorch/language-modeling/run_clm.py

zpcore · 2024-11-01T21:32:59Z

examples/pytorch/language-modeling/run_clm.py


+        model.load_state_dict(dict_of_params, assign=True)
+        model.to('xla')


Question: with meta device, did we skip loading the model into CPU RAM but loading into XLA device directly?

> /workspaces/torch/transformers/<eval_with_key>.6(210)forward() 206 mul_15 = torch.ops.aten.mul.Tensor(add_7, _to_copy_3); add_7 = _to_copy_3 = None 207 sum_1 = torch.ops.aten.sum.dim_IntList(mul_15, [0, 1], True); mul_15 = None 208 view_58 = torch.ops.aten.view.default(sum_1, [4096]); sum_1 = None 209 _to_copy_4 = torch.ops.aten._to_copy.default(mul_14, dtype = torch.float32, layout = torch.strided, device = device(type='xla', index=0)); mul_14 = None 210 -> mul_16 = torch.ops.aten.mul.Tensor(_to_copy_4, _to_copy_2) 211 mul_17 = torch.ops.aten.mul.Tensor(_to_copy_4, rsqrt_1); _to_copy_4 = None 212 sum_2 = torch.ops.aten.sum.dim_IntList(mul_16, [2], True); mul_16 = None 213 detach_29 = torch.ops.aten.detach.default(rsqrt_1); rsqrt_1 = None 214 detach_30 = torch.ops.aten.detach.default(detach_29); detach_29 = None _to_copy_2 is `meta`. _to_copy_4 is `xla`.

This supports hardcoded XLA device names in the input graph (e.g. due to casts). We need to trace the graph correspondingly with XLA devices. Why does a cast involve devices? Looks like the ATen ops doesn't include cast, only `_to_copy`: https://pytorch.org/docs/stable/torch.compiler_ir.html

…very step

This helps XLA avoid gathering the logits and saves a few Gigs of RAM. Before: http://shortn/_tlV88E1Ca3 After: http://shortn/_GodQzu6GMu

Using a custom aten slice fast path in the ptxla branch. Before: http://shortn/_1Xyrm0lLdL After: http://shortn/_SgU6vG1pNm

After we trace out HLO in scan using placeholder tensors, memory usage drops enough that we can up the batch size to 16. Profile: http://shortn/_ESRGhAhKce

These are negative optimizations (probably some are wrong). On 2D sharded v6e-8, memory usage goes 26 GiB -> 22 GiB. Step time: 4.76s -> 4.71s.

tengyifei added 12 commits October 29, 2024 12:01

Add DCN sharding

5b03879

Suppress JAX megascale initialization

c09445f

Add log

3be710b

Monkey patch _create_device_mesh_for_nd_torus

71d4e9a

Add metrics before/after materializing model

991a9d0

Materialize model layer by layer, to workaround OOM in multi-slice di…

8f64157

…stributed model initialization

Don't put inv_freq buffers in the state dict

e198ef2

See huggingface#34234

move model to XLA to be sure

0eb057e

Attempt to support inv_freq another way

0d77304

Instead of leaving it as a meta tensor, mark it as persistent=True

Switch to a hybrid ring in 2 pod training

65900a6

Don't hardcode the number of slices

ed0e6ec

Move around XLA metrics logs

aeaeac8

tengyifei requested review from bhavya01, zpcore, qihqi and miladm November 1, 2024 18:54

Replace nn.Linear with XlaPatchedLinear

b9995c4

zpcore reviewed Nov 1, 2024

View reviewed changes

examples/pytorch/language-modeling/run_clm.py Show resolved Hide resolved

zpcore reviewed Nov 1, 2024

View reviewed changes

tengyifei added 7 commits November 4, 2024 17:02

Support remat every N layers (e.g. every other layer)

d7ee405

[wip] integrate scan, offloading, AOTAutograd FA kernel

294ca19

works with scan

2cb51c1

Comment the testing script

fd1f94e

surface unexpected transfers to device when running the model

4bdde23

tengyifei added 11 commits December 16, 2024 17:07

surface transfers to device

5549402

more trace spans

aa30da4

move CurriedLayer out to __init__ so we don't leak a causal_mask on e…

3742877

…very step

repro a particular llama3 8b training run

c3c9024

support both 1d and 2d sharding with offload and scan

9cf653c

fix NaN training loss in aot flash attention

e4c6a83

train 100 steps

f6068c1

shard the logits

3187404

This helps XLA avoid gathering the logits and saves a few Gigs of RAM. Before: http://shortn/_tlV88E1Ca3 After: http://shortn/_GodQzu6GMu

Get rid of aten_take in flash attention kernel

6feaf19

Using a custom aten slice fast path in the ptxla branch. Before: http://shortn/_1Xyrm0lLdL After: http://shortn/_SgU6vG1pNm

Test bs 16

347384d

After we trace out HLO in scan using placeholder tensors, memory usage drops enough that we can up the batch size to 16. Profile: http://shortn/_ESRGhAhKce

Try removing sharding annotation in modeling_llama

c2a6f61

These are negative optimizations (probably some are wrong). On 2D sharded v6e-8, memory usage goes 26 GiB -> 22 GiB. Step time: 4.76s -> 4.71s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring up Llama 3.1 405B on two pods #68

Bring up Llama 3.1 405B on two pods #68

tengyifei commented Nov 1, 2024

zpcore Nov 1, 2024

zpcore Nov 1, 2024

zpcore commented Nov 1, 2024

zpcore Nov 1, 2024

zpcore Nov 1, 2024

		model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code)
		# note: at this point, the mode is not materialized


		model.load_state_dict(dict_of_params, assign=True)
		model.to('xla')

Bring up Llama 3.1 405B on two pods #68

Are you sure you want to change the base?

Bring up Llama 3.1 405B on two pods #68

Conversation

tengyifei commented Nov 1, 2024

What does this PR do?

zpcore Nov 1, 2024

Choose a reason for hiding this comment

zpcore Nov 1, 2024

Choose a reason for hiding this comment

zpcore commented Nov 1, 2024

zpcore Nov 1, 2024

Choose a reason for hiding this comment

zpcore Nov 1, 2024

Choose a reason for hiding this comment