-
Notifications
You must be signed in to change notification settings - Fork 592
Closed
Description
I am testing GPT2 model training using TransformerLayer.
Training slows down significantly when sequence_parallel=True, achieves 1/5th of throughput of training without sequence_parallel.
I also observe that sequence_parallel=True results in OOM for some batch sizes where sequence_parallel=False can run successfully.
Do you have any recommendation to achieve better throughput with sequence_parallel and fp16?
Model is ~4.3B with 12 layers, tp_size=4, fp16, seq_len=2048, training with 8 A100 GPUs.
transformer_engine.pytorch.TransformerLayer(
5120,
20480,
40,
layer_number=(l+1),
self_attn_mask_type="causal",
tp_group=tp_group(),
tp_size=tp_size,
params_dtype=torch.float16,
output_layernorm=True,
layer_type="encoder",
set_parallel_mode=True,
fuse_qkv_params=True,
sequence_parallel=True,
qkv_weight_interleaved=False,
attention_softmax_in_fp32=False,
)
Metadata
Metadata
Assignees
Labels
No labels