Enabling ``sequence_parallel`` slows down training with fp16

I am testing GPT2 model training using TransformerLayer. 

Training slows down significantly when ``sequence_parallel=True``, achieves  1/5th of throughput of training without sequence_parallel.
I also observe that ```sequence_parallel=True``` results in OOM for some batch sizes where ```sequence_parallel=False``` can run successfully. 

Do you have any recommendation to achieve better throughput with  ``sequence_parallel `` and ``fp16``? 

Model is ~4.3B with 12 layers, ``tp_size=4, fp16, seq_len=2048``,  training with 8 A100 GPUs.  


```
transformer_engine.pytorch.TransformerLayer(
5120,
20480,
40,
layer_number=(l+1),
self_attn_mask_type="causal",
tp_group=tp_group(),
tp_size=tp_size,
params_dtype=torch.float16,
output_layernorm=True,
layer_type="encoder",
set_parallel_mode=True,
fuse_qkv_params=True,
sequence_parallel=True,
qkv_weight_interleaved=False,
attention_softmax_in_fp32=False,
)
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling `sequence_parallel` slows down training with fp16 #182

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enabling sequence_parallel slows down training with fp16 #182

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Enabling `sequence_parallel` slows down training with fp16 #182