Please cite modded-nanogpt or Braden Koszarsky's X

https://github.com/KellerJordan/modded-nanogpt

The original code can be found at https://github.com/KellerJordan/modded-nanogpt/blob/2305e1ed6cba6399c9b20f8f7b5cd64aad3de170/records/120424_ValueEmbed/train_gpt2.py#L269

Proposed in December 2024, this is the earliest place I can found scaling the layerwise embedding for transformers. The credit should be given to Braden Koszarsky.