https://github.com/KellerJordan/modded-nanogpt
The original code can be found at https://github.com/KellerJordan/modded-nanogpt/blob/2305e1ed6cba6399c9b20f8f7b5cd64aad3de170/records/120424_ValueEmbed/train_gpt2.py#L269
Proposed in December 2024, this is the earliest place I can found scaling the layerwise embedding for transformers. The credit should be given to Braden Koszarsky.