Consider model weight averaging after pretext task?

I've heard that this can help performance, maybe it helps us here?

Basically, do three runs on the pretext task, take the three sets of weights and average them.

I need to find the paper this is from to make sure I'm understanding it correctly.