Rephrasing the Web: A Recipe for Compute & Data-Efficient Language Modeling

The authors reveal that creating new training-examples from input data using an off-the-shelf model (Mistral-7B) can yield convergence speeds that are 3x without doing so. The rephrasing is done in a manner that is 'like wikipedia' or in a 'question-answer format'. They are also done at different levels of style diversity, such as a child or a a scholar. In detailed analysis they found that:

Style diversity improves the value
Reasonable paraphraser models are needed
It is better than standard augmentation that does random deletions or synonym replacements.

Here is one of a few example rephrasing prompts:

“For the following paragraph give me a paraphrase of the same in high-quality English language as in sentences on Wikipedia”