Rephrasing the Web: A Recipe for Compute & Data-Efficient Language Modeling
The authors reveal that creating new training-examples from input data using an off-the-shelf model (Mistral-7B) can yield convergence speeds that are 3x without doing so. The rephrasing is done in a manner that is 'like wikipedia' or in a 'question-answer format'. They are also done at different levels of style diversity, such as a child or a a scholar. In detailed analysis they found that:
- Style diversity improves the value
- Reasonable paraphraser models are needed
- It is better than standard augmentation that does random deletions or synonym replacements.
Here is one of a few example rephrasing prompts:
“For the following paragraph give me a paraphrase of the same in high-quality English language as in sentences on Wikipedia”