Paper Abstract
Test-time scaling is a promising new approach to
language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s
o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to
achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K
of 1,000 questions paired with reasoning traces
relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we
develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait”
multiple times to the model’s generation when it
tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning
steps.
In this video, we turn DeepSeek-R1-Distill-Qwen-1.5B into a deep thinking model enabling test time scaling.
Note: this works with all the models that generate thinking tokens!
🔗 Links 🔗
s1: Simple test-time scaling
https://arxiv.org/pdf/2501.19393
MLX LM - https://pypi.org/project/mlx-lm/
Code by Awni Hannum - https://gist.github.com/awni/9d8b35ef...
❤️ If you want to support the channel ❤️
Support here:
Patreon - / 1littlecoder
Ko-Fi - https://ko-fi.com/1littlecoder
🧭 Follow me on 🧭
Twitter - / 1littlecoder