Understanding DeepSeek R1-zero and R1

Paper link: https://arxiv.org/abs/2501.12948

<aside> 🔖 We’ll Try to understand the DeepSeek paper over the top, learning some terminologies. PS: This is my first blog so if you get any misinterpretations or so please let me know I have mentioned my twitter profile here. I hope I’ll improve more after writing more such blogs. So yeah Let’s get it. 💪🏻

</aside>

This aged well didn't it ?

Let’s set the context first:

Overview of DeepSeek R1-zero:

We’ll now look at what the abstract talks about

Introduces it’s first gen reasoning models R1-zero and R1.
R1-zero is trained using large-scale Reinforcement Learning.
But R1-zero has its limitations such as poor readability and language mixing.
To overcome that they release R1 which is trained at multiple stages along with a cold start data

Introduction:

Gives shout out to other LLMs
Highlights importance of **Post Training methods(**quantization,pruning,finetuning,distillation.. etc)
OpenAI with their o1 model had introduced inference time scaling ie. taking time to think first and then answer it. This was a major reason why o1 had such good performance, so in technical terms its called generating long Chain of Thoughts( CoTs) .