LOADING

3B small model, programming score comparable to Opus 4.5, mysterious model sparks heated discussion

In recent days, a small 3B model has become popular on X because it has entered the performance range of cutting-edge models such as Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, Kimi K2.5 in some difficult and verifiable reasoning tasks (such as programming), and its volume is much smaller than these models.

 

This model is called VibeThinker-3B and is a dense inference model with 3 billion parameters, aimed at exploring to what extent verifiable reasoning ability can be advanced under strict small model scales.

After the model was released, many people were amazed by its performance and expressed their desire to give it a try.

It is worth noting that it is also a domestic model from the Sina Weibo team.

The technical report shows that the model is designed for tasks with reliable validation signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with explicit constraints.

Therefore, it performs well in all benchmark tests. It scored 94.3 in the AIME26 test, 89.3 in the HMMT25 test, and 80.2 in the LiveCodeBench v6 test( Pass@1 )And during the period from April 25 to May 31, 2026, LeetCode achieved a pass rate of 96.1% in its latest undisclosed weekly and biweekly tournaments.

 

How is this model trained? The technical report revealed some details.

Firstly, it is built on Qwen2.5-Coder-3B and uses an upgraded Spectrum to Signal process for post training. This process strengthens data synthesis, quality filtering, and course learning in supervised fine-tuning (SFT), extends MGPO style reinforcement learning to multiple verifiable domains, preserves the complete long context inference trajectory, and consolidates various abilities through offline self distillation and instruction reinforcement learning (Instruction RL).

 

VibeThinker-3B overall training process

 

Spectrum to Signal process.

In addition, VibeThinker-3B also introduces Claim Level Reliability Assessment (CLR), which is a testing time scaling strategy for answer verifiable reasoning. CLR further improved the performance of mathematical benchmarking, increasing AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

 

The specific training process is as follows:

Two stage SFT based on curriculum. The first stage focuses on a wide range of abilities in mathematics, programming, STEM reasoning, general dialogue, and instruction following. The second stage shifts towards more difficult and broader reasoning samples. Diversity exploration distillation is used to preserve multiple effective solution pathways.
Multi domain reasoning reinforcement learning. VibeThinker-3B reused MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Train using a single 64K long contextual window to preserve the complete long temporal inference trajectory.
Offline self distillation. Screen and extract high-quality trajectories from mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. The learning potential score is used to prioritize trajectories that are correct but not yet well imitated by students.
Instruct RL。 The final stage improved the controllability of user oriented prompts. For format sensitive and open teaching data, rule-based validators and reward models based on scoring criteria are used.
In a recent post, renowned AI researcher and blogger Sebastian Raschka summarized the key points disclosed in the VibeThinker-3B technology report, including the following:

If you are interested in these contents, you can go through their technical reports in detail. At present, the model is also available for public download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

However, the applicability of this model is clearly limited as it does not perform well in fields that require general knowledge.

 

The official also explicitly pointed out this point and proposed the “parameter compression coverage hypothesis”: different abilities have vastly different ways of relying on model parameters. Verifiable reasoning is closer to a highly compressible and parameter intensive ability, with its core consisting of multi-step reasoning, constraint satisfaction, self correction, and answer verification. When the task space structure is clear enough and the feedback signal is reliable enough, compact models may also have inference capabilities close to the frontier. In contrast, open domain knowledge, general dialogue, and long tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This assumption is very insightful. VentureBeat wrote in its report, “It reveals a partial decoupling between reasoning ability and factual knowledge, and the former can be compressed more effectively than previously imagined – an insight that has profound implications for how the industry views model design, deployment costs, and the widespread adoption of advanced artificial intelligence capabilities

 

The author stated that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With the help of VibeThinker-3B, they hope to demonstrate that small models should not only be seen as a compromise solution to reduce deployment costs. In the field of capabilities with clear feedback and validation mechanisms, small language models are showing a promising research path, with the potential to achieve cutting-edge performance and form a fundamental complementary relationship with traditional parameter scaling paradigms.

At present, the model still faces some doubts in the community. If you are interested in this model, you may want to try it out yourself.

Report link: https://arxiv.org/pdf/2606.16140

HuggingFace link: https://huggingface.co/WeiboAI/VibeThinker-3B

© 版权声明

相关文章