Microsoft's open source Phi-4 inference model: long-winded AI, rewind

The most interesting thing in the AI ​​circle is no longer "whose model has the most parameters", but - whose small model can knock down the big model.

Recently, Microsoft Research has opened up a "small but strong" research: Phi-4-reasoning-plus. This is an open source language model designed specifically for deeply structured inference tasks.

The 14B parameters are less than one-fifth of that of DeepSeek 70B, but the performance of mathematics, science, code, and logical reasoning is relatively good.

In the AIME 2025 Mathematics Exam, the 14B model, the correctness of the full question for the first time, actually had a 70B refinement big guy, and even almost touched the heel of the DeepSeek 671B.

The Microsoft team broke the convention with a series of "reasoning chains", allowing AI to slow down, talk a little bit, think repeatedly, and allow itself to make mistakes, which is mainly reflected in:

Chain-of-Thought becomes the core training goal. Instead of giving the answer directly like traditional big models, it is specially trained to write the "inference process"; in the training data and output, the model is mandatory to use the

Encourage "slow thinking" and reward the verbose reasoning process. In the RL (reinforcement learning) stage, the reward mechanism is specially designed to encourage a longer reasoning chain when answering wrong, and encourage a conciseness when answering right; as long as the model does not answer correctly, it is encouraged to "think two more steps", and the reasoning process can be longer, more detailed, and even repeatedly self-negation and correction.

result? Not only is the answer correct, but the idea is also clear.

There is a very interesting detail in the technical report: Phi-4-reasoning's reasoning chain is not longer, the better, nor is it shorter, the stronger, but it "just just" simulates the "thinking length" of human beings.

The specific reward mode in the RL stage is: "If you answer correctly, you must be concise, but if you answer wrong, you encourage more thinking." For some tasks, the answering process will also "self-deny" and even overturn and start over. Of course, not all fields have been greatly improved, such as biology, chemistry, discrete mathematics, and AI will also be "stuck".

After SFT (supervised fine-tuning), Phi-4-reasoning-plus also adds a layer of rules-based reinforcement learning, and the reward design is also very exquisite:

If you answer correctly, encourage conciseness (reward short reasoning)

If you answer wrong, you will encourage you to be long-winded (rewards will be more thoughtful)

Points will be deducted if the output format is incorrect or the thinking is disordered

Repeated statements are punished, encouraging diversity and exploration

This is different from traditional RLHF (reinforced learning based on human feedback). The Phi-4 team uses automatically verified mathematical problems. The reward function is directly linked to the length of the reasoning chain and the correctness of the answer. The model is trained to "think more, write more, and reflect more steps when there is any mistake."

Performance of Phi-4 inference model in cross-domain benchmarks

Performance of Phi-4 inference model in cross-domain benchmarks

According to the evaluation results in the report, Phi-4-reasoning and plus not only defeated the larger Distill-Llama-70B and DeepSeek-R1 on mathematical/scientific benchmarks such as AIME, OmniMath, and GPQA, but also showed extremely strong "migration power" in new fields such as algorithms (TSP/3SAT), planning (BA-Calendar), and code (LiveCodeBench). These fields are not specifically covered during model training.

This is the meta-enablement brought by the reasoning chain: the model not only knows how to solve problems, but also knows how to reason. New question types can also be learned from one example and apply them slowly and try again and again when encountering difficult problems you have never seen before. Compared with the perfect answer of the traditional big model "in one step", this kind of "slow-down" AI is more reliable and resilient.

Even in some "non-reasoning" tasks, such as long-text Q&A, instruction compliance, toxicity detection and other general capability tests, Phi-4-reasoning-plus has also been significantly improved. Ultimately, letting AI learn to think slowly and examine itself is more sustainable than simply improving computing power and knowledge.

Comment

Dedicated to interviewing and publishing global news events.