At 1:30 am today, Meta generative AI leader Ahmad Al-Dahle released a long article on a social platform, officially responding to the questions of Llama 4, which just opened the source the day before yesterday.
Ahmad said that Llama 4 was released as soon as it was developed, so there will inevitably be some differences in model quality among different services. Meta will fix these vulnerabilities soon to improve performance. Also denied that pre-training was performed on the test set.
However, when Meta released its official website, it specifically named DeepSeek, saying that their newly open source Llama 4 Maverick has the same code capabilities as its newly open source V3 model. Many well-known domestic media also used this as a gimmick to write titles.
It now seems that Meta's first counterattack failed. We look forward to their subsequent optimization and the 2 trillion parameter teacher model that is being trained.
Here is Ahmad's full statement:
We are so glad to start getting everyone to use the Llama 4. We have heard that many people have achieved good results using these models.
Having said that, we have also heard some reports about the mixed quality of models across different services. Since we release the model as soon as it is ready, we expect that all public application implementations will take several days to optimize and adjust. We will continue to repair vulnerabilities and complete the docking process with our partners.
We also heard people claiming that Llama 4 was trained on the test set, which is pure nonsense and we would never do that. According to our judgment, the quality differences seen by people are caused by the need to stabilize the application implementation.
We believe that the Llama 4 model is a significant technological advancement and we look forward to working with the community to fully tap its value.
In fact, on the day Llama 4 was open sourced, some people questioned its performance. Its code capability is much worse than that of Grok 3, DeepSeek V3, and Sonnet 3.5/7.
Whether it's the Scout or Maverick model, I used detailed prompt words that seem almost impossible to use in terms of actual encoding.
Given the efforts Meta has put in, I was surprised that a 400 billion-parameter model (even if it is a hybrid expert model) performed so badly. It's a far cry from DeepSeekV3".
We tested Scout and Maverick provided by different platforms, and found that these two models performed poorly and were even compared with models with smaller parameter sizes.
Outside of basic programming tasks, they make mistakes and are weak in following instructions. Maverick's ranking is close to Google's Gemini 2.5, which is worrying. They feel like models in the GPT-3.5 era. It's great that Meta is taking steps to stabilize the situation.
Being able to use Llama 4 in advance is great, but here is a key fact: a powerful model whose actual effect depends on its application implementation.
The effect you test in the lab is not the same as the effect users experience in actual use. The gap between over-hype and actual operation is what really needs to be filled.
Given that many runtime environments are open source, maybe in the future, before you release products to avoid such a mess, can you make sure that those fixes are in place? The statement "You use it incorrectly" is not very nice.
Some netizens also questioned Meta's rankings: "The quality is uneven"? ? In every benchmark I see, the Llama 4 performs terrible unless you refer to LMSYS's "1417 eon" benchmark results.
Which API have you opened to LMSYS? Because the current performance in the model list on LMSYS is also very poor.
Llama 4 is just trash, you guys messed up so much about this. Rather than misleading everyone, it is better to admit your mistake. Not sure if it's true to trolling on the test set, but given its high scores in the benchmarks and its poor performance in the real world, the possibility seems to be great.
Meta's Llama 4 Maverick ranks number one in programming on the big language model rankings in the chatbot arena.
However, almost every difficult programming tip I give or a moderately difficult programming tip, it can't be done. In terms of programming, it is much worse than the DeepSeek V3 - 0324, Claude 3.5/3.7 Sonnet or Gemini 2.0 Pro.
Therefore, this netizen is also questioning the issue of Meta rankings.
In fact, it can be seen from the time of release that Meta is not ready this time. As one of the originators of ChatGPT-like open source, Llama 4, such a major open source model, was released on Saturday night in the United States (3 am on Sunday in China), which is too unconventional.
According to their previous Llama series models, they are usually placed around 10 a.m. on Tuesdays and Wednesdays in the United States. So, they were guilty when they released Llama 4.
The emergence of DeepSeek has put tremendous pressure on Meta, and its users and reputation are seriously lost. They urgently need a heavy-duty product to save the defeat. During the period when DeepSeek was crazy about the New Year this year, Meta also specially formed a "Combat Research Room" to study its model. But judging from the final result, it is still not very ideal.
In addition, due to the tariff war, Meta's stock suffered a heavy blow, and they also needed good news to pull up the stock, which is backfiring now.
The material in this article is from Meta and the Internet. If there is any infringement, please contact us to delete it.
Comment