Modern servers room, corridor in the database with supercomputers, neon lights and air conditioners. … [+]
All over the world, the biggest players in LLM technology are coming up with new versions of their models at stunning speed.
But how do they accumulate?
The analyst and the testers (and others) are coming up with completely new assessments of these competitive models and detailing their performance for everything, from deep doctoral questions to coding, to different types of specialized tasks .
But after all, some claim that most of this hard work does not make any change in the last average user. Let’s explore this little through the lentils of one of my favorite podcasts.
Grok-3 and O3: He Short Observations
Two of the positions are now Openai’s O3 Mini Model and Grok3, the new version of Xai Chatbot that has its own reasoning skills and new built functionality.
We can see the graphs of these models used by GPQA, a google-project graduate point, and US winning math exam data dating from 1983. Some team members in Openai claim that O3 mini is better on the entire board- others in Xai, startling, disagree.
And then there is the third argument…
Short daily coverage of he
At the brief brief in the Daily, the Nathaniel Whittemore covers these kinds of evolution, starting with a quotation by Matthew Lambert:
“Honestly, there are no industry rates to support. Just expect noise. It is good let the best models win. Make your evale anyway. AIME is virtually useless for 99% of people. “
Whittemore agrees.
“At this point, I am completely on the train that these standards are fully soaked,” he says. “There is almost no significant signal, in this … All models are now at the high end of these things, and that they just tell you almost nothing.”
He has this advice for people who are curious about comparable functionality:
“If you are willing to take the time and resources to do it, then just try any kind of question, and any kind of fast, and any kind of challenge, against all the state of art (systems) and see which one does best . Or, otherwise, you choose one, assume it will be as well as the state of art, and it will be as good as the state of art within two weeks when they send the last update.
Anthropic hybrid model
Later in Podcast, Whittemore passes over the new Sonet Claude 3.7, which he calls a “hybrid” model based on reasoning and extensive non-reasoning skills. Calling the innovation “a nude ahead than a forward jump”, he admits that the improvements of Swe-Bench and the use of agent tools are moved forward with this model.
Reviews of users of new models
Moreover, let’s turn to a recent post from one of my favorite voices in him, Ethan Mollick, in his blog, a useful thing and a point mentioned by Whittemore during Podcast.
Mollick has experimented with CLAUDE 3.7 Sonnet and Grok 3, and this must say, in general, for his observations:
“This new generation of AIS is smarter and dancing on the skills is great, especially in the way these models treat complex tasks, mathematics and code,” he writes. “These models often give me the same feeling I had when I first use chatgt-4, where I am evenly impressed and slightly unclear from what it can do. Get the coding ability of Claude, Now I can get work programs through natural conversation or documents, no programming skills are needed. “
Showing demonstrations of interactive impressive experiences built with patterns, as a timely travel simulation that is intuitive, visual and multi-model, Mollick then speaks of two scaling laws:
One is that the largest models are more capable. Or, as they have noticed a lot, we can account for systems and make them work better. The second is about the conclusion of the test time which can also be called the calculation of the conclusion time.
“Openai revealed that if you allow a model to spend more computing power working through a problem, it gets better results,” Mollick writes. “(It’S) a kind of how to give a smart person a few extra minutes to solve an enigma.”
Together, these two tendencies are excessive skills of him, and also add others.
“Generation Gen3 gives the opportunity for a fundamental review of what is possible,” he adds. “As the models become better, and while they apply more tricks such as reasoning and access to the Internet, they hallucinates less (though they still make mistakes) and they are capable of higher order” thinking. “”
So – less hallucination, better reasoning, more accuracy, more performance, and more tendency for human doctoral doctorates. As Molick writes: “Managers and leaders will have to update their beliefs about what he can do, and how well he can do it, given these new models of him. Instead of assuming that they can only do low -level work, we will have to consider the ways it can serve as a genuine intellectual partner. These models can now address complex analytical tasks, creative work and even problems of the level of searching with startling sophistication. “
There is also an interesting part of the post where Molick mentions an idea he generated with the new model, a video game based on Bartleby, The Scripture “by Herman Melville. These are the types of projects that will turn our head as we can see what he can do now.
Self -analysis
What I hear from all the above thoughts about him is that the end users must do their research, and understand what works best for them.
This makes sense because we have a certain amount of the black box issue with LLM. We do not know exactly how they are coming to their conclusions. We cannot read the actions of digital neurons, obviously. Also, there are plenty of subjectivity involved. You can measure model results in test groups such as GPQA or AIME, but what about the usual things that users end up willing – a teacher planning a curriculum, an engineer who wants a git push, or a creative professional looking for something for a presentation?
Here, many of our assessments will be based on examples of the real life of the help of it, and not a comparable technical whole.