X Feed Intel beta

individual tinkerer enterprises
670
Relevant
263
Topics
1841
Total Posts
$1.088
Cost This Week
$1.088
Total Cost
2026-02-23T21:39
Last Fetch
← Back to Topics
Frontier Models

SWE-Bench frontier coding capability benchmarking evolution

Discussion of retirement of swebench-verified benchmark for tracking frontier AI coding capabilities, reflecting evolution in AI model evaluation methodologies.

11 posts · First seen 2026-02-23 · Last activity 2026-02-23
TimeAuthorPost
2026-02-23T20:52 @srchvrs "This is why we have stopped reporting SWE-bench Verified scores, and we recommend that other model developers do so too. We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro." ↩ reply parent
2026-02-23T20:50 @srchvrs @OpenAI : "Improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time." https://t.co/jQKWWodli1
2026-02-23T20:39 @TheRealAdamG RT @OpenAIDevs: The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are s…
2026-02-23T20:39 @ChowdhuryNeil RT @OliviaGWatkins2: In the past 6 months we’ve seen a divergence between the game-changing experience of coding w new models and tiny SWE-…
2026-02-23T20:32 @polynoamial tl;dr SWE-bench Verified is heavily contaminated for all frontier models, and many of the problems are also broken. Time to move on to harder, uncontaminated coding evals. https://t.co/FipvC9oMbs https://t.co/pL053dCczY
2026-02-23T20:19 @swyx Big news today if you're into coding evals: SWE-Bench Verified is dead!! https://t.co/TKdjV4yc9U i'm not sure if @HamelHusain is tired of me tagging him but it turns out @OpenAI really did look back at their own 2024 work and then you 1) look at the CoT and 2) look at the evals they realized that at LEAST 16.4% of SWE-Bench Verified should technically be unsolvable... ... and also that ALL frontier models, including OpenAI's own, are capable of solving them by sheer contamination (including being able to recite verbatim the entire SWE-Bench problem setup and solution, just by giving Task ID alone (!!!!)). Heroic work from the OAI Evals team, and imo an important highlight on the fragility and messiness of Evals work in general. OpenAI spent the money to do 3 independent reviews of each problem in 2024 and AT LEAST SIXTEEN PERCENT OF THESE were still egregiously prolematic (as shown in screenshots). in this 2026 audit they then did 6 independent reviews from software engineers, with ADDITIONAL positive finding verification from a separate team, in order to arrive at today's conclusion. If this happens to SWE-Bench Verified... what else is hiding in other benchmarks out there?
2026-02-23T20:14 @OfirPress 1. The SWE-bench Verified ceiling, even with a simple scaffolding like mini-SWE-agent, is at least 87.4% (current top system is 76.8%) 2. We are launching the SWE-bench Multilingual leaderboard in a few days, the competition is heating up there 3. The SWE-bench Multimodal leaderboard will be up in the next month or so, and there will be *a lot* of progress to be made there.
2026-02-23T20:12 @latentspacepod 🆕 The End of SWE-Bench Verified (2024-2026) https://t.co/c8rSvGyNuI Today @OpenAIDevs is announcing the voluntary deprecation of SWE-Bench Verified! We're releasing a podcast + analysis in today's post. Saturation of SWE-Bench has been a community hot topic for over a year - @jyangballin and @OfirPress argue that there is still room to grow - 87.5-95% is the theoretical "ceiling". But new analysis from OpenAI has identified enough problems with their remaining unsolved tasks that it is no longer worth pursuing or publicizing SWE-Bench Verified numbers. The most egregious is contamination - every single frontier model, including OpenAI's own - now demonstrates ability to regurgitate SWE-Bench eval data and solutions, sometimes from as little as just the Task ID: The other is simply just bad tests! at least 60% of remaining unsolved problems should be unsolvable given their problem description... and if you can solve them you are probably cheating. for example, SWE-Bench's test for pylint issue #4551: Massive kudos to OpenAI for leading the way in both initiating and then sunsetting SWE-Bench Verified. End of an Era!
2026-02-23T19:13 @Orwelian84 RT @OpenAIDevs: The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are s…
2026-02-23T18:42 @Miles_Brundage RT @tejalpatwardhan: swebench-verified had a great run, but we no longer recommend it to track frontier coding capabilities more analysis…
2026-02-23T18:32 @OpenAIDevs The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. https://t.co/3GeAsnUHdC
@srchvrs 2026-02-23T20:52
↩ reply parent
@srchvrs 2026-02-23T20:50
@TheRealAdamG 2026-02-23T20:39
@ChowdhuryNeil 2026-02-23T20:39
@polynoamial 2026-02-23T20:32
@swyx 2026-02-23T20:19
@OfirPress 2026-02-23T20:14
@latentspacepod 2026-02-23T20:12
@Orwelian84 2026-02-23T19:13
@Miles_Brundage 2026-02-23T18:42
@OpenAIDevs 2026-02-23T18:32

Markdown Export

Loading...