News

AI moves fast. Here's what we're reading this week.

Research

Open-source models close in on frontier on math benchmarks

DeepSeek-V3 and Qwen-3-Math are now within 4 percentage points of GPT-4 on MATH-500. The gap on coding tasks remains wider, but the trajectory has analysts revising 2026 forecasts.

Hugging Face1mo ago

Research

OpenAI's Operator agent gets 50%+ on real-world web tasks

OpenAI's browser-using agent product hit 53% on a curated set of real-world tasks in independent testing. The remaining failures cluster around payment forms and CAPTCHAs — both, notably, by design.

MIT Tech Review1mo ago