On June 10, OpenAI slashed the list price of its flagship reasoning model, o3, by roughly 80% from $10 per million input tokens and $40 per million output tokens to $2 and $8, respectively. API resellers reacted immediately: Cursor now counts one o3 request the same as a GPT-4o call, and Windsurf lowered the “o3-reasoning” tier to a single credit as well. For Cursor users, that’s a ten-fold cost cut overnight.
Latency improved in parallel. OpenAI hasn’t published new latency metrics; third-party dashboards still see time to first token (TTFT) in the 15s to 20s range for long prompts. Thanks to fresh Nvidia GB200 clusters and a revamped scheduler that shards long prompts across more GPUs, o3 feels snappier in real use. o3 is still slower than lightweight models, but no longer coffee-break slow.
Claude 4 is fast yet sloppy
Much of the community’s oxygen has gone to Claude 4. It’s undeniably quick, and its 200k context window feels luxurious. Yet, in day-to-day coding, I, along with many Reddit and Discord posters, keep tripping over Claude’s action bias: It happily invents stubbed functions instead of real implementations, fakes unit tests, or rewrites mocks that were told to leave alone. The speed is great; the follow-through often isn’t.