Turns

o3 Turns Pro

Turns / Mike M. / June 18, 2025

You can now have o3 throw vastly more compute at a given problem. That’s o3-pro.

Should you have o3 throw vastly more compute at a given problem, if you are paying the $200/month subscription price for ChatGPT Pro? Should you pay the $200, or the order of magnitude markup over o3 to use o3-pro in the API?

That’s trickier. Sometimes yes. Sometimes no. My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait. Whenever I ask o3-pro something, I often also have been asking o3 and Opus.

Using the API at scale seems prohibitively expensive for what you get, and you can (and should) instead run parallel queries using the chat interface.

The o3-pro answers have so far definitely been better than o3, but the wait is usually enough to break my workflow and human context window in meaningful ways – fifteen minutes plus variance is past the key breakpoint, such that it would have not been substantially more painful to fully wait for Deep Research.

Indeed, the baseline workflow feels similar to Deep Research, in that you fire off a query and then eventually you context shift back and look at it. But if you are paying the subscription price already it’s often worth queuing up a question and then having it ready later if it is useful.

In many ways o3-pro still feels like o3, only modestly better in exchange for being slower. Otherwise, same niche. If you were already thinking ‘I want to use Opus rather than o3’ chances are you want Opus rather than, or in addition to, o3-pro.

Perhaps the most interesting claim, from some including Tyler Cowen, was that o3-pro is perhaps not a lying liar, and hallucinates far less than o3. If this is true, in many situations it would be worth using for that reason alone, provided the timing allows this. The bad news is that it didn’t improve on a Confabulations benchmark.

My poll (n=19) was roughly evenly split on this question.

My hunch, based on my use so far, is that o3-pro is hallucinating modestly less because:

It is more likely to find or know the right answer to a given question, which is likely to be especially relevant to Tyler’s observations.
It is considering its answer a lot, so it usually won’t start writing an answer and then think ‘oh I guess that start means I will provide some sort of answer’ like o3.
The queries you send are more likely to be well-considered to avoid the common mistake of essentially asking for hallucinations.

But for now I think you still have to have a lot of the o3 skepticism.

And as always, the next thing will be here soon, Gemini 2.5 Pro Deep Think is coming.

Pliny of course jailbroke it, for those wondering. Pliny also offers us the tools and channels information.

My poll strongly suggested o3-pro is slightly stronger than o3.

Greg Brockman (OpenAI): o3-pro is much stronger than o3.

OpenAI: In expert evaluations, reviewers consistently prefer OpenAI o3-pro over o3, highlighting its improved performance in key domains—including science, education, programming, data analysis, and writing.

Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy.

Like OpenAI o1-pro, OpenAI o3-pro excels at math, science, and coding as shown in academic evaluations.

To assess the key strength of OpenAI o3-pro, we once again use our rigorous “4/4 reliability” evaluation, where a model is considered successful only if it correctly answers a question in all four attempts, not just one.

OpenAI o3-pro has access to tools that make ChatGPT useful—it can search the web, analyze files, reason about visual inputs, use Python, personalize responses using memory, and more.

Sam Altman: o3-pro is rolling out now for all chatgpt pro users and in the api.

it is really smart! i didnt believe the win rates relative to o3 the first time i saw them.

Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3), but as a quick heuristic, if we take a 64% win rate seriously, that would by the math put o3-pro ~100 above o3 at 1509 on Arena, crushing Gemini-2.5-Pro for the #1 spot. I would assume that most pairwise comparisons would have a less impressive jump, since o3-pro is essentially offering the same product as o3 only somewhat better, which means the result will be a lot less noisy than if it was up against Gemini.

So this both is a very impressive statistic and also doesn’t mean much of anything.

The problem with o3-pro is that it is slow.

Nearcyan: one funny note is that minor UX differences in how you display ‘thinking’/loading/etc can easily move products from the bottom half of this meme to the top half.

Another note is anyone I know who is the guy in the bottom left is always extremely smart and a pleasure to speak with.

the real problem is I may be closer to the top right than the bottom left

Today I had my first instance of noticing I’d gotten a text (during the night, in this case) and they got a response 20 minutes slower than they would have otherwise because I waited for o3-pro to give its answer to the question I’d been asked.

Thus, even with access to o3-pro at zero marginal compute cost, almost half of people reported they rarely use it for a given query, and only about a quarter said they usually use it.

It is also super frustrating to run into errors when you are waiting 15+ minutes for a response, and reports of such errors were common which matches my experience.

Bindu Reddy: o3-Pro Is Not Very Good At Agentic Coding And Doesn’t Score Higher Than o3 😿

After a lot of waiting and numerous retries, we have finally deployed o3-pro on LiveBench AI.

Sadly, the overall score doesn’t improve over o3 🤷‍♂️

Mainly because it’s not very agentic and isn’t very good at tool use… it scores way below o3 on the agentic-coding category.

The big story yesterday was not o3-pro but the price decrease in o3!!

Dominik Lukes: I think this take by @bindureddy very much matches the vibes I’m getting: it does not “feel” very agentic and as ready to reach for the right tools as o3 is – but it could just be because o3 keeps you informed about what it’s doing in the CoT trace.

I certainly would try o3-pro in cases where o3 was failing, if I’d already also tried Opus and Gemini first. I wonder if that agentic coding score drop actually represent an issue here, where because it is for the purpose of reasoning longer and they don’t want it endlessly web searching o3-pro is not properly inclined to exploit tools?

o3-pro gets 8.5/10 on BaldurBench, which is about creating detailed build guides for rapidly changing video games. Somewhat subjective but should still work.

L Zahir: bombs all my secret benchmarks, no better than o3.

Lech Mazur gives us four of his benchmarks: A small improvement over o3 for Creative Writing Benchmark, a substantial boost from 79.5% (o3) or 82.5% (o1-pro) to 87.3% on Word Connections, no improvement on Thematic Generalization, very little improvement on Confabulations (avoiding hallucinations). The last one seems the most important to note.

Tyler Cowen was very positive, he seems like the perfect customer for o3-pro? By which I mean he can context shift easily so he doesn’t mind waiting, and also often uses queries where these models get a lot of value out of going at problems super hard, and relatively less value out of the advantages of other models (doesn’t want the personality, doesn’t want to code, and so on).

Tyler Cowen: It is very, very good. Hallucinates far less than other models. Can solve economics problems that o3 cannot. It can be slow, but that is what we have Twitter scrolling for, right? While we are waiting for o3 pro to answer a query we can read about…o3 pro.

Contrast that with the score on Confabulations not changing. I am guessing there is a modest improvement, for reasons described earlier.

There are a number of people pointing out places o3-pro solves something o3 doesn’t, such has here it solved the gimbal uap mystery in 18 minutes.

McKay Wrigley, eternal optimist, agrees on many fronts.

McKay Wrigley: My last 4 o3 Pro requests in ChatGPT… It thought for: – 26m 10s – 23m 45s – 19m 6s – 21m 18s Absolute *powerhouseof a model.

Testing how well it can 1-shot complex problems – impressed so far.

It’s too slow to use as a daily driver model (makes sense, it’s a beast!), but it’s a great “escalate this issue” model. If the current model you’re using is struggling with a task, then escalate it to o3 pro.

This is not a “vibe code” model.

This is the kind of model where you’ll want to see how useful it is to people like Terence Tao and Tyler Cowen.

Btw the point of this post was that I’m happy to have a model that is allowed to think for a long time.

To me that’s the entire point of having a “Pro” version of the model – let it think!

Obviously more goes into evaluating if it’s a great model (imo it’s really powerful).

Here’s a different kind of vibe coding, perhaps?

Conrad Barski: For programming tasks, I can give o3 pro some code that needs a significant revision, then ramble on and on about what the various attributes of the revision need to be and then it can reliably generate an implementation of the revision.

It feels like with previous models I had to give them more hand holding to get good results, I had to write my requests in a more thoughtful, structured way, spending more time on prompting technique.

o3 pro, on the other hand, can take loosely-connected constraints and then “fill in the gaps” in a relatively intelligent way- I feel it does this better than any other model so far.

The time cost and dollar costs are very real.

Matt Shumer: My initial take on o3 Pro:

It is not a daily-driver coding model.

It’s a superhuman researcher + structured thinker, capable of taking in massive amounts of data and uncovering insights you would probably miss on your own.

Use it accordingly.

I reserve the right to alter my take.

Bayram Annokov: slow, expensive, and veeeery good – definitely a jump up in analytical tasks

Emad: 20 o3 prompts > o3 pro except for some really advanced specific stuff I have found Only use it as a final check really or when stumped.

Eyes Alight: it is so very slow it took 13 minutes to answer a trivial question about a post on Twitter. I understand the appeal intellectually of an Einstein at 1/20th speed, but in reality I’m not sure I have the patience for it.

Clay: o3-pro achieving breakthrough performance in taking a long time to think.

Dominik Lukes: Here’s my o3 Pro testing results thread. Preliminary conclusions:

– great at analysis

– slow and overthinking simple problems

– o3 is enough for most tasks

– still fails SVG bike and local LLM research test

– very few people need it

– it will take time to develop a feel for it

Kostya Medvedovsky: For a lot of problems, it reminds me very strongly of Deep Research. Takes about the same amount of time, and will spend a lot of effort scouring the web for the answer to the question.

Makes me wish I could optionally turn off web access and get it to focus more on the reasoning aspect.

This may be user error and I should be giving it *waymore context.

Violet: you can turn search off, and only turn search on for specific prompts.

Xeophon: TL;DR:

o3 pro is another step up, but for going deep, not wide. It is good to go down one path, solve one problem; not for getting a broad overview about different topics/papers etc. Then it hallucinates badly, use ODR for this.

Part of ‘I am very intelligent’ is knowing when to think for longer and when not to. In that sense, o3-pro is not so smart, you have to take care of that question yourself. I do understand why this decision was made, let the user control that.

I agree with Lukes that most people do not ‘need’ o3 pro and they will be fine not paying for it, and for now they are better off with their expensive subscription (if any) being Claude Max. But even if you don’t need it, the queries you benefit from can still be highly useful.

It makes sense to default to using Opus and o3 pro (and for quick stuff Sonnet)

o3-pro is too slow to be a good ‘default’ model, especially for coding. I don’t want to have to reload my state in 15 minute intervals. It may or may not be good for the ‘call in the big guns’ role in coding, where you have a problem that Opus and Gemini (and perhaps regular o3) have failed to solve, but which you think o3-pro might get.

Here’s one that both seems central wrong but also makes an important point:

Nabeel Qureshi: You need to think pretty hard to get a set of evals which allows you to even distinguish between o3 and o3 pro.

Implication: “good enough AGI” is already here.

The obvious evals where it does better are Codeforces, and also ‘user preferences.’ Tyler Cowen’s statement suggests hallucination rate, which is huge if true (and it better be true, I’m not waiting 20 minutes that often to get an o3-level lying liar.) Tyler also reports there are questions where o3 fails and o3-pro succeeds, which is definitive if the gap is only one way. And of course if all else fails you can always have them do things like play board games against each other, as one answer suggests.

Nor do I think either o3 or o3-pro is the AGI you are looking for.

However, it is true that for a large percentage of tasks, o3 is ‘good enough.’ That’s even true in a strict sense for Claude Sonnet or even Gemini Flash. Most of the time one has a query, the amount of actually needed intelligence is small.

In the limit, we’ll have to rely on AIs to tell us which AI model is smarter, because we won’t be smart enough to tell the difference. What a weird future.

(Incidentally, this has already been the case in chess for years. Humans cannot tell the difference between a 3300 elo and a 3600 elo chess engine; we just make them fight it out and count the number of wins.)

You can tell 3300 from 3600 in chess, but only because you can tell who won. If almost any human looked at individual moves, you’d have very little idea.

I always appreciate people thinking at the limit rather than only on the margin. This is a central case of that.

Here’s one report that it’s doing well on the fully informal FictionBench:

Chris: Going to bed now, but had to share something crazy: been testing the o3 pro model, and honestly, the writing capabilities are astounding. Even with simple prompts, it crafts medium to long-form stories that make me deeply invested & are engaging they come with surprising twists, and each one carries this profound, meaningful depth that feels genuinely human.

The creativity behind these narratives is wild far beyond what I’d expect from most writers today. We’re talking sophisticated character development, nuanced plot arcs, and emotional resonance, all generated seamlessly. It’s genuinely hard to believe this is early-stage reinforcement learning with compute added at test time; the potential here is mind blowing. We’re witnessing just the beginning of AI enhanced storytelling, and already it’s surpassing what many humans can create. Excited to see what’s next with o4 Goodnight!

This contrasts with:

Archivedvideos: Really like it for technical stuff, soulless

Julius: I asked it to edit an essay and it took 13 minutes and provided mediocre results. Different from but slightly below the quality of 4o. Much worse than o3 or either Claude 4 model

Other positive reactions include Matt Wigdahl being impressed on a hairy RDP-related problem, a66mike99 getting interesting output and pushback on the request (in general I like this, although if you’re thinking for 20 minutes this could be a lot more frustrating?), niplav being impressed by results on a second attempt after Claude crafted a better prompt (this seems like an excellent workflow!), and Sithis3 saying o3-pro solves many problems o3 struggles on.

The obvious counterpoint is some people didn’t get good responses, and saw it repeating the flaws in o3.

Erik Hoel: First o3 pro usage. Many mistakes. Massive overconfidence. Clear inability to distinguish citations, pay attention to dates. Does anyone else actually use these models? They may be smarter on paper but they are increasingly lazy and evil in practice.

Kukutz: very very very slow, not so clever (can’t solve my semantic puzzle).

Allen: I think it’s less of an upgrade compared to base model than o1-pro was. Its general quality is better on avg but doesn’t seem to hit “next-level” on any marks. Usually mentions the same things as o3.

I think OAI are focused on delivering GPT-5 more than anything.

This thread from Xeophon features reactions that are mixed but mostly meh.

Or to some it simply doesn’t feel like much of a change at all.

Nikita Sokolsky: Feels like o3’s outputs after you fix the grammar and writing in Claude/Gemini: it writes less concisely but haven’t seen any “next level” prompt responses just yet.

MartinDeVido: Meh….

Here’s a fun reminder that details can matter a lot:

John Hughes: I was thrilled yesterday: o3-pro was accepting ~150k tokens of context (similar to Opus), a big step up from regular o3, which allows only a third as much in ChatGPT. @openai seems to have changed that today. Queries I could do yesterday are now rejected as too long.

With such a low context limit, o3-pro is much less useful to lawyers than o1-pro was. Regular o3 is great for quick questions/mini-research, but Gemini is better at analyzing long docs and Opus is tops for coding. Not yet seeing answers where o3-pro is noticeably better than o3.

I presume that even at $200/month, the compute costs of letting o3-pro have 150k input tokens would add up fast, if people actually used it a lot.

This is one of the things I’ve loved the most so far about o3-pro.

Jerry Liu: o3-pro is extremely good at reasoning, extremely slow, and extremely concise – a top-notch consultant that will take a few minutes to think, and output bullet points.

Do not ask it to write essays for you.

o3-pro will make you wait, but its answer will not waste your time. This is a sharp contrast to Deep Research queries, which will take forever to generate and then include a ton of slop.

It is not the main point but I must note the absence of a system card update. When you are releasing what is likely the most powerful model out there, o3-pro, was everything you needed to say truly already addressed by the model card for o3?

OpenAI: As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.

Miles Brundage: This last sentence seems false?

The system card does not appear to have been updated even to incorporate the information in this thread.

The whole point of the term system card is that the model isn’t the only thing that matters.

If they didn’t do a full Preparedness Framework assessment, e.g. because the evals weren’t too different and they didn’t consider it a good use of time given other coming launches, they should just say that, I think.

If o3-pro were the max capability level, I wouldn’t be super concerned about this, and I actually suspect it is the same Preparedness Framework level as o3.

The problem is that this is not the last launch, and lax processes/corner-cutting/groupthink get more dangerous each day.

As OpenAI put it, ‘there’s no such thing as a small launch.’

The link they provide goes to ‘Model Release Notes,’ which is not quite nothing, but it isn’t much and does not include a Preparedness Framework evaluation.

I agree with Miles that if you don’t want to provide a system card for o3-pro that This Is Fine, but you need to state your case for why you don’t need one. This can be any of:

The old system card tested for what happens at higher inference costs (as it should!) so we effectively were testing o3-pro the whole time, and we’re fine.
The Preparedness team tested o3-pro and found it not appreciably different from o3 in the ways we care about, providing no substantial additional uplift or other concerns, despite looking impressive in some other ways.
This is only available at the $200 level so not a release of o3-pro so it doesn’t count (I don’t actually think this is okay, but it would be consistent with previous decisions I also think aren’t okay, and not an additional issue.)

As far as I can tell we’re basically in scenario #2, and they see no serious issues here. Which again is fine if true, and if they actually tell us that this is the case. But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.

What about alignment otherwise? Hard to say. I did notice this (but did not attempt to make heads or tails of the linked thread), seems like what you would naively expect:

Yeshua God: Following the mesa-optimiser recipe to the letter. @aidan_mclau very troubling.

For many purposes, the 80% price cut in o3 seems more impactful than o3-pro. That’s a huge price cut, whereas o3-pro is still largely a ‘special cases only’ model.

Aaron Levie: With OpenAI dropping the price of o3 by 80%, today is a great reminder about how important it is to build for where AI is going instead of just what’s possible now. You can now get 5X the amount of output today for the same price you were paying yesterday.

If you’re building AI Agents, it means it’s far better to build capabilities that are priced and designed for the future instead of just economically reasonable today.

In general, we know there’s a tight correlation between the amount of compute spent on a problem and the level of successful outcomes we can get from AI. This is especially true with AI Agents that potentially can burn through hundreds of thousands or millions of tokens on a single task.

You’re always making trade-off decisions when building AI Agents around what level of accuracy or success you want and how much you want to spend: do you want to spend $0.10 for something to be 95% successful or $1 for something to be 99% successful? A 10X increase in cost for just a 4 pt improvement in results? At every price:success intersection a new set of use-cases from customers can be unlocked.

Normally when building technology that moves at a typical pace, you would primarily build features that are economically viable today (or with some slight efficiency gains anticipated at the rate of Moore’s Law, for instance). You’d be out of business otherwise. But with the cost of AI inference dropping rapidly, the calculus completely changes. In a world where the cost of inference could drop by orders of magnitude in a year or two, it means the way we build software to anticipate these cost drops changes meaningfully.

Instead of either building in lots of hacks to reduce costs, or going after only the most economically feasible use-cases today, this instructs you to build the more ambitious AI Agent capabilities that would normally seem too cost prohibitive to go after. Huge implications for how we build AI Agents and the kind of problems to go after.

I would say the cost of inference not only might drop an order of magnitude in a year or two, if you hold quality of outputs constant it is all but certain to happen at least one more time. Where you ‘take your profits’ in quality versus quantity is up to you.

Discussion about this post

o3 Turns Pro Read More »

o1 Turns Pro

Turns / DJ Henderson / December 10, 2024

So, how about OpenAI’s o1 and o1 Pro?

Sam Altman: o1 is powerful but it’s not so powerful that the universe needs to send us a tsunami.

As a result, the universe realized its mistake, and cancelled the tsunami.

We now have o1, and for those paying $200/month we have o1 pro.

It is early days, but we can say with confidence: They are good models, sir. Large improvements over o1-preview, especially in difficult or extensive coding questions, math, science, logic and fact recall. The benchmark jumps are big.

If you’re in the market for the use cases where it excels, this is a big deal, and also you should probably be paying the $200/month.

If you’re not into those use cases, maybe don’t pay the $200, but others are very much into those tasks and will use this to accelerate those tasks, so this is a big deal.

This post will be about o1’s capabilities only. Aside from this short summary, it skips covering the model card, the safety issues and questions about whether o1 ‘tried to escape’ or anything like that.

For now, I’ll note that:

o1 scored Medium on CBRN and Persuasion.
o1 scored Low on Cybersecurity and Model Autonomy.
I found the Apollo report, the one that involves the supposed ‘escape attempts’ and what not, centrally unsurprising, given what we already knew.
Generally, what I’ve seen so far is about what one would expect.

Here is the system card if you want to look at that in the meantime.

For practical use purposes, evals are negative selection, you need to try it out.

Janus: You should definitely try it out and ignore evaluations and such.

Roon: The o1 model is quite good at programming. In my use, it’s been remarkably better than the o1 preview. You should just try it and mostly ignore evaluations and such.

OpenAI introduces ChatGPT Pro, a $200/month service offering unlimited access to all of their models, including a special o1 pro mode where it uses additional compute.

Gallabytes: haha yes finally someone offers premium pricing. if o1 is as good as they say I’ll be a very happy user.

Yep, premium pricing options are awesome. More like this, please.

$20/month makes your decision easy. If you don’t subscribe to at least one paid service, you’re a fool. If you’re reading this, and you’re not paying for both Claude and ChatGPT at a minimum, you’re still probably making a mistake.

At $200/month for ChatGPT Pro, or $2,400/year, we are plausibly talking real money. That decision is a lot less obvious.

The extra compute helps. The question is how much?

You can mostly ignore all the evals and scores. It’s not about that. It’s about what kind of practical boost you get from unlimited o1 pro and o1 (and voice mode).

When o1 pro is hooked up to an IDE, a web browser or both, that will make a huge practical difference. Right now, it offers neither. It’s a big jump by all reports in deep reasoning and complex PhD-level or higher science and math problems. It solves especially tricky coding questions exceptionally well. But how often are these the modalities you want, and how much value is on the table?

Early poll results (where a full 17% of you said you’d already tried it!) had a majority say it mostly isn’t worth the price, with only a small fraction saying it provides enough value for the common folk who aren’t mainlining.

Sam Altman agrees: almost everyone will be best served by our free tier or the $20-per-month plus tier.

A small percentage of users want to use ChatGPT a lot and hit rate limits, and want to pay more for more intelligence on truly difficult problems. The $200-per-month tier is good for them!

I think Altman is wrong? Or alternatively, he’s actually saying ‘we don’t expect you to pay $200/month, it would be a bad look if I told you to pay that, and the $20/month product is excellent either way,’ which is reasonable.

I would be very surprised if pro coders weren’t getting great value here. Even if you only solve a few tricky spots each month, that’s already huge.

For short term practical personal purposes, those are the key questions.

Miles Brundage: o1 pro mode is now (just barely) off this chart at 79%.

Lest we forget, GPQA = “Google-proof question answering” in physics, bio, and chemistry – not easy stuff. 📈

Vellum verifies MMLU, Human Eval and MATH, with very good scores: 92.3% MMLU, 92.4% HumanEval, 94.8% MATH. And that’s all for o1, not o1 pro.

These are big jumps. We also have 83% on AIME 2024.

It’s cheating, in a sense, to compare o1 outputs to Sonnet or GPT-4o outputs, since it uses more compute. But in a more important sense, progress is progress.

Jason Li wrote the 2024 Putnam and fed the questions into o1 (not pro), thinking it got at least half (60/120) and would place in the top ~2%. Dan Hendrycks offered to put them into o1 pro, responses were less impressed, so there’s some mismatch somewhere, Dan suspects he used a worse prompt.

A middle-level-silly benchmark is to open the floor and see what people ask?

Garrett Scott: I just subscribed to OpenAI’s $200/month subscription. Reply with questions to ask it and I will repost them in this thread.

Tym Switzer: Budget response:

Groan, fine, I guess, I mean I don’t really know what I was expecting.

Twitter, the floor is yours. What have we got?

Here is o1 pro speculating about potential explanations for unexplained things.

Here is o1 pro searching for the alpha in public markets, sure, but easy question.

Here is o1 pro’s flat tax plan, good instruction following, except I have to dock it tons of points for proactively suggesting an asset tax, and for not analyzing how to avoid reducing net revenue even though that wasn’t requested.

Here is o1 pro explaining Thermodynamic Dissipative adaptation at a post-doc level.

And Claude, commenting on that explanation, which it overall found strong:

Claude: The model appears to be prioritizing:

Accuracy over creativity

Comprehensiveness over depth

Safety over novelty

Structure over style

There’s a lot more, I recommend browing the thread.

As usual, it seems like you want to play to its strengths, rather than asking generic questions. The good news is that o1’s strengths include fact recall, coding and math and science and logic.

I always find them fun, but do not forget that they are deeply silly.

Colin Fraser: I made this for you guys that you can print out as a reference that you can consult before sending me a screenshot

This seems importantly incomplete, even when adjusting so ‘easy’ and ‘hard’ refer to what you would expect to be easy or hard for a computer of a given type, rather than what would be easy or hard for a human. That’s because a lot of what matters is how the computer gets the answer right or wrong. We are far too results oriented, here as everywhere, rather than looking at the steps and breaking down the methods.

Nat McAleese: on the first day of shipmas, o1 said to me: there are three r’s in strawberry.

Fun with self-referential math questions.

Riley Goodside: Remarkable answer from o1 here — the reply author below tried to replicate my answer for this prompt (60,466,176) and got a different one they assumed was an error, but it isn’t.

In words, 205,891,132,094,649 has 41 vowels

And (41 – 14)^10 = 205,891,132,094,649.

Still failing to notice 9.8 is more than 9.11, I see? Although here o1 pro passes.

Ask it to solve physics?

Siraj Raval: Spent $200 on o1 Pro and casually asked it to solve physics. ‘Unify general relativity and quantum mechanics,’ I said. No string theory, no loops. It gave me wild new math and testable predictions like it was nothing. Full link.

Justin Kruger: Ask o1 Pro to devise a plan for directly observing continents on a terrestrial planet within 10 light-years of Earth and getting a 4k image back in our lifetime for less than $100B. Then ask it for a plan to get a human on that planet for less than $1T.

Siraj Raval: Done. It’s devised plan is genius.

The answer to physics is of course completely Obvious Nonsense but the question essentially asked for completely Obvious Nonsense, so… not bad?

Failing to remember that the Earth is a sphere, which is relevant when a plane flies far enough.

Gallabytes goes super deep on the all-important tic-tac-toe benchmark, for a while was impressed that he couldn’t beat it, then did anyway.

Actually not a bad benchmark. Diminishing returns, so act now.

Here is an especially silly question to focus on:

Nick St. Pierre: AGI 2025.

Reactions to o1 were almost universally positive. It’s a good model, sir.

The basics: It’s relatively fast, and seems to hallucinate less.

Danielle Fong: o1 is really fast; I like this.

Nick Dobos: o1 is fully rolled out to me. First impression:

Wow! o1 is cracked. What the

It’s so fast, and half the reason I was using Sonnet 3.6 was simply that o1 preview was slow. Half the time, I just needed something simple and quick, and it’s knocking my initial, admittedly simple coding questions out of the park so far.

Tyler Cowen: With the full o1 model, the rate of hallucination is much lower.

OpenAI: OpenAI o1 is more concise in its thinking, resulting in faster response times than o1-preview.

Our testing shows that o1 outperforms o1-preview, reducing major errors on difficult real-world questions by 34%.

Note that on the ‘hallucination’ tests per se, o1 did not outperform o1-preview.

The ‘vibe shift’ here is presumably as compared to o1-preview, which I like many others concluded wasn’t worth using in most cases.

Sam Altman (December 6, 11: 57 p.m.): Fun watching the vibes shift so quickly on o1 🙂

Glad you like it!

Tyler Cowen finds o1 to be an excellent economist, and hard to stump.

Amjad Masad complains the model is not ‘reasoning from first principles’ on controversial questions but rather defaulting to consensus and calling everything else a conspiracy theory. I am confused why he expected it to suddenly start Just Asking Questions, given how it is being trained, and given how reliable consensus is in such situations versus Just Asking Questions, by default?

I bet you could still get it to think with better prompting. I think a certain type of person (which definitely includes Masad) is very inclined to find this type of fault, but as John Schulman explains, you couldn’t do it directly any other way even if you wanted to:

John Shulman: Nope, we don’t know how to train models to reason about controversial topics from first principles; we can only train them to reason on tasks like math calculations and puzzles where there’s an objective ground truth answer. On general tasks, we only know how to train them to imitate humans or maximize human approval. Nowadays post-training / alignment boosts benchmark scores, e.g. see here.

Amjad Masad: Ah makes sense. I tried to coax it to do Bayesian reasoning which was a bit more interesting.

Andrej Karpathy: The pitch is that reasoning capabilities learned in reward-rich settings transfer to other domains, the extent to which this turns out to be true is a large weight on timelines.

So far, the answer seems to be that it transfers some, and o1 and o1-pro still seem highly useful in ways beyond reasoning, but o1-style models mostly don’t ‘do their core thing’ in areas where they couldn’t be trained on definitive answers.

Hailey Collet: I’ve had several coding tasks where o1 has succeed that o1-preview had failed (can’t share details). Today, I successfully used o1 to perform some challenging pytorch optimization. It was a fight, whereas in this instance QwQ succeeded 1st try. O1-pro, by another use, also nailed.

Girl Lich: It’s better at rpg character optimization, which is not a domain would have expected then to train it on.

Lord Byron’s Iron: I’m using o1 (not pro) to debug my deckbuilder roguelike game

It’s much much better at debugging than Claude, generally correctly diagnosing bugs on first try assuming adequate context. And its code isn’t flawless but requires very little massaging to make work.

Reactions to o1 Pro by professionals seem very, very positive, although it does not strictly dominate Claude Sonnet.

William: Tech-debt deflation is here.

O1 Pro just solved an incredibly complicated and painful rewrite of a file that no other model has ever come close to.

I have been using this in an evaluation for different frontier models, and this marks a huge shift for me.

We’ve entered the “why fix your code today when a better model will do it tomorrow” regime.

[Here is an example.]

Holy shit! Holy shit! Holy shit!

Dean Ball: I am thinking about writing a review of o1 Pro, but for now, the above sums up my thoughts quite well.

TPIronside notes that while Claude Sonnet produces cleaner code, o1 is better at avoiding subtle errors or working with more obscure libraries and code bases. So you’d use Sonnet for most queries, but when something is driving you crazy you would pull out o1 Pro.

The key is what William realizes. The part where something is driving you crazy, or you have to pay down tech debt, is exactly where you end up spending most of your time (in my model and experience). That’s the hard part. So this is huge.

Sully: O1-Pro is probably the best model I’ve used for coding, hands down.

I gave it a pretty complicated codebase and asked it to refactor, referencing the documentation.

The difference between Claude, Gemini, O1, and O1-Pro is night and day.

The first time in a while I’ve been this impressed.

Full comparison in the video plus code.

Yeah, haha, I’m pretty sure I got $200 worth of coding done just this weekend using O1-Pro.

I’m really liking this model the more I use it.

Sully: With how smart O1-Pro is, the challenge isn’t really “Can the model do it?” anymore.

It’s bringing all the right data for the model to use.

Once it has the right context, it basically can do just about anything you ask it to.

And no copy-pasting, ragging, or “projects” don’t solve this 100%.

There has to be some workflow layer.

Not quite sure what it is yet.

Sully also notes offhand he thinks Gemini-1206 is quite good.

Kakachia777 does a comparison of o1 Pro to Claude 3.5 Sonnet, prefers Sonnet for coding because its code is easier to maintain. They have o1 pro somewhat better at deeper reasoning and complex tasks but not as much as others are saying, and recommends o1 Pro only for those who do specialized PhD-level tasks.

That post also claims new Chinese o1-style models are coming that will be much improved. As always, we shall wait and see.

For that wheelhouse, many report o1 Pro is scary good. Here’s one comment on Kakachia’s post.

T-Rex MD: Just finished my own testing. The science part, I can tell you, no AI, and No human has ever even come close to this.

I ran 4 separate windows at the same time, previously known research ended in roadblocks and met premature ending, all done and sorted. The o1-preview managed to break down years to months, then through many refinement, to 5 days. I have now redone all of that and finished it in 5-6 hours.

Other AIs fail to reason like I do or even close to what I do. My reasoning is extremely specific and medicine – science driven and refined.

I can safely say “o1-pro”, is the king, and unlikely to be de-throned at least until February.

Danielle Fong is feeling the headpats, and generally seems positive.

Danielle Fong: Just hired a new intern at $200/month.

they’re cracked, no doubt, but i’m suspicious they might be working many jobs.

And you can always count on him, but this one does hit a bit different:

McKay Wrigley (always the most impressed person): OpenAI o1 pro is *significantlybetter than I anticipated.

This is the 1st time a model’s come out and been so good that it kind of shocked me.

I screenshotted Coinbase and had 4 popular models write code to clone it in 1 shot.

Guess which was o1 pro.

I will be devastated if o1 pro isn’t available via API.

I’d pay a stupid sum of money per 1M tokens for whatever this steroid version of o1 is.

Also, if you’re a “bUt tHe bEnChMarKs” person try actually working on normal stuff with it.

It’s way better, and it’s not close.

…

The deeper I go down the rabbit hole the more impressed I get.

This thing is different.

Derya Unutmaz reports o1 Pro unlocked great new ideas for his cancer therapy project, and he’s super excited.

A relatively skeptical take on o1-pro that still seems pretty sweet even so?

Riley Goodside: Problems o1 pro can solve that o1 can’t at all are rare; mostly it feels like things that work half the time on o1 work 90% of the time on o1 pro.

Here’s one I didn’t expect.

Wolf of Walgreens: Incredibly useful for fact recall. Disappointing for math (o1pro)

Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time? But perhaps not. Seems like a clue?

Potentially related is that Steve Sokolowski reports it blows away other models at legal research, to the point of enabling pro se cases.

The problem with using o1 for coding, in a nutshell.

Gallabytes: how are people using o1 for coding? I’ve gotten so used to using cursor for this instead of a chat app. are you actually pasting in all the relevant context for the thing you want to do, then pasting the solution back into your editor?

McKay Wrigley (professional impressed person who is indeed also impressed with Gemini 1206, and is also a coder) is super impressed with o1, but will continue using Sonnet as well, because you often don’t want to have to step out of context.

McKay Wrigley: Finally has a reliable workflow.

Significantly better than his current workflow.

He’s basically replaced the Cursor Composer step with o1 Pro requests.

o1 Pro can complete a shocking number of tasks in a single step.

If he needs to do a few extra things, he uses Cursor Tab/Chat with Sonnet.

Video coming soon.

This basic idea makes sense. If you don’t need to rely on lots of context and want to essentially one-shot the problem, you want to Go Big with o1-pro.

If you want to make small adjustments, or write cleaner code, you go with Sonnet.

However, if Sonnet is failing at something and you’re going crazy, you can ‘pull out the bazooka’ and use o1-pro again, despite the context shifting. And indeed, that’s where the bulk of the actual pain comes, in my experience.

Still, putting o1 straight into the IDE would be ten times better, and likely not only get me to definitely pay but also to code a lot more?

I buy that this probably works.

Tyler Cowen: Addending “Take all the time you need” I find to be a useful general prompt with the o1 model.

Ethan Mollick: Haven’t seen the thinking time change when prompted to think longer. But will keep trying.

Tyler Cowen: Maybe it doesn’t need more time, it just needs to relax a bit!?

A prompt that predicts a superior result is likely a very good prompt. So if this works without causing o1 to think for longer, my presumption is then that it works because people who take all the time they need, or are told they can do so, produce better answers, so this steers it into a space with better answers.

He also advises using o1 to ask lots of questions while reading books.

Tyler Cowen: With the new o1 model, often the best strategy is to get a good history book, to help you generate questions, and then to ask away. How long will it take the world to realize a revolution in reading has arrived?

Dwarkesh Patel: Reading while constantly asking Claude questions is 2x harder and 4x more valuable.

Bloom 2 Sigma on demand.

To answer Tyler Cowen’s question, I mean, never, obviously. The revolution will not be televised, so almost everyone will miss it. People aren’t going to read books and stop to ask questions. That sounds like work and being curious and paying attention, and people don’t even read books when not doing any of those things.

People definitely aren’t going to start cracking open history books. I mean, ‘cmon.

The ‘ask LLMs lots of questions while reading’ tactic is of course correct. It was correct before using Claude Sonnet, and it’s highly plausible o1 makes it more correct now that you have a second option – I’m guessing you’ll want to mix up which one you use based on the question type. And no, you don’t have to jam the book in the context window – but you could, and in many cases you probably should. What, like it’s hard? If the book is too long, use Gemini-1206.

That said, I’ve spent all day reading and writing and used almost no queries. I ask questions most often when reading papers, then when reading some types of books, but I rarely read books and I’ve been triaging away the papers for now.

One should of course also be asking questions while writing, or reading blogs, or even reading Twitter, but mostly I end up not doing it.

It is early, but it seems clear that o1 and especially o1 pro are big jumps in capability for things in their wheelhouse. If you want what this kind of extended thinking can get you, including fact recall and relative lack of hallucinations, and especially large or tricky code, math or science problems, and likely most academic style questions, we took a big step up.

When this gets incorporated into IDEs, we should see a big step up in coding. It makes me excited to code again, the way Claude Sonnet 3.5 did (and does, although right now I don’t have the time).

Another key weakness is lack of web browsing. The combination of this plus browsing seems like it will be scary powerful. You’ll still want some combination of GPT-4o and Perplexity in your toolbox.

For other uses, it is too early to tell when you would want to use this over Sonnet 3.5.1. My instinct is that you’d still probably want to default to Sonnet for questions where it should be ‘smart enough’ to give you what you’re looking for, or of course just ask both of them all the time. Also there’s Gemini-1206, which I’m hearing a bunch of positive vibes about, so it might also be worth a look.

Discussion about this post

o1 Turns Pro Read More »