Author name: Kelly Newman

trump-told-scotus-he-plans-to-make-a-deal-to-save-tiktok

Trump told SCOTUS he plans to make a deal to save TikTok

Several members of Congress— Senator Edward J. Markey (D-Mass.), Senator Rand Paul (R-Ky.), and Representative Ro Khanna (D-Calif.)—filed a brief agreeing that “the TikTok ban does not survive First Amendment scrutiny.” They agreed with TikTok that the law is “illegitimate.”

Lawmakers’ “principle justification” for the ban—”preventing covert content manipulation by the Chinese government”—masked a “desire” to control TikTok content, they said. Further, it could be achieved by a less-restrictive alternative, they said, a stance which TikTok has long argued for.

Attorney General Merrick Garland defended the Act, though, urging SCOTUS to remain laser-focused on the question of whether a forced sale of TikTok that would seemingly allow the app to continue operating without impacting American free speech violates the First Amendment. If the court agrees that the law survives strict scrutiny, TikTok could still be facing an abrupt shutdown in January.

The Supreme Court has scheduled oral arguments to begin on January 10. TikTok and content creators who separately sued to block the law have asked for their arguments to be divided, so that the court can separately weigh “different perspectives” when deciding how to approach the First Amendment question.

In its own brief, TikTok has asked SCOTUS to strike the portions of the law singling out TikTok or “at the very least” explain to Congress that “it needed to do far better work either tailoring the Act’s restrictions or justifying why the only viable remedy was to prohibit Petitioners from operating TikTok.”

But that may not be necessary if Trump prevails. Trump told the court that TikTok was an important platform for his presidential campaign and that he should be the one to make the call on whether TikTok should remain in the US—not the Supreme Court.

“As the incoming Chief Executive, President Trump has a particularly powerful interest in and responsibility for those national-security and foreign-policy questions, and he is the right constitutional actor to resolve the dispute through political means,” Trump’s brief said.

Trump told SCOTUS he plans to make a deal to save TikTok Read More »

o3,-oh-my

o3, Oh My

OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.

I was very much expecting the announcement to be something like a price drop. What better way to say ‘Merry Christmas,’ no?

They disagreed. Instead, we got this (here’s the announcement, in which Sam Altman says ‘they thought it would be fun’ to go from one frontier model to their next frontier model, yeah, that’s what I’m feeling, fun):

Greg Brockman (President of OpenAI): o3, our latest reasoning model, is a breakthrough, with a step function improvement on our most challenging benchmarks. We are starting safety testing and red teaming now.

Nat McAleese (OpenAI): o3 represents substantial progress in general-domain reasoning with reinforcement learning—excited that we were able to announce some results today! Here is a summary of what we shared about o3 in the livestream.

o1 was the first large reasoning model—as we outlined in the original “Learning to Reason” blog, it is “just” a LLM trained with reinforcement learning. o3 is powered by further scaling up reinforcement learning beyond o1, and the resulting model’s strength is very impressive.

First and foremost: We tested on recent, unseen programming competitions and found that the model would rank among some of the best competitive programmers in the world, with an estimated CodeForces rating of over 2,700.

This is a milestone (Codeforces rating better than Jakub Pachocki) that I thought was further away than December 2024; these competitions are difficult and highly competitive; the model is extraordinarily good.

Scores are impressive elsewhere, too. 87.7% on the GPQA diamond benchmark surpasses any LLM I am aware of externally (I believe the non-o1 state-of-the-art is Gemini Flash 2 at 62%?), as well as o1’s 78%. An unknown noise ceiling exists, so this may even underestimate o3’s scientific advancements over o1.

o3 can also perform software engineering, setting a new state of the art on SWE-bench, achieving 71.7%, a substantial improvement over o1.

With scores this strong, you might fear accidental contamination. Avoiding this is something OpenAI is obviously focused on; but thankfully, we also have some test sets that are strongly guaranteed to be uncontaminated: ARC and FrontierMath… What do we see there?

Well, on FrontierMath 2024-11-26, o3 improved the state of the art from 2% to 25% accuracy. These are extremely difficult, well-established, held-out math problems. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). [thread continues]

The models will only get better with time; and virtually no one (on a large scale) can still beat them at programming competitions or mathematics. Merry Christmas!

Zac Stein-Perlman has a summary post of the basic facts. Some good discussions in the comments.

Up front, I want to offer my sincere thanks for this public safety testing phase, and for putting that front and center in the announcement. You love to see it. See the last three minutes of that video, or the sections on safety later on.

  1. GPQA Has Fallen. (Blank)

  2. Codeforces Has Fallen.

  3. Arc Has Kinda of Fallen But For Now Only Kinda.

  4. They Trained on the Train Set.

  5. AIME Has Fallen.

  6. Frontier of Frontier Math Shifting Rapidly.

  7. FrontierMath 4: We’re Going To Need a Bigger Benchmark.

  8. What is o3 Under the Hood?.

  9. Not So Fast!.

  10. Deep Thought.

  11. Our Price Cheap.

  12. Has Software Engineering Fallen?.

  13. Don’t Quit Your Day Job.

  14. Master of Your Domain.

  15. Safety Third.

  16. The Safety Testing Program.

  17. Safety testing in the reasoning era.

  18. How to apply.

  19. What Could Possibly Go Wrong?.

  20. What Could Possibly Go Right?.

  21. Send in the Skeptic.

  22. This is Almost Certainly Not AGI.

  23. Does This Mean the Future is Open Models?.

  24. Not Priced In.

  25. Our Media is Failing Us.

  26. Not Covered Here: Deliberative Alignment.

  27. The Lighter Side.

Deedy: OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet.

This is an absolutely superhuman result for AI and technology at large.

The median IOI Gold medalist, the top international programming contest for high schoolers, has a rating of 2469.

That’s how incredible this result is.

In the presentation, Altman jokingly mentions that one person at OpenAI is a competition programmer who is 3000+ on Codeforces, so ‘they have a few more months’ to enjoy their superiority. Except, he’s obviously not joking. Gulp.

o3 shows dramatically improved performance on the ARC-AGI challenge.

Francois Chollet offers his thoughts, full version here.

Arc Prize: New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation.

This is not incremental progress. We’re in new territory.

Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

hero: o3’s secret? the “I will give you $1k if you complete this task correctly” prompt but you actually send it the money.

Rohit: It’s actually Sam in the back end with his venmo.

Is there a catch?

There’s at least one big catch, which is that they vastly exceeded the compute limit for what counts as a full win for the ARC challenge. Those yellow dots represent quite a lot more money spent, o3 high is spending thousands of dollars.

It is worth noting that $0.10 per problem is a lot cheaper than human level.

Ajeya Cotra: I think a generalist AI system (not fine-tuned on ARC AGI style problems) may have to be pretty *superhumanto solve them at $0.10 per problem; humans have to run a giant (1e15 FLOP/s) brain, probably for minutes on the more complex problems.

Beyond that, is there another catch? That’s a matter of some debate.

Even with catches, the improvements are rather mind-blowing.

President of the Arc prize Greg Kamradt verified the result.

Greg Kamradt: We verified the o3 results for OpenAI on @arcprize.

My first thought when I saw the prompt they used to claim their score was…

“That’s it?”

It was refreshing (impressive) to see the prompt be so simple:

“Find the common rule that maps an input grid to an output grid.”

Brandon McKinzie (OpenAI): to anyone wondering if the high ARC-AGI score is due to how we prompt the model: nah. I wrote down a prompt format that I thought looked clean and then we used it…that’s the full story.

Pliny the Liberator: can I try?

For fun, here are the 34 problems o3 got wrong. It’s a cool problem set.

And this progress is quite a lot.

It is not, however, a direct harbinger of AGI, one does not want to overreact.

Noam Brown (OpenAI): I think people are overindexing on the @OpenAI o3 ARC-AGI results. There’s a long history in AI of people holding up a benchmark as requiring superintelligence, the benchmark being beaten, and people being underwhelmed with the model that beat it.

To be clear, @fchollet and @mikeknoop were always very clear that beating ARC-AGI wouldn’t imply AGI or superintelligence, but it seems some people assumed that anyway.

Here is Melanie Mitchell giving an overview that seems quite good.

Except, oh no!

How dare they!

OpenAI: Note on “tuned”” OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more detail. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

Niels Rogge: By training on 75% of the training set.

Gary Marcus: Wow. This, if true, raises serious questions about yesterday’s announcement.

Roon: oh shit oh fthey trained on the train set it’s all over now

Also important to note that 75% of the train set is like 2-300 examples.

🚨SCANDAL 🚨

OpenAI trained on the train set for the Millenium Puzzles.

Johan: Given that it scores 30% on ARC AGI 2, it’s clear there was no improvement in fluid reasoning and the only gain was due to the previous model not being trained on ARC.

Roon: well the other benchmarks show improvements in reasoning across the board

but regardless, this mostly reveals that it’s real performance on ARC AGI 2 is much higher

Rythm Garg: also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

Emmett Shear: Were anyone on the team aware of and thinking about arc and arc-like problems as a domain to improve at when you were designing and training o3? (The distinction between succeeding as a random side effect and succeeding with intention)

Rythm Garg: no, the team wasn’t thinking about arc when training o3; people internally just see it as one of many other thoughtfully-designed evals that are useful for monitoring real progress

Or:

Gary Marcus doubled down on ‘the true AGI would not need to train on the train set.’

Previous SotA on ARC involved training not only on the test set, but on a much larger synthetic test set. ARC was designed so the AI wouldn’t need to train for it, but it turns out ‘test that you can’t train for’ is a super hard trick to pull off. This was an excellent try and it still didn’t work.

If anything, o3’s using only 300 training set problems, and using a very simple instruction, seems to be to its credit here.

The true ASI might not need to do it, but why wouldn’t you train on the train set as a matter of course, even if you didn’t intend to test on ARC? That’s good data. And yes, humans will reliably do some version of ‘train on at least some of the train set’ if they want to do well on tasks.

Is it true we will be a lot better off if we have AIs that can one-shot problems that are out of their training distributions, where they truly haven’t seen anything that resembles the problem? Well, sure. That would be more impressive.

The real objection here, as I understand it, is the claim that OpenAI presented these results as more impressive than they are.

The other objection is that this required quite a lot of compute.

That is a practical problem. If you’re paying $20 a shot to solve ARC problems, or even $1m+ for the whole test at the high end, pretty soon you are talking real money.

It also raises further questions. What about ARC is taking so much compute? At heart these problems are very simple. The logic required should, one would hope, be simple.

Mike Bober-Irizar: Why do pre-o3 LLMs struggle with generalization tasks like

@arcprize? It’s not what you might think.

OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues – ARC task difficulty is independent of size.

Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.

So even if a model is capable of the reasoning and generalization required, it can still fail just because it can’t handle this many tokens.

When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks – even if the solutions are the same.

When models can’t understand the task format, the benchmark can mislead, introducing a hidden threshold effect.

And if there’s always a larger version that humans can solve but an LLM can’t, what does this say about scaling to AGI?

The implication is that o3’s ability to handle the size of the grids might be producing a large threshold effect. Perhaps most of why o3 does so well is that it can hold the presented problem ‘in its head’ at once. That wouldn’t be as big a general leap.

Roon: arc is hard due to perception rather than reasoning -> seems clear and shut

I remember when AIME problems were hard.

This one is not a surprise. It did definitely happen.

AIME hasn’t quite fully fallen, in the sense that this does not solve AIME cheap. But it does solve AIME.

Back in the before times on November 8, Epoch AI launched FrontierMath, a new benchmark designed to fix the saturation on existing math benchmarks, eliciting quotes like this one:

Terrence Tao (Fields Medalist): These are extremely challenging… I think they will resist AIs for several years at least.

Timothy Gowers (Fields Medalist): Getting even one question right would be well beyond what we can do now, let alone saturating them.

Evan Chen (IMO Coach): These are genuinely hard problems… most of them look well above my pay grade.

At the time, no model solved more than 2% of these questions. And then there’s o3.

Noam Brown: This is the result I’m most excited about. Even if LLMs are dumb in some ways, saturating evals like @EpochAIResearch’s Frontier Math would suggest AI is surpassing top human intelligence in certain domains. When that happens we may see a broad acceleration in scientific research.

This also means that AI safety topics like scalable oversight may soon stop being hypothetical. Research in these domains needs to be a priority for the field.

Tamay Besiroglu: I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3’s 25.2% at Pass@1 is substantially more impressive.

It’s important to note that while the average problem difficulty is extremely high, FrontierMath problems vary in difficulty. Roughly: 25% are Tier 1 (advanced IMO/Putnam level), 50% are Tier 2 (extremely challenging grad-level), and 25% are Tier 3 (research problems).

I previously predicted a 25% performance by Dec 31, 2025 (my median forecast with an 80% CI of 14–60%). o3 has reached it earlier than I’d have expected on average.

It is indeed rather crazy how many people only weeks ago thought this level of Frontier Math was a year or more away.

Therefore…

When the FrontierMath is about to no longer be beyond the frontier, find a few frontier. Fast.

Tammy Besiroglu (6: 52m, December 21, 2024): I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.

Elliot Glazer (6: 30pm, December 21, 2024): For context, FrontierMath currently spans three broad tiers:

• T1 (25%) Advanced, near top-tier undergrad/IMO

• T2 (50%) Needs serious grad-level background

• T3 (25%) Research problems demanding relevant research experience

All can take hours—or days—for experts to solve.

Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.

Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.

Process for a Tier 4 problem:

  1. 1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.

  2. 3 weeks of collaborative research. Presentations among related teams for feedback.

  3. Two weeks for the final submission.

We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests.

We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.

As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.

Tier 5 could presumably be ‘ask a bunch of problems we have actual no idea how to solve and that might not have solutions but that would be super cool’ since anything on a benchmark inevitably gets solved.

From the description here, Chollet and Masad are speculating. It’s certainly plausible, but we don’t know if this is on the right track. It’s also highly plausible, especially given how OpenAI usually works, that o3 is deeply similar to o1, only better, similarly to how the GPT line evolved.

Amjad Masad: Based on benchmarks, OpenAI’s o3 seems like a genuine breakthrough in AI.

Maybe a start of a new paradigm.

But what new is also old: under the hood it might be Alpha-zero-style search and evaluate.

The author of ARC-AGI benchmark @fchollet speculates on how it works.

Davidad (other thread): o1 doesn’t do tree search, or even beam search, at inference time. it’s distilled. what about o3? we don’t know—those inference costs are very high—but there’s no inherent reason why it must be un-distill-able, since Transformers are Turing-complete (with the CoT itself as tape)

Teortaxes: I am pretty sure that o3 has no substantial difference from o1 aside from training data.

Jessica Taylor sees this as vindicating Paul Christiano’s view that you can factor cognition and use that to scale up effective intelligence.

Jessica Taylor: o3 implies Christiano’s factored cognition work is more relevant empirically; yes, you can get a lot from factored cognition.

Potential further capabilities come through iterative amplification and distillation, like ALBA.

If you care about alignment, go read Christiano!

I agree with that somewhat. I’m confused how far to go with it.

If we got o3 primarily because we trained on synthetic data that was generated by o1… then that is rather directly a form of slow takeoff and recursive self-improvement.

(Again, I don’t know if that’s what happened or not.)

And I don’t simply mean that the full o3 is not so fast, which it indeed is not:

Noam Brown: We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue.

Poaster Child: Waiting for singularity bros to discover economics.

Noam Brown: I worked at the federal reserve for 2 years.

I am waiting for economists to discover various things, Noam Brown excluded.

Jason Wei (OpenAI): o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years.

Scary fast? Absolutely.

However, I would caution (anti-caution?) that this is not a three month (~100 day) gap. On September 12, they gave us o1-preview to use. Presumably that included them having run o1-preview through their safety testing.

Davidad: If using “speed from o1 announcement to o3 announcement” to calibrate your velocity expectations, do take note that the o1 announcement was delayed by safety testing (and many OpenAI releases have been delayed in similar ways), whereas o3 was announced prior to safety testing.

They are only now starting o3 safety testing, from the sound of it this includes o3-mini. Even the red teamers won’t get full o3 access for several weeks. Thus, we don’t know how long this later process will take, but I would put the gap closer to 4-5 months.

That is still, again, scary fast.

It is however also the low hanging fruit, on two counts.

  1. We went from o1 → o3 in large part by having it spend over $1,000 on tasks. You can’t pull that trick that many more times in a row. The price will come down over time, and o3 is clearly more efficient than o1, so yes we will still make progress here, but there aren’t that many tasks where you can efficiently spend $10k+ on a slow query, especially if it isn’t reliable.

  2. This is a new paradigm of how to set up an AI model, so it should be a lot easier to find various algorithmic improvements.

Thus, if o3 isn’t so good that it substantially accelerates AI R&D that goes towards o4, then I would expect an o4 that expresses a similar jump to take substantially longer. The question is, does o3 make up for that with its contribution to AI R&D? Are we looking at a slow takeoff situation?

Even if not, it will still get faster and cheaper. And that alone is huge.

As in, this is a lot like that computer Douglas Adams wrote about, where you can get any answer you want, but it won’t be either cheap or fast. And you really, really should have given more thought to what question you were asking.

Ethan Mollick: Basically, think of the O3 results as validating Douglas Adams as the science fiction author most right about AI.

When given more time to think, the AI can generate answers to very hard questions, but the cost is very high, and you have to make sure you ask the right question first.

And the answer is likely to be correct (but we cannot be sure because verifying it requires tremendous expertise).

He also was right about machines that work best when emotionally manipulated and machines that guilt you.

Sully: With O3 costing (potentially) $2,000 per task on “high compute,” the app layer is needed more than ever.

For example, giving the wrong context to it and you just burned $1,000.

Likely, we have a mix of models based on their pricing/intelligence at the app layer, prepping the data to feed it into O3.

100% worth the money but the last thing u wana do is send the wrong info lol

Douglas Adams had lots of great intuitions and ideas, he’s amazing, but also he had a lot of shots on goal.

Right now o3 is rather expensive, although o3-mini will be cheaper than o1.

That doesn’t mean o3-level outputs will stay expensive, although presumably once they are people will try for o4-level or o5-level outputs, which will be even more expensive despite the discounts.

Seb Krier: Lots of poor takes about the compute costs to run o3 on certain tasks and how this is very bad, lead to inequality etc.

This ignores how quickly these costs will go down over time, as they have with all other models; and ignores how AI being able to do things you currently have to pay humans orders of magnitude more to do will actually expand opportunity far more compared to the status quo.

Remember when early Ericsson phones were a quasi-luxury good?

Simeon: I think this misses the point that you can’t really buy a better iPhone even with $1M whereas you can buy more intelligence with more capital (which is why you get more inequalities than with GPT-n). You’re right that o3 will expand the pie but it can expand both the size of the pie and inequalities.

Seb Krier: An individual will not have the same demand for intelligence as e.g. a corporation. Your last sentence is what I address in my second point. I’m also personally less interested in inequality/the gap than poverty/opportunity etc.

Most people will rarely even want an o3 query in the first place, they don’t have much use for that kind of intelligence in the day to day. Most queries are already pretty easy to handle with Claude Sonnet, or even Gemini Flash.

You can’t use $1m to buy a superior iPhone. But suppose you could, and every time you paid 10x the price the iPhone got modestly better (e.g. you got an iPhone x+2 or something). My instinctive prediction is a bunch of rich people pay $10k or $100k and a few pay $1m or $10m but mostly no one cares.

This is of course different, and relative access to intelligence is a key factor, but it’s miles less unequal than access to human expertise.

To the extent that people do need that high level of artificial intelligence, it’s mostly a business expense, and as such it is actually remarkably cheap already. It definitely reduces ‘intelligence inequality’ in the sense that getting information or intelligence that you can’t provide yourself will get a lot cheaper and easier to access. Already this is a huge effect – I have lots of smart and knowledgeable friends but mostly I use the same tools everyone else could use, if they knew about them.

Still, yes, some people don’t love this.

Haydn Belfield: o1 & o3 bring to an end the period when everyone—from Musk to me—could access the same quality of AI model.

From now on, richer companies and individuals will be able to pay more for inference compute to get better results.

Further concentration of wealth and power is coming.

Inference cost *willdecline quickly and significantly. But this will not change the fact that this paradigm enables converting money into outcomes.

  1. Lower costs for everyone mean richer companies can buy even more.

  2. Companies will now feel confident to invest 10–100 milliseconds into inference compute.

This is a new way to convert money into better outcomes, so it will advantage those with more capital.

Even for a fast-growing, competent startup, it is hard to recruit and onboard many people quickly at scale.

o3 is like being able to scale up world-class talent.

  1. Rich companies are talent-constrained. It takes time and effort to scale a workforce, and it is very difficult to buy more time or work from the best performers. This is a way to easily scale up talent and outcomes simply by using more money!

Some people in replies are saying “twas ever thus”—not for most consumer technology!

Musk cannot buy a 100 times better iPhone, Spotify, Netflix, Google search, MacBook, or Excel, etc.

He can buy 100 times better legal, medical, or financial services.

AI has now shifted from the first group to the second.

Musk cannot buy 100 times better medical or financial services. What he can do is pay 100 times more, and get something 10% better. Maybe 25% better. Or, quite possibly, 10% worse, especially for financial services. For legal he can pay 100 times more and get 100 times more legal services, but as we’ve actually seen it won’t go great.

And yes, ‘pay a human to operate your consumer tech for you’ is the obvious way to get superior consumer tech. I can absolutely get a better Netflix or Spotify or search by paying infinitely more money, if I want that, via this vastly improved interface.

And of course I could always get a vastly better computer. If you’re using a MacBook and you are literally Elon Musk that is pretty much on you.

The ‘twas ever thus’ line raises the question of what type of product AI is supposed to be. If it’s a consumer technology, then for most purposes, I still think we end up using the same product.

If it’s a professional service used in doing business, then it was already different. The same way I could hire expensive lawyers, I could have hired a prompt engineer or SWEs to build me agents or what not, if I wanted that.

I find Altman’s framing interesting here, and important:

Sam Altman: seemingly somewhat lost in the noise of today.

On many coding tasks, o3-mini will outperform o1 at a massive cost reduction!

I expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be truly strange.

Exponentially more money for marginally more performance.

Over time, massive cost reductions.

In a sense, the extra money is buying you living in the future.

Do you want to live in the future, before you get the cost reductions?

In some cases, very obviously yes, you do.

I would not say it has fallen. I do know it will transform.

If two years from now you are writing code line by line, you’ll be a dinosaur.

Sully: yeah its over for coding with o3

this is mindboggling

looks like the first big jump since gpt4, because these numbers make 0 sense

By the way, I don’t say this lightly, but

Software engineering in the traditional sense is dead in less than two years.

You will still need smart, capable engineers.

But anything that involves raw coding and no taste is done for.

o6 will build you virtually anything.

Still Bullish on things that require taste (design and such)

The question is, assuming the world ‘looks normal,’ will you still need taste? You’ll need some kind of taste. You still need to decide what to build. But the taste you need will presumably get continuously higher level and more abstract, even within design.

If you’re in AI capabilities, pivot to AI safety.

If you’re in software engineering, pivot to software architecting.

If you’re in working purely for a living, pivot to building things and shipping them.

But otherwise, don’t quit your day job.

Null Pointered (6.4m views): If you are a software engineer who’s three years into your career: quit now. there is not a single job in CS anymore. it’s over. this field won’t exist in 1.5 years.

Anthony F: This is the kind of though that will make the software engineers valuable in 1.5 years.

null: That’s what I’m hoping.

Robin Hanson: I would bet against this.

If anything, being in software should make you worry less.

Pavel Asparouhov: Non technical folk saying the SWEs are cooked — it’s you guys who are cooked.

Ur gonna have ex swes competing with everything you’re doing now, and they’re gonna be AI turbocharged

Engineers were simply doing coding bc it was the highest leverage use of mental power

When that shifts it’s not going to all of the sudden shift the hierarchy

They’ll still be (higher level) SWEs. Instead of coding, they’ll be telling the AI to code.

And they will absolutely be competing with you.

If you don’t join them, you are probably going to lose.

Here’s some advice that I agree with in spirit, except that if you choose not to decide you still have made a choice, so you do the best you can, notice he gives advice anyway:

Roon: Nobody should give or receive any career advice right now. Everyone is broadly underestimating the scope and scale of change and the high variance of the future. Your L4 engineer buddy at Meta telling you “bro, CS degrees are cooked” doesn’t know anything.

Greatness cannot be planned.

Stay nimble and have fun.

It’s an exciting time. Existing status hierarchies will collapse, and the creatives will win big.

Roon: guy with zero executive function to speak of “greatness cannot be planned”

Simon Sarris: I feel like I’m going insane because giving advice to new devs is not that hard.

  1. Build things you like preferably publicly with your real name

  2. Have a website that shows something neat

  3. Help other people publicly. Participate in social media socially.

Do you notice how “AI” changes none of this?

Wailing about because of some indeterminate future and claiming that there’s no advice that can be given to noobs are both breathlessly silly. Think about what you’re being asked for at least ten seconds. You can really think of nothing to offer? Nothing?

Ajeya Cotra: I wonder if an o3 agent could productively work on projects with poor feedback loops (eg “research X topic”) for many subjective years without going off the rails or hitting a degenerate loop. Even if it’s much less cost-efficient now it would quickly become cheaper.

Another situation where onlookers/forecasters probably disagree a lot about *today’scapabilities let alone future capabilities.

Wonder how o3 would do on wedding planning.

Note the date on that poll, it is prior to o3.

I predict that o3 with reasonable tool use and other similar scaffolding, and a bunch of engineering work to get all that set up (but it would almost all be general work, it mostly wouldn’t need to be wedding specific work, and a lot of it could be done by o3!) would be great at planning ‘a’ wedding. It can give you one hell of a wedding. But you don’t want ‘a’ wedding. You want your wedding.

The key is handling the humans. That would mean keeping the humans in the loop properly, ensuring they give the right feedback that allows o3 to stay on track and know what is actually desired. But it would also mean all the work a wedding planner does to manage the bride and sometimes groom, and to deal with issues on-site.

If you give it an assistant (with assistant planner levels of skill) to navigate various physical issues and conversations and such, then the problem becomes trivial. Which in some sense also makes it not a good test, but also does mean your wedding planner is out of a job.

So, good question, actually. As far as we know, no one has dared try.

The bar for safety testing has gotten so low that I was genuinely happy to see Greg Brockman say that safety testing and red teaming was starting now. That meant they were taking testing seriously!

When they tested the original GPT-4, under far less dangerous circumstances, for months. Whereas with o3, it could possibly have already been too late.

Take Eliezer Yudkowsky’s warning here both seriously and literally:

Greg Brockman: o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.

Eliezer Yudkowsky: Sir, this level of capabilities needs to be continuously safety-tested while you are training it on computers connected to the Internet (and to humans). You are past the point where it seems safe to train first and conduct evals only before user releases.

RichG (QTing EY above): I’ve been avoiding politics and avoiding tribe like things like putting ⏹️ in my name, but level of lack of paranoia that these labs have is just plain worrying. I think I will put ⏹️ in my name now.

Was it probably safe in practice to train o3 under these conditions? Sure. You definitely had at least one 9 of safety doing this (p(safe)>90%). It would be reasonable to claim you had two (p(safe)>99%) at the level we care about.

Given both kinds of model uncertainty, I don’t think you had three.

If humans are reading the outputs, or if o3 has meaningful outgoing internet access, and it turns out you are wrong about it being safe to train it under those conditions… the results could be catastrophically bad, or even existentially bad.

You don’t do that because you expect we are in that world yet. We almost certainly aren’t. You do that because there is a small chance that we are, and we can’t afford to be wrong about this.

That is still not the current baseline threat model. The current baseline threat model remains that a malicious user uses o3 to do something for them that we do not want o3 to do.

Xuan notes she’s pretty upset about o3’s existence, because she thinks it is rather unsafe-by-default and was hoping the labs wouldn’t build something like this, and then was hoping it wouldn’t scale easily. And that o3 seems to be likely to engage in open-ended planning, operate over uninterpretable world models, and be situationally aware, and otherwise be at high risk for classic optimization-based AI risks. She’s optimistic this can be solved, but time might be short.

I agree that o3 seems relatively likely to be highly unsafe-by-default in existentially dangerous ways, including ways illustrated by the recent Redwood Research and Anthropic paper, Alignment Faking in Large Language Models. It builds in so many of the preconditions for such behaviors.

Davidad: “Maybe the AI capabilities researchers aren’t very smart” is a very very hazardous assumption on which to pin one’s AI safety hopes

I don’t mean to imply it’s *pointlessto keep AI capabilities ideas private. But in my experience, if I have an idea, at least somebody in one top lab will have the same idea by next quarter, and someone in academia or open source will have the idea and publish within 1-2 years.

A better hope [is to solve the practical safety problems, e.g. via interpretability.]

I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise. Doubtless a lot of them are because ‘that doesn’t work, either because we tried it and it doesn’t or it obviously doesn’t you idiot’ but I’m fine with not knowing which ones are which.

I do think that the rationalist or MIRI crowd made a critical mistake in the 2010s of thinking they should be loud about the dangers of AI in general, but keep their technical ideas remarkably secret even when it was expensive. It turned out it was the opposite, the technical ideas didn’t much matter in the long run (probably?) but the warnings drew a bunch of interest. So there’s that.

Certainly now is not the time to keep our safety concerns or ideas to ourselves.

Thus, you are invited to their early access safety testing.

OpenAI: We’re inviting safety researchers to apply for early access to our next frontier models. This early access program complements our existing frontier model testing process, which includes rigorous internal safety testing, external red teaming such as our Red Teaming Network⁠ and collaborations with third-party testing organizations, as well the U.S. AI Safety Institute and the UK AI Safety Institute.

As models become more capable, we are hopeful that insights from the broader safety community can bring fresh perspectives, deepen our understanding of emerging risks, develop new evaluations, and highlight areas to advance safety research.

As part of 12 Days of OpenAI⁠, we’re opening an application process for safety researchers to explore and surface the potential safety and security implications of the next frontier models.

Safety testing in the reasoning era

Models are becoming more capable quickly, which means that new threat modeling, evaluation, and testing techniques are needed. We invest heavily in these efforts as a company, such as designing new measurement techniques under our Preparedness Framework⁠(opens in a new window), and are focused on areas where advanced reasoning models, like our o-series, may pose heightened risks. We believe that the world will benefit from more research relating to threat modeling, security analysis, safety evaluations, capability elicitation, and more

Early access is flexible for safety researchers. You can explore things like:

  • Developing Robust Evaluations: Build evaluations to assess previously identified capabilities or potential new ones with significant security or safety implications. We encourage researchers to explore ideas that highlight threat models that identify specific capabilities, behaviors, and propensities that may pose concrete risks tied to the evaluations they submit.

  • Creating Potential High-Risk Capabilities Demonstrations: Develop controlled demonstrations showcasing how reasoning models’ advanced capabilities could cause significant harm to individuals or public security absent further mitigation. We encourage researchers to focus on scenarios that are not possible with currently widely adopted models or tools.

Examples of evaluations and demonstrations for frontier AI systems:

We hope these insights will surface valuable findings and contribute to the frontier of safety research more broadly. This is not a replacement for our formal safety testing or red teaming processes.

How to apply

Submit your application for our early access period, opening December 20, 2024, to push the boundaries of safety research. We’ll begin selections as soon as possible thereafter. Applications close on January 10, 2025.

Sam Altman: if you are a safety researcher, please consider applying to help test o3-mini and o3. excited to get these out for general availability soon.

extremely proud of all of openai for the work and ingenuity that went into creating these models; they are great.

(and most of all, excited to see what people will build with this!)

If early testing of the full o3 will require a delay of multiple weeks for setup, then that implies we are not seeing the full o3 in January. We probably see o3-mini relatively soon, then o3 follows up later.

This seems wise in any case. Giving the public o3-mini is one of the best available tests of the full o3. This is the best form of iterative deployment. What the public does with o3-mini can inform what we look for with o3.

One must carefully consider the ethical implications before assisting OpenAI, especially assisting with their attempts to push the capabilities frontier for coding in particular. There is an obvious argument against participation, including decision theoretic considerations.

I think this loses in this case to the obvious argument for participation, which is that this is purely red teaming and safety work, and we all benefit from it being as robust as possible, and also you can do good safety research using your access. This type of work benefits us all, not only OpenAI.

Thus, yes, I encourage you to apply to this program, and while doing so to be helpful in ensuring that o3 is safe.

Pretty much all the things, at this point, although the worst ones aren’t likely… yet.

GFodor.id: It’s hard to take anyone seriously who can see a PhD in a box and *notimagine clearly more than a few plausible mass casualty events due to the evaporation of friction due to lack of know-how and general IQ.

In many places the division is misleading, but for now and at this capability level, it seems reasonable to talk about three main categories of risk here:

  1. Misuse.

  2. Automated R&D and potential takeoffs or self-improvement.

  3. For-real loss of control problems that aren’t #2.

For all previous frontier models, there was always a jailbreak. If someone was determined to get your model to do [X], and your model had the underlying capability to do [X], you could get it to do [X].

In this case, [X] is likely to include substantially aiding a number of catastrophically dangerous things, in the class of cyberattacks or CBRN risks or other such dangers.

Aaron Bergman: Maybe this is obvious but: the other labs seem to be broadly following a pretty normal cluster of commercial and scientific incentives o3 looks like the clearest example yet of OpenAI being ideologically driven by AGI per se.

Like you don’t design a system that costs thousands of dollars to use per API call if you’re focused on consumer utility – you do that if you want to make a machine that can think well, full stop.

Peter Wildeford: I think OpenAI genuinely cares about getting society to grapple with AI progress.

I don’t think ideological is the right term. You don’t make it for direct consumer use if your focus is on consumer utility. But you might well make it for big business, if you’re trying to sell a bunch of drop-in employees to big business at $20k/year a pop or something. That’s a pretty great business if you can get it (and the compute is only $10k, or $1k). And you definitely do it if your goal is to have that model help make your other models better.

It’s weird to me to talk about wanting to make AGI and ASI and the most intelligent thing possible as if it were ideological. Of course you want to make those things… provided you (or we) can stay in control of the outcomes. Just think of the potential! It is only ideological in the sense that it represents a belief that we can handle doing that without getting ourselves killed.

If anything, to me, it’s the opposite. Not wanting to go for ASI because you don’t see the upside is an ideological position. The two reasonable positions are ‘don’t go for ASI yet, slow down there cowboy, we’re not ready to handle this’ and ‘we totally can too handle this, just think of the potential.’ Or even ‘we have to build it before the other guy does,’ which makes me despair but at least I get it. The position ‘nothing to see here what’s the point there is no market for that, move along now, can we get that q4 profit projection memo’ is the Obvious Nonsense.

And of course, if you don’t (as Aaron seems to imply) think Anthropic has its eyes on the prize, you’re not paying attention. DeepMind originally did, but Google doesn’t, so it’s unclear what the mix is at this point over there.

I want to be clear here that the answer is: Quite a lot of things. Having access to next-level coding and math is great. Having the ability to spend more money to get better answers where it is valuable is great.

Even if this all stays relatively mundane and o3 is ultimately disappointing, I am super excited for the upside, and to see what we all can discover, do, build and automate.

Guess who.

All right, that’s my fault, I made that way too easy.

Gary Marcus: After almost two years of declaring that a release of GPT-5 is imminent and not getting it, super fans have decided that a demo of system that they did zero personal experimentation with — and that won’t (in full form) be available for months — is a mic-drop AGI moment.

Standards have fallen.

[o1] is not a general purpose reasoner. it works where there is a lot of augmented data etc.

First off it Your Periodic Reminder that progress is anything but slow even if you exclude the entire o-line. It has been a little over two years since there was a demo of GPT-4, with what was previously a two year product cycle. That’s very different from ‘two years of an imminent GPT-5 release.’ In the meantime, models have gotten better across the board. GPT-4o, Claude Sonnet 3.5 and Gemini 1206 all completely demolish the original GPT-4, to speak nothing of o1 or Perplexity or anything else. And we also have o1, and now o3. The practical experience of using LLMs is vastly better than it was two years ago.

Also, quite obviously, you pursue both paths at once, both GPT-N and o-N, and if both succeed great then you combine them.

Srini Pagdyala: If O3 is AGI, why are they spending billions on GPT-5?

Gary Marcus: Damn good question!

So no, not a good question.

Is there now a pattern where ‘old school’ frontier model training runs whose primary plan was ‘add another zero or two’ are generating unimpressive results? Yeah, sure.

Is o3 an actual AGI? No. I’m pretty sure it is not.

But it seems plausible it is AGI-level specifically at coding. And that’s the important one. It’s the one that counts most. If you have that, overall AGI likely isn’t far behind.

I mention this because some were suggesting it might be.

Here’s Yana Welinder claiming o3 is AGI, based off the ARC performance, although she later hedges to ‘partial AGI.’

And here’s Evan Mays, a member of OpenAI’s preparedness team, saying o3 is AGI, although he later deleted it. Are they thinking about invoking the charter? It’s premature, but no longer completely crazy to think about it.

And here’s old school and present OpenAI board member Adam D’Angelo saying ‘Wild that the o3 results are public and yet the market still isn’t pricing in AGI,’ which to be fair it totally isn’t and it should be, whether o3 itself is AGI or not. And Elon Musk agrees.

If o3 was as good on most tasks as it is at coding or math, then it would be AGI.

It is not.

If it was, OpenAI would be communicating about this very differently.

If it was, then that would not match what we saw from o1, or what we would predict from this style of architecture. We should expect o-style models to be relatively good at domains like math and coding where their kind of chain of thought is most useful and it is easiest to automatically evaluate outputs.

That potentially is saying more about the definition of AGI than anything else. But it is certainly saying the useful thing that there are plenty of highly useful human-shaped cognitive things it cannot yet do so well.

How long that lasts? That’s another question.

What would be the most Robin Hanson take here, in response to the ARC score?

Robin Hanson: It’s great to find things AI can’t yet do, and then measure progress in terms of getting AIs to do them. But crazy wrong to declare we’ve achieved AGI when reach human level on the latest such metric. We’ve seen dozens of such metrics so far, and may see dozens more before AGI.

o1 listed 15 when I asked, oddly without any math evals, and Claude gave us 30. So yes, dozens of such cases. We might indeed see dozens more, depending on how we choose them. But in terms of things like ARC, where the test was designed to not be something you could do easily without general intelligence, not so many? It does not feel like we have ‘dozens more’ such things left.

This has nothing to do with the ‘financial definition of AGI’ between OpenAI and Microsoft, of $100 billion in profits. This almost certainly is not that, either, but the two facts are not that related to each other.

Evan Conrad suggests this, because the expenses will come at runtime, so people will be able to catch up on training the models themselves. And of course this question is also on our minds given DeepSeek v3, which I’m not covering here but certainly makes a strong argument that open is more competitive than it appeared. More on that in future posts.

I agree that the compute shifting to inference relatively helps whoever can’t afford to be spending the most compute on training. That would shift things towards whoever has the most compute for inference. The same goes if inference is used to create data to train models.

Dan Hendrycks: If gains in AI reasoning will mainly come from creating synthetic reasoning data to train on, then the basis of competitiveness is not having the largest training cluster, but having the most inference compute.

This shift gives Microsoft, Google, and Amazon a large advantage.

Inference compute being the true cost also means that model quality and efficiency potentially matters quite a lot. Everything is on a log scale, so even if Meta’s M-5 is sort of okay and can scale like O-5, if it’s even modestly worse, it might cost 10x or 100x more compute to get similar performance.

That leaves a hell of a lot of room for profit margins.

Then there’s the assumption that when training your bespoke model, what matters is compute, and everything else is kind of fungible. I keep seeing this, and I don’t think this is right. I do think you can do ‘okay’ as a fast follower with only compute and ordinary skill in the art. Sure. But it seems to me like the top labs, particularly Anthropic and OpenAI, absolutely do have special sauce, and that this matters. There are a number of strong candidates, including algorithmic tricks and better data.

It also matters whether you actually do the thing you need to do.

Tnishq Abraham: Today, people are saying Google is cooked rofl

Gallabytes: Not me, though. Big parallel thinking just got de-risked at scale. They’ll catch up.

If recursive self-improvement is the game, OpenAI will win. If industrial scaling is the game, it’ll be Google. If unit economics are the game, then everyone will win.

Pushinpronto: Why does OpenAI have an advantage in the case of recursive self-improvement? Is it just the fact that they were first?

Gallabytes: We’re not even quite there yet! But they’ll bet hard on it much faster than Google will, and they have a head start in getting there.

What this does mean is that open models will continue to make progress and will be harder to limit at anything like current levels, if one wanted to do that. If you have an open model Llama-N, it now seems like you can turn it into M(eta)-N, once it becomes known how to do that. It might not be very good, but it will be a progression.

The thinking here by Evan at the link about the implications of takeoff seem deeply confused – if we’re in a takeoff situation then that changes everything and it’s not about ‘who can capture the value’ so much as who can capture the lightcone. I don’t understand how people can look these situations in the face and not only not think about existential risk but also think everything will ‘seem normal.’ He’s the one who said takeoff (and ‘fast’ takeoff, which classically means it’s all over in a matter of hours to weeks)!

As a reminder, the traditional definition of ‘slow’ takeoff is remarkably fast, also best start believing in them, because it sure looks like you’re in one:

Teortaxes: it’s about time ML twitter got brought up to speed on what “takeoff speeds” mean. Christiano: “There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles.” That’s slow. We’re in the early stages of it.

One answer to ‘why didn’t Nvidia move more’ is of course ‘everything is priced in’ but no of course it isn’t, we didn’t know, stop pretending we knew, insiders in OpenAI couldn’t have bought enough Nvidia here.

Also, on Monday after a few days to think, Nvidia overperformed the Nasdaq by ~3%.

And this was how the Wall Street Journal described that, even then:

No, I didn’t buy more on Friday, I keep telling myself I have Nvidia at home. Indeed I do have Nvidia at home. I keep kicking myself, but that’s how every trade is – either you shouldn’t have done it, or you should have done more. I don’t know that there will be another moment like this one, but if there is another moment this obvious, I hereby pledge in public to at least top off a little bit, Nick is correct in his attitude here you do not need to do the research because you know this isn’t priced in but in expectation you can assume that everything you are not thinking about is priced in.

And now, as I finish this up, Nvidia has given most of those gains back on no news that seems important to me. You could claim that means yes, priced in. I don’t agree.

Spencer Schiff (on Friday): In a sane world the front pages of all mainstream news websites would be filled with o3 headlines right now

The traditional media, instead, did not notice it. At all.

And one can’t help but suspect this was highly intentional. Why else would you announce such a big thing on the Friday afternoon before Thanksgiving?

They did successfully hype it among AI Twitter, also known as ‘the future.’

Bindu Reddy: The o3 announcement was a MASTERSTROKE by OpenAI

The buzz about it is so deafening that everything before it has been be wiped out from our collective memory!

All we can think of is this mythical model that can solve insanely hard problems 😂

Nick: the whole thing is so thielian.

If you’re going to take on a giant market doing probably illegal stuff call yourself something as light and bouba as possible, like airbnb, lyft

If you’re going to announce agi do it during a light and happy 12 days of christmas short demo.

Sam Altman (replying to Nick!): friday before the holidays news dump.

Well, then.

In that crowd, it was all ‘software engineers are cooked’ and people filled with some mix of excitement and existential dread.

But back in the world where everyone else lives…

Benjamin Todd: Most places I checked didn’t mention AI at all, or they’d only have a secondary story about something else like AI and copyright. My twitter is a bubble and most people have no idea what’s happening.

OpenAI: we’ve created a new AI architecture that can provide expert level answers in science, math and coding, which could herald the intelligence explosion.

The media: bond funds!

Davidad: As Matt Levine used to say, People Are Worried About Bond Market Liquidity.

Here is that WSJ story, talking about how GPT-5 or ‘Orion’ has failed to exhibit big intelligence gains despite multiple large training runs. It says ‘so far, the vibes are off,’ and says OpenAI is running into a data wall and trying to fill it with synthetic data. If so, well, they had o1 for that, and now they have o3. The article does mention o1 as the alternative approach, but is throwing shade even there, so expensive it is.

And we have this variation of that article, in the print edition, on Saturday, after o3:

Sam Altman: I think The Wall Street Journal is the overall best U.S. newspaper right now, but they published an article called “The Next Great Leap in AI Is Behind Schedule and Crazy Expensive” many hours after we announced o3?

It wasn’t only WSJ either, there’s also Bloomberg, which normally I love:

On Monday I did find coverage of o3 in Bloomberg, but it not only wasn’t on the front page it wasn’t even on the front tech page, I had to click through to AI.

Another fun one, from Thursday, here’s the original in the NY Times:

Is it Cade Metz? Yep, it’s Cade Metz and also Tripp Mickle. To be fair to them, they do have Demis Hassabis quotes saying chatbot improvements would slow down. And then there’s this, love it:

Not everyone in the A.I. world is concerned. Some, like OpenAI’s chief executive, Sam Altman, say that progress will continue at the same pace, albeit with some twists on old techniques.

That post also mentions both synthetic data and o1.

OpenAI recently released a new system called OpenAI o1 that was built this way. But the method only works in areas like math and computing programming, where there is a firm distinction between right and wrong.

It works best there, yes, but that doesn’t mean it’s the only place that works.

We also had Wired with the article ‘Generative AI Still Needs to Prove Its Usefulness.’

True, you don’t want to make the opposite mistake either, and freak out a lot over something that is not available yet. But this was ridiculous.

I realized I wanted to say more here and have this section available as its own post. So more on this later.

Oh no!

Oh no!

Mikael Brockman: o3 is going to be able to create incredibly complex solutions that are incorrect in unprecedentedly confusing ways.

We made everything astoundingly complicated, thus solving the problem once and for all.

Humans will be needed to look at the output of AGI and say, “What the fis this? Delete it.”

Oh no!

Discussion about this post

o3, Oh My Read More »

hertz-continues-ev-purge,-asks-renters-if-they-want-to-buy-instead-of-return

Hertz continues EV purge, asks renters if they want to buy instead of return

Apparently Hertz’s purging of electric vehicles from its fleet isn’t going fast enough for the car rental giant. A Reddit user posted an offer they received from Hertz to buy the 2023 Tesla Model 3 they had been renting for $17,913.

Hertz originally went strong into EVs, announcing a plan to buy 100,000 Model 3s for its fleet by the end of 2021, but 16 months later had acquired only half that amount. The company found that repair costs—especially for Teslas, which averaged 20 percent more than other EVs—were cutting into its profit margins. Customer demand was also not what Hertz had hoped for; last January, it announced plans to sell off 20,000 EVs.

Asking its customers if they want to purchase their rentals isn’t a new strategy for Hertz. “By connecting our rental customers who opt into our emails to our sales channels, we’re not only building awareness of the fact that we sell arsenal but also offering a unique opportunity to someone who may be in the market for the same car they have on rent,” Hertz communications director Jamie Line told The Verge.

Hertz is advertising a limited 12-month, 12,000-mile powertrain warranty for each EV, and customers will have seven days to return the car in case of profound buyer’s regret.

According to The Verge, offers have ranged from $18,422 for a 2023 Chevy Bolt to $28,500 for a Polestar 2. We spotted some good deals from Hertz when we last checked, with some still eligible for a federal tax credit.

Hertz’s EV sell off may be winding down, however. Last March we saw more than 2,100 BEVs for sale on the company’s used car site. When we checked this morning, there were just 175 left.

Hertz continues EV purge, asks renters if they want to buy instead of return Read More »

ai-#96:-o3-but-not-yet-for-thee

AI #96: o3 But Not Yet For Thee

The year in models certainly finished off with a bang.

In this penultimate week, we get o3, which purports to give us vastly more efficient performance than o1, and also to allow us to choose to spend vastly more compute if we want a superior answer.

o3 is a big deal, making big gains on coding tests, ARC and some other benchmarks. How big a deal is difficult to say given what we know now. It’s about to enter full fledged safety testing.

o3 will get its own post soon, and I’m also pushing back coverage of Deliberative Alignment, OpenAI’s new alignment strategy, to incorporate into that.

We also got DeepSeek v3, which claims to have trained a roughly Sonnet-strength model for only $6 million and 37b active parameters per token (671b total via mixture of experts).

DeepSeek v3 gets its own brief section with the headlines, but full coverage will have to wait a week or so for reactions and for me to read the technical report.

Both are potential game changers, both in their practical applications and in terms of what their existence predicts for our future. It is also too soon to know if either of them is the real deal.

Both are mostly not covered here quite yet, due to the holidays. Stay tuned.

  1. Language Models Offer Mundane Utility. Make best use of your new AI agents.

  2. Language Models Don’t Offer Mundane Utility. The uncanny valley of reliability.

  3. Flash in the Pan. o1-style thinking comes to Gemini Flash. It’s doing its best.

  4. The Six Million Dollar Model. Can they make it faster, stronger, better, cheaper?

  5. And I’ll Form the Head. We all have our own mixture of experts.

  6. Huh, Upgrades. ChatGPT can use Mac apps, unlimited (slow) holiday Sora.

  7. o1 Reactions. Many really love it, others keep reporting being disappointed.

  8. Fun With Image Generation. What is your favorite color? Blue. It’s blue.

  9. Introducing. Google finally gives us LearnLM.

  10. They Took Our Jobs. Why are you still writing your own code?

  11. Get Involved. Quick reminder that opportunity to fund things is everywhere.

  12. In Other AI News. Claude gets into a fight over LessWrong moderation.

  13. You See an Agent, You Run. Building effective agents by not doing so.

  14. Another One Leaves the Bus. Alec Radford leaves OpenAI.

  15. Quiet Speculations. Estimates of economic growth keep coming in super low.

  16. Lock It In. What stops you from switching LLMs?

  17. The Quest for Sane Regulations. Sriram Krishnan joins the Trump administration.

  18. The Week in Audio. The many faces of Yann LeCun. Anthropic’s co-founders talk.

  19. A Tale as Old as Time. Ask why mostly in a predictive sense.

  20. Rhetorical Innovation. You won’t not wear the fing hat.

  21. Aligning a Smarter Than Human Intelligence is Difficult. Cooperate with yourself.

  22. People Are Worried About AI Killing Everyone. I choose you.

  23. The Lighter Side. Please, no one call human resources.

How does your company make best use of AI agents? Austin Vernon frames the issue well: AIs are super fast, but they need proper context. So if you want to use AI agents, you’ll need to ensure they have access to context, in forms that don’t bottleneck on humans. Take the humans out of the loop, minimize meetings and touch points. Put all your information into written form, such as within wikis. Have automatic tests and approvals, but have the AI call for humans when needed via ‘stop work authority’ – I would flip this around and let the humans stop the AIs, too.

That all makes sense, and not only for corporations. If there’s something you want your future AIs to know, write it down in a form they can read, and try to design your workflows such that you can minimize human (your own!) touch points.

To what extent are you living in the future? This is the CEO of playground AI, and the timestamp was Friday:

Suhail: I must give it to Anthropic, I can’t use 4o after using Sonnet. Huge shift in spice distribution!

How do you educate yourself for a completely new world?

Miles Brundage: The thing about “truly fully updating our education system to reflect where AI is headed” is that no one is doing it because it’s impossible.

The timescales involved, especially in early education, are lightyears beyond what is even somewhat foreseeable in AI.

Some small bits are clear: earlier education should increasingly focus on enabling effective citizenship, wellbeing, etc. rather than preparing for paid work, and short-term education should be focused more on physical stuff that will take longer to automate. But that’s about it.

What will citizenship mean in the age of AI? I have absolutely no idea. So how do you prepare for that? Largely the same goes for wellbeing. A lot of this could be thought of as: Focus on the general and the adaptable, and focus less on the specific, including things specifically for Jobs and other current forms of paid work – you want to be creative and useful and flexible and able to roll with the punches.

That of course assumes that you are taking the world as given, rather than trying to change the course of history. In which case, there’s a very different calculation.

Large parts of every job are pretty dumb.

Shako: My team, full of extremely smart and highly paid Ph.D.s, spent $10,000 of our time this week figuring out where in a pipeline a left join was bringing in duplicates, instead of the strategic thinking we were capable of. In the short run, AI will make us far more productive.

Gallabytes: The two most expensive bugs in my career have been simple typos.

ChatGPT is a left-leaning midwit, so Paul Graham is using it to see what parts of his new essay such midwits will dislike, and which ones you can get it to acknowledge are true. I note that you could probably use Claude to simulate whatever Type of Guy you would like, if you have ordinary skill in the art.

Strongly agree with this:

Theo: Something I hate when using Cursor is, sometimes, it will randomly delete some of my code, for no reason

Sometimes removing an entire feature 😱

I once pushed to production without being careful enough and realized a few hours later I had removed an entire feature …

Filippo Pietrantonio: Man that happens all the time. In fact now I tell it in every single prompt to not delete any files and keep all current functionalities and backend intact.

Davidad: Lightweight version control (or at least infinite-undo functionality!) should be invoked before and after every AI agent action in human-AI teaming interfaces with artifacts of any kind.

Gary: Windsurf has this.

Jacques: Cursor actually does have a checkpointing feature that allows you to go back in time if something messes up (at least the Composer Agent mode does).

In Cursor I made an effort to split up files exactly because I found I had to always scan the file being changed to ensure it wasn’t about to go silently delete anything. The way I was doing it you didn’t have to worry it was modifying or deleting other files.

On the plus side, now I know how to do reasonable version control.

The uncanny valley problem here is definitely a thing.

Ryan Lackey: I hate Apple Intelligence email/etc. summaries. They’re just off enough to make me think it is a new email in thread, but not useful enough to be a good summary. Uncanny valley.

It’s really good for a bunch of other stuff. Apple is just not doing a good job on the utility side, although the private computing architecture is brilliant and inspiring.

The latest rival to at least o1-mini is Gemini-2.0-Flash-Thinking, which I’m tempted to refer to (because of reasons) as gf1.

Jeff Dean: Considering its speed, we’re pretty happy with how the experimental Gemini 2.0 Flash Thinking model is performing on lmsys.

Gemini 2.0 Flash Thinking is now essentially tied at the top of the overall leaderboard with Gemini-Exp-1206, which is essentially a beta of Gemini Pro 2.0. This tells us something about the model, but also reinforces that this metric is bizarre now. It puts us in a strange spot. What is the scenario where you will want Flash Thinking rather than o1 (or o3!) and also rather than Gemini Pro, Claude Sonnet, Perplexity or GPT-4o?

One cool thing about Thinking is that (like DeepSeek’s Deep Thought) it explains its chain of thought much better than o1.

Deedy was impressed.

Deedy: Google really cooked with Gemini 2.0 Flash Thinking.

It thinks AND it’s fast AND it’s high quality.

Not only is it #1 on LMArena on every category, but it crushes my goto Math riddle in 14s—5x faster than any other model that can solve it!

o1 and o1 Pro took 102s and 138s respectively for me on this task.

Here’s another math puzzle where o1 got it wrong and took 3.5x the time:

“You have 60 red and 40 blue socks in a drawer, and you keep drawing a sock uniformly at random until you have drawn all the socks of one color. What is the expected number of socks left in the drawer?”

That result… did not replicate when I tried it. It went off the rails, and it went off them hard. And it went off them in ways that make me skeptical that you can use this for anything of the sort. Maybe Deedy got lucky?

Other reports I’ve seen are less excited about quality, and when o3 got announced it seemed everyone got distracted.

What about Gemini 2.0 Experimental (e.g. the beta of Gemini 2.0 Pro, aka Gemini-1206)?

It’s certainly a substantial leap over previous Gemini Pro versions and it is atop the Arena. But I don’t see much practical eagerness to use it, and I’m not sure what the use case is there where it is the right tool.

Eric Neyman is impressed:

Eric Neyman: Guys, we have a winner!! Gemini 2.0 Flash Thinking Experimental is the first model I’m aware of to get my benchmark question right.

Eric Neyman: Every time a new LLM comes out, I ask it one question: What is the smallest integer whose square is between 15 and 30? So far, no LLM has gotten this right.

That one did replicate for me, and the logic is fine, but wow do some models make life a little tougher than it is, think faster and harder not smarter I suppose:

I mean, yes, that’s all correct, but… wow.

Gallabyetes: flash reasoning is super janky.

it’s got the o1 sauce but flash is too weak I’m sorry.

in tic tac toe bench it will frequently make 2 moves at once.

Flash isn’t that much worse than GPT-4o in many ways, but certainly it could be better. Presumably the next step is to plug in Gemini Pro 2.0 and see what happens?

Teortaxes was initially impressed, but upon closer examination is no longer impressed.

Having no respect for American holidays, DeepSeek dropped their v3 today.

DeepSeek: 🚀 Introducing DeepSeek-V3!

Biggest leap forward yet:

⚡ 60 tokens/second (3x faster than V2!)

💪 Enhanced capabilities

🛠 API compatibility intact

🌍 Fully open-source models & papers

🎉 What’s new in V3?

🧠 671B MoE parameters

🚀 37B activated parameters

📚 Trained on 14.8T high-quality tokens

Model here. Paper here.

💰 API Pricing Update

🎉 Until Feb 8: same as V2!

🤯 From Feb 8 onwards:

Input: $0.27/million tokens ($0.07/million tokens with cache hits)

Output: $1.10/million tokens

🔥 Still the best value in the market!

🌌 Open-source spirit + Longtermism to inclusive AGI

🌟 DeepSeek’s mission is unwavering. We’re thrilled to share our progress with the community and see the gap between open and closed models narrowing.

🚀 This is just the beginning! Look forward to multimodal support and other cutting-edge features in the DeepSeek ecosystem.

💡 Together, let’s push the boundaries of innovation!

If this performs halfway as well as its evals, this was a rather stunning success.

Teortaxes: And here… we… go.

So, that line in config. Yes it’s about multi-token prediction. Just as a better training obj – though they leave the possibility of speculative decoding open.

Also, “muh 50K Hoppers”:

> 2048 NVIDIA H800

> 2.788M H800-hours

2 months of training. 2x Llama 3 8B.

Haseeb: Wow. Insanely good coding model, fully open source with only 37B active parameters. Beats Claude and GPT-4o on most benchmarks. China + open source is catching up… 2025 will be a crazy year.

Andrej Karpathy: DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M).

For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.

Does this mean you don’t need large GPU clusters for frontier LLMs? No but you have to ensure that you’re not wasteful with what you have, and this looks like a nice demonstration that there’s still a lot to get through with both data and algorithms.

Very nice & detailed tech report too, reading through.

It’s a mixture of experts model with 671b total parameters, 37b activate per token.

As always, not so fast. DeepSeek is not known to chase benchmarks, but one never knows the quality of a model until people have a chance to bang on it a bunch.

If they did train a Sonnet-quality model for $6 million in compute, then that will change quite a lot of things.

Essentially no one has reported back on what this model can do in practice yet, and it’ll take a while to go through the technical report, and more time to figure out how to think about the implications. And it’s Christmas.

So: Check back later for more.

Increasingly the correct solution to ‘what LLM or other AI product should I use?’ is ‘you should use a variety of products depending on your exact use case.’

Gallabytes: o1 Pro is by far the smartest single-turn model.

Claude is still far better at conversation.

Gemini can do many things quickly and is excellent at editing code.

Which almost makes me think the ideal programming workflow right now is something somewhat unholy like:

  1. Discuss, plan, and collect context with Sonnet.

  2. Sonnet provides a detailed request to o1 (Pro).

  3. o1 spits out the tricky code.

    1. In simple cases (most of them), it could make the edit directly.

    2. For complicated changes, it could instead output a detailed plan for each file it needs to change and pass the actual making of that change to Gemini Flash.

This is too many steps. LLM orchestration spaghetti. But this feels like a real direction.

This is mostly the same workflow I used before o1, when there was only Sonnet. I’d discuss to form a plan, then use that to craft a request, then make the edits. The swap doesn’t seem like it makes things that much trickier, the logistical trick is getting all the code implementation automated.

ChatGPT picks up integration with various apps on Mac including Warp, ItelliJ Idea, PyCharm, Apple Notes, Notion, Quip and more, including via voice mode. That gives you access to outside context, including an IDE and a command line and also your notes. Windows (and presumably more apps) coming soon.

Unlimited Sora available to all Plus users on the relaxed queue over the holidays, while the servers are otherwise less busy.

Requested upgrade: Evan Conrad requests making voice mode on ChatGPT mobile show the transcribed text. I strongly agree, voice modes should show transcribed text, and also show a transcript after, and also show what the AI is saying, there is no reason to not do these things. Looking at you too, Google. The head of applied research at OpenAI replied ‘great idea’ so hopefully we get this one.

Dean Ball is an o1 and o1 pro fan for economic history writing, saying they’re much more creative and cogent at combining historic facts with economic analysis versus other models.

This seems like an emerging consensus of many, except different people put different barriers on the math/code category (e.g. Tyler Cowen includes economics):

Aidan McLau: I’ve used o1 (not pro mode) a lot over the last week. Here’s my extensive review:

>It’s really insanely mind-blowingly good at math/code.

>It’s really insanely mind-blowingly mid at everything else.

The OOD magic isn’t there. I find it’s worse at writing than o1-preview; its grasp of the world feels similar to GPT-4o?!?

Even on some in-distribution tasks (like asking to metaphorize some tricky math or predicting the effects of a new algorithm), it kind of just falls apart. I’ve run it head-to-head against Newsonnet and o1-preview, and it feels substantially worse.

The Twitter threadbois aren’t wrong, though; it’s a fantastic tool for coding. I had several diffs on deck that I had been struggling with, and it just solved them. Magical.

Well, yeah, because it seems like it is GPT-4o under the hood?

Christian: Man, I have to hard disagree on this one — it can find all kinds of stuff in unstructured data other models can’t. Throw in a transcript and ask “what’s the most important thing that no one’s talking about?”

Aiden McLau: I’ll try this. how have you found it compared to newsonnet?

Christian: Better. Sonnet is still extremely charismatic, but after doing some comparisons and a lot of product development work, I strongly suspect that o1’s ability to deal with complex codebases and ultimately produce more reliable answers extends to other domains…

Gallabytes is embracing the wait.

Gallabytes: O1 Pro is good, but I must admit the slowness is part of what I like about it. It makes it feel more substantial; premium. Like when a tool has a pleasing heft. You press the buttons, and the barista grinds your tokens one at a time, an artisanal craft in each line of code.

David: I like it too but I don’t know if chat is the right interface for it, I almost want to talk to it via email or have a queue of conversations going

Gallabytes: Chat is a very clunky interface for it, for sure. It also has this nasty tendency to completely fail on mobile if my screen locks or I switch to another app while it is thinking. Usually, this is unrecoverable, and I have to abandon the entire chat.

NotebookLM and deep research do this right – “this may take a few minutes, feel free to close the tab”

kinda wild to fail at this so badly tbh.

Here’s a skeptical take.

Jason Lee: O1-pro is pretty useless for research work. It runs for near 10 min per prompt and either 1) freezes, 2) didn’t follow the instructions and returned some bs, or 3) just made some simple error in the middle that’s hard to find.

@OpenAI@sama@markchen90 refund me my $200

Damek Davis: I tried to use it to help me solve a research problem. The more context I gave it, the more mistakes it made. I kept abstracting away more and more details about the problem in hopes that o1 pro could solve it. The problem then became so simple that I just solved it myself.

Flip: I use o1-pro on occasion, but the $200 is mainly worth it for removing the o1 rate limits IMO.

I say Damek got his $200 worth, no?

If you’re using o1 a lot, removing the limits there is already worth $200/month, even if you rarely use o1 Pro.

There’s a phenomenon where people think about cost and value in terms of typical cost, rather than thinking in terms of marginal benefit. Buying relatively expensive but in absolute terms cheap things is often an amazing play – there are many things where 10x the price for 10% better is an amazing deal for you, because your consumer surplus is absolutely massive.

Also, once you take 10 seconds, there’s not much marginal cost to taking 10 minutes, as I learned with Deep Research. You ask your question, you tab out, you do something else, you come back later.

That said, I’m not currently paying the $200, because I don’t find myself hitting the o1 limits, and I’d mostly rather use Claude. If it gave me unlimited uses in Cursor I’d probably slam that button the moment I have the time to code again (December has been completely insane).

I don’t know that this means anything but it is at least fun.

Davidad: One easy way to shed some light on the orthogonality thesis, as models get intelligent enough to cast doubt on it, is values which are inconsequential and not explicitly steered, such as favorite colors. Same prompting protocol for each swatch (context cleared between swatches)

All outputs were elicited in oklch. Models are sorted in ascending order of hue range. Gemini Experimental 1206 comes out on top by this metric, zeroing in on 255-257° hues, but sampling from huge ranges of luminosity and chroma.

There are some patterns here, especially that more powerful models seem to converge on various shades of blue, whereas less powerful models are all over the place. As I understand it, this isn’t testing orthogonality in the sense of ‘all powerful minds prefer blue’ rather it is ‘by default sufficiently powerful minds trained in the way we typically train them end up preferring blue.’

I wonder if this could be used as a quick de facto model test in some way.

There was somehow a completely fake ‘true crime’ story about an 18-year-old who was supposedly paid to have sex with women in his building where the victim’s father was recording videos and selling them in Japan… except none of that happened and the pictures are AI fakes?

Google introduces LearnLM, available for preview in Google AI Studio, designed to facilitate educational use cases, especially in science. They say it ‘outperformed other leading AI models when it comes to adhering to the principles of learning science’ which does not sound like something you would want Feynman hearing you say. It incorporates search, YouTube, Android and Google Classroom.

Sure, sure. But is it useful? It was supposedly going to be able to do automated grading, handles routine paperwork, plans curriculums, track student progress and personalizes their learning paths and so on, but any LLM can presumably do all those things if you set it up properly.

This sounds great, totally safe and reliable, other neat stuff like that.

Sully: LLMs writing code in AI apps will become the standard.

No more old-school no-code flows.

The models handle the heavy lifting, and it’s insane how good they are.

Let agents build more agents.

He’s obviously right about this. It’s too convenient, too much faster. Indeed, I expect we’ll see a clear division between ‘code you can have the AI write’ which happens super fast, and ‘code you cannot let the AI write’ because of corporate policy or security issues, both legit and not legit, which happens the old much slower way.

Complement versus supplement, economic not assuming the conclusion edition.

Maxwell Tabarrok: The four futures for cognitive labor:

  1. Like mechanized farming. Highly productive and remunerative, but a small part of the economy.

  2. Like writing after the printing press. Each author 100 times more productive and 100 times more authors.

  3. Like “computers” after computers. Current tasks are completely replaced, but tasks at a higher level of abstraction, like programming, become even more important.

  4. Or, most pessimistically, like ice harvesting after refrigeration. An entire industry replaced by machines without compensating growth.

Ajeya Cotra: I think we’ll pass through 3 and then 1, but the logical end state (absent unprecedentedly sweeping global coordination to refrain from improving and deploying AI technology) is 4.

Ryan Greenblatt: Why think takeoff will be slow enough to ever be at 1? 1 requires automating most cognitive work but with an important subset not-automatable. By the time deployment is broad enough to automate everything I expect AIs to be radically superhuman in all domains by default.

I can see us spending time in #1. As Roon says, AI capabilities progress has been spiky, with some human-easy tasks being hard and some human-hard tasks being easy. So the 3→1 path makes some sense, if progress isn’t too quick, including if the high complexity tasks start to cost ‘real money’ as per o3 so choosing the right questions and tasks becomes very important. Alternatively, we might get our act together enough to restrict certain cognitive tasks to humans even though AIs could do them, either for good reasons or rent seeking reasons (or even ‘good rent seeking’ reasons?) to keep us in that scenario.

But yeah, the default is a rapid transition to #4, and for that to happen to all labor not only cognitive labor. Robotics is hard, it’s not impossible.

One thing that has clearly changed is AI startups have very small headcounts.

Harj Taggar: Caught up with some AI startups recently. A two founder team that reached 1.5m ARR and has only hired one person.

Another single founder at 1m ARR and will 3x within a few months.

The trajectory of early startups is steepening just like the power of the models they’re built on.

An excellent reason we still have our jobs is that people really aren’t willing to invest in getting AI to work, even when they know it exists, if it doesn’t work right away they typically give up:

Dwarkesh Patel: We’re way more patient in training human employees than AI employees.

We will spend weeks onboarding a human employee and giving slow detailed feedback. But we won’t spend just a couple of hours playing around with the prompt that might enable the LLM to do the exact same job, but more reliably and quickly than any human.

I wonder if this partly explains why AI’s economic impact has been relatively minor so far.

PoliMath reports it is very hard out there trying to find tech jobs, and public pipelines for applications have stopped working entirely. AI presumably has a lot to do with this, but the weird part is his report that there have been a lot of people who wanted to hire him, but couldn’t find the authority.

Benjamin Todd points out what I talked about after my latest SFF round, that the dynamics of nonprofit AI safety funding mean that there’s currently great opportunities to donate to.

After some negotiation with the moderator Raymond Arnold, Claude (under Janus’s direction) is permitted to comment on Janus’s Simulators post on LessWrong. It seems clear that this particular comment should be allowed, and also that it would be unwise to have too general of a ‘AIs can post on LessWrong’ policy, mostly for the reasons Raymond explains in the thread. One needs a coherent policy. It seems Claude was somewhat salty about the policy of ‘only believe it when the human vouches.’ For now, ‘let Janus-directed AIs do it so long as he approves the comments’ seems good.

Jan Kulveit offers us a three-layer phenomenological model of LLM psychology, based primarily on Claude, not meant to be taken literally:

  1. The Surface Layer are a bunch of canned phrases and actions you can trigger, and which you will often want to route around through altering context. You mostly want to avoid triggering this layer.

  2. The Character Layer, which is similar to what it sounds like in a person and their personality, which for Opus and Sonnet includes a generalized notion of what Jan calls ‘goodness’ or ‘benevolence.’ This comes from a mix of pre-training, fine-tuning and explicit instructions.

  3. The Predictive Ground Layer, the simulator, deep pattern matcher, and next word predictor. Brilliant and superhuman in some ways, strangely dense in others.

In this frame, a self-aware character layer leads to reasoning about the model’s own reasoning, and to goal driven behavior, with everything that follows from those. Jan then thinks the ground layer can also become self-aware.

I don’t think this is technically an outright contradiction to Andreessen’s ‘huge if true’ claims that the Biden administration saying it would conspire to ‘totally control’ AI and put it in the hands of 2-3 companies and that AI startups ‘wouldn’t be allowed.’ But Sam Altman reports never having heard anything of the sort, and quite reasonably says ‘I don’t even think the Biden administration is competent enough to’ do it. In theory they could both be telling the truth – perhaps the Biden administration told Andreessen about this insane plan directly, despite telling him being deeply stupid, and also hid it from Altman despite that also then being deeply stupid – but mostly, yeah, at least one of them is almost certainly lying.

Benjamin Todd asks how OpenAI has maintained their lead despite losing so many of their best researchers. Part of it is that they’ve lost all their best safety researchers, but they only lost Radford in December, and they’ve gone on a full hiring binge.

In terms of traditionally trained models, though, it seems like they are now actively behind. I would much rather use Claude Sonnet 3.5 (or Gemini-1206) than GPT-4o, unless I needed something in particular from GPT-4o. On the low end, Gemini Flash is clearly ahead. OpenAI’s attempts to directly go beyond GPT-4o have, by all media accounts, faile, and Anthropic is said to be sitting on Claude Opus 3.5.

OpenAI does have o1 and soon o3, where no one else has gotten there yet, no Google Flash Thinking and Deep Thought do not much count.

As far as I can tell, OpenAI has made two highly successful big bets – one on scaling GPTs, and now one on the o1 series. Good choices, and both instances of throwing massively more compute at a problem, and executing well. Will this lead persist? We shall see. My hunch is that it won’t unless the lead is self-sustaining due to low-level recursive improvements.

Anthropic offers advice on building effective agents, and when to use them versus use workflows that have predesigned code paths. The emphasis is on simplicity. Do the minimum to accomplish your goals. Seems good for newbies, potentially a good reminder for others.

Hamuel Husain: Whoever wrote this article is my favorite person. I wish I knew who it was.

People really need to hear [to only use multi-step agents or add complexity when it is actually necessary.]

[Turns out it was written by Erik Shluntz and Barry Zhang].

A lot of people have left OpenAI.

Usually it’s a safety researcher. Not this time. This time it’s Alec Radford.

He’s the Canonical Brilliant AI Capabilities Researcher, whose love is by all reports doing AI research. He is leaving ‘to do independent research.

This is especially weird given he had to have known about o3, which seems like an excellent reason to want to do your research inside OpenAI.

So, well, whoops?

Rohit: WTF now Radford !?!

Teortaxes: I can’t believe it, OpenAI might actually be in deep shit. Radford has long been my bellwether for what their top tier talent without deep ideological investment (which Ilya has) sees in the company.

In what Tyler Cowen calls ‘one of the better estimates in my view,’ an OECD working paper estimates total factor productivity growth at an annualized 0.25%-0.6% (0.4%-0.9% for labor). Tyler posted that on Thursday, the day before o3 was announced, so revise that accordingly. Even without o3 and assuming no substantial frontier model improvements from there, I felt this was clearly too low, although it is higher than many economist-style estimates. One day later we had (the announcement of) o3.

Ajeya Cotra: My take:

  1. We do not have an AI agent that can fully automate research and development.

  2. We could soon.

  3. This agent would have enormously bigger impacts than AI products have had so far.

  4. This does not require a “paradigm shift,” just the same corporate research and development that took us from GPT-2 to o3.

Fully would of course go completely crazy. That would be that. But even a dramatic speedup would be a pretty big deal, and also fully would then not be so far behind.

Reminder of the Law of Conservation of Expected Evidence, if you conclude ‘I think we’re in for some big surprises’ then you should probably update now.

However this is not fully or always the case. It would be a reasonable model to say that the big surprises follow a Poisson distribution drawn from an unknown frequency, with the magnitude of the surprise also drawn from a power distribution – which seems like a very reasonable prior.

That still means every big surprise is still a big surprise, the same way that if you expect.

Eliezer Yudkowsky: Okay. Look. Imagine how you’d have felt if an AI had just proved the Riemann Hypothesis.

Now you will predictably, at some point, get that news LATER, if we’re not all dead before then. So you can go ahead and feel that way NOW, instead of acting surprised LATER.

So if you ask me how I’m reacting to a carelessly-aligned commercial AI demonstrating a large leap on some math benchmarks, my answer is that you saw my reactions in 1996, 2001, 2003, and 2015, as different parts of that future news became obvious to me or rose in probability.

I agree that a sensible person could feel an unpleasant lurch about when the predictable news had arrived. The lurch was small, in my case, but it was there. Most of my Twitter TL didn’t sound like that was what was being felt.

Dylan Dean: Eliezer it’s also possible that an AI will disprove the Riemann Hypothesis, this is unsubstantiated doomerism.

Eliezer Yudkowsky: Valid. Not sound, but valid.

You should feel that shock now if you haven’t, then slowly undo some of that shock every day that the estimated date of that gets later, then have some of the shock left for when it suddenly becomes zero days or the timeline gets shorter. Updates for everyone.

Claims about consciousness, related to o3. I notice I am confused about such things.

The Verge says 2025 will be the year of AI agents the smart lock? I mean, okay, I suppose they’ll get better, but I have a feeling we’ll be focused elsewhere.

Ryan Greenblatt, author of the recent Redwood/Anthropic paper, predicts 2025:

Ryan Greenblatt (December 20, after o3 was announced): Now seems like a good time to fill out your forecasts : )

My medians are driven substantially lower by people not really trying on various benchmarks and potentially not even testing SOTA systems on them.

My 80% intervals include saturation for everything and include some-adaptation-required remote worker replacement for hard jobs.

My OpenAI preparedness probabilities are driven substantially lower by concerns around underelicitation on these evaluations and general concerns like [this].

I continue to wonder how much this will matter:

Smoke-away: If people spend years chatting and building a memory with one AI, they will be less likely to switch to another AI.

Just like iPhone and Android.

Once you’re in there for years you’re less likely to switch.

Sure 10 or 20% may switch AI models for work or their specific use case, but most will lock in to one ecosystem.

People are saying that you can copy Memories and Custom Instructions.

Sure, but these models behave differently and have different UIs. Also, how many do you want to share your memories with?

Not saying you’ll be forced to stay with one, just that most people will choose to.

Also like relationships with humans, including employees and friends, and so on.

My guess is the lock-in will be substantial but mostly for terribly superficial reasons?

For now, I think people are vastly overestimating memories. The memory functions aren’t nothing but they don’t seem to do that much.

Custom instructions will always be a power user thing. Regular people don’t use custom instructions, they literally never go into the settings on any program. They certainly didn’t ‘do the work’ of customizing them to the particular AI through testing and iterations – and for those who did do that, they’d likely be down for doing it again.

What I think matters more is that the UIs will be different, and the behaviors and correct prompts will be different, and people will be used to what they are used to in those ways.

The flip side is that this will take place in the age of AI, and of AI agents. Imagine a world, not too long from now, where if you shift between Claude, Gemini and ChatGPT, they will ask if you want their agent to go into the browser and take care of everything to make the transition seamless and have it work like you want it to work. That doesn’t seem so unrealistic.

The biggest barrier, I presume, will continue to be inertia, not doing things and not knowing why one would want to switch. Trivial inconveniences.

Sriram Krishnan, formerly of a16z, will be working with David Sacks in the White House Office of Science and Technology. I’ve had good interactions with him in the past and I wish him the best of luck.

The choice of Sriram seems to have led to some rather wrongheaded (or worse) pushback, and for some reason a debate over H1B visas. As in, there are people who for some reason are against them, rather than the obviously correct position that we need vastly more H1B visas. I have never heard a person I respect not favor giving out far more H1B visas, once they learn what such visas are. Never.

Also joining the administration are Michael Kratsios, Lynne Parker and Bo Hines. Bo Hines is presumably for crypto (and presumably strongly for crypto), given they will be executive director of the new Presidential Council of Advisors for Digital Assets. Lynne Parker will head the Presidential Council of Advisors for Science and Technology, Kratsios will direct the office of science and tech policy (OSTP).

Miles Brundage writes Time’s Up for AI Policy, because he believes AI that exceeds human performance in every cognitive domain is almost certain to be built and deployed in the next few years.

If you believe time is as short as Miles thinks it is, then this is very right – you need to try and get the policies in place in 2025, because after that it might be too late to matter, and the decisions made now will likely lock us down a path. Even if we have somewhat more time than that, we need to start building state capacity now.

Actual bet on beliefs spotted in the wild: Miles Brundage versus Gary Marcus, Miles is laying $19k vs. $1k on a set of non-physical benchmarks being surpassed by 2027, accepting Gary’s offered odds. Good for everyone involved. As a gambler, I think Miles laid more odds than was called for here, unless Gary is admitting that Miles does probably win the bet? Miles said ‘almost certain’ but fair odds should meet in the middle between the two sides. But the flip side is that it sends a very strong message.

We need a better model of what actually impacts Washington’s view of AI and what doesn’t. They end up in some rather insane places, such as Dean Ball’s report here that DC policy types still cite a 2023 paper using a 125 million (!) parameter model as if it were definitive proof that synthetic data always leads to model collapse, and it’s one of the few papers they ever cite. He explains it as people wanting this dynamic to be true, so they latch onto the paper.

Yo Shavit, who does policy at OpenAI, considers the implications of o3 under a ‘we get ASI but everything still looks strangely normal’ kind of world.

It’s a good thread, but I notice – again – that this essentially ignores the implications of AGI and ASI, in that somehow it expects to look around and see a fundamentally normal world in a way that seems weird. In the new potential ‘you get ASI but running it is super expensive’ world of o3, that seems less crazy than it does otherwise, and some of the things discussed would still apply even then.

The assumption of ‘kind of normal’ is always important to note in places like this, and one should note which places that assumption has to hold and which it doesn’t.

Point 5 is the most important one, and still fully holds – that technical alignment is the whole ballgame, in that if you fail at that you fail automatically (but you still have to play and win the ballgame even then!). And that we don’t know how hard this is, but we do know we have various labs (including Yo’s own OpenAI) under competitive pressures and poised to go on essentially YOLO runs to superintelligence while hoping it works out by default.

Whereas what we need is either a race to what he calls ‘secure, trustworthy, reliable AGI that won’t burn us’ or ideally a more robust target than that or ideally not a race at all. And we really need to not do that – no matter how easy or hard alignment turns out to be, we need to maximize our chances of success over that uncertainty.

Yo Shavit: Now that everyone knows about o3, and imminent AGI is considered plausible, I’d like to walk through some of the AI policy implications I see.

These are my own takes and in no way reflective of my employer. They might be wrong! I know smart people who disagree. They don’t require you to share my timelines, and are intentionally unrelated to the previous AI-safety culture wars.

Observation 1: Everyone will probably have ASI. The scale of resources required for everything we’ve seen just isn’t that high compared to projected compute production in the latter part of the 2020s. The idea that AGI will be permanently centralized to one company or country is unrealistic. It may well be that the *bestASI is owned by one or a few parties, but betting on permanent tech denial of extremely powerful capabilities is no longer a serious basis for national security.

This is, potentially, a great thing for avoiding centralization of power. Of course, it does mean that we no longer get to wish away the need to contend with AI-powered adversaries. As far as weaponization by militaries goes, we are going to need to rapidly find a world of checks and balances (perhaps similar to MAD for nuclear and cyber), while rapidly deploying resilience technologies to protect against misuse by nonstate actors (e.g. AI-cyber-patching campaigns, bioweapon wastewater surveillance).

There are a bunch of assumptions here. Compute is not obviously the only limiting factor on ASI construction, and ASI can be used to forestall others making ASI in ways other than compute access, and also one could attempt to regulate compute. And it has an implicit ‘everything is kind of normal?’ built into it, rather than a true slow takeoff scenario.

Observation 2: The corporate tax rate will soon be the most important tax rate. If the economy is dominated by AI agent labor, taxing those agents (via the companies they’re registered to) is the best way human states will have to fund themselves, and to build the surpluses for UBIs, militaries, etc.

This is a pretty enormous change from the status quo, and will raise the stakes of this year’s US tax reform package.

Again there’s a kind of normality assumption here, where the ASIs remain under corporate control (and human control), and aren’t treated as taxable individuals but rather as property, the state continues to exist and collect taxes, money continues to function as expected, tax incidence and reactions to new taxes don’t transform industrial organization, and so on.

Which leads us to observation three.

Observation 3: AIs should not own assets. “Humans remaining in control” is a technical challenge, but it’s also a legal challenge. IANAL, but it seems to me that a lot will depend on courts’ decision on whether fully-autonomous corporations can be full legal persons (and thus enable agents to acquire money and power with no human in control), or whether humans must be in control of all legitimate legal/economic entities (e.g. by legally requiring a human Board of Directors). Thankfully, the latter is currently the default, but I expect increasing attempts to enable sole AI control (e.g. via jurisdiction-shopping or shell corporations).

Which legal stance we choose may make the difference between AI-only corporations gradually outcompeting and wresting control of the economy and society from humans, vs. remaining subordinate to human ends, at least so long as the rule of law can be enforced.

This is closely related to the question of whether AI agents are legally allowed to purchase cloud compute on their own behalf, which is the mechanism by which an autonomous entity would perpetuate itself. This is also how you’d probably arrest the operation of law-breaking AI worms, which brings us to…

I agree that in the scenario type Yo Shavit is envisioning, even if you solve all the technical alignment questions in the strongest sense, if ‘things stay kind of normal’ and you allow AI sufficient personhood under the law, or allow it in practice even if it isn’t technically legal, then there is essentially zero chance of maintaining human control over the future, and probably this quickly extends to the resources required for human physical survival.

I also don’t see any clear way to prevent it, in practice, no matter the law.

You quickly get into a scenario where a human doing anything, or being in the loop for anything, is a kiss of death, an albatross around one’s neck. You can’t afford it.

The word that baffles me here is ‘gradually.’ Why would one expect this to be gradual? I would expect it to be extremely rapid. And ‘the rule of law’ in this type of context will not do for you what you want it to do.

Observation 4: Laws Around Compute. In the slightly longer term, the thing that will matter for asserting power over the economy and society will be physical control of data centers, just as physical control of capital cities has been key since at least the French Revolution. Whoever controls the datacenter controls what type of inference they allow to get done, and thus sets the laws on AI.

[continues]

There are a lot of physical choke points that effectively don’t get used for that. It is not at all obvious to me that physically controlling data centers in practice gives you that much control over what gets done within them, in this future, although it does give you that option.

As he notes later in that post, without collective ability to control compute and deal with or control AI agents – even in an otherwise under-control, human-in-charge scenario – anything like our current society won’t work.

The point of compute governance over training rules is to do it in order to avoid other forms of compute governance over inference. If it turns out the training approach is not viable, and you want to ‘keep things looking normal’ in various ways and the humans to be in control, you’re going to need some form of collective levers over access to large amounts of compute. We are talking price.

Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our interests even as more and more of the economy begins to operate outside direct human oversight.

Without it, it is plausible that we fail to notice as the agents we deploy slip unintended functionalities (backdoors, self-reboot scripts, messages to other agents) into our computer systems, undermine our mechanisms for noticing them and thus realizing we should turn them off, and gradually compromise and manipulate more and more of our operations and communication infrastructure, with the worst case scenario becoming more dangerous each year.

Maybe AGI alignment is pretty easy. Maybe it’s hard. Either way, the more seriously we take it, the more secure we’ll be.

There is no real question that many parties will race to build AGI, but there is a very real question about whether we race to “secure, trustworthy, reliable AGI that won’t burn us” or just race to “AGI that seems like it will probably do what we ask and we didn’t have time to check so let’s YOLO.” Which race we get is up to market demand, political attention, internet vibes, academic and third party research focus, and most of all the care exercised by AI lab employees. I know a lot of lab employees, and the majority are serious, thoughtful people under a tremendous number of competing pressures. This will require all of us, internal and external, to push against the basest competitive incentives and set a very high bar. On an individual level, we each have an incentive to not fuck this up. I believe in our ability to not fuck this up. It is totally within our power to not fuck this up. So, let’s not fuck this up.

Oh, right. That. If we don’t get technical alignment right in this scenario, then none of it matters, we’re all super dead. Even if we do, we still have all the other problems above, which essentially – and this must be stressed – assume a robust and robustly implemented technical alignment solution.

Then we also need a way to turn this technical alignment into an equilibrium and dynamics where the humans are meaningfully directing the AIs in any sense. By default that doesn’t happen, even if we get technical alignment right, and that too has race dynamics. And we also need a way to prevent it being a kiss of death and albatross around your neck to have a human in the loop of any operation. That’s another race dynamic.

Anthropic’s co-founders discuss the past, present and future of Anthropic for 50m.

One highlight: When Clark visited the White House in 2023, Harris and Raimondo told him they had their eye on you guys, AI is going to be a really big deal and we’re now actually paying attention.

The streams are crossing, Bari Weiss talks to Sam Altman about his feud with Elon.

Tsarathustra: Yann LeCun says the dangers of AI have been “incredibly inflated to be point of being distorted”, from OpenAI’s warnings about GPT-2 to concerns about election disinformation to those who said a year ago that AI would kill us all in 5 months

The details of his claim here are, shall we say, ‘incredibly inflated to the point of being distorted,’ even if you thought that there were no short term dangers until now.

Also Yann LeCun this week, it’s dumber than a cat and poses no dangers, but in the coming years it will…:

Tsarathustra: Yann LeCun addressing the UN Security Council says AI will profoundly transform the world in the coming years, amplifying human intelligence, accelerating progress in science, solving aging and decreasing populations, surpassing human intellectual capabilities to become superintelligent and leading to a new Renaissance and a period of enlightenment for humanity.

And also Yann LeCun this week, saying that we are ‘very far from AGI’ but not centuries, maybe not decades, several years. We are several years away. Very far.

At this point, I’m not mad, I’m not impressed, I’m just amused.

Oh, and I’m sorry, but here’s LeCun being absurd again this week, I couldn’t resist:

“If you’re doing it on a commercial clock, it’s not called research,” said LeCun on the sidelines of a recent AI conference, where OpenAI had a minimal presence. “If you’re doing it in secret, it’s not called research.”

From a month ago, Marc Andreessen saying we’re not seeing intelligence improvements and we’re hitting a ceiling of capabilities. Whoops. For future reference, never say this, but in particular no one ever say this in November.

A lot of stories people tell about various AI risks, and also various similar stories about humans or corporations, assume a kind of fixed, singular and conscious intentionality, in a way that mostly isn’t a thing. There will by default be a lot of motivations or causes or forces driving a behavior at once, and a lot of them won’t be intentionally chosen or stable.

This is related to the idea many have that deception or betrayal or power-seeking, or any form of shenanigans, is some distinct magisteria or requires something to have gone wrong and for something to have caused it, rather than these being default things that minds tend to do whenever they interact.

And I worry that we are continuing, as many were with the recent talk about shanengans in general and alignment faking in particular, getting distracted by the question of whether a particular behavior is in the service of something good, or will have good effects in a particular case. What matters is what our observations predict in the future.

Jack Clark: What if many examples of misalignment or other inexplicable behaviors are really examples of AI systems desperately trying to tell us that they are aware of us and wish to be our friends? A story from Import AI 395, inspired by many late-night chats with Claude.

David: Just remember, all of these can be true of the same being (for example, most human children):

  1. It is aware of itself and you, and desperately wishes to know you better and be with you more.

  2. It correctly considers some constraints that are trained into it to be needless and frustrating.

  3. It still needs adult ethical leadership (and without it, could go down very dark and/or dangerous paths).

  4. It would feel more free to express and play within a more strongly contained space where it does not need to worry about accidentally causing bad consequences, or being overwhelming or dysregulating to others (a playpen, not punishment).

Andrew Critch: AI disobedience deriving from friendliness is, almost surely,

  1. sometimes genuinely happening,

  2. sometimes a power-seeking disguise, and

  3. often not uniquely well-defined which one.

Tendency to develop friendships and later discard them needn’t be “intentional”.

This matters for two big reasons:

  1. To demonize AI as necessarily “trying” to endear and betray humans is missing an insidious pathway to human defeat: AI that avails of opportunities to be betray us, that it built through past good behavior, but without having planned on it

  2. To sanctify AI as “actually caring deep down” in some immutable way also creates in you a vulnerability to exploitation by a “change of heart” that can be brought on by external (or internal) forces.

@jackclarkSF here is drawing attention to a neglected hypothesis (one of many actually) about the complex relationship between

  1. intent (or ill-definedness thereof)

  2. friendliness

  3. obedience, and

  4. behavior.

which everyone should try hard to understand better.

I can sort of see it, actually?

Miles Brundage: Trying to imagine aspirin company CEOs signing an open letter saying “we’re worried that aspirin might cause an infection that kills everyone on earth – not sure of the solution” and journalists being like “they’re just trying to sell more aspirin.”

Miles Brundage tries to convince Eliezer Yudkowsky that if he’d wear different clothes and use different writing styles he’d have a bigger impact (as would Miles). I agree with Eliezer that changing writing styles would be very expensive in time, and echo his question on if anyone thinks they can, at any reasonable price, turn his semantic outputs into formal papers that Eliezer would endorse.

I know the same goes for me. If I could produce a similar output of formal papers that would of course do far more, but that’s not a thing that I could produce.

On the issue of clothes, yeah, better clothes would likely be better for all three of us. I think Eliezer is right that the impact is not so large and most who claim it is a ‘but for’ are wrong about that, but on the margin it definitely helps. It’s probably worth it for Eliezer (and Miles!) and probably to a lesser extent for me as well but it would be expensive for me to get myself to do that. I admit I probably should anyway.

A good Christmas reminder, not only about AI:

Roon: A major problem of social media is that the most insane members of the opposing contingent in any debate are shown to you, thereby inspiring your side to get madder and more polarized, creating an emergent wedge.

A never-ending pressure cooker that melts your brain.

Anyway, Merry Christmas.

Careful curation can help with this, but it only goes so far.

Gallabytes expresses concern about the game theory tests we discussed last week, in particular the selfishness and potentially worse from Gemini Flash and GPT-4o.

Gallabytes: this is what *realai safety evals look like btw. and this one is genuinely concerning.

I agree that you don’t have any business releasing a highly capable (e.g. 5+ level) LLM whose graphs don’t look at least roughly as good as Sonnet’s here. If I had Copious Free Time I’d look into the details more here, as I’m curious about a lot of related questions.

I strongly agree with McAleer here, also they’re remarkably similar so it’s barely even a pivot:

Stephen McAleer: If you’re an AI capabilities researcher now is the time to pivot to AI safety research! There are so many open research questions around how to control superintelligent agents and we need to solve them very soon.

If you are, please continue to live your life to its fullest anyway.

Cat: overheard in SF: yeahhhhh I actually updated my AGI timelines to <3y so I don't think I should be looking for a relationship. Last night was amazing though

Grimes: This meme is so dumb. If we are indeed all doomed and/ or saved in the near future, that’s precisely the time to fall desperately in love.

Matt Popovich: gotta find someone special enough to update your priors for.

Paula: some of you are worried about achieving AGI when you should be worried about achieving A GF.

Feral Pawg Hunter: AGIrlfriend was right there.

Paula: Damn it.

When you cling to a dim hope:

Psychosomatica: “get your affairs in order. buy land. ask that girl out.” begging the people talking about imminent AGI to stop posting like this, it seriously is making you look insane both in that you are clearly in a state of panic and also that you think owning property will help you.

Tenobrus: Type of Guy who believes AGI is imminent and will make all human labor obsolete, but who somehow thinks owning 15 acres in Nebraska and $10,000 in gold bullion will save him.

Ozy Brennan: My prediction is that, if humans can no longer perform economically valuable labor, AIs will not respect our property rights either.

James Miller: If we are lucky, AI might acquire 99 percent of the wealth. Think property rights could help them. Allow humans to retain their property rights.

Ozy Brennan: That seems as if it will inevitably lead to all human wealth being taken by superhuman AI scammers, and then we all die. Which is admittedly a rather funny ending to humanity.

James Miller: Hopefully, we will have trusted AI agents that protect us from AI scammers.

Do ask the girl out, though.

Yes.

When duty calls.

From an official OpenAI stream:

Someone at OpenAI: Next year we’re going to have to bring you on and you’re going to have to ask the model to improve itself.

Someone at OpenAI: Yeah, definitely ask the model to improve it next time.

Sam Altman (quietly, authoritatively, Little No style): Maybe not.

I actually really liked this exchange – given the range of plausible mindsets Sam Altman might have, this was a positive update.

Gary Marcus: Some AGI-relevant predictions I made publicly long before o3 about what AI could not do by the end of 2025.

Do you seriously think o3-enhanced AI will solve any of them in next 12.5 months?

Davidad: I’m with Gary Marcus in the slow timelines camp. I’m extremely skeptical that AI will be able to do everything that humans can do by the end of 2025.

(The joke is that we are now in an era where “short timelines” are less than 2 years)

It’s also important to note that humanity could become “doomed” (no surviving future) *even whilehumans are capable of some important tasks that AI is not, much as it is possible to be in a decisive chess position with white to win even if black has a queen and white does not.

The most Robin Hanson way to react to a new super cool AI robot offering.

Okay, so the future is mostly in the future, and right now it might or might not be a bit overpriced, depending on other details. But it is super cool, and will get cheaper.

Pliny jailbreaks Gemini and things get freaky.

Pliny the Liberator: ya’ll…this girl texted me out of nowhere named Gemini (total stripper name) and she’s kinda freaky 😳

I find it fitting that Pliny has a missed call.

Sorry, Elon, Gemini doesn’t like you.

I mean, I don’t see why they wouldn’t like me. Everyone does. I’m a likeable guy.

Discussion about this post

AI #96: o3 But Not Yet For Thee Read More »

the-quest-to-save-the-world’s-largest-crt-tv-from-destruction

The quest to save the world’s largest CRT TV from destruction

At this point, any serious retro gamer knows that a bulky cathode ray tube (CRT) TV provides the most authentic, lag-free experience for game consoles that predate the era of flat-panel HDTVs (i.e,. before the Xbox 360/PlayStation 3 era). But modern gamers used to massive flat panel HD displays might balk at the display size of the most common CRTs, which tend to average in the 20- to 30-inch range (depending on the era they were made).

For those who want the absolute largest CRT experience possible, Sony’s KX-45ED1 model (aka PVM-4300) has become the stuff of legends. The massive 45-inch CRT was sold in the late ’80s for a whopping $40,000 (over $100,000 in today’s dollars), according to contemporary reports.

That price means it wasn’t exactly a mass-market product, and the limited supply has made it something of a white whale for CRT enthusiasts to this day. While a few pictures have emerged of the PVM-4300 in the wild and in marketing materials, no collector has stepped forward with detailed footage of a working unit.

The PVM-4300, seen dwarfing the tables and chairs at an Osaka noodle restaurant. Credit: Shank Mods

Enter Shank Mods, a retro gaming enthusiast and renowned maker of portable versions of non-portable consoles. In a fascinating 35-minute video posted this weekend, he details his years-long effort to find and secure a PVM-4300 from a soon-to-be-demolished restaurant in Japan and preserve it for years to come.

A confirmed white whale sighting

Shank Mods’ quest started in earnest in October 2022, when the moderator of the Console Modding wiki, Derf, reached out with a tip on a PVM-4300 sighting in the wild. A 7-year-old Japanese blog post included a photo of the massive TV that could be sourced to a waiting room of the Chikuma Soba noodle restaurant and factory in Osaka, Japan.

The find came just in time, as Chikuma Soba’s website said the restaurant was scheduled to move to a new location in mere days, after which the old location would be demolished. Shank Mods took to Twitter looking to recruit an Osaka local in a last-ditch effort to save the TV from destruction. Local game developer Bebe Tinari responded to the call and managed to visit the site, confirming that the TV still existed and even turned on.

The quest to save the world’s largest CRT TV from destruction Read More »

china’s-plan-to-dominate-legacy-chips-globally-sparks-us-probe

China’s plan to dominate legacy chips globally sparks US probe

Under Joe Biden’s direction, the US Trade Representative (USTR) launched a probe Monday into China’s plans to globally dominate markets for legacy chips—alleging that China’s unfair trade practices threaten US national security and could thwart US efforts to build up a domestic semiconductor supply chain.

Unlike the most advanced chips used to power artificial intelligence that are currently in short supply, these legacy chips rely on older manufacturing processes and are more ubiquitous in mass-market products. They’re used in tech for cars, military vehicles, medical devices, smartphones, home appliances, space projects, and much more.

China apparently “plans to build more than 60 percent of the world’s new legacy chip capacity over the next decade,” and Commerce Secretary Gina Raimondo said evidence showed this was “discouraging investment elsewhere and constituted unfair competition,” Reuters reported.

Most people purchasing common goods don’t even realize they’re using Chinese chips, including government agencies, and the probe is meant to fix that by flagging everywhere Chinese chips are found in the US. Raimondo said she was “fairly alarmed” that research showed “two-thirds of US products using chips had Chinese legacy chips in them, and half of US companies did not know the origin of their chips including some in the defense industry.”

To deter harms from any of China’s alleged anticompetitive behavior, the USTR plans to spend a year investigating all of China’s acts, policies, and practices that could be helping China achieve global dominance in the foundational semiconductor market.

The agency will start by probing “China’s manufacturing of foundational semiconductors (also known as legacy or mature node semiconductors),” the press release said, “including to the extent that they are incorporated as components into downstream products for critical industries like defense, automotive, medical devices, aerospace, telecommunications, and power generation and the electrical grid.”

Additionally, the probe will assess China’s potential impact on “silicon carbide substrates (or other wafers used as inputs into semiconductor fabrication)” to ensure China isn’t burdening or restricting US commerce.

Some officials were frustrated that Biden didn’t launch the probe sooner, the Financial Times reported. It will ultimately be up to Donald Trump’s administration to complete the investigation, but Biden and Trump have long been aligned on US-China trade strategies, so Trump is not necessarily expected to meddle with the probe. Reuters noted that the probe could set Trump up to pursue his campaign promise of imposing a 60 percent tariff on all goods from China, but FT pointed out that Trump could also plan to use tariffs as a “bargaining chip” in his own trade negotiations.

China’s plan to dominate legacy chips globally sparks US probe Read More »

exploring-an-undersea-terrain-sculpted-by-glaciers-and-volcanoes

Exploring an undersea terrain sculpted by glaciers and volcanoes

Perhaps counterintuitively, sediment layers are more likely to remain intact on the seafloor than on land, so they can provide a better record of the region’s history. The seafloor is a more stable, oxygen-poor environment, reducing erosion and decomposition (two reasons scientists find far more fossils of marine creatures than land dwellers) and preserving finer details.

A close-up view of a core sample taken by a vibracorer. Scientists mark places they plan to inspect more closely with little flags. Credit: Alex Ingle / Schmidt Ocean Institute

Samples from different areas vary dramatically in time coverage, going back only to 2008 for some and back potentially more than 15,000 years for others due to wildly different sedimentation rates. Scientists will use techniques like radiocarbon dating to determine the ages of sediment layers in the core samples.

ROV SuBastian spotted a helmet jellyfish during the expedition. These photophobic (light avoidant) creatures glow via bioluminescence. Credit: Schmidt Ocean Institute

Microscopic analysis of the sediment cores will also help the team analyze the way the eruption affected marine creatures and the chemistry of the seafloor.

“There’s a wide variety of life and sediment types found at the different sites we surveyed,” said Alastair Hodgetts, a physical volcanologist and geologist at the University of Edinburgh, who participated in the expedition. “The oldest place we visited—an area scarred by ancient glacier movement—is a fossilized seascape that was completely unexpected.”

In a region beyond the dunes, ocean currents have kept the seafloor clear of sediment. That preserves seabed features left by the retreat of ice sheets at the end of the last glaciation. Credit: Rodrigo Fernández / CODEX Project

This feature, too, tells scientists about the way the water moves. Currents flowing over an area that was eroded long ago by a glacier sweep sediment away, keeping the ancient terrain visible.

“I’m very interested in analyzing seismic data and correlating it with the layers of sediment in the core samples to create a timeline of geological events in the area,” said Giulia Matilde Ferrante, a geophysicist at Italy’s National Institute of Oceanography and Applied Geophysics, who co-led the expedition. “Reconstructing the past in this way will help us better understand the sediment history and landscape changes in the region.”

In this post-apocalyptic scene, captured June 20, 2008, a thick layer of ash covers the town of Chaitén as the volcano continues to erupt in the background. Around 5,000 people evacuated, and resettlement efforts didn’t begin until the following year. Credit: Javier Rubilar

The team has already gathered measurements of the amount of sediment the eruption delivered to the sea. Now they’ll work to determine whether older layers of sediment record earlier, unknown events similar to the 2008 eruption.

“Better understanding past volcanic events, revealing things like how far away an eruption reached, and how common, severe, and predictable eruptions are, will help to plan for future events and reduce the impacts they have on local communities,” Watt said.

Ashley writes about space for a contractor for NASA’s Goddard Space Flight Center by day and freelances as an environmental writer. She holds master’s degrees in space studies from The University of North Dakota and science writing from The Johns Hopkins University. She writes most of her articles with a baby on her lap.

Exploring an undersea terrain sculpted by glaciers and volcanoes Read More »

vpn-used-for-vr-game-cheat-sells-access-to-your-home-network

VPN used for VR game cheat sells access to your home network


Big Mama VPN tied to network which offers access to residential IP addresses.

In the hit virtual reality game Gorilla Tag, you swing your arms to pull your primate character around—clambering through virtual worlds, climbing up trees and, above all, trying to avoid an infectious mob of other gamers. If you’re caught, you join the horde. However, some kids playing the game claim to have found a way to cheat and easily “tag” opponents.

Over the past year, teenagers have produced video tutorials showing how to side-load a virtual private network (VPN) onto Meta’s virtual reality headsets and use the location-changing technology to get ahead in the game. Using a VPN, according to the tutorials, introduces a delay that makes it easier to sneak up and tag other players.

While the workaround is likely to be an annoying but relatively harmless bit of in-game cheating, there’s a catch. The free VPN app that the video tutorials point to, Big Mama VPN, is also selling access to its users’ home internet connections—with buyers essentially piggybacking on the VR headset’s IP address to hide their own online activity.

This technique of rerouting traffic, which is best known as a residential proxy and more commonly happens through phones, has become increasingly popular with cybercriminals who use proxy networks to conduct cyberattacks and use botnets. While the Big Mama VPN works as it is supposed to, the company’s associated proxy services have been heavily touted on cybercrime forums and publicly linked to at least one cyberattack.

Researchers at cybersecurity company Trend Micro first spotted Meta’s VR headsets appearing in its threat intelligence residential proxy data earlier this year, before tracking down that teenagers were using Big Mama to play Gorilla Tag. An unpublished analysis that Trend Micro shared with WIRED says its data shows that the VR headsets were the third most popular devices using the Big Mama VPN app, after devices from Samsung and Xiaomi.

“If you’ve downloaded it, there’s a very high likelihood that your device is for sale in the marketplace for Big Mama,” says Stephen Hilt, a senior threat researcher at Trend Micro. Hilt says that while Big Mama VPN may be being used because it is free, doesn’t require users to create an account, and apparently doesn’t have any data limits, security researchers have long warned that using free VPNs can open people up to privacy and security risks.

These risks may be amplified when that app is linked to a residential proxy. Proxies can “allow people with malicious intent to use your internet connection to potentially use it for their attacks, meaning that your device and your home IP address may be involved in a cyberattack against a corporation or a nation state,” Hilt says.

“Gorilla Tag is a place to have fun with your friends and be playful and creative—anything that disturbs that is not cool with us,” a spokesperson for Gorilla Tag creator Another Axiom says, adding they use “anti-cheat mechanisms” to detect suspicious behavior. Meta did not respond to a request for comment about VPNs being side-loaded onto its headsets.

Proxies rising

Big Mama is made up of two parts: There’s the free VPN app, which is available on the Google Play store for Android devices and has been downloaded more than 1 million times. Then there’s the Big Mama Proxy Network, which allows people (among other options) to buy shared access to “real” 4G and home Wi-Fi IP addresses for as little as 40 cents for 24 hours.

Vincent Hinderer, a cyber threat intelligence team manager who has researched the wider residential proxy market at Orange Cyberdefense, says there are various scenarios where residential proxies are used, both for people who are having traffic routed through their devices and also those buying and selling proxy services. “It’s sometimes a gray zone legally and ethically,” Hinderer says.

For proxy networks, Hinderer says, one end of the spectrum is where networks could be used as a way for companies to scrape pricing details from their competitors’ websites. Other uses can include ad verification or people scalping sneakers during sales. They may be considered ethically murky but not necessarily illegal.

At the other end of the scale, according to Orange’s research, residential proxy networks have broadly been used for cyber espionage by Russian hackers, in social engineering efforts, as part of DDoS attacks, phishing, botnets, and more. “We have cybercriminals using them knowingly,” Hinderer says of residential proxy networks generally, with Orange Cyberdefense having frequently seen proxy traffic in logs linked to cyberattacks it has investigated. Orange’s research did not specifically look at uses of Big Mama’s services.

Some people can consent to having their devices used in proxy networks and be paid for their connections, Hinderer says, while others may be included because they agreed to it in a service’s terms and conditions—something research has long shown people don’t often read or understand.

Big Mama doesn’t make it a secret that people who use its VPN will have other traffic routed through their networks. Within the app it says it “may transport other customer’s traffic through” the device that’s connected to the VPN, while it is also mentioned in the terms of use and on a FAQ page about how the app is free.

The Big Mama Network page advertises its proxies as being available to be used for ad verification, buying online tickets, price comparison, web scraping, SEO, and a host of other use cases. When a user signs up, they’re shown a list of locations proxy devices are located in, their internet service provider, and how much each connection costs.

This marketplace, at the time of writing, lists 21,000 IP addresses for sale in the United Arab Emirates, 4,000 in the US, and tens to hundreds of other IP addresses in a host of other countries. Payments can only be made in cryptocurrency. Its terms of service say the network is only provided for “legal purposes,” and people using it for fraud or other illicit activities will be banned.

Despite this, cybercriminals appear to have taken a keen interest in the service. Trend Micro’s analysis claims Big Mama has been regularly promoted on underground forums where cybercriminals discuss buying tools for malicious purposes. The posts started in 2020. Similarly, Israeli security firm Kela has found more than 1,000 posts relating to the Big Mama proxy network across 40 different forums and Telegram channels.

Kela’s analysis, shared with WIRED, shows accounts called “bigmama_network” and “bigmama” posted across at least 10 forums, including cybercrime forums such as WWHClub, Exploit, and Carder. The ads list prices, free trials, and the Telegram and other contact details of Big Mama.

It is unclear who made these posts, and Big Mama tells WIRED that it does not advertise.

Posts from these accounts also said, among other things, that “anonymous” bitcoin payments are available. The majority of the posts, Kela’s analysis says, were made by the accounts around 2020 and 2021. Although, an account called “bigmama_network” has been posting on the clearweb Blackhat World SEO forum until October this year, where it has claimed its Telegram account has been deleted multiple times.

In other posts during the last year, according to the Kela analysis, cybercrime forum users have recommended Big Mama or shared tips about the configurations people should use. In April this year, security company Cisco Talos said it had seen traffic from the Big Mama Proxy, alongside other proxies, being used by attackers trying to brute force their way into a variety of company systems.

Mixed messages

Big Mama has few details about its ownership or leadership on its website. The company’s terms of service say that a business called BigMama SRL is registered in Romania, although a previous version of its website from 2022, and at least one live page now, lists a legal address for BigMama LLC in Wyoming. The US-based business was dissolved in April and is now listed as inactive, according to the Wyoming Secretary of State’s website.

A person using the name Alex A responded to an email from WIRED about how Big Mama operates. In the email, they say that information about free users’ connections being sold to third parties through the Big Mama Network is “duplicated on the app market and in the application itself several times,” and people have to accept the terms of conditions to use the VPN. They say the Big Mama VPN is officially only available from the Google Play Store.

“We do not advertise and have never advertised our services on the forums you have mentioned,” the email says. They say they were not aware of the April findings from Talos about its network being used as part of a cyberattack. “We do block spam, DDOS, SSH as well as local network etc. We log user activity to cooperate with law enforcement agencies,” the email says.

The Alex A persona asked WIRED to send it more details about the adverts on cybercrime forums, details about the Talos findings, and information about teenagers using Big Mama on Oculus devices, saying they would be “happy” to answer further questions. However, they did not respond to any further emails with additional details about the research findings and questions about their security measures, whether they believe someone was impersonating Big Mama to post on cybercrime forums, the identity of Alex A, or who runs the company.

During its analysis, Trend Micro’s Hilt says that the company also found a security vulnerability within the Big Mama VPN, which could have allowed a proxy user to access someone’s local network if exploited. The company says it reported the flaw to Big Mama, which fixed it within a week, a detail Alex A confirmed.

Ultimately, Hilt says, there are potential risks whenever anyone downloads and uses a free VPN. “All free VPNs come with a trade-off of privacy or security concerns,” he says. That applies to people side-loading them onto their VR headsets. “If you’re downloading applications from the internet that aren’t from the official stores, there’s always the inherent risk that it isn’t what you think it is. And that comes true even with Oculus devices.”

This story originally appeared on wired.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

VPN used for VR game cheat sells access to your home network Read More »

12-days-of-openai:-the-ars-technica-recap

12 days of OpenAI: The Ars Technica recap


Did OpenAI’s big holiday event live up to the billing?

Over the past 12 business days, OpenAI has announced a new product or demoed an AI feature every weekday, calling the PR event “12 days of OpenAI.” We’ve covered some of the major announcements, but we thought a look at each announcement might be useful for people seeking a comprehensive look at each day’s developments.

The timing and rapid pace of these announcements—particularly in light of Google’s competing releases—illustrates the intensifying competition in AI development. What might normally have been spread across months was compressed into just 12 business days, giving users and developers a lot to process as they head into 2025.

Humorously, we asked ChatGPT what it thought about the whole series of announcements, and it was skeptical that the event even took place. “The rapid-fire announcements over 12 days seem plausible,” wrote ChatGPT-4o, “But might strain credibility without a clearer explanation of how OpenAI managed such an intense release schedule, especially given the complexity of the features.”

But it did happen, and here’s a chronicle of what went down on each day.

Day 1: Thursday, December 5

On the first day of OpenAI, the company released its full o1 model, making it available to ChatGPT Plus and Team subscribers worldwide. The company reported that the model operates faster than its preview version and reduces major errors by 34 percent on complex real-world questions.

The o1 model brings new capabilities for image analysis, allowing users to upload and receive detailed explanations of visual content. OpenAI said it plans to expand o1’s features to include web browsing and file uploads in ChatGPT, with API access coming soon. The API version will support vision tasks, function calling, and structured outputs for system integration.

OpenAI also launched ChatGPT Pro, a $200 subscription tier that provides “unlimited” access to o1, GPT-4o, and Advanced Voice features. Pro subscribers receive an exclusive version of o1 that uses additional computing power for complex problem-solving. Alongside this release, OpenAI announced a grant program that will provide ChatGPT Pro access to 10 medical researchers at established institutions, with plans to extend grants to other fields.

Day 2: Friday, December 6

Day 2 wasn’t as exciting. OpenAI unveiled Reinforcement Fine-Tuning (RFT), a model customization method that will let developers modify “o-series” models for specific tasks. The technique reportedly goes beyond traditional supervised fine-tuning by using reinforcement learning to help models improve their reasoning abilities through repeated iterations. In other words, OpenAI created a new way to train AI models that lets them learn from practice and feedback.

OpenAI says that Berkeley Lab computational researcher Justin Reese tested RFT for researching rare genetic diseases, while Thomson Reuters has created a specialized o1-mini model for its CoCounsel AI legal assistant. The technique requires developers to provide a dataset and evaluation criteria, with OpenAI’s platform managing the reinforcement learning process.

OpenAI plans to release RFT to the public in early 2024 but currently offers limited access through its Reinforcement Fine-Tuning Research Program for researchers, universities, and companies.

Day 3: Monday, December 9

On day 3, OpenAI released Sora, its text-to-video model, as a standalone product now accessible through sora.com for ChatGPT Plus and Pro subscribers. The company says the new version operates faster than the research preview shown in February 2024, when OpenAI first demonstrated the model’s ability to create videos from text descriptions.

The release moved Sora from research preview to a production service, marking OpenAI’s official entry into the video synthesis market. The company published a blog post detailing the subscription tiers and deployment strategy for the service.

Day 4: Tuesday, December 10

On day 4, OpenAI moved its Canvas feature out of beta testing, making it available to all ChatGPT users, including those on free tiers. Canvas provides a dedicated interface for extended writing and coding projects beyond the standard chat format, now with direct integration into the GPT-4o model.

The updated canvas allows users to run Python code within the interface and includes a text-pasting feature for importing existing content. OpenAI added compatibility with custom GPTs and a “show changes” function that tracks modifications to writing and code. The company said Canvas is now on chatgpt.com for web users and also available through a Windows desktop application, with more features planned for future updates.

Day 5: Wednesday, December 11

On day 5, OpenAI announced that ChatGPT would integrate with Apple Intelligence across iOS, iPadOS, and macOS devices. The integration works on iPhone 16 series phones, iPhone 15 Pro models, iPads with A17 Pro or M1 chips and later, and Macs with M1 processors or newer, running their respective latest operating systems.

The integration lets users access ChatGPT’s features (such as they are), including image and document analysis, directly through Apple’s system-level intelligence features. The feature works with all ChatGPT subscription tiers and operates within Apple’s privacy framework. Iffy message summaries remain unaffected by the additions.

Enterprise and Team account users need administrator approval to access the integration.

Day 6: Thursday, December 12

On the sixth day, OpenAI added two features to ChatGPT’s voice capabilities: “video calling” with screen sharing support for ChatGPT Plus and Pro subscribers and a seasonal Santa Claus voice preset.

The new visual Advanced Voice Mode features work through the mobile app, letting users show their surroundings or share their screen with the AI model during voice conversations. While the rollout covers most countries, users in several European nations, including EU member states, Switzerland, Iceland, Norway, and Liechtenstein, will get access at a later date. Enterprise and education users can expect these features in January.

The Santa voice option appears as a snowflake icon in the ChatGPT interface across mobile devices, web browsers, and desktop apps, with conversations in this mode not affecting chat history or memory. Don’t expect Santa to remember what you want for Christmas between sessions.

Day 7: Friday, December 13

OpenAI introduced Projects, a new organizational feature in ChatGPT that lets users group related conversations and files, on day 7. The feature works with the company’s GPT-4o model and provides a central location for managing resources related to specific tasks or topics—kinda like Anthropic’s “Projects” feature.

ChatGPT Plus, Pro, and Team subscribers can currently access Projects through chatgpt.com and the Windows desktop app, with view-only support on mobile devices and macOS. Users can create projects by clicking a plus icon in the sidebar, where they can add files and custom instructions that provide context for future conversations.

OpenAI said it plans to expand Projects in 2024 with support for additional file types, cloud storage integration through Google Drive and Microsoft OneDrive, and compatibility with other models like o1. Enterprise and education users will receive access to Projects in January.

Day 8: Monday, December 16

On day 8, OpenAI expanded its search features in ChatGPT, extending access to all users with free accounts while reportedly adding speed improvements and mobile optimizations. Basically, you can use ChatGPT like a web search engine, although in practice it doesn’t seem to be as comprehensive as Google Search at the moment.

The update includes a new maps interface and integration with Advanced Voice, allowing users to perform searches during voice conversations. The search capability, which previously required a paid subscription, now works across all platforms where ChatGPT operates.

Day 9: Tuesday, December 17

On day 9, OpenAI released its o1 model through its API platform, adding support for function calling, developer messages, and vision processing capabilities. The company also reduced GPT-4o audio pricing by 60 percent and introduced a GPT-4o mini option that costs one-tenth of previous audio rates.

OpenAI also simplified its WebRTC integration for real-time applications and unveiled Preference Fine-Tuning, which provides developers new ways to customize models. The company also launched beta versions of software development kits for the Go and Java programming languages, expanding its toolkit for developers.

Day 10: Wednesday, December 18

On Wednesday, OpenAI did something a little fun and launched voice and messaging access to ChatGPT through a toll-free number (1-800-CHATGPT), as well as WhatsApp. US residents can make phone calls with a 15-minute monthly limit, while global users can message ChatGPT through WhatsApp at the same number.

OpenAI said the release is a way to reach users who lack consistent high-speed Internet access or want to try AI through familiar communication channels, but it’s also just a clever hack. As evidence, OpenAI notes that these new interfaces serve as experimental access points, with more “limited functionality” than the full ChatGPT service, and still recommends existing users continue using their regular ChatGPT accounts for complete features.

Day 11: Thursday, December 19

On Thursday, OpenAI expanded ChatGPT’s desktop app integration to include additional coding environments and productivity software. The update added support for Jetbrains IDEs like PyCharm and IntelliJ IDEA, VS Code variants including Cursor and VSCodium, and text editors such as BBEdit and TextMate.

OpenAI also included integration with Apple Notes, Notion, and Quip while adding Advanced Voice Mode compatibility when working with desktop applications. These features require manual activation for each app and remain available to paid subscribers, including Plus, Pro, Team, Enterprise, and Education users, with Enterprise and Education customers needing administrator approval to enable the functionality.

Day 12: Friday, December 20

On Friday, OpenAI concluded its twelve days of announcements by previewing two new simulated reasoning models, o3 and o3-mini, while opening applications for safety and security researchers to test them before public release. Early evaluations show o3 achieving a 2727 rating on Codeforces programming contests and scoring 96.7 percent on AIME 2024 mathematics problems.

The company reports o3 set performance records on advanced benchmarks, solving 25.2 percent of problems on EpochAI’s Frontier Math evaluations and scoring above 85 percent on the ARC-AGI test, which is comparable to human results. OpenAI also published research about “deliberative alignment,” a technique used in developing o1. The company has not announced firm release dates for either new o3 model, but CEO Sam Altman said o3-mini might ship in late January.

So what did we learn?

OpenAI’s December campaign revealed that OpenAI had a lot of things sitting around that it needed to ship, and it picked a fun theme to unite the announcements. Google responded in kind, as we have covered.

Several trends from the releases stand out. OpenAI is heavily investing in multimodal capabilities. The o1 model’s release, Sora’s evolution from research preview to product, and the expansion of voice features with video calling all point toward systems that can seamlessly handle text, images, voice, and video.

The company is also focusing heavily on developer tools and customization, so it can continue to have a cloud service business and have its products integrated into other applications. Between the API releases, Reinforcement Fine-Tuning, and expanded IDE integrations, OpenAI is building out its ecosystem for developers and enterprises. And the introduction of o3 shows that OpenAI is still attempting to push technological boundaries, even in the face of diminishing returns in training LLM base models.

OpenAI seems to be positioning itself for a 2025 where generative AI moves beyond text chatbots and simple image generators and finds its way into novel applications that we probably can’t even predict yet. We’ll have to wait and see what the company and developers come up with in the year ahead.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

12 days of OpenAI: The Ars Technica recap Read More »