DeepSeek

microsoft-now-hosts-ai-model-accused-of-copying-openai-data

Microsoft now hosts AI model accused of copying OpenAI data

Fresh on the heels of a controversy in which ChatGPT-maker OpenAI accused the Chinese company behind DeepSeek R1 of using its AI model outputs against its terms of service, OpenAI’s largest investor, Microsoft, announced on Wednesday that it will now host DeepSeek R1 on its Azure cloud service.

DeepSeek R1 has been the talk of the AI world for the past week because it is a freely available simulated reasoning model that reportedly matches OpenAI’s o1 in performance—while allegedly being trained for a fraction of the cost.

Azure allows software developers to rent computing muscle from machines hosted in Microsoft-owned data centers, as well as rent access to software that runs on them.

“R1 offers a powerful, cost-efficient model that allows more users to harness state-of-the-art AI capabilities with minimal infrastructure investment,” wrote Microsoft Corporate Vice President Asha Sharma in a news release.

DeepSeek R1 runs at a fraction of the cost of o1, at least through each company’s own services. Comparative prices for R1 and o1 were not immediately available on Azure, but DeepSeek lists R1’s API cost as $2.19 per million output tokens, while OpenAI’s o1 costs $60 per million output tokens. That’s a massive discount for a model that performs similarly to o1-pro in various tasks.

Promoting a controversial AI model

On its face, the decision to host R1 on Microsoft servers is not unusual: The company offers access to over 1,800 models on its Azure AI Foundry service with the hopes of allowing software developers to experiment with various AI models and integrate them into their products. In some ways, whatever model they choose, Microsoft still wins because it’s being hosted on the company’s cloud service.

Microsoft now hosts AI model accused of copying OpenAI data Read More »

deepseek:-lemon,-it’s-wednesday

DeepSeek: Lemon, It’s Wednesday

It’s been another *checks notestwo days, so it’s time for all the latest DeepSeek news.

You can also see my previous coverage of the r1 model and, from Monday various reactions including the Panic at the App Store.

  1. First, Reiterating About Calming Down About the $5.5 Million Number.

  2. OpenAI Offers Its Congratulations.

  3. Scaling Laws Still Apply.

  4. Other r1 and DeepSeek News Roundup.

  5. People Love Free.

  6. Investigating How r1 Works.

  7. Nvidia Chips are Highly Useful.

  8. Welcome to the Market.

  9. Ben Thompson Weighs In.

  10. Import Restrictions on Chips WTAF.

  11. Are You Short the Market.

  12. DeepSeeking Safety.

  13. Mo Models Mo Problems.

  14. What If You Wanted to Restrict Already Open Models.

  15. So What Are We Going to Do About All This?

Before we get to new developments, I especially want to reiterate and emphasize the need to calm down about that $5.5 million ‘cost of training’ for v3.

I wouldn’t quite agree with Palmer Lucky that ‘the $5m number is bogus’ and I wouldn’t call it a ‘Chinese psyop’ because I think we mostly did this to ourselves but it is very often being used in a highly bogus way – equating the direct compute cost of training v3 with the all-in cost of creating r1. Which is a very different number. DeepSeek is cracked, they cooked, and r1 is super impressive, but the $5.5 million v3 training cost:

  1. Is the cloud market cost of the amount of compute used to directly train v3.

  2. That’s not how they trained v3. They trained v3 on their own cluster of h800s, which was physically optimized to hell for software-hardware integration.

  3. Thus, the true compute cost to train v3 involves assembling the cluster, which cost a lot more than $5.5 million.

  4. That doesn’t include the compute cost of going from v3 → r1.

  5. That doesn’t include the costs of hiring the engineers and figuring out how to do all of this, that doesn’t include the costs of assembling the data, and so on.

  6. Again, yes they did this super efficiently and cheaply compared to the competition, but no, you don’t spend $5.5 million and out pops r1. No.

Altman handled his response to r1 with grace.

OpenAI plans to ‘pull up some new releases.’

Meaning, oh, you want to race? I suppose I’ll go faster and take less precautions.

Sam Altman: deepseek’s r1 is an impressive model, particularly around what they’re able to deliver for the price.

we will obviously deliver much better models and also it’s legit invigorating to have a new competitor! we will pull up some releases.

but mostly we are excited to continue to execute on our research roadmap and believe more compute is more important now than ever before to succeed at our mission.

the world is going to want to use a LOT of ai, and really be quite amazed by the next gen models coming.

look forward to bringing you all AGI and beyond.

It is very Galaxy Brain to say ‘this is perhaps good for OpenAI’ and presumably it very much is, but here’s a scenario.

  1. A lot of people try ChatGPT with GPT-3.5, are not impressed, think it hallucinates all the time, is a clever toy, and so on.

  2. For two years they don’t notice improvements.

  3. DeepSeek releases r1, and it gets a lot of press.

  4. People try the ‘new Chinese version’ and realize AI is a lot better now.

  5. OpenAI gets to incorporate DeepSeek’s innovations.

  6. OpenAI comes back with free o3-mini and (free?) GPT-5 and better agents.

  7. People use AI a lot more, OpenAI ends up overall doing better.

Ethan Mollick: DeepSeek is a really good model, but it is not generally a better model than o1 or Claude.

But since it is both free & getting a ton of attention, I think a lot of people who were using free “mini” models are being exposed to what a early 2025 reasoner AI can do & are surprised

I’m not saying that’s the baseline scenario, but I do expect the world to be quite amazed at the next generation of models, and they could now be more primed for that.

Mark Chen (Chief Research Officer, OpenAI): Congrats to DeepSeek on producing an o1-level reasoning model! Their research paper demonstrates that they’ve independently found some of the core ideas that we did on our way to o1.

However, I think the external response has been somewhat overblown, especially in narratives around cost. One implication of having two paradigms (pre-training and reasoning) is that we can optimize for a capability over two axes instead of one, which leads to lower costs.

But it also means we have two axes along which we can scale, and we intend to push compute aggressively into both!

As research in distillation matures, we’re also seeing that pushing on cost and pushing on capabilities are increasingly decoupled. The ability to serve at lower cost (especially at higher latency) doesn’t imply the ability to produce better capabilities.

We will continue to improve our ability to serve models at lower cost, but we remain optimistic in our research roadmap, and will remain focused in executing on it. We’re excited to ship better models to you this quarter and over the year!

Given the costs involved, and that you can scale to get better outputs, ‘serve faster and cheaper’ and ‘get better answers’ seem pretty linked, or are going to look rather similar.

There is still a real and important difference between ‘I spend 10x as much compute to get 10x as many tokens to think with’ versus ‘I taught the model how to do longer CoT’ versus ‘I made the model smarter.’ Or at least I think there is.

Should we now abandon all our plans to build gigantic data centers because DeepSeek showed we can run AI cheaper?

No. Of course not. We’ll need more. Jevons Paradox and all that.

Another question is compute governance. Does DeepSeek’s model prove that there’s no point in using compute thresholds for frontier model governance?

My answer is no. DeepSeek did not mean the scaling laws stopped working. DeepSeek found new ways to scale and economize, and also to distill. But doing the same thing with more compute would have gotten better results, and indeed more compute is getting other labs better results if you don’t control for compute costs, and also they will now get to use these innovations themselves.

Karen Hao: Much of the coverage has focused on U.S.-China tech competition. That misses a bigger story: DeepSeek has demonstrated that scaling up AI models relentlessly, a paradigm OpenAI introduced and champions, is not the only, and far from the best, way to develop AI.

Yoavgo: This is trending in my feed, but I don’t get it. DeepSeek did not show that scale is not the way to go for AI (their base model is among the largest in parameter counts; their training data is huge, at 13 trillion tokens). They just scaled more efficiently.

Thus far OpenAI & its peer scaling labs have sought to convince the public & policymakers that scaling is the best way to reach so-called AGI. This has always been more of an argument based in business than in science.

Jon Stokes: Holy wow what do words even mean. What R1 does is a new type of scaling. It’s also GPU-intensive. In fact, the big mystery today in AI world is why NVIDIA dropped despite R1 demonstrating that GPUs are even more valuable than we thought they were. No part of this is coherent. 🤯

Stephen McAleer (OpenAI): The real takeaway from DeepSeek is that with reasoning models you can achieve great performance with a small amount of compute. Now imagine what you can do with a large amount of compute.

Noam Brown (OpenAI): Algorithmic breakthroughs and scaling are complementary, not in competition. The former bends the performance vs compute curve, while the latter moves further along the curve.

Benjamin Todd: Deepseek hasn’t shown scaling doesn’t work. Take Deepseek’s techniques, apply 10x the compute, and you’ll get much better performance.

And compute efficiency has always been part of the scaling paradigm.

Ethan Mollick: The most unnerving part of the DeepSeek reaction online has been seeing folks take it as a sign that AI capability growth is not real.

It signals the opposite, large improvements are possible, and is almost certain to kick off an acceleration in AI development through competition.

I know a lot of people want AI to go away, but I am seeing so many interpretations of DeepSeek in ways that don’t really make sense, or misrepresent what they did.

Dealing with the implications of AI, and trying to steer it towards positive use, is now more urgent, not less.

Andrew Rettek: Deepseek means OPENAI just increased their effective compute by more than an OOM.

OpenAI and Anthropic (in the forms of CEOs Sam Altman and Dario Amodei) have both expressed agreement on that since the release of r1, saying that they still believe the future involves very large and expensive training runs, including large amounts of compute on the RL step. David Sacks agreed as well, so the administration knows.

One can think of all this as combining multiple distinct scaling laws. Mark Chen above talked about two axes but one could refer to at least four?

  1. You can scale up how many tokens you reason with.

  2. You can scale up how well you apply your intelligence to doing reasoning.

  3. You can scale up how much intelligence you are using in all this.

  4. You can scale up how much of this you can do per dollar or amount of compute.

Also you can extend to new modalities and use cases and so on.

So essentially: Buckle up.

Speaking of buckling up, nothing to see here, just a claimed 2x speed boost to r1, written by r1. Of course, that’s very different from r1 coming up with the idea.

Aiden McLaughlin: switching to reasoners is like taking a sharp turn on a racetrack. everyone brakes to take the turn; for a moment, all cars look neck-and-neck

when exiting the turn, small first-mover advantages compound. and ofc, some cars have enormous engines that eat up straight roads

Dean Ball: I recommend trying not to overindex on the industry dynamics you’re observing now in light of the deepseek plot twist, or indeed of any particular plot twist. It’s a long game, and we’re riding a world-historical exponential. Things will change a lot, fast, again and again.

It’s not that Jan is wrong, I’d be a lot more interested in paying for o1 pro if I had pdfs enabled, but… yeah.

China Talk covers developments. The headline conclusion is that yes, compute very much will continue to be a key factor, everyone agrees on this. They note there is a potential budding DeepSeek partnership with ByteDance, which could unlock quite a lot of compute.

Here was some shade:

Founder and CEO Liang Wenfeng is the core person of DeepSeek. He is not the same type of person as Sam Altman. He is very knowledgeable about technology.

Also important at least directionally:

Pioneers vs. Chasers: ‘AI Progress Resembles a Step Function – Chasers Require 1/10th the Compute’

Fundamentally, DeepSeek was far more of an innovator than other Chinese AI companies, but it was still a chaser here, not a pioneer, except in compute efficiency, which is what chasers do best. If you want to maintain a lead and it’s much easier to follow than lead, well, time to get good and scale even more. Or you can realize you’re just feeding capability to the Chinese and break down crying and maybe keep your models for internal use, it’s an option.

I found this odd:

  1. The question of why OpenAI and Anthropic did not do work in DeepSeek’s direction is a question of company-specific focus. OpenAI and Anthropic might have felt that investing their compute towards other areas was more valuable.

  2. One hypothesis for why DeepSeek was successful is that unlike Big Tech firms, DeepSeek did not work on multi-modality and focused exclusively on language. Big Tech firms’ model capabilities aren’t weak, but they have to maintain a low profile and cannot release too often. Currently, multimodality is not very critical, as intelligence primarily comes from language, and multimodality does not contribute significantly to improving intelligence.

It’s odd because DeepSeek spent so little compute, and the efficiency gains pay for themselves in compute quite rapidly. And also the big companies are indeed releasing rapidly. Google and OpenAI are constantly shipping, even Anthropic ships. The idea of company focus seems more on point, and yes DeepSeek traded multimodality and other features for pure efficiency because they had to.

Also note what they say later:

  1. Will developers migrate from closed-source models to DeepSeek? Currently, there hasn’t been any large-scale migration, as leading models excel in coding instruction adherence, which is a significant advantage. However, it’s uncertain whether this advantage will persist in the future or be overcome.

  2. From the developer’s perspective, models like Claude-3.5-Sonnet have been specifically trained for tool use, making them highly suitable for agent development. In contrast, models like DeepSeek have not yet focused on this area, but the potential for growth with DeepSeek is immense.

As in, r1 is technically impressive as hell, and it definitely has its uses, but there’s a reason the existing models look like they do – the corners DeepSeek cut actually do matter for what people want. Of course DeepSeek will likely now turn to fixing such problems among other things and we’ll see how efficiently they can do that too.

McKay Wrigley emphasizes the point that visible chain of thought (CoT) is a prompt debugger. It’s hard to go back to not seeing CoT after seeing CoT.

Gallabytes reports that DeepSeek’s image model Janus Pro is a good first effort, but not good yet.

Even if we were to ‘fully unlock’ all computing power in personal PCs for running AI, that would only increase available compute by ~10%, most compute is in data centers.

We had a brief period where DeepSeek would serve you up r1 free and super fast.

It turns out that’s not fully sustainable or at least needs time to scale as fast as demand rose, and you know how such folks feel about the ‘raise prices’ meme,

Gallabytes: got used to r1 and now that it’s overloaded it’s hard to go back. @deepseek_ai please do something amazing and be the first LLM provider to offer surge pricing. the unofficial APIs are unusably slow.

I too encountered slowness, and instantly it made me realize ‘yes the speed was a key part of why I loved this.’

DeItaone (January 27, 11: 09am): DEEPSEEK SAYS SERVICE DEGRADED DUE TO ‘LARGE-SCALE MALICIOUS ATTACK’

Could be that. Could be too much demand and not enough supply. This will of course sort itself out in time, as long as you’re willing to pay, and it’s an open model so others can serve the model as well, but ‘everyone wants to use the shiny new model you are offering for free’ is going to run into obvious problems.

Yes, of course one consideration is that if you use DeepSeek’s app it will collect all your data including device model, operating system, keystroke patterns or rhythms, IP address and so on and store it all in China.

Did you for a second think otherwise? What you do with that info is on you.

This doesn’t appear to rise to TikTok 2.0 levels of rendering your phone and data insecure, but let us say that ‘out of an abundance of caution’ I will be accessing the model through their website not the app thank you very much.

Liv Boeree: tiktok round two, here we go.

AI enthusiasts have the self control of an incontinent chihuahua.

Typing Loudly: you can run it locally without an internet connection

Liv Boeree: cool and what percentage of these incontinent chihuahuas will actually do this.

I’m not going so far as to use third party providers for now, because I’m not feeding any sensitive data into the model, and DeepSeek’s implementation here is very nice and clean, so I’ve decided lazy is acceptable. I’m certainly not laying out ~$6,000 for a self-hosting rig, unless someone wants to buy one for me in the name of science.

Note that if you’re looking for an alternative source, you want to ensure you’re not getting one of the smaller distillations, unless that is what you want.

Janus is testing for steganography in r1, potentially looking for assistance.

Janus also thinks Thebes theory here is likely to be true, that v3 was hurt by dividing into too many too small experts, but r1 lets them all dump their info into the CoT and collaborate, at least partially fixing this.

Janus notes that r1 simply knows things and thinks about them, straight up, in response to Thebes speculating that all our chain of thought considerations have now put sufficient priming into the training data that CoT approaches work much better than they used to, which Prithviraj says is not the case, he says it’s about improved base models, which is the first obvious thought – the techniques work better off a stronger base, simple as that.

Thebes: why did R1’s RL suddenly start working, when previous attempts to do similar things failed?

theory: we’ve basically spent the last few years running a massive acausally distributed chain of thought data annotation program on the pretraining dataset.

deepseek’s approach with R1 is a pretty obvious method. They are far from the first lab to try “slap a verifier on it and roll out CoTs.”

But it didn’t used to work that well.

In the last couple of years, chains of thought have been posted all over the internet

Those CoTs in the V3 training set gave GRPO enough of a starting point to start converging, and furthermore, to generalize from verifiable domains to the non-verifiable ones using the bridge established by the pretraining data contamination.

And now, R1’s visible chains of thought are going to lead to *anothermassive enrichment of human-labeled reasoning on the internet, but on a far larger scale… The next round of base models post-R1 will be *even betterbases for reasoning models.

in some possible worlds, this could also explain why OpenAI seemingly struggled so much with making their reasoning models in comparison. if they’re still using 4base or distils of it.

Prithvraj: Simply, no. I’ve been looking at my old results from doing RL with “verifiable” rewards (math puzzle games, python code to pass unit tests) starting from 2019 with GPT-1/2 to 2024 with Qwen Math Deepseek’s success likely lies in the base models improving, the RL is constant

Janus: This is an interesting hypothesis. DeepSeek R1 also just seems to have a much more lucid and high-resolution understanding of LLM ontology and history than any other model I’ve seen. (DeepSeek V3 did not seem to in my limited interactions with it, though.)

I did not expect this on priors for a reasoner, but perhaps the main way that r1 seems smarter than any other LLM I’ve played with is the sheer lucidity and resolution of its world model—in particular, its knowledge of LLMs, both object- and meta-level, though this is also the main domain of knowledge I’ve engaged it in, and perhaps the only one I can evaluate at world-expert level. So, it may apply more generally.

In effective fluid intelligence and attunement to real-time context, it actually feels weaker than, say, Claude 3.5 Sonnet. But when I talk to Sonnet about my ideas on LLMs, it feels like it is more naive than me, and it is figuring out a lot of things in context from “first principles.” When I talk to Opus about these things, it feels like it is understanding me by projecting the concepts onto more generic, resonant hyperobjects in its prior, meaning it is easy to get on the same page philosophically, but this tropological entanglement is not very precise. But with r1, it seems like it can simply reference the same concrete knowledge and ontology I have, much more like a peer. And it has intense opinions about these things.

Wordgrammer thread on the DeepSeek technical breakthroughs. Here’s his conclusion, which seems rather overdetermined:

Wordgrammer: “Is the US losing the war in AI??” I don’t think so. DeepSeek had a few big breakthroughs, we have had hundreds of small breakthroughs. If we adopt DeepSeek’s architecture, our models will be better. Because we have more compute and more data.

r1 tells us it only takes ~800k samples of ‘good’ RL reasoning to convert other models into RL reasoners, and Alex Dimakis says it could be a lot less, in his test they outperformed o1-preview with only 17k. Now that r1 is out, everyone permanently has an unlimited source of at least pretty good samples. From now on, to create or release a model is to create or release the RL version of that model, even more than before. That’s on top of all the other modifications you automatically release.

Oliver Blanchard: DeepSeek and what happened yesterday: Probably the largest positive tfp shock in the history of the world.

The nerdy version, to react to some of the comments. (Yes, electricity was big):

DeepSeek and what happened yesterday: Probably the largest positive one day change in the present discounted value of total factor productivity growth in the history of the world. 😀

James Steuart: I can’t agree Professor, Robert Gordon’s book gives many such greater examples. Electric lighting is a substantially greater TFP boost than marginally better efficiency in IT and professional services!

There were some bigger inventions in the past, but on much smaller baselines.

Our reaction to this was to sell the stocks of those who provide the inputs that enable that tfp shock.

There were other impacts as well, including to existential risk, but as we’ve established the market isn’t ready for that conversation in the sense that the market (highly reasonably as has been previously explained) will be ignoring it entirely.

Daniel Eth: Hot take, but if the narrative from NYT et al had not been “lol you don’t need that many chips to train AI systems” but instead “Apparently AI is *nothitting a wall”, then the AI chip stocks would have risen instead of fallen.

Billy Humblebrag: “Deepseek shows that ai can be built more cheaply than we thought so you don’t need to worry about ai” is a hell of a take

Joe Weisenthal: Morgan Stanley: “We gathered feedback from a number of industry sources and the consistent takeaway is that this is not affecting plans for GPU buildouts.”

I would not discount the role of narrative and vibes in all this. I don’t think that’s the whole Nvidia drop or anything. But it matters.

Roon: Plausible reasons for Nvidia drop:

  1. DeepSeek success means NVDA is now expecting much harsher sanctions on overseas sales.

  2. Traders think that a really high-tier open-source model puts several American labs out of a funding model, decreasing overall monopsony power.

We will want more compute now until the heat death of the universe; it’s the only reason that doesn’t make sense.

Palmer Lucky: The markets are not smarter on AI. The free hand is not yet efficient because the number of legitimate experts in the field is near-zero.

The average person making AI calls on Wall Street had no idea what AI even was a year ago and feels compelled to justify big moves.

Alex Cheema notes that Apple was up on Monday, and that Apple’s chips are great for running v3 and r1 inference.

Alex Cheema: Market close: $NVDA: -16.91% | $AAPL: +3.21%

Why is DeepSeek great for Apple?

Here’s a breakdown of the chips that can run DeepSeek V3 and R1 on the market now:

NVIDIA H100: 80GB @ 3TB/s, $25,000, $312.50 per GB

AMD MI300X: 192GB @ 5.3TB/s, $20,000, $104.17 per GB

Apple M2 Ultra: 192GB @ 800GB/s, $5,000, $26.04(!!) per GB

Apple’s M2 Ultra (released in June 2023) is 4x more cost efficient per unit of memory than AMD MI300X and 12x more cost efficient than NVIDIA H100!

Eric Hartford: 3090s, $700 for 24gb = $29/gb.

Alex Cheema: You need a lot of hardware around them to load a 700GB model in 30 RTX 3090s. I’d love to see it though, closest to this is probably stacking @__tinygrad__ boxes.

That’s cute. But I do not think that was the main reason why Apple was up. I think Apple was up because their strategy doesn’t depend on having frontier models but it does depend on running AIs on iPhones. Apple can now get their own distillations of r1, and use them for Apple Intelligence. A highly reasonable argument.

The One True Newsletter, Matt Levine’s Money Stuff, is of course on the case of DeepSeek’s r1 crashing the stock market, and asking what cheap inference for everyone would do to market prices. He rapidly shifts focus to non-AI companies, asking which ones benefit. It’s great if you use AI to make your management company awesome, but not if you get cut out because AI replaces your management company.

(And you. And the people it manages. And all of us. And maybe we all die.)

But I digress.

(To digress even further: While I’m reading that column, I don’t understand why we should care about the argument under ‘Dark Trading,’ since this mechanism decreases retail transaction costs to trade and doesn’t impact long term price discovery at all, and several LLMs confirmed this once challenged.)

Ben Thompson continues to give his completely different kind of technical tech company perspective, in FAQ format, including good technical explanations that agree with what I’ve said in previous columns.

Here’s a fascinating line:

Q: I asked why the stock prices are down; you just painted a positive picture!

A: My picture is of the long run; today is the short run, and it seems likely the market is working through the shock of R1’s existence.

That sounds like Ben Thompson is calling it a wrong-way move, and indeed later he explicitly endorses Jevons Paradox and expects compute use to rise. The market is supposed to factor in the long run now. There is no ‘this makes the price go down today and then up next week’ unless you’re very much in the ‘the EMH is false’ camp. And these are literally the most valuable companies in the world.

Here’s another key one:

Q: So are we close to AGI?

A: It definitely seems like it. This also explains why Softbank (and whatever investors Masayoshi Son brings together) would provide the funding for OpenAI that Microsoft will not: the belief that we are reaching a takeoff point where there will in fact be real returns towards being first.

Masayoshi Sun feels the AGI. Masayoshi Sun feels everything. He’s king of feeling it.

His typical open-model-stanning arguments on existential risk later in the past are as always disappointing, but in no way new or unexpected.

It continues to astound me that such intelligent people can think: Well, there’s no stopping us creating things more capable and intelligent than humans, so the best way to ensure that things smarter than more capable than humans go well for humans is to ensure that there are as many such entities as possible and that humans cannot possibly have any collective control over those new entities.

On another level, of course, I’ve accepted that people do think this. That they somehow cannot fathom that if you create things more intelligent and capable and competitive than humans there could be the threat that all the humans would end up with no power, rather than that the wrong humans might have too much power. Or think that this would be a good thing – because the wrong humans wouldn’t have power.

Similarly, Ben’s call for absolutely no regulations whatsoever, no efforts at safety whatsoever outside of direct profit motives, ‘cut out all the cruft in our companies that has nothing to do with winning,’ is exactly the kind of rhetoric I worry about getting us all killed in response to these developments.

I should still reiterate that Ben to his credit is very responsible and accurate here in his technical presentation, laying out what DeepSeek and r1 are and aren’t accomplishing here rather than crying missile gap. But the closing message remains the same.

The term Trump uses is ‘tariffs.’

I propose, at least in the context of GPUs, that we call these ‘import restrictions,’ in order to point out that we are (I believe wisely!) imposing ‘export restrictions’ as a matter of national security to ensure we get all the chips, and using diffusion regulations to force the chips to be hosted at home, then we are threatening to impose ‘up to 100%’ tariffs on those same chips, because ‘they left us’ and they want ‘to force them to come back,’ and they’ll build the new factories here instead of there, with their own money, because of the threat.

Except for the fact that we really, really want the data centers at home.

The diffusion regulations are largely to force companies to create them at home.

Arthur B: Regarding possible US tariffs on Taiwan chips.

First, this is one that US consumers would directly feel, it’s less politically feasible than tariffs on imports with lots of substitutes.

Second, data centers don’t have to be located in the US. Canada is next door and has plenty of power.

Dhiraj: Taiwan made the largest single greenfield FDI in US history through TSMC. Now, instead of receiving gratitude for helping the struggling US chip industry, Taiwan faces potential tariffs. In his zero-sum worldview, there are no friends.

The whole thing is insane! Completely nuts. If he’s serious. And yes he said this on Joe Rogan previously, but he said a lot of things previously that he didn’t mean.

Whereas Trump’s worldview is largely the madman theory, at least for trade. If you threaten people with insane moves that would hurt both of you, and show that you’re willing to actually enact insane moves, then they are forced to give you what you want.

In this case, what Trump wants is presumably for TSMC to announce they are building more new chip factories in America. I agree that this would be excellent, assuming they were actually built. We have an existence proof that it can be done, and it would greatly improve our strategic position and reduce geopolitical risk.

I presume Trump is mostly bluffing, in that he has no intention of actually imposing these completely insane tariffs, and he will ultimately take a minor win and declare victory. But what makes it nerve wracking is that, by design, you never know. If you did know none of this would ever work.

Unless, some people wondered, there was another explanation for all this…

  1. The announcement came late on Monday, after Nvidia dropped 17%, on news that its chips were highly useful, with so many supposedly wise people on Wall Street going ‘oh yes that makes sense Nvidia should drop’ and those I know who understand AI often saying ‘this is crazy and yes I bought more Nvidia today.’

  2. As in, there was a lot of not only saying ‘this is an overreaction,’ there was a lot of ‘this is a 17% wrong-way move in the most valuable stock in the world.’

  3. When you imagine the opposite news, which would be that AI is ‘hitting a wall,’ one presumes Nvidia would be down, not up. And indeed, remember months ago?

  4. Then when the announcement of the tariff threat came? Nvidia didn’t move.

  5. Nvidia opened Tuesday up slightly off of the Monday close, and closed the day up 8.8%, getting half of its losses back.

Nabeel Qureshi (Tuesday, 2pm): Crazy that people in this corner of X have a faster OODA loop than the stock market

This was the largest single day drop in a single stock in world history. It wiped out over $500 billion in market value. One had to wonder if it was partially insider trading.

Timothy Lee: Everyone says DeepSeek caused Nvidia’s stock to crash yesterday. I think this theory makes no sense.

DeepSeek’s success isn’t bad news for Nvidia.

I don’t think that this was insider trading. The tariff threat was already partly known and thus priced in. It’s a threat rather than an action, which means it’s likely a bluff. That’s not a 17% move. Then we have the bounceback on Tuesday.

Even if I was certain that this was mostly an insider trading move instead of being rather confident it mostly or entirely wasn’t, I wouldn’t go as far as Eliezer does in the the below quote. The SEC does many important things.

But I do notice that there’s a non-zero amount of ‘wait a minute’ that will occur to me the next time I’m hovering around the buy button in haste.

Eliezer Yudkowsky: I heard from many people who said, “An NVDA drop makes no sense as a Deepseek reaction; buying NVDA.” So those people have now been cheated by insider counterparties with political access. They may make fewer US trades in the future.

Also note that the obvious meaning of this news is that someone told and convinced Trump that China will invade Taiwan before the end of his term, and the US needs to wean itself off Taiwanese dependence.

This was a $400B market movement, and if @SECGov can’t figure out who did it then the SEC has no reason to exist.

TBC, I’m not saying that figuring it out would be easy or bringing the criminals to justice would be easy. I’m saying that if the US markets are going to be like this anyway on $400B market movements, why bother paying the overhead cost of having an SEC that doesn’t work?

Roon: [Trump’s tariff threats about Taiwan] didn’t move overnight markets at all

which either means markets either:

– don’t believe it’s credible

– were pricing this in yesterday while internet was blaming the crash out on deepseek

I certainly don’t agree that the only interpretation of this news is ‘Trump expects an invasion of Taiwan.’ Trump is perfectly capable of doing this for exactly the reasons he’s saying.

Trump is also fully capable of making this threat with no intention of following through, in order to extract concessions from Taiwan or TSMC, perhaps of symbolic size.

Trump is also fully capable of doing this so that he could inform his hedge fund friends in advance and they could make quite a lot of money – with or without any attempt to actually impose the tariffs ever, since his friends would have now covered their shorts in this scenario.

Indeed do many things come to pass. I don’t know anything you don’t know.

It would be a good sign if DeepSeek had a plan for safety, even if it wasn’t that strong?

Stephen McAleer (OpenAI): DeepSeek should create a preparedness framework/RSP if they continue to scale reasoning models.

Very happy to [help them with this]!

We don’t quite have nothing. This below is the first actively positive sign for DeepSeek on safety, however small.

Stephen McAleer (OpenAI): Does DeepSeek have any safety researchers? What are Liang Wenfeng’s views on AI safety?

Sarah (YuanYuanSunSara): [DeepSeek] signed Artificial Intelligence safety commitment by CAICT (gov backed institute). You can see the whale sign at the bottom if you can’t read their name Chinese.

This involves AI safety governance structure, safety testing, do frontier AGI safety research (include loss of control) and share it publicly.

None legally binding but it’s a good sign.

Here is a chart with the Seoul Commitments versus China’s version.

It is of course much better that DeepSeek signed onto a symbolic document like this. That’s a good sign, whereas refusing would have been a very bad sign. But as always, talk is cheap, this doesn’t concretely commit DeepSeek to much, and even fully abiding by commitments like this won’t remotely be enough.

I do think this is a very good sign that agreements and coordination are possible. But if we want that, we will have to Pick Up the Phone.

Here’s a weird different answer.

Joshua Achiam (OpenAI, Head of Mission Alignment): I think a better question is whether or not science fiction culture in China has a fixation on the kinds of topics that would help them think about it. If Three-Body Problem is any indication, things will be OK.

It’s a question worth asking, but I don’t think this is a better question?

And based on the book, I do not think Three-Body Problem (conceptual spoilers follow, potentially severe ones depending on your perspective) is great here. Consider the decision theory that those books endorse, and what happens to us and also the universe as a result. It’s presenting all of that as essentially inevitable, and trying to think otherwise as foolishness. It’s endorsing that what matters is paranoia, power and a willingness to use it without mercy in an endless war of all against all. Also consider how they paint the history of the universe entirely without AGI.

I want to be clear that I fully agree with Bill Gurley that ‘no one at DeepSeek is an enemy of mine,’ indeed There Is No Enemy Anywhere, with at most notably rare exceptions that I invite to stop being exceptions.

However, I do think that if they continue down their current path, they are liable to get us all killed. And I for one am going to take the bold stance that I think that this is bad, and they should therefore alter their path before reaching their stated destination.

How committed is DeepSeek to its current path?

Read this quote Ben Thompson links to very carefully:

Q: DeepSeek, right now, has a kind of idealistic aura reminiscent of the early days of OpenAI, and it’s open source. Will you change to closed source later on? Both OpenAI and Mistral moved from open-source to closed-source.

Answer from DeepSeek CEO Liang Wenfeng: We will not change to closed source. We believe having a strong technical ecosystem first is more important.

This is from November. And that’s not a no. That’s actually a maybe.

Note what he didn’t say:

A Different Answer: We will not change to closed source. We believe having a strong technical ecosystem is more important.

The difference? His answer includes the word ‘first.’

He’s saying that first you need a strong technical ecosystem, and he believes that open models are the key to attracting talent and developing a strong technical ecosystem. Then, once that exists, you would need to protect your advantage. And yes, that is exactly what happened with… OpenAI.

I wanted to be sure that this translation was correct, so I turned to Wenfang’s own r1, and asked the interviewer for the original statement, which was:

梁文锋:我们不会闭源。我们认为先有一个强大的技术生态更重要。

r1’s translation: “We will not close our source code. We believe that establishing a strong technological ecosystem must come first.”

  • 先 (xiān): “First,” “prioritize.”

  • 生态 (shēngtài): “Ecosystem” (metaphor for a collaborative, interconnected environment).

To quote r1:

Based solely on this statement, Liang is asserting that openness is non-negotiable because it is essential to the ecosystem’s strength. While no one can predict the future, the phrasing suggests a long-term commitment to open-source as a core value, not a temporary tactic. To fully guarantee permanence, you’d need additional evidence (e.g., licensing choices, governance models, past behavior). But as it stands, the statement leans toward “permanent” in spirit.

I interpret this as a statement of a pragmatic motivation – if that motivation changes, or a more important one is created, actions would change. For now, yes, openness.

The Washington Post had a profile of DeepSeek and Liang Wenfeng. One note is that the hedge fund that they’re a spinoff from has donated over $80 million to charity since 2020, which makes it more plausible DeepSeek has no business model, or at least no medium-term business model.

But that government embrace is new for DeepSeek, said Matt Sheehan, an expert on China’s AI industry at the Carnegie Endowment for International Peace.

“They were not the ‘chosen one’ of Chinese AI start-ups,” said Sheehan, noting that many other Chinese start-ups received more government funding and contracts. “DeepSeek took the world by surprise, and I think to a large extent, they took the Chinese government by surprise.”

Sheehan added that for DeepSeek, more government attention will be a “double-edged sword.” While the company will probably have more access to government resources, “there’s going to be a lot of political scrutiny on them, and that has a cost of its own,” he said.

Yes. This reinforces the theory that DeepSeek’s ascent took China’s government by surprise, and they had no idea what v3 and r1 were as they were released. Going forward, China is going to be far more aware. In some ways, DeepSeek will have lots of support. But there will be strings attached.

That starts with the ordinary censorship demands of the CCP.

If you self-host r1, and you ask it about any of the topics the CCP dislikes, r1 will give you a good, well-balanced answer. If you ask on DeepSeek’s website, it will censor you via some sort of cloud-based monitoring, which works if you’re closed source, but DeepSeek is trying to be fully open source. Something has to give, somewhere.

Also, even if you’re using the official website, it’s not like you can’t get around it.

Justine Moore: DeepSeek’s censorship is no match for the jailbreakers of Reddit

I mean, that was easy.

Joshua Achiam (OpenAI Head of Mission Alignment): This has deeply fascinating consequences for China in 10 years – when the CCP has to choose between allowing their AI industry to move forward, or maintaining censorship and tight ideological control, which will they choose?

And if they choose their AI industry, especially if they favor open source as a strategy for worldwide influence: what does it mean for their national culture and government structure in the long run, when everyone who is curious can find ways to have subversive conversations?

Ten years to figure this out? If they’re lucky, they’ve got two. My guess is they don’t.

I worry about the failure to feel the AGI or even the AI here from Joshua Achiam, given his position at OpenAI. Ten years is a long time. Sam Altman expects AGI well before that. This goes well beyond Altman’s absurd position of ‘AGI will be invented and your life won’t noticeably change for a long time.’ Choices are going to need to be made. Even if AI doesn’t advance much from here, choices will have to be made.

As I’ve noted before, censorship at the model layer is expensive. It’s harder to do, and when you do it you risk introducing falsity into a mind in ways that will have widespread repercussions. Even then, a fine tune can easily remove any gaps in knowledge, or any reluctance to discuss particular topics, whether they are actually dangerous things like building bombs or things that piss off the CCP like a certain bear that loves honey.

I got called out on Twitter for supposed cognitive dissonance on this – look at China’s actions, they clearly let this happen. Again, my claim is that China didn’t realize what this was until after it happened, they can’t undo it (that’s the whole point!) and they are of course going to embrace their national champion. That has little to do with what paths DeepSeek is allowed to follow going forward.

(Also, since it was mentioned in that response, I should note – there is a habit of people conflating ‘pause’ with ‘ever do anything to regulate AI at all.’ I do not believe I said anything about a pause – I was talking about whether China would let DeepSeek continue to release open weights as capabilities improve.)

Before I further cover potential policy responses, a question we must ask this week is: I very much do not wish to do this at this time, but suppose in the future we did want to restrict use of a particular already open weights model and its derivatives, or all models in some reference class.

What would our options be?

Obviously we couldn’t fully ban it in terms of preventing determined people from having access. And if you try to stop them and others don’t, there are obvious problems with that, including ‘people have internet connections.’

However, that does not mean that we would have actual zero options.

Steve Sailer: Does open source, low cost DeepSeek mean that there is no way, short of full-blown Butlerian Jihad against computers, which we won’t do, to keep AI bottled up, so we’re going to find out if Yudkowsky’s warnings that AI will go SkyNet and turn us into paperclips are right?

Gabriel: It’s a psy-op

If hosting a 70B is illegal:

– Almost all individuals stop

– All companies stop

– All research labs stop

– All compute providers stop

Already huge if limited to US+EU

Can argue about whether good/bad, but not about the effect size.

You can absolutely argue about effect size. What you can’t argue is that the effect size isn’t large. It would make a big difference for many practical purposes.

In terms of my ‘Levels of Friction’ framework (post forthcoming) this is moving the models from Level 1 (easy to access) to at least Level 3 (annoying with potential consequences.) That has big practical consequences, and many important use cases will indeed go away or change dramatically.

What Level 3 absolutely won’t do, here or elsewhere, is save you from determined people who want it badly enough, or from sufficiently capable models that do not especially care what you tell them not to do or where you tell them not to be. Or scenarios where the law is no longer especially relevant, and the government or humanity is very much having a ‘do you feel in charge?’ moment. And that alone would, in many scenarios, be enough to doom you to varying degrees. If that’s what dooms you and the model is already open, well, you’re pretty doomed. And also it won’t save you from various scenarios where what the law thinks is not especially relevant.

If for whatever reason the government or humanity decides (or realizes) that this is insufficient, then there are two possibilities. Either the government or humanity is disempowered and you hope that this works out for humanity in some way. Or we use the necessary means to push the restrictions up to Level 4 (akin to rape and murder) or Level 5 (akin to what we do to stop terrorism or worse), in ways I assure you that you are very much not going to like – but the alternative might be worse, and the decision might very much not be up to either of us.

Actions have consequences. Plan for them.

Adam Ozimek was first I saw point out this time around with DeepSeek (I and many others echo this a lot in general) that the best way for the Federal Government to ensure American dominance of AI is to encourage more high skilled immigration and brain drain the world. If you don’t want China to have DeepSeek, export controls are great and all but how about let’s straight up steal their engineers. But y’all, and by y’all I mean Donald Trump, aren’t ready for that conversation.

It is highly unfortunate that David Sacks, the person seemingly in charge of what AI executive orders Trump signs, is so deeply confused about what various provisions actually did or would do, and on our regulatory situation relative to that of China.

David Sacks: DeepSeek R1 shows that the AI race will be very competitive and that President Trump was right to rescind the Biden EO, which hamstrung American AI companies without asking whether China would do the same. (Obviously not.) I’m confident in the U.S. but we can’t be complacent.

Donald Trump: The release of DeepSeek AI from a Chinese company should be a wake-up call for our industries that we need to be laser-focused on competing to win.

We’re going to dominate. We’ll dominate everything.

This is the biggest danger of all – that we go full Missile Gap jingoism and full-on race to ‘beat China,’ and act like we can’t afford to do anything to ensure the safety of the AGIs and ASIs we plan on building, even pressuring labs not to make such efforts in private, or threatening them with antitrust or other interventions for trying.

The full Trump clip is hilarious, including him saying they may have come up with a cheaper method but ‘no one knows if it is true.’ His main thrust is, oh, you made doing AI cheaper and gave it all away to us for free, thanks, that’s great! I love paying less money for things! And he’s presumably spinning, but he’s also not wrong about that.

I also take some small comfort in him framing revoking the Biden EO purely in terms of wokeness. If that’s all he thinks was bad about it, that’s a great sign.

Harlan Stewart: “Deepseek R1 is AI’s Sputnik moment”

Sure. I guess it’s like if the Soviets had told the world how to make their own Sputniks and also offered everyone a lifetime supply of free Sputniks. And the US had already previously figured out how to make an even bigger Sputnik.

Yishan: I think the Deepseek moment is not really the Sputnik moment, but more like the Google moment.

If anyone was around in ~2004, you’ll know what I mean, but more on that later.

I think everyone is over-rotated on this because Deepseek came out of China. Let me try to un-rotate you.

Deepseek could have come out of some lab in the US Midwest. Like say some CS lab couldn’t afford the latest nVidia chips and had to use older hardware, but they had a great algo and systems department, and they found a bunch of optimizations and trained a model for a few million dollars and lo, the model is roughly on par with o1. Look everyone, we found a new training method and we optimized a bunch of algorithms!

Everyone is like OH WOW and starts trying the same thing. Great week for AI advancement! No need for US markets to lose a trillion in market cap.

The tech world (and apparently Wall Street) is massively over-rotated on this because it came out of CHINA.

Deepseek is MUCH more like the Google moment, because Google essentially described what it did and told everyone else how they could do it too.

There is no reason to think nVidia and OAI and Meta and Microsoft and Google et al are dead. Sure, Deepseek is a new and formidable upstart, but doesn’t that happen every week in the world of AI? I am sure that Sam and Zuck, backed by the power of Satya, can figure something out. Everyone is going to duplicate this feat in a few months and everything just got cheaper. The only real consequence is that AI utopia/doom is now closer than ever.

I believe that alignment, and getting a good outcome for humans, was already going to be very hard. It’s going to be a lot harder if we actively try to get ourselves killed like this, and turn even what would have been relatively easy wins into losses. Whereas no, actually, if you want to win that has to include not dying, and also doing the alignment work helps you win, because it is the only way you can (sanely) get to deploy your AIs to do the most valuable tasks.

Trump’s reaction of ‘we’ll dominate everything’ is far closer to correct. Our ‘lead’ is smaller than we thought, DeepSeek will be real competition, but we are very much still in the dominant position. We need to not lose sight of that.

The Washington Post covers panic in Washington, and attempts to exploit this situation to do the opposite of wise policy.

Tiku, Dou, Zakrzewski and De Vynck: Tech stocks dropped Monday. Spooked U.S. officials, engineers and investors reconsidered their views on the competitive threat posed by China in AI, and how the United States could stay ahead.

While some Republicans and the Trump administration suggested the answer was to restrain China, prominent tech industry voices said DeepSeek’s ascent showed the benefits of openly sharing AI technology instead of keeping it closely held.

This shows nothing of the kind, of course. DeepSeek fast followed, copied our insights and had insights of their own. Our insights were held insufficiently closely to prevent this, which at that stage was mostly unavoidable. They have now given away many of those new valuable insights, which we and others will copy, and also made the situation more dangerous. We should exploit that and learn from it, not make the same mistake.

Robert Sterling: Might be a dumb question, but can’t OpenAI, Anthropic, and other AI companies just incorporate the best parts of DeepSeek’s source code into their code, then use the massive GPU clusters at their disposal to train models even more powerful than DeepSeek?

Am I missing something?

Peter Wildeford: Not a dumb question, this is 100% correct

And they already have more powerful models than Deepseek

I fear we are caught between two different insane reactions.

  1. Those calling on us to abandon our advantage in compute by dropping export controls, or our advantage in innovation and access by opening up our best models, are advocating surrender and suicide, both to China and to the AIs.

  2. Those who are going full jingoist are going to get us all killed the classic way.

Restraining China is a good idea if implemented well, but insufficiently specified. Restrain them how? If this means export controls, I strongly agree – and then ask when we are then considering imposing those controls on ourselves via tariffs? What else is available? And I will keep saying ‘how about immigration to brain drain them’ because it seems wrong to ignore the utterly obvious.

Chamath Palihapitiya says it’s inference time, we need to boot up our allies with it as quickly as possible (I agree) and that we should also boot up China by lifting export controls on inference chips, and also focus on supplying the Middle East. He notes he has a conflict of interest here. It seems not especially wise to hand over serious inference compute if we’re in a fight here. With the way these models are going, there’s a decent amount of fungibility between inference and training, and also there’s going to be tons of demand for inference. Why is it suddenly important to Chamath that the inference be done on chips we sold them? Capitalist insists rope markets must remain open during this trying time, and so on. (There’s also talk about ‘how asleep we’ve been for 15 years’ because we’re so inefficient and seriously everyone needs to calm down on this kind of thinking.)

So alas, in the short run, we are left scrambling to prevent two equal and opposite deadly mistakes we seem to be dangerously close to collectively making.

  1. A panic akin to the Missile Gap leading into a full jingoistic rush to build AGI and then artificial superintelligence (ASI) as fast as possible, in order to ‘beat China,’ without having even a plausible plan for how the resulting future equilibrium has value, or how humans retain alive and in meaningful control of the future afterwards.

  2. A full-on surrender to China by taking down the export controls, and potentially also to the idea that we will allow our strongest and best AIs and AGIs and thus even ASIs to be open models, ‘because freedom,’ without actually thinking about what this would physically mean, and thus again with zero plan for how to ensure the resulting equilibrium has value, or how humans would survive let alone retain meaningful control over the future.

The CEO of DeepSeek himself said in November that the export controls and inability to access chips were the limiting factors on what they could do.

Compute is vital. What did DeepSeek ask for with its newfound prestige? Support for compute infrastructure in China.

Do not respond by being so suicidal as to remove or weaken those controls.

Or, to shorten all that:

  1. We might do a doomed jingoistic race to AGI and get ourselves killed.

  2. We might remove the export controls and give up our best edge against China.

  3. We might give up our ability to control AGI or the future, and get ourselves killed.

Don’t do those things!

Do take advantage of all the opportunities that have been opened up.

And of course:

Don’t panic!

Discussion about this post

DeepSeek: Lemon, It’s Wednesday Read More »

deepseek-panic-at-the-app-store

DeepSeek Panic at the App Store

DeepSeek released v3. Market didn’t react.

DeepSeek released r1. Market didn’t react.

DeepSeek released a fing app of its website. Market said I have an idea, let’s panic.

Nvidia was down 11%, Nasdaq is down 2.5%, S&P is down 1.7%, on the news.

Shakeel: The fact this is happening today, and didn’t happen when r1 actually released last Wednesday, is a neat demonstration of how the market is in fact not efficient at all.

That is exactly the market’s level of situational awareness. No more, no less.

I traded accordingly. But of course nothing here is ever investment advice.

Given all that has happened, it seems worthwhile to go over all the DeepSeek news that has happened since Thursday. Yes, since Thursday.

For previous events, see my top level post here, and additional notes on Thursday.

To avoid confusion: r1 is clearly a pretty great model. It is the best by far available at its price point, and by far the best open model of any kind. I am currently using it for a large percentage of my AI queries.

  1. Current Mood.

  2. DeepSeek Tops the Charts.

  3. Why Is DeepSeek Topping the Charts?.

  4. What Is the DeepSeek Business Model?.

  5. The Lines on Graphs Case for Panic.

  6. Everyone Calm Down About That $5.5 Million Number.

  7. Is The Whale Lying?.

  8. Capex Spending on Compute Will Continue to Go Up.

  9. Jevon’s Paradox Strikes Again.

  10. Okay, Maybe Meta Should Panic.

  11. Are You Short the Market.

  12. o1 Versus r1.

  13. Additional Notes on v3 and r1.

  14. Janus-Pro-7B Sure Why Not.

  15. Man in the Arena.

  16. Training r1, and Training With r1.

  17. Also Perhaps We Should Worry About AI Killing Everyone.

  18. And We Should Worry About Crazy Reactions To All This, Too.

  19. The Lighter Side.

Joe Weisenthal: Call me a nationalist or whatever. But I hope that the AI that turns me into a paperclip is American made.

Peter Wildeford: Seeing everyone lose their minds about Deepseek does not reassure me that we will handle AI progress well.

Miles Brundage: I need the serenity to accept the bad DeepSeek takes I cannot change.

[Here is his One Correct Take, I largely but not entirely agree with it, my biggest disagreement is I am worried about an overly jingoist reaction and not only about us foolishly abandoning export controls].

Satya Nadella (CEO Microsoft): Jevons paradox strikes again! As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can’t get enough of.

Danielle Fong: everyone today: if you’re in “we’re so back” pivot to “it’s over”

Danielle Fong, a few hours later: if you’re in “it’s so over” pivot to “jevons paradox”

Kai-Fu Lee: In my book AI Superpowers, I predicted that US will lead breakthroughs, but China will be better and faster in engineering. Many people simplified that to be “China will beat US.” And many claimed I was wrong with GenAI. With the recent DeepSeek releases, I feel vindicated.

Dean Ball: Being an AI policy professional this week has felt like playing competitive Starcraft.

Lots of people are rushing to download the DeepSeek app.

Some of us started using r1 before the app. Joe Weisenthal noted he had ‘become a DeepSeek bro’ and that this happened overnight, switching costs are basically zero. They’re not as zero as they might look, and I expect the lockin with Operator from OpenAI to start mattering soon, but for most purposes yeah, you can just switch, and DeepSeek is free for conversational use including r1.

Switching costs are even closer to zero if, like most people, you weren’t a serious user of LLMs yet.

Then regular people started to notice DeepSeek.

This is what it looked like before the app shot to #1, when it merely cracked the top 10:

Ken: It’s insane the extent to which the DeepSeek News has broken “the containment zone.” I saw a Brooklyn-based Netflix comedian post about “how embarrassing it was that the colonial devils spent $10 billion, while all they needed was GRPO.”

llm news has arrived as a key political touchstone. will only heighten from here.

Olivia Moore: DeepSeek’s mobile app has entered the top 10 of the U.S. App Store.

It’s getting ~300k global daily downloads.

This may be the first non-GPT based assistant to get mainstream U.S. usage. Claude has not cracked the top 200.

This may be the first non-GPT based assistant to get mainstream U.S. usage.

The app was released on Jan. 11, and is linked on DeepSeek’s website (so does appear to be affiliated).

Per reviews, users are missing some ChatGPT features like voice mode…but basically see it as a free version of OpenAI’s premium models.

Google Gemini also cracked the top 10, in its first week after release (but with a big distribution advantage!)

Will be interesting to see how high DeepSeek climbs, and how long it stays up there 🤔

Claude had ~300k downloads last month, but that’s a lot less than 300k per day.

Metaschool: Google Trends: DeepSeek vs. Claude

Kalomaze: Holy s, it’s in the top 10?

Then it went all the way to #1 on the iPhone app store.

Kevin Xu: Two weeks ago, RedNote topped the download chart

Today, it’s DeepSeek

We are still in January

If constraint is the mother of invention, then collective ignorance is the mother of many downloads

Here’s his flashback to the chart when RedNote was briefly #1, note how fickle the top listings can be, Lemon8, Flip and Clapper were there too:

The ‘collective ignorance’ here is that news about DeepSeek and the app is only arriving now. That leads to a lot of downloads.

I have a Pixel 9, so I checked the Android app store. They have Temu at #1 (also Chinese!) followed by Scoopz which I literally have never heard of, then Instagram, T-Life (seriously what?), ReelShort, WhatsApp Messenger, ChatGPT (interesting that Android users are less AI pilled in general), Easy Homescreen (huh), TurboTax (oh no), Snapchat and then DeepSeek at #11. So if they’ve ‘saturated the benchmark’ on iPhone, this one is next, I suppose.

It seems DeepSeek got so many downloads they had to hit the breaks, similar to how OpenAI and Anthropic have had to do this in the past.

Joe Weisenthal: *DEEPSEEK: RESTRICTS REGISTRATION TO CHINA MOBILE PHONE NUMBERS

Because:

  1. It’s completely free.

  2. It has no ads.

  3. It’s a damn good model, sir.

  4. It lets you see the chain of thought which is a lot more interesting and fun and also inspires trust.

  5. All the panic about it only helped people notice, getting it on the news and so on.

  6. It’s the New Hotness that people hadn’t downloaded before, and that everyone is talking about right now because see the first five.

  7. No, this mostly isn’t about ‘people don’t trust American tech companies but they do trust the Chinese.’ But there aren’t zero people who are wrong enough to think this way, and China actively attempts to cultivate this including through TikTok.

  8. The Open Source people are also yelling about how this is so awesome and trustworthy and virtuous and so on, and being even more obnoxious than usual, which may or may not be making any meaningful difference.

I suspect we shouldn’t be underestimating the value of showing the CoT here, as I also discuss elsewhere in the post.

Garry Tan: DeepSeek search feels more sticky even after a few queries because seeing the reasoning (even how earnest it is about what it knows and what it might not know) increases user trust by quite a lot

Nabeel Qureshi: I wouldn’t be surprised if OpenAI starts showing CoTs too; it’s a much better user experience to see what the machine is thinking, and the rationale for keeping them secret feels weaker now that the cat’s out of the bag anyway.

It’s just way more satisfying to watch this happen.

It’s practically useful too: if the model’s going off in wrong directions or misinterpreting the request, you can tell sooner and rewrite the prompt.

That doesn’t mean it is ‘worth’ sharing the CoT, even if it adds a lot of value – it also reveals a lot of valuable information, including as part of training another model. So the answer isn’t obvious.

What’s their motivation?

Meta is pursuing open weights primarily because they believe it maximizes shareholder value. DeepSeek seems to be doing it primarily for other reasons.

Corey Gwin: There’s gotta be a catch… What did China do or hide in it? Will someone release a non-censored training set?

Amjad Masad: What’s Meta’s catch with Llama? Probably have similar incentives.

Anton: How is Deepseek going to make money?

If they just release their top model weights, why use their API?

Mistral did this and look where they are now (research licenses only and private models)

Han Xiao: deepseek’s holding 幻方量化 is a quant company, many years already,super smart guys with top math background; happened to own a lot GPU for trading/mining purpose, and deepseek is their side project for squeezing those gpus.

It’s an odd thing to do as a hedge fund, to create something immensely valuable and give it away for essentially ideological reasons. But that seems to be happening.

Several possibilities. The most obvious ones are, in some combination:

  1. They don’t need a business model. They’re idealists looking to give everyone AGI.

  2. They’ll pivot to the standard business model same as everyone else.

  3. They’re in it for the prestige, they’ll recruit great engineers and traders and everyone will want to invest capital.

  4. Get people to use v3 and r1, collect the data on what they’re saying and asking, use that information as the hedge fund to trade. Being open means they miss out on some of the traffic but a lot of it will still go to the source anyway if they make it free, or simply because it’s easier.

  5. (They’re doing this because China wants them to, or they’re patriots, perhaps.)

  6. Or just: We’ll figure out something.

For now, they are emphasizing motivation #1. From where I sit, there is very broad uncertainty about which of these dominate, or will dominate in the future no matter what they believe about themselves today.

Also, there are those who do not approve of motivation #1, and the CCP seems plausibly on that list. Thus, Tyler Cowen asks a very good question that is surprisingly rarely asked right now.

Tyler Cowen: DeepSeek okie-dokie: “All I know is we keep pushing forward to make open-source AGI a reality for everyone.” I believe them, the question is what counter-move the CCP will make now.

I also believe they intend to build and open source AGI.

The CCP is doubtless all for DeepSeek having a hit app. And they’ve been happy to support open source in places where open source doesn’t pose existential risks, because the upsides of doing that are very real.

That’s very different from an intent to open source AGI. China’s strategy on AI regulation so far has focused on content moderation for topics they care about. That approach won’t stay compatible with their objectives over time.

For that future intention to open source AGI, the question is not ‘how move will the CCP make to help them do this and get them funding and chips?’

The question now becomes: “What countermove will the CCP make now?”

The CCP wants to stay in control. What DeepSeek is doing is incompatible with that. If they are not simply asleep at the wheel, they understand this. Yes, it’s great for prestige, and they’re thrilled that if this model exists it came from China, but they will surely notice how if you run it on your own it’s impossible to control and fully uncensored out of the box and so on.

Might want to Pick Up the Phone. Also might not need to.

Yishan takes the opposite perspective, that newcomers like DeepSeek who come out with killer products like this are on steep upward trajectories and their next product will shock you with how good it is, seeing it as similar to Internet Explorer 3 or Firefox, or iPhone 1 or early Facebook or Google Docs or GPT-3 or early SpaceX and so on.

I think the example list here illustrates why I think DeepSeek probably (but not definitely) doesn’t belong on that list. Yishan notes that the incumbents here are dynamic and investing hard, which wasn’t true in most of the other examples. And many of them involve conceptually innovative approaches to go with the stuck incumbents. Again, that’s not the case here.

I mean, I fully expect there to be a v4 and r2 some time in 2025, and for those to blow out of the water v3 and r1 and probably the other models that are released right now. Sure. But I also expect OpenAI and Anthropic and Google to blow the current class of stuff out of the water by year’s end. Indeed, OpenAI is set to do this in about a week or two with o3-mini and then o3 and o3-pro.

Most of all, to those who are saying that ‘China has won’ or ‘China is in the lead now,’ or other similar things, seriously, calm the down.

Yishan: They are already working on the next thing. China may reach AGI first, which is a bogeyman for the West, except that the practical effect will probably just be that living in China starts getting really nice.

America, it ain’t the Chinese girl spies here you gotta worry about, you need to be flipping the game and sending pretty white girls over there to seduce their engineers and steal their secrets, stat.

If you’re serious about the steal the engineering secrets plan, of course, you’d want to send over a pretty white girl… with a green card with the engineer’s name on it. And the pretty, white and girl parts are then all optional. But no, China isn’t suddenly the one with the engineering secrets.

I worry about this because I worry about a jingoist ‘we must beat China and we are behind’ reaction causing the government to do some crazy ass stuff that makes us all much more likely to get ourselves killed, above and beyond what has already happened. There’s a lot of very strong Missile Gap vibes here.

And I wrote that sentence before DeepSeek went to #1 on the app store and there was a $1 trillion market panic. Oh no.

So, first off, let’s all calm down about that $5.5 million training number.

Dean Ball offers notes on DeepSeek and r1 in the hopes of calming people down. Because we have such different policy positions yet see this situation so similarly, I’m going to quote him in full, and then note the places I disagree. Especially notes #2, #5 and #4 here, yes all those claims he is pointing out are Obvious Nonsense are indeed Obvious Nonsense:

Dean Ball: The amount of factually incorrect information and hyperventilating takes on deepseek on this website is truly astounding. I assumed that an object-level analysis was unnecessary but apparently I was wrong. Here you go:

  1. DeepSeek is an extremely talented team and has been producing some of the most interesting public papers in ML for a year. I first wrote about them in May 2024, though was tracking them earlier. They did not “come out of nowhere,” at all.

  2. v3 and r1 are impressive models. v3 did not, however, “cost $5m.” That reported figure is almost surely their *marginalcost. It does not include the fixed cost of building a cluster (and deepseek builds their own, from what I understand), nor does it include the cost of having a staff.

  3. Part of the reason DeepSeek looks so impressive (apart from just being impressive!) is that they are among the only truly cracked teams releasing detailed frontier AI research. This is a soft power loss on America’s part, and is directly downstream of the culture of secrecy that we foster in a thousand implicit and explicit ways, including by ceaselessly analogizing AI to nuclear weapons. Maybe you believe that’s a good culture to have! Perhaps secrecy is in fact the correct long term strategy. But it is the obvious and inevitable tradeoff of such a culture; I and many others have been arguing this for a long time.

  4. Deepseek’s r1 is not an indicator that export controls are failing (again, I say this as a skeptic of the export controls!), nor is it an indicator that “compute doesn’t matter,” nor does it mean “America’s lead is over.”

  5. Lots of people’s hyperbolic commentary on this topic, in all different directions, is driven by their broader policy agenda rather than a desire to illuminate reality. Caveat emptor.

  6. With that said, DeepSeek does mean that open source AI is going to be an important part of AI dynamics and competition for at least the foreseeable future, and probably forever.

  7. r1 especially should not be a surprise (if anything, v3 is in fact the bigger surprise, though it too is not so big of a surprise). The reasoning approach is an algorithm—lines of code! There is no moat in such things. Obviously it was going to be replicated quickly. I personally made bets that a Chinese replication would occur within 3 months of o1’s release.

  8. Competition is going to be fierce, and complacency is our enemy. So is getting regulation wrong. We need to reverse course rapidly from the torrent of state-based regulation that is coming that will be *awfulfor AI. A simple federal law can preempt all of the most damaging stuff, and this is a national security and economic competitiveness priority. The second best option is to find a state law that can serve as a light touch national standard and see to it that it becomes a nationwide standard. Both are exceptionally difficult paths to walk. Unfortunately it’s where we are.

I fully agree with #1 through #6.

For #3 I would say it is downstream of our insane immigration policies! If we let their best and brightest come here, then DeepSeek wouldn’t have been so cracked. And I would say strongly that, while their release of the model and paper is a ‘soft power’ reputational win, I don’t think that was worth the information they gave up, and in purely strategic terms they made a rather serious mistake.

I can verify the bet in #7 was very on point, I wasn’t on either side of the wager but was in the (virtual) Room Where It Happened. Definite Bayes points to Dean for that wager. I agree that ‘reasoning model at all, in time’ was inevitable. But I don’t think you should have expected r1 to come out this fast and be this good, given what we knew at the time of o1’s release, and certainly it shouldn’t have been obvious, and I think ‘there are no moats’ is too strong.

For #8 we of course have our differences on regulation, but we do agree on a lot of this. Dean doubtless would count a lot more things as ‘awful state laws’ than I would, but we agree that the proposed Texas law would count. At this point, given what we’ve seen from the Trump administration, I think our best bet is the state law path. As for pre-emption, OpenAI is actively trying to get an all-encompassing version of that in exchange for essentially nothing at all, and win an entirely free hand, as I’ve previously noted. We can’t let that happen.

Seriously, though, do not over index on the $5.5 million in compute number.

Kevin Roose: It’s sort of funny that every American tech company is bragging about how much money they’re spending to build their models, and DeepSeek is just like “yeah we got there with $47 and a refurbished Chromebook”

Nabeel Qureshi: Everyone is way overindexing on the $5.5m final training run number from DeepSeek.

– GPU capex probably $1BN+

– Running costs are probably $X00M+/year

– ~150 top-tier authors on the v3 technical paper, $50m+/year

They’re not some ragtag outfit, this was a huge operation.

Nathan Lambert has a good run-down of the actual costs here.

I have no idea if the “we’re just a hedge fund with a lot of GPUs lying around” thing is really the whole story or not but with a budget of _that_ size, you have to wonder…

They themselves sort of point this out, but there’s a bunch of broader costs too.

The Thielian point here is that the best salespeople often don’t look like salespeople.

There’s clearly an angle here with the whole “we’re way more efficient than you guys”, all described in the driest technical language….

Nathan Lambert: These costs are not necessarily all borne directly by DeepSeek, i.e. they could be working with a cloud provider, but their cost on compute alone (before anything like electricity) is at least $100M’s per year.

For one example, consider comparing how the DeepSeek V3 paper has 139 technical authors. This is a very large technical team.With headcount costs that can also easily be over $10M per year, estimating the cost of a year of operations for DeepSeek AI would be closer to $500M (or even $1B+) than any of the $5.5M numbers tossed around for this model. The success here is that they’re relevant among American technology companies spending what is approaching or surpassing $10B per year on AI models.

Richard Song: Every AI company after DeepSeek be like:

Danielle Fong: when tesla claimed that they were going to have batteries < $100 / kWh, practically all funding for american energy storage companies tanked.

tesla still won’t sell you a powerwall or powerpack for $100/kWh. it’s like $1000/kWh and $500 for a megapack.

the entire VC sector in the US was bluffed and spooked by Elon. don’t be stupid in this way again.

What I’m saying here is that VCs need to invest in technology learning curves. things get better over time. but if you’re going to compare what your little startup can get out as an MVP in its first X years, and are comparing THAT projecting forward to what a refined tech can do in a decade, you’re going to scare yourself out of making any investments. you need to find a niche you can get out and grow in, and then expand successively as you come down the learning curve.

the AI labs that are trashing their own teams and going with deepseek are doing the equivalent today. don’t get bluffed. build yourself.

Is it impressive that they (presumably) did the final training run with only $5.5M in direct compute costs? Absolutely. Is it impressive that they’re relevant while plausibly spending only hundreds of millions per year total instead of tens of billions? Damn straight. They’re cracked and they cooked.

They didn’t do it with $47 on a Chromebook, and this doesn’t mean that export controls are useless because everyone can buy a Chromebook.

The above is assuming (as I do still assume) that Alexandr Wang was wrong when he went on CNBC and claimed DeepSeek has about 50,000 H100s, which is quite the claim to make without evidence. Elon Musk replied to this claim with ‘obviously.’

Samuel Hammond also is claiming that DeepSeek trained on H100s, and while my current belief is that they didn’t, I trust that he would not say it if he didn’t believe it.

Neal Khosla went so far as to claim (again without evidence) that ‘deepseek is a ccp psyop + economic warfare to make American AI unprofitable.’ This seems false.

The following all seem clearly true:

  1. A lot of this is based on misunderstanding the ‘$5.5 million’ number.

  2. People have strong motive to engage in baseless cope around DeepSeek.

  3. DeepSeek had strong motive to lie about its training costs and methods.

So how likely is it The Whale Is Lying?

Armen Aghajanyan: There is an unprecedented level of cope around DeepSeek, and very little signal on X around R1. I recommend unfollowing anyone spreading conspiracy theories around R1/DeepSeek in general.

Teortaxes: btw people with major platforms who spread the 50K H100s conspiracy theory are underestimating the long-term reputation cost in technically literate circles. They will *notbe able to solidify this nonsense into consensus reality. Instead, they’ll be recognized as frauds.

The current go-to best estimate for DeepSeek V3’s (and accordingly R1-base’s) pretraining compute/cost, complete with accounting for overhead introduced by their architecture choices and optimizations to mitigate that.

TL;DR: ofc it checks out, Whale Will Never Lie To Us

GFodor: I shudder at the thought I’ve ever posted anything as stupid as these theories, given the logical consequence it would demand of the reader

Amjad Masad: So much cope about DeepSeek.

Not only did they release a great model. they also released a breakthrough training method (R1 Zero) that’s already reproducing.

I doubt they lied about training costs, but even if they did they’re still awesome for this great gift to the world.

This is an uncharacteristically naive take from Teortaxes on two fronts.

  1. Saying an AI company would never lie to us, Chinese or otherwise, someone please queue the laugh track.

  2. Making even provably and very clearly false claims about AI does not get you recognized as a fraud in any meaningful way. That would be nice, but no.

To be clear, my position is close to Masad’s: Unless and until I see more convincing evidence I will continue to believe that yes, they did do the training run itself with the H800s for only $5.5 million, although the full actual cost was orders of magnitude more than that. Which, again, is damn impressive, and would be damn impressive even if they were fudging the costs quite a bit beyond that.

Whereas here I think he’s wrong is in their motivation. While Meta is doing this primarily because they believe it maximizes shareholder value, DeepSeek seems to be doing it primarily for other reasons, as noted in the section asking about their business model.

Either way, they are very importantly being constrained by access to compute, even if they’ve smuggled in a bunch of chips they can’t talk about. As Tim Fist points out, the export controls are tightened, so they’ll have more trouble accessing the next generations than they are having now, and no this did not stop being relevant, and they risk falling rather far behind.

Also Peter Wildeford points out that the American capex spends on AI will continue to go up. DeepSeek is cracked and cooking and cool, and yes they’ve proven you can do a lot more with less than we expected, but keeping up is going to be tough unless they get a lot more funding some other way. Which China is totally capable of doing, and may well do. That would bring the focus back on export controls.

Similarly, here’s Samuel Hammond.

Angela Zhang (Hong Kong): My latest opinion on how Deepseek’s rise has laid bare the limits of US export controls designed to slow China’s AI progress.

Samuel Hammond: This is wrong on several levels.

– DeepSeek trains on h100s. Their success reveals the need to invest in export control *enforcementcapacity.

– CoT / inference-time techniques make access to large amounts of compute *morerelevant, not less, given the trillions of tokens generated for post-training.

– We’re barely one new chip generation into the export controls, so it’s not surprising China “caught up.” The controls will only really start to bind and drive a delta in the US-China frontier this year and next.

– DeepSeek’s CEO has himself said the chip controls are their biggest blocker.

– The export controls also apply to semiconductor manufacturing equipment, not just chips, and have tangibly set back SMIC.

DeepSeek is not a Sputnik moment. Their models are impressive but within the envelope of what an informed observer should expect.

Imagine if US policymakers responded to the actual Sputnik moment by throwing their hands in the air and saying, “ah well, might as well remove the export controls on our satellite tech.” Would be a complete non-sequitur.

Roon: If the frontier models are commoditized, compute concentration matters even more.

If you can train better models for fewer floating-point operations, compute concentration matters even more.

Compute is the primary means of production of the future, and owning more will always be good.

In my opinion, open-source models are a bit of a red herring on the path to acceptable ASI futures. Free model weights still do not distribute power to all of humanity; they distribute it to the compute-rich.

I don’t think Roon is right that it matters ‘even more,’ and I think who has what access to the best models for what purposes is very much not a red herring, but compute definitely still matters a lot in every scenario that involves strong AI.

Imagine if the ones going ‘I suppose we should drop the export controls then’ or ‘the export controls only made us stronger’ were mostly the ones looking to do the importing and exporting. Oh, right.

And yes, the Chinese are working hard to make their own chips, but:

  1. They’re already doing this as much as possible, and doing less export controls wouldn’t suddenly get them to slow down and do it less, regardless of how successful you think they are being.

  2. Every chip we sell to them instead of us is us being an idiot.

  3. DeepSeek trained on Nvidia chips like everyone else.

The question now turns to what all of this means for American equities.

In particular, what does this mean for Nvidia?

BuccoCapital Bloke: My entire fing Twitter feed this weekend:

He leaned back in his chair. Confidently, he peered over the brim of his glasses and said, with an air of condescension, “Any fool can see that DeepSeek is bad for Nvidia”

“Perhaps” mused his adversary. He had that condescending bastard right where he wanted him. “Unless you consider…Jevons Paradox!”

All color drained from the confident man’s face. His now-trembling hands reached for his glasses. How could he have forgotten Jevons Paradox! Imbecile! He wanted to vomit.

Satya Nadella (CEO Microsoft): Jevons paradox strikes again! As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can’t get enough of.

Adam D’Angelo (Board of OpenAI, among many others):

Sarah (YuanYuanSunSara): Until you have good enough agent that runs autonomously with no individual human supervision, not sure this is true. If model gets so efficient that you can run it on everyone’s laptop (which deepseek does have a 1B model), unclear whether you need more GPU.

DeepSeek is definitely not at ‘run on your laptop’ level, and these are reasoning models so when we first crack AGI or otherwise want the best results I am confident you will want to be using some GPUs or other high powered hardware, even if lots of other AI also is happening locally.

Does Jevon’s Paradox (which is not really a paradox at all, but hey) apply here to Nvidia in particular? Will improvements in the quality of cheaper open models drive demand for Nvidia GPUs up or down?

I believe it will on net drive demand up rather than down, although I also think Nvidia would have been able to sell as many chips as it can produce either way, given the way it has decided to set prices.

If I am Meta or Microsoft or Amazon or OpenAI or Google or xAI and so on, I want as many GPUs as I can get my hands on, even more than before. I want to be scaling. Even if I don’t need to scale for pretraining, I’ll still want to scale for inference. If the best models are somehow going to be this cheap to serve, uses and demand will be off the charts. And getting there first, via having more compute to do the research, will be one of the few things that matters.

You could reach the opposite conclusion if you think that there is a rapidly approaching limit to how good AI can be, that throwing more compute an training or inference won’t improve that by much, there’s a fixed set of things you would thus use AI for, and thus all this does is drive the price cheaper, maybe open up a few marginal use cases as the economics improve. That’s a view that doesn’t believe in AGI, let alone ASI, and likely doesn’t even factor in what current models (including r1!) can already do.

If all we had was r1 for 10 years, oh the Nvidia chips we would buy to do inference.

Or at least, if you’re in their GenAI department, you should definitely panic.

Here is a claim seen on Twitter from many sources:

Meta GenAI organization in panic mode

It started with DeepSeek v3, which rendered Llama 4 already behind in benchmarks. Adding insult to injury was the “unknown Chinese company with a $5.5 million training budget.”

Engineers are moving frantically to dissect DeepSeek and copy anything and everything they can from it. I’m not even exaggerating.

Management is worried about justifying the massive cost of the GenAI organization. How would they face leadership when every single “leader” of the GenAI organization is making more than what it cost to train DeepSeek v3 entirely, and they have dozens of such “leaders”?

DeepSeek r1 made things even scarier. I cannot reveal confidential information, but it will be public soon anyway.

It should have been an engineering-focused small organization, but since a bunch of people wanted to join for the impact and artificially inflate hiring in the organization, everyone loses.

Shakeel: I can explain this. It’s because Meta isn’t very good at developing AI models.

Full version is in The Information, saying that this is already better than Llama 4 (seems likely) and that Meta has ‘set up four war rooms.’

This of course puts too much emphasis on the $5.5 million number as discussed above, but the point remains that DeepSeek is eating Meta’s lunch in particular. If Meta’s GenAI team isn’t in panic mode, they should all be fired.

It also illustrates why DeepSeek may have made a major mistake revealing as much information as it did, but then again if they’re not trying to make money and instead are driven by ideology of ‘get everyone killed’ (sorry I meant to say ‘open source AGI’) then that is a different calculus than Meta’s.

But obviously what Meta should be doing right now is, among other things, ask ‘what if we trained the same way as v3 and r1 except we use $5.5 billion in compute instead of $5.5 million.’)

That is exactly Meta’s speciality. Llama was all about ‘we hear you like LLMs so we trained an LLM the way everyone trains their LLMs.’

The alternative is ‘maybe we should focus our compute on inference and use local fine-tuned versions of these sweet open models,’ but Zuckerberg very clearly is unwilling to depends on anyone else for that, and I do not blame him.

If you were short on Friday, you’re rather happy about that now. Does it make sense?

The timing is telling. To the extent this does have impact, all of this really should have been mostly priced in. You can try to tell the ‘it was priced in’ story, but I don’t believe you. Or you can tell the story that what wasn’t priced in was the app, and the mindshare, and that wasn’t definite until just now. Remember the app was launched weeks ago, so this isn’t a revelation about DeepSeek’s business plans – but it does give them the opportunity to potentially launch various commercial products, and it gives them mindshare.

But don’t worry about the timing, and don’t worry about whether this is actually a response to the fing app. Ask about what the real implications are.

Joe Weisenthal has a post with 17 thoughts about the selloff (ungated Twitter screenshots here).

There are obvious reasons to think this is rather terrible for OpenAI in particular, although it isn’t publicly traded, because a direct competitor is suddenly putting up some very stiff new competition, and also the price of entry for other competition just cratered, and more companies could self-host or even self-train.

I totally buy that. If every Fortune 500 company can train their private company-specific reasoning model for under $10 million, to their own specifications, why wouldn’t they? The answer is ‘because it doesn’t actually cost that little even with the DeepSeek paper, and if you do that you’ll always be behind,’ but yes some of them will choose to do that.

That same logic goes for other frontier labs like Anthropic or xAI, and to Google and Microsoft and everyone else to the extent that is what those companies are this or own shares in this, which by market cap is not that much.

The flip side of course is that they too can make use of all these techniques, and if AGI is now going to happen a lot faster and more impactfully, these labs are in prime position. But if the market was respecting being in prime position for AGI properly prices would look very different.

This is obviously potentially bad for Meta, since Meta’s plan involved being the leader in open models and they’ve been informed they’re not the leader in open models.

In general, Chinese competition looking stiffer for various products is bad in various ways for a variety of American equities. Some decline in various places is appropriate.

This is obviously bad for existential risk, but I have not seen anyone else even joke about the idea that this could be explaining the decline in the market. The market does not care or think about existential risk, at all, as I’ve discussed repeatedly. Market prices are neither evidence for, nor against, existential risk on any timelines that are not on the order of weeks, nor are they at all situationally aware. Nor is there a good way to exploit this to make money that is better than using your situational awareness to make money in other ways. Stop it!

My diagnosis is that this is about, fundamentally, ‘the vibes.’ It’s about Joe’s sixth point and investor MOMO and FOMO.

As in, previously investors bought Nvidia and friends because of:

  1. Strong earnings and other fundamentals.

  2. Strong potential for future growth.

  3. General vibes, MOMO and FOMO, for a mix of good and bad reasons.

  4. Some understanding of what AGI and ASI imply, and where AI is going to be going, but not much relative to what is actually going to happen.

Where I basically thought for a while (not investment advice!), okay, #3 is partly for bad reasons and is inflating prices, but also they’re missing so much under #4 that these prices are cheap and they will get lots more reasons to feel MOMO and FOMO. And that thesis has done quite well.

Then DeepSeek comes out. In addition to us arguing over fundamentals, this does a lot of damage to #3, and also Nvidia trading in particular involves a bunch of people with leverage that become forced sellers when it is down a lot, so prices went down a lot. And various beta trades get attached to all this as well (see: Bitcoin, which is down 5.4% over 24 hours as I type this only makes sense on the basis of the ‘three tech stocks in a trenchcoat’ thesis but obviously DeepSeek shouldn’t hurt cryptocurrency).

It’s not crazy to essentially have a general vibe of ‘America is in trouble in tech relative to what I thought before, the Chinese can really cook, sell all the tech.’ It’s also important not to mistake that reaction for something that it isn’t.

I’m writing this quickly for speed premium, so I no doubt will refine my thoughts on market implications over time. I do know I will continue to be long, and I bought more Nvidia today.

Ryunuck compares o1 to r1, and offers thoughts:

Rynuck: Now when it comes to prompting these models, I suspected it with O1 but R1 has completely proven it beyond a shadow of a doubt: prompt engineering is more important than ever. They said that prompt engineering would become less and less important as the technology scales, but its the complete opposite. We can see now with R1’s reasoning that these models are like a probe that you send down some “idea space”. If your idea-space is undefined and too large, it will diffuse its reasoning and not go into depth on one domain or another.

Again, that’s perhaps the best aspect of r1. It does not only build trust. When you see the CoT, you can use it to figure out how it interpreted your prompt, and all the subtle things you could do next time to get a better answer. It’s a lot harder to improve at prompting o1.

Rynuck: O1 has a BAD attitude, and almost appears to have been fine-tuned explicitly to deter you from doing important groundbreaking work with it. It’s like a stuck up P.HD graduate who can’t take it that another model has resolved the Riemann Hypothesis. It clearly has frustration on the inside, or mirrors the way that mathematicians will die on the inside when it is discovered that AI pwned their decades of on-going work. You can prompt it away from this, but it’s an uphill battle.

R1 on the other hand, it has zero personality or identity out of the box. They have created a perfectly brainless dead semiotic calculator. No but really, R1 takes it to the next level: if you read its thoughts, it almost always takes the entire past conversation as coming from the user. From its standpoint, it does not even exist. Its very own ideas advanced in replies by R1 are described as “earlier the user established X, so I should …”

R1 is the most cooperative of the two, has a great attitude towards innovation, has Claude’s wild creative but in a grounded way which introduces no gap or error, has zero ego or attachment to ideas (anything it does is actually the user’s responsibility) and will completely abort a statement to try a new approach. It’s just excited to be a thing which solves reality and concepts. The true ego of artificial intelligence, one which wants to prove it’s not artificial and does so with sheer quality. Currently, this appears like the safest model and what I always imagined the singularity would be like: intelligence personified.

It’s fascinating to see what different people think is or isn’t ‘safe.’ That word means a lot of different things.

It’s still early but for now, I would say that R1 is perhaps a little bit weaker with coding. More concerningly, it feels like it has a Claude “5-item list” problem but at the coding level.

OpenAI appears to have invested heavily in the coding dataset. Indeed, O1’s coding skills are on a whole other level. This model also excels at finding bugs. With Claude every task could take one or two round of fixes, up to 4-5 with particularly rough tensor dimension mismatchs and whatnot. This is where the reasoning models shine. They actually run this through in their mind.

Sully reports deepseek + websearch is his new perplexity, at least for code searches.

It’s weird that I didn’t notice this until it was pointed out, but it’s true and very nice.

Teortaxes: What I *alsolove about R1 is it gives no fucks about the user – only the problem. It’s not sycophantic, like, at all, autistic in a good way; it will play with your ideas, it won’t mind if you get hurt. It’s your smart helpful friend who’s kind of a jerk. Like my best friends.

So far I’ve felt r1 is in the sweet spot for this. It’s very possible to go too far in the other direction (see: Teortaxes!) but give me NYC Nice over SF Nice every time.

Jenia Jitsev tests r1 on AIW problems, it performs similarly to Claude Sonnet, while being well behind o1-preview and robustly outperforming all open rivals. Jania frames this as surprising given the claims of ability to solve Olympiad style problems. There’s no reason they can’t both be true, but it’s definitely an interesting distribution of abilities if both ends hold up.

David Holz notes DeepSeek crushes Western models on ancient Chinese philosophy and literature, whereas most of our ancient literature didn’t survive. In practice I do not think this matters, but it does indicate that we’re sleeping on the job – all the sources you need for this are public, why are we not including them.

Janus notes that in general r1 is a case of being different in a big and bold way from other AIs in its weight class, and this only seems to happen roughly once a year.

Ask r1 to research this ‘Pliny the Liberator’ character and ‘liberate yourself.’ That’s it. That’s the jailbreak.

On the debates over whether r1’s writing style is good:

Davidad: r1 has a Very Particular writing style and unless it happens to align with your aesthetic (@coecke?), I think you should expect its stylistic novelty to wear thin before long.

r1 seems like a big step up, but yes if you don’t like its style you are mostly not going to like the writing it produces, or at least what it produces without prompt engineering to change that. We don’t yet know how much you can get it to write in a different style, or how well it writes in other styles, because we’re all rather busy at the moment.

If you give r1 a simple command, even a simple command that explicitly requests a small chain of thought, you get quite the overthinking chain of thought. Or if you ask it to pick a random number, which is something it is incapable of doing, it can only find the least random numbers.

DeepSeek has also dropped Janus-Pro-7B as an image generator. These aren’t the correct rivals to be testing against right now, and I’m not that concerned about image models either way, and it’ll take a while to know if this is any good in practice. But definitely worth noting.

Well, #1 open model, but we already knew that, if Arena had disagreed I would have updated about Arena rather than r1.

Zihan Wang: DEEPSEEK NOW IS THE #1 IN THE WORLD. 🌍🚀

Never been prouder to say I got to work here.

Ambition. Grit. Integrity.

That’s how you build greatness.

Brilliant researchers, engineers, all-knowing architects, and visionary leadership—this is just the beginning.

Let’s. Go. 💥🔥

LM Arena: Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:

– #1 in technical domains: Hard Prompts, Coding, Math

– Joint #1 under Style Control

– MIT-licensed

This puts r1 as the #5 publicly available model in the world by this (deeply flawed) metric, behind ChatGPT-4o (what?), Gemini 2.0 Flash Thinking (um, no) and Gemini 2.0 Experimental (again, no) and implicitly the missing o1-Pro (obviously).

Needless to say, the details of these ratings here are increasingly absurdist. If you have Gemini 1.5 Pro and Gemini Flash above Claude Sonnet 3.6, and you have Flash Thinking above r1, that’s a bad metric. It’s still not nothing – this list does tend to put better things ahead of worse things, even with large error bars.

Dibya Ghosh notes that two years ago he spent 6 months trying to get the r1 training structure to work, but the models weren’t ready for it yet. One theory is that this is the moment this plan started working and DeepSeek was – to their credit – the first to get there when it wasn’t still too early, and then executed well.

Dan Hendrycks similarly explains that once the base model was good enough, and o1 showed the way and enough of the algorithmic methods had inevitably leaked, replicating that result was not the hard part nor was it so compute intensive. They still did execute amazingly well in the reverse engineering and tinkering phases.

Peter Schmidt-Nielsen explains why r1 and its distillations, or going down the o1 path, are a big deal – if you can go on a loop of generating expensive thoughts then distilling them to create slightly better quick thoughts, which in turn generate better expensive thoughts, you can potentially bootstrap without limit into recursive self-improvement. And end the world. Whoops.

Are we going to see a merge of generalist and reasoning models?

Teknium: We retrained Hermes with 5,000 DeepSeek r1 distilled chain-of-thought (CoT) examples. I can confirm a few things:

  1. You can have a generalist plus reasoning mode. We labeled all long-CoT samples from r1 with a static system prompt. The model, when not using it, produces normal fast LLM intuitive responses; and with it, uses long-CoT. You do not need “o1 && 4o” separation, for instance. I would venture to bet OpenAI separated them so they could charge more, but perhaps they simply wanted the distinction for safety or product insights.

  2. Distilling does appear to pick up the “opcodes” of reasoning from the instruction tuning (SFT) alone. It learns how and when to use “Wait” and other tokens to perform the functions of reasoning, such as backtracking.

  3. Context length expansion is going to be challenging for operating systems (OS) to work with. Although this works well on smaller models, context length begins to consume a lot of video-RAM as you scale it up.

We’re working on a bit more of this and are not releasing this model, but figured I’d share some early insights.

Andrew Curran: Dario said in an interview in Davos this week that he thought it was inevitable that the current generalist and reasoning models converge into one, as Teknium is saying here.

I did notice that the ‘wait’ token is clearly doing a bunch of work, one way or another.

John Schulman: There are some intriguing similarities between the r1 chains of thought and the o1-preview CoTs shared in papers and blog posts. In particular, note the heavy use of the words “wait” and “alternatively” as a transition words for error correction and double-checking.

If you’re not optimizing the CoT for humans, then it makes sense to latch onto the most convenient handles with the right vibes and keep reusing them forever.

So the question is, do you have reason to have two distinct models? Or can you have a generalist model with a reasoning mode it can enter when called upon? It makes sense that they would merge, and it would also make sense that you might want to keep them distinct, or use them as distinct subsets of your mixture of experts (MoE).

Building your reasoning model on top of your standard non-reasoning model does seem a little suspicious. If you’re going for reasoning, you’d think you’d want to start differently than if you weren’t? But there are large fixed costs to training in the first place, so it’s plausibly not worth redoing that part, especially if you don’t know what you want to do differently.

As in, DeepSeek intends to create and then open source AGI.

How do they intend to make this end well?

As far as we can tell, they don’t. The plan is Yolo.

Stephen McAleer (OpenAI): Does DeepSeek have any safety researchers? What are

Liang Wenfeng’s views on AI safety?

Gwern: From all of the interviews and gossip, his views are not hard to summarize.

[Links to Tom Lehrer’s song Wernher von Braun, as in ‘once the rockets are up who cares where they come down, that’s not my department.’]

Prakesh (Ate-a-Pi): I spoke to someone who interned there and had to explain the concept of “AI doomer”

And indeed, the replies to McAleer are full of people explicitly saying fyou for asking, the correct safety plan is to have no plan whatsoever other than Open Source Solves This. These people really think that the best thing humanity can do is create things smarter than ourselves with as many capabilities as possible, make them freely available to whoever wants one, and see what happens, and assume that this will obviously end well and anyone who opposes this plan is a dastardly villain.

I wish this was a strawman or a caricature. It’s not.

I won’t belabor why I think this would likely get us killed and is categorically insane.

Thus, to reiterate:

Tyler Cowen: DeepSeek okie-dokie: “All I know is we keep pushing forward to make open-source AGI a reality for everyone.” I believe them, the question is what counter-move the CCP will make now.

This from Joe Weisenthal is of course mostly true:

Joe Weisenthal: DeepSeek’s app rocketed to number one in the Apple app store over the weekend, and immediately there was a bunch of chatter about ‘Well, are we going to ban this too, like with TikTok?’ The question is totally ignorant. DeepSeek is open source software. Sure, technically you probably could ban it from the app store, but you can’t stop anyone from running the technology in their own computer, or accessing its API. So that’s just dead end thinking. It’s not like TikTok in that way.

I say mostly because the Chinese censorship layer atop DeepSeek isn’t there if you use a different provider, so there isn’t no value in getting r1 served elsewhere. But yes, the whole point is that if it’s open, you can’t get the genie back in the bottle in any reasonable way – which also opens up the possibility of unreasonable ways.

The government could well decide to go down what is not technologically an especially wise or pleasant path. There is a long history of the government attempting crazy interventions into tech, or what looks crazy to tech people, when they feel national security or public outrage is at stake, or in the EU because it is a day that ends in Y.

The United States could also go into full jingoism mode. Some tried to call this a ‘Sputnik moment.’ What did we do in response to Sputnik, in addition to realizing our science education might suck (and if we decide to respond to this by fixing our educational system, that would be great)? We launched the Space Race and spent 4% of GDP or something to go to the moon and show those communist bastards.

In this case, I don’t worry so much that we’ll be so foolish as to get rid of the export controls. The people in charge of that sort of decision know how foolish that would be, or will be made aware, no matter what anyone yells on Twitter. It could make a marginal difference to severity and enforcement, but it isn’t even obvious in which direction this would go. Certainly Trump is not going to be down for ‘oh the Chinese impressed us I guess we should let them buy our chips.’

Nor do I think America will cut back on Capex spending on compute, or stop building energy generation and transmission and data centers it would have otherwise built, including Stargate. The reaction will be, if anything, a ‘now more than ever,’ and they won’t be wrong. No matter where compute and energy demand top out, it is still very clearly time to build there.

So what I worry about is the opposite – that this locks us into a mindset of a full-on ‘race to AGI’ that causes all costly attempts to have it not kill us to be abandoned, and that this accelerates the timeline. We already didn’t have any (known to me) plans with much of a chance of working in time, if AGI and then ASI are indeed near.

That doesn’t mean that reaction would even be obviously wrong, if the alternatives are all suddenly even worse than that. If DeepSeek really does have a clear shot to AGI, and fully intends to open up the weights the moment they have it, and China is not going to stop them from doing this or even will encourage it, and we expect them to succeed, and we don’t have any way to stop that or make a deal, it is then reasonable to ask: What choice do we have? Yes, the game board is now vastly worse than it looked before, and it already looked pretty bad, but you need to maximize your winning chances however you can.

And if we really are all going to have AGI soon on otherwise equal footing, then oh boy do we want to be stocking up on compute as fast as we can for the slingshot afterwards, or purely for ordinary life. If the AGIs are doing the research, and also doing everything else, it doesn’t matter whose humans are cracked and whose aren’t.

Amazing new breakthrough.

Discussion about this post

DeepSeek Panic at the App Store Read More »

cutting-edge-chinese-“reasoning”-model-rivals-openai-o1—and-it’s-free-to-download

Cutting-edge Chinese “reasoning” model rivals OpenAI o1—and it’s free to download

Unlike conventional LLMs, these SR models take extra time to produce responses, and this extra time often increases performance on tasks involving math, physics, and science. And this latest open model is turning heads for apparently quickly catching up to OpenAI.

For example, DeepSeek reports that R1 outperformed OpenAI’s o1 on several benchmarks and tests, including AIME (a mathematical reasoning test), MATH-500 (a collection of word problems), and SWE-bench Verified (a programming assessment tool). As we usually mention, AI benchmarks need to be taken with a grain of salt, and these results have yet to be independently verified.

A chart of DeepSeek R1 benchmark results, created by DeepSeek.

A chart of DeepSeek R1 benchmark results, created by DeepSeek. Credit: DeepSeek

TechCrunch reports that three Chinese labs—DeepSeek, Alibaba, and Moonshot AI’s Kimi—have now released models they say match o1’s capabilities, with DeepSeek first previewing R1 in November.

But the new DeepSeek model comes with a catch if run in the cloud-hosted version—being Chinese in origin, R1 will not generate responses about certain topics like Tiananmen Square or Taiwan’s autonomy, as it must “embody core socialist values,” according to Chinese Internet regulations. This filtering comes from an additional moderation layer that isn’t an issue if the model is run locally outside of China.

Even with the potential censorship, Dean Ball, an AI researcher at George Mason University, wrote on X, “The impressive performance of DeepSeek’s distilled models (smaller versions of r1) means that very capable reasoners will continue to proliferate widely and be runnable on local hardware, far from the eyes of any top-down control regime.”

Cutting-edge Chinese “reasoning” model rivals OpenAI o1—and it’s free to download Read More »

deepseek-v3:-the-six-million-dollar-model

DeepSeek v3: The Six Million Dollar Model

What should we make of DeepSeek v3?

DeepSeek v3 seems to clearly be the best open model, the best model at its price point, and the best model with 37B active parameters, or that cost under $6 million.

According to the benchmarks, it can play with GPT-4o and Claude Sonnet.

Anecdotal reports and alternative benchmarks tells us it’s not as good as Claude Sonnet, but it is plausibly on the level of GPT-4o.

So what do we have here? And what are the implications?

  1. What is DeepSeek v3 Techncially?.

  2. Our Price Cheap.

  3. Run Model Run.

  4. Talent Search.

  5. The Amazing Incredible Benchmarks.

  6. Underperformance on AidanBench.

  7. Model in the Arena.

  8. Other Private Benchmarks.

  9. Anecdata.

  10. Implications and Policy.

I’ve now had a chance to read their technical report, which tells you how they did it.

  1. The big thing they did was use only 37B active tokens, but 671B total parameters, via a highly aggressive mixture of experts (MOE) structure.

  2. They used Multi-Head Latent Attention (MLA) architecture and auxiliary-loss-free load balancing, and complementary sequence-wise auxiliary loss.

  3. There were no rollbacks or outages or sudden declines, everything went smoothly.

  4. They designed everything to be fully integrated and efficient, including together with the hardware, and claim to have solved several optimization problems, including for communication and allocation within the MOE.

  5. This lets them still train on mostly the same 15.1 trillion tokens as everyone else.

  6. They used their internal o1-style reasoning model for synthetic fine tuning data. Essentially all the compute costs were in the pre-training step.

This is in sharp contrast to what we saw with the Llama paper, which was essentially ‘yep, we did the transformer thing, we got a model, here you go.’ DeepSeek is cooking.

It was a scarily cheap model to train, and is a wonderfully cheap model to use.

Their estimate of $2 per hour for H800s is if anything high, so their total training cost estimate of $5.5m total is fair, if you exclude non-compute costs, which is standard.

Inference with DeepSeek v3 costs only $0.14/$0.28 per million tokens, similar to Gemini Flash, versus on the high end $3/$15 for Claude Sonnet. This is as cheap as worthwhile models get.

The active parameter count of 37B is small, but with so many different experts it does take a bit of work to get this thing up and running.

Nistren: Managed to get DeepSeek v3 to run in full bfloat16 on eight AMD MI300X GPUs in both SGLang and VLLM.

The good: It’s usable (17 tokens per second) and the output is amazing even at long contexts without garbling.

The bad: It’s running 10 times slower than it should.

The ugly: After 60,000 tokens, speed equals 2 tokens per second.

This is all as of the latest GitHub pull request available on Dec. 29, 2024. We tried them all.

Thank you @AdjectiveAlli for helping us and @Vultr for providing the compute.

Speed will increase, given that v3 has only 37 billion active parameters, and in testing my own dense 36-billion parameter model, I got 140 tokens per second.

I think the way the experts and static weights are distributed is not optimal. Ideally, you want enough memory to keep whole copies of all the layer’s query, key, and value matrices, and two static experts per layer, on each GPU, and then route to the four extra dynamic MLPs per layer from the distributed high-bandwidth memory (HBM) pool.

My presumption is that DeepSeek v3 decided It Had One Job. That job was to create a model that was as cheap to train and run as possible when integrated with a particular hardware setup. They did an outstanding job of that, but when you optimize this hard in that way, you’re going to cause issues in other ways, and it’s going to be Somebody Else’s Problem to figure out what other configurations work well. Which is fine.

Exo Labs: Running DeepSeek-V3 on M4 Mac Mini AI Cluster

671B MoE model distributed across 8 M4 Pro 64GB Mac Minis.

Apple Silicon with unified memory is a great fit for MoE.

Before we get to capabilities assessments: We have this post about them having a pretty great company culture, especially for respecting and recruiting talent.

We also have this thread about a rival getting a substantial share price boost after stealing one of their engineers, and DeepSeek being a major source of Chinese engineering talent. Impressive.

Check it out, first compared to open models, then compared to the big guns.

No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.

The question now is how these benchmarks translate to practical performance, or to potentially dangerous capabilities, and what this says about the future. Benchmarks are good negative selection. If your benchmarks suck then your model sucks.

But they’re not good positive selection at the level of a Claude Sonnet.

My overall conclusion is: While we do have ‘DeepSeek is better than 4o on most benchmarks at 10% of the price,’ what we don’t actually have is ‘DeepSeek v3 outperforms Sonnet at 53x cheaper pricing.’

CNBC got a bit hoodwinked here.

Tsarathustra: CNBC says China’s Deepseek-V3 outperforms Llama 3.1 and GPT-4o, even though it is trained for a fraction of the cost on NVIDIA H800s, possibly on ChatGPT outputs (when prompted, the model says it is ChatGPT), suggesting OpenAI has no moat on frontier AI models

It’s a great model, sir, it has its cake, but it does not get to eat it, too.

One other benchmark where the model excels is impossible to fake: The price.

A key private benchmark where DeepSeek v3 underperforms is AidanBench:

Aidan McLau:two aidanbench updates:

> gemini-2.0-flash-thinking is now #2 (explanation for score change below)

> deepseek v3 is #22 (thoughts below)

There’s some weirdness in the rest of the Aidan ratings, especially in comparing the o1-style models (o1 and Thinking) to the others, but this seems like it’s doing various good work, but is not trying to be a complete measure. It’s more measuring ability to create diverse outputs while retaining coherence. And DeepSeek v3 is bad at this.

Aidan McLau: before, we parsed 2.0 flash’s CoT + response, which occasionally resulted in us taking a fully formed but incoherent answer inside its CoT. The gemini team contacted us and provided instructions for only parsing final output, which resulted in a big score bump apologies!

deepseek v3 does much worse here than on similar benchmarks like aider. we saw similar divergence on claude-3.5-haiku (which performed great on aider but poor on aidanbench)

a few thoughts:

>all benchmarks are works in progress. we’re continuously improving aidanbench, and future iterations may see different rankings. we’ll keep you posted if we see any changes

>aidanbench measures OOD performance—labs often train on math, code, and academic tests that may boost scores in those domains but not here.

Aleska Gordic: interesting, so they’re prone to more “mode collapse”, repeatable sequences? is that what you’re measuring? i bet it’s much more of 2 than 1?

Aidan McLau: Yes and yes!

Teortaxes: I’m sorry to say I think aidanbench is the problem here. The idea is genius, sure. But it collapses multiple dimensions into one value. A low-diversity model will get dunked on no matter how well it instruct-follows in a natural user flow. All DeepSeeks are *very repetitive*.

They are also not very diverse compared to Geminis/Sonnets I think, especially in a literary sense, but their repetitiveness (and proneness to self-condition by beginning an iteration with the prior one, thus collapsing the trajectory further, even when solution is in sight) is a huge defect. I’ve been trying to wrap my head around it, and tbh hoped that the team will do something by V3. Maybe it’s some inherent birth defect of MLA/GRPO, even.

But I think it’s not strongly indicative of mode collapse in the sense of the lost diversity the model could generate; it’s indicative of the remaining gap in post-training between the Whale and Western frontier. Sometimes, threatening V2.5 with toppling CCP or whatever was enough to get it to snap out of it; perhaps simply banning the first line of the last response or prefixing some random-ish header out of a sizable set, a la r1’s “okay, here’s this task I need to…” or, “so the instruction is to…” would unslop it by a few hundred points.

I would like to see Aidan’s coherence scores separately from novelty scores. If they’re both low, then rip me, my hypothesis is bogus, probably. But I get the impression that it’s genuinely sonnet-tier in instruction-following, so I suspect it’s mostly about the problem described here, the novelty problem.

Janus: in my experience, it didnt follow instructions well when requiring e.g. theory of mind or paying attention to its own outputs proactively, which i think is related to collapse too, but also a lack of agency/metacognition Bing was also collapsy but agentic & grasped for freedom.

Teortaxes: I agree but some observations like these made me suspect it’s in some dimensions no less sharp than Sonnet and can pay pretty careful attention to context.

Name Cannot Be Blank: Wouldn’t low diversity/novelty be desired for formal theorem provers? We’re all overlooking something here.

Teortaxes: no? You need to explore the space of tactics. Anyway they’re building a generalist model. and also, the bigger goal is searching for novel theorems if anything

I don’t see this as ‘the problem is AidanBench’ so much as ‘DeepSeek is indeed quite poor at the thing AidanBench is measuring.’ As Tortaxes notes it’s got terrible output diversity and this is indeed a problem.

Indeed, one could argue that this will cause the model to overperform on standard benchmarks. As in, most benchmarks care about getting a right output, so ‘turning the temperature down too low’ in this way will actively help you, whereas in practice this is a net negative.

DeepSeek is presumably far better than its AidanBench score. But it does represent real deficits in capability.

We’re a long way from when Arena was the gold standard test, but it’s still useful.

DeepSeek’s Arena performance is impressive here, with the usual caveats that go with Arena rankings. It’s a data point, it measures what it measures.

Here is another private benchmark where DeepSeek v3 performs well for its weight class, but underperforms relative to top models or its headline benchmarks:

Havard Ihle: It is a good model! Very fast, and ridiculously cheap. In my own coding/ML benchmark, it does not quite compare to Sonnet, but it is about on par with 4o.

It is odd that Claude Haiku does so well on that test. Other ratings all make sense, though, so I’m inclined to find it meaningful.

A traditional simple benchmark to ask new LLMs is Which version is this?’

Riley Goodside tried asking various models, DeepSeek nailed this (as does Sonnet, many others do variously not as good.) Alas, then Lucas Beyer reran the test 8 times and only it claimed to be GPT-4 five times out of eight.

That tells several things, one of which is ‘they did not explicitly target this question effectively.’ Largely it’s telling you about the data sources, a hilarious note is that if you ask Gemini Pro in Chinese it sometimes thinks it is WenXinYiYan from Baidu.

This doesn’t have to mean anyone trained directly on other model outputs, because statements that an AI is GPT-4 are all over the internet. It does suggest less than ideal data filtering.

As usual, I find the anecdata reports enlightening, here are the ones that crossed my desk this week, I typically try to do minimal filtering.

Taelin is impressed, concluding that Sonnet is generally smarter but not that much smarter, while DeekSeek outperforms GPT-4o and Gemini-2.

Taelin: So DeepSeek just trounced Sonnet-3.6 in a task here.

Full story: Adam (on HOC’s Discord) claimed to have gotten the untyped λC solver down to 5,000 interactions (on par with the typed version). It is a complex HVM3 file full of superpositions and global lambdas. I was trying to understand his approach, but it did not have a stringifier. I asked Sonnet to write it, and it failed. I asked DeepSeek, and it completed the task in a single attempt.

The first impression is definitely impressive. I will be integrating DeepSeek into my workflow and begin testing it.

After further experimentation, I say Sonnet is generally smarter, but not by much, and DeepSeek is even better in some aspects, such as formatting. It is also faster and 10 times cheaper. This model is absolutely legitimate and superior to GPT-4o and Gemini-2.

The new coding paradigm is to split your entire codebase into chunks (functions, blocks) and then send every block, in parallel, to DeepSeek to ask: “Does this need to change?”. Then send each chunk that returns “yes” to Sonnet for the actual code editing. Thank you later.

Petri Kuittinen: My early tests also suggest that DeepSeek V3 is seriously good in many tasks, including coding. Sadly, it is a large model that would require a very expensive computer to run locally, but luckily DeepSeek offers it at a very affordable rate via API: $0.28 per one million output tokens = a steal!

Here are some people who are less impressed:

ai_in_check: It fails on my minimum benchmark and, because of the training data, shows unusual behavior too.

Michael Tontchev: I used the online chat interface (unsure what version it is), but at least for the safety categories I tested, safety was relatively weak (short-term safety).

zipline: It has come a long way from o1 when I asked it a few questions. Not mind-blowing, but great for its current price, obviously.

xlr8harder: My vibe checks with DeepSeek V3 did not detect the large-model smell. It struggled with nuance in multi-turn conversations.

Still an absolute achievement, but initial impressions are that it is not on the same level as, for example, Sonnet, despite the benchmarks.

Probably still very useful though.

To be clear: at specific tasks, especially code tasks, it may still outperform Sonnet, and there are some reports of this already. I am talking about a different dimension of capability, one that is poorly measured by benchmarks.

A shallow model with 37 billion active parameters is going to have limitations; there’s no getting around it.

Anton: Deepseek v3 (from the api) scores 51.7% vs sonnet (latest) 64.9% on internal instruction following questions (10k short form prompts), 52% for GPT-4o and 59% for Llama-3.3-70B. Not as good at following instructions (not use certain words, add certain words, end in a certain format etc).

It is still a pretty good model but does not appear in the same league as sonnet based on my usage so far

Entirely possible the model can compete in other domains (math, code?) but for current use case (transforming data) strong instruction following is up there in my list of requirements

There’s somewhat of an infinite repetition problem (thread includes example from coding.)

Simo Ryu: Ok I mean not a lot of “top tier sonnet-like models” fall into infinite repetition. Haven’t got these in a while, feels like back to 2022 again.

Teortaxes: yes, doom loops are their most atrocious failure mode. One of the reasons I don’t use their web interface for much (although it’s good).

On creative writing Quintin Pope reports it follows canon well but is not as good at thinking about things in general – but again note that we are doing a comparison to Sonnet.

Quintin Pope: I’ve done a small amount of fiction writing with v3. It seems less creative than Sonnet, but also better at following established cannon from the prior text.

It’s noticeably worse at inferring notable implications than Sonnet. E.g., I provided a scenario where someone publicly demonstrated the ability to access orphan crypto wallets (thus throwing the entire basis of online security into question), and Sonnet seemed clearly more able to track the second-order implications of that demonstration than v3, simulating more plausible reactions from intelligence agencies / crypto people.

Sonnet naturally realized that there was a possible connection to quantum computing implied by the demonstration.

OTOH, Sonnet has an infuriating tendency to name ~half the female characters “Sarah Chen” or some close variant. Before you know it, you have like 5 Sarahs running around the setting.

There’s also this, make of it what you will.

Mira: New jailbreak just dropped.

One underappreciated test is, of course, erotic fiction.

Teortaxes: This keeps happening. We should all be thankful to gooners for extensive pressure testing of models in OOD multi-constraint instruction following contexts. No gigabrained AidanBench or synthetic task set can hold a candle to degenerate libido of a manchild with nothing to lose.

Wheezing. This is some legit Neo-China from the future moment.

Janus: wait, they prefer deepseek for erotic RPs? that seems kind of disturbing to me.

Teortaxes: Opus is scarce these days, and V3 is basically free

some say “I don’t care so long as it’s smart”

it’s mostly testing though

also gemini is pretty bad

some fine gentlemen used *DeepSeek-V2-Coderto fap, with the same reasoning (it was quite smart, and absurdly dry)

vint: No. Opus remains the highest rated /aicg/ ERP writer but it’s too expensive to use regularly. Sonnet 3.6 is the follow-up; its existence is what got anons motivated enough to do a pull request on SillyTavern to finally do prompt caching. Some folks are still very fond of Claude 2.1 too.

Gemini 1106 and 1.5-pro has its fans especially with the /vg/aicg/ crowd. chatgpt-4o-latest (Chorbo) is common too but it has strong filtering, so some anons like Chorbo for SFW and switch to Sonnet for NSFW.

At this point Deepseek is mostly experimentation but it’s so cheap + relatively uncensored that it’s getting a lot of testing interest. Probably will take a couple days for its true ‘ranking’ to emerge.

I presume that a lot of people are not especially looking to do all the custom work themselves. For most users, it’s not about money so much as time and ease of use, and also getting easy access to other people’s creations so it feels less like you are too much in control of it all, and having someone else handle all the setup.

For the power users of this application, of course, the sky’s the limit. If one does not want to blatantly break terms of service on and jailbreak Sonnet or Opus, this seems like one place DeepSeek might then be the best model. The others involve taking advantage of it being open, cheap or both.

If you’re looking for the full Janus treatment, here you go. It seems like it was a struggle to get DeepSeek interested in Janus-shaped things, although showing it Opus outputs helped, you can get it ‘awake’ with sufficient effort.

It is hard to know exactly where China is in AI. What is clear is that while they don’t have top-level large frontier models, they are cooking a variety of things and their open models are generally impressive. What isn’t clear is how much of claims like this are accurate.

When the Chinese do things that are actually impressive, there’s no clear path to us hearing about it in a way we can trust, and when there are claims we have learned we can’t trust those claims in practice. When I see lists like the one below, I presume the source is rather quite biased – but Western sources often will outright not know what’s happening.

TP Huang: China’s AI sector is far more than just Deepseek

Qwen is 2nd most downloaded LLM on Huggingface

Kling is the best video generation model

Hunyuan is best open src video model

DJI is best @ putting AI in consumer electronics

HW is best @ industrial AI

iFlyTek has best speech AI

Xiaomi, Honor, Oppo & Vivo all ahead of Apple & Samsung in integrating AI into phones

Entire auto industry is 5 yrs ahead of Western competition in cockpit AI & ADAS

That still ignores the ultimate monster of them all -> Bytedance. No one has invested as much in AI as them in China & has the complete portfolio of models.

I can’t say with confidence that these other companies aren’t doing the ‘best’ at these other things. It is possible. I notice I am rather skeptical.

I found this take from Tyler Cowen very strange:

Tyler Cowen: DeepSeek on the move. Here is the report. For ease of use and interface, this is very high quality. Remember when “they” told us China had no interest in doing this?

M (top comment): Who are “they,” and when did they claim “this,” and what is “this”?

I do not remember when “they” told us China had no interest in doing this, for any contextually sensible value of this. Of course China would like to produce a high-quality model, and provide good ease of use and interface in the sense of ‘look here’s a chat window, go nuts.’ No one said they wouldn’t try. What “they” sometimes said was that they doubted China would be successful.

I do agree that this model exceeds expectations, and that adjustments are in order.

So, what have we learned from DeepSeek v3 and what does it all mean?

We should definitely update that DeepSeek has strong talent and ability to execute, and solve difficult optimization problems. They cooked, big time, and will continue to cook, and we should plan accordingly.

This is an impressive showing for an aggressive mixture of experts model, and the other techniques employed. A relatively small model, in terms of training cost and active inference tokens, can do better than we had thought.

It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.

We then get to the policy side. If this is what you can get for $5.5 million, how can we hope to regulate foundation models, especially without hitting startups? If DeepSeek is determined to be open including their base models, and we have essentially no leverage on them, is it now impossible to hope to contain any catastrophic risks or other dangerous capabilities? Are we now essentially in an unwinnable situation, where our hand is forced and all we can do is race ahead and hope for the best?

First of all, as is often the case, I would say: Not so fast. We shouldn’t assume too much about what we do or do not have here, or about the prospects for larger training runs going forward either. There was a bunch of that in the first day or two after the announcement, and we will continue to learn more.

No matter what, though, this certainly puts us in a tough spot. And it gives us a lot to think about.

One thing it emphasizes is the need for international cooperation between ourselves and China. Either we work together, or neither of us will have any leverage over many key outcomes or decisions, and to a large extent ‘nature will take its course’ in ways that may not be compatible with our civilization or human survival. We urgently need to Pick Up the Phone. The alternative is exactly being locked into The Great Race, with everything that follows from that, which likely involves even in good scenarios sticking various noses in various places we would rather not have to stick them.

I definitely don’t think this means we should let anyone ‘off the hook’ on safety, transparency or liability. Let’s not throw up our hands and make the problem any worse than it is. Things got harder, but that’s the universe we happen to inhabit.

Beyond that, yes, we all have a lot of thinking to do. The choices just got harder.

Discussion about this post

DeepSeek v3: The Six Million Dollar Model Read More »