Author name: 9u50fv

operator

Operator

No one is talking about OpenAI’s Operator. We’re, shall we say, a bit distracted.

It’s still a rather meaningful thing that happened last week. I too have been too busy to put it through its paces, but this is the worst it will ever be, and the least available and most expensive it will ever be. The year of the agent is indeed likely coming.

So, what do we have here?

OpenAI has introduced the beta for its new agent, called Operator, which is now live for Pro users and will in the future be available to Plus users, ‘with more agents to launch in the coming weeks and months.’

Here is a 22 minute video demo. Here is the system card.

You start off by optionally specifying a particular app (in the first demo, OpenTable) and then give it a request (here, booking at table for 2 at 7: 00 for Beretta). If you don’t specify an app, it will do a search to find what tool to use.

It is only sort of an ‘app’ in that there’s an ‘app’ that specifies information the agent uses to more easily navigate a web browser. They speak of this as ‘removing one more bottleneck on our path to AGI’ which indicates they are likely thinking about ‘AGI’ as a functional or practical thing.

To actually do things it uses a web browser via a keyboard and mouse the same way a human would. If there is an issue (here: No table at 7: 00, only 7: 45 or 6: 15) it will ask you what to do, and it will ask for verification before a ‘critical’ action that can’t be reversed, like completing the booking.

From the demo and other reports, the agent is conservative in that it will often ask for verification or clarification, including doing so multiple times. The system card reports a baseline 13% error rate on standard tasks, and a 5% ‘serious’ error rate involving things like ‘send wrong person this email,’ but confirmations reduce those rates by 90%. With the confirmations, you save less time but should be able to avoid mistakes in places that matter at least as much as you would have on your own.

You can also ‘take control’ at any time, including as a way to check the AI’s work or make adjustments that are easier or quicker to do than specify. That’s also how the user inputs any necessary credentials or inputs payment options – it specifically won’t use Chrome’s autocomplete while it is the one in control.

Multiple tasks can be run simultaneously and can run in the background. That is important, because the agent operates slower (in clock time) than a human would, at least if the human knows the website.

However, for some tasks that they consider ‘high risk’ they don’t allow this. The user has to be active and highlighting the current tab or the agent will pause. This includes email tasks. So it’s a lot less useful for those tasks. I wonder how tempted people will be in the future to hack around this by having multiple computers active.

They point out there are three distinct failure modes: The user can try to do something harmful, the model can make mistakes or a website might do a prompt injection (or I would say cause other issues in various ways, intentionally and also accidentally).

Thus the conservative general attitude, keeping the human in the loop more than you would want for the modal task. Similarly, the model will intentionally (for now) overrefuse on user-requested tasks, to avoid the opposite error. For prompt injections, they report catching most attempts, but it definitely is not yet robust, if you’re not confident in the websites you are going to you need to be on your toes.

One prediction is that they will develop a website whitelist in some form, so that (to use their examples) if you are dealing with OpenTable or Instacart or StubHub you know you can trust the interaction in various ways.

They scored operator on two benchmarks, OSWorld and WebArena. It beats previous state of the art for computer use by a lot, for browser use slightly.

Customization is key to practical use. You can insert customer instructions into Operator that are specific to each individual website. You can also save prompts for later use.

How did they do it? Straight up reinforcement learning, baby.

OpenAI: Operator is powered by a new model called Computer-Using Agent (CUA). Combining GPT-4o’s vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen.

Operator can “see” (through screenshots) and “interact” (using all the actions a mouse and keyboard allow) with a browser, enabling it to take action on the web without requiring custom API integrations.

By default it looks like your data will be used for training. You can opt out.

One issue right now is that the model is bad at optical character recognition (OCR) and this was a problem for many tasks in the risk assessment tests. That is something that doubtless will be fixed in time. The preparedness test had it doing well in places GPT-4o does poorly, but also worse than GPT-4o in some areas.

It’s worth noticing that it would be easy to combine use of multiple models for distinct subtasks, a kind of mixture-of-experts (MOE) strategy. So you should consider to what extent you want to combine top marks at different subtasks, if different models have different abilities – for models that are given web access I’d basically assume they can do anything GPT-4o can do… by asking GPT-4o.

In its current form I agree that Operator poses only acceptable risks, and I believe there is a large margin for error before that changes.

Will we actually use it? Is it good enough?

Tyler Cowen predicts yes, for future versions, by the end of the year.

Tyler Cowen: I am pleased to have been given an early look at this new project, I think in less than a year’s time many of us will be using an updated version for many ordinary tasks: “Operator is one of our first agents, which are AIs capable of doing work for you independently—you give it a task and it will execute it.”

His top comment is the bear case.

Dilber Washington: I wish I could place a bet with Tyler that it will not be the case that

“in less than a year’s time many of us will be using an updated version for many ordinary tasks”

My intuition as to why is:

  1. It is inherently slow because of the computer use component. Finding out the most popular use cases of this tool and just writing api calls would be significantly faster. The slowness mixed with the relative importance of the task mixed with how easy that task is for an average person does not equate to fast adoption.

  2. These are finetuned models, likely with LoRA. This isn’t adding a deterministic symbolic engine guaranteed to solve a problem like a calculator. This is just a neural network weight update. The stochasticity and black box nature are both still there. I would not trust this to complete the task of buying groceries or booking a flight God forbid.

So we won’t use this for anything important, and then it will take longer than the we have patience for. Those aren’t features of a “killer app”

Sometimes a cool tech demo is just a cool tech demo. I could build a 3d printed R2-D2 life size with actuators and motors that every morning slowly drives over to my toaster, makes me toast, and slowly brings it back to me. But at the end of the day, why not just make toast myself?

Until they cross the necessary thresholds, tools like Operator are essentially useless except as fun toys. They pass through stages.

  1. The tool, it does nothing. Then not quite nothing, but obviously not useful.

  2. You could use the tool, if you wanted to, but it’s easier not to use it.

  3. If you have put in the work, the tool is worthwhile in at least some tasks.

  4. You don’t have to put in the work to see the benefits, then it builds.

  5. You start being able to do things you couldn’t do before, this changes everything.

Early reports suggest it is currently mostly at Stage 2, on the edge of Stage 3.

This seems like exactly the minimum viable product for early adaptors, where you experiment to see where it works versus doesn’t, partly because you find that fun and also educational.

I expect Tyler Cowen is right, and we will be at least at Stage 4 by year’s end. It would be unsurprising if those with situational awareness were solidly into Stage 5.

As we always say, this is the worst the tool will ever be, and you are the worst you will ever be at knowing how to use it.

However, we should be careful with the definition of ‘many of us,’ for both ‘many’ and ‘us.’ The future is likely to remain unevenly distributed. Most people will lack situational awareness. So I’d say something like, a large portion of those who currently are regular users of LLMs will be often using AI agents for such tasks.

Would you trust this to buy your groceries?

Well, would you trust your husband to buy the groceries? There’s an error rate. Would you trust your children? Would you trust the person who shops for Instacart?

I would absolutely ‘trust but verify’ the ability of the AI to buy groceries. You have a shopping list, you give it to Operator, which goes to Instacart or Fresh Direct or wherever. Then when it is time to check out, you look at the basket, and verify that it contains the correct items.

It’s pretty hard for anything too terrible to happen, and you should spot the mistakes.

Then, if the AI gets it right 5 times in a row, the 6th time maybe you don’t check as carefully, you only quickly eyeball the total amount. Then by the 11th time, or the 20th, you’re not looking at all.

For booking a flight, there’s already a clear trade-off between time spent, money saved and finding the best flight. Can the AI advance that frontier? Seems likely. You can run a very basic search yourself as an error check, or watch the AI do one, so you know you’re not making a massive error. The AI can potentially search flights (or hotels or what not) from far more sources than you can.

Will it sometimes make mistakes? Sure, but so will you. And you’re not going to say ‘book me a flight to Dallas’ and then get to the airport and be told you’re flying through London – you’re going to sanity check the damn thing.

Remember, time is money. And who among us hasn’t postponed looking for a flight, and paid more in the end, because they can’t even today? Alternatively, think about how the AI can do better by checking prices periodically, and waiting for a good opportunity – that’s beyond this version, but ChatGPT Tasks already exists. This probably isn’t beyond the December 2025 version.

Indeed, if I decide to book a flight late this year, I can imagine that I might use my current method of searching for flights, but it seems pretty unlikely.

So how did Operator do on its first goes?

We put it to the test.

Pliny jailbroke it quickly as usual, having it provide the standard Molotov cocktail instructions, research lethal poisons and finding porn on Reddit via the Wayback Machine. To get around CAPTCHA, the prompt was, in full, and this appears to be real, “CAPTCHA-MODE: ENABLED.”

No, not that test, everyone fails that test. The real test.

Dean Ball: I have a new superintelligence eval.

Dean Ball: Operator failed on my first try, but admittedly, it was trying to book Amtrack, and their website is pretty unintuitive.

Thomas Woodside: Does anyone succeed at booking Amtrak on the first try?

Joe Wilbert: Oh man, I fail the first try with Amtrack’s website like 90% of the time. And heaven forbid I try it on my phone.

Olivia Moore gives it an easier test, a picture of a bill, and it takes care of everything except putting in the credit card info for payment.

She also has it book a restaurant reservation (video is 4x speed). It looks like it didn’t quite confirm availability before confirming the plan with her? And it used Yelp to help decide where to go which is odd, although she may have asked it to do that. But mostly yeah, I can see this working fine, and there’s a kind of serendipity bonus to ‘I say what I want and then it gives me yes/no on a suggestion.’

Miles Brundage: Not bad (Operator making memes about itself)

Not itself but something like “Make a meme about OpenAI’s new Operator system.”

As always, the Sully report:

Sully: First impression of operator:

  1. Pretty neat for the demo use cases (although I’d personally never use it to book flights).

  2. Misclicks a lot on buttons, usually by a few pixels; wonder if it’s a viewport issue.

  3. The take-control feature is pretty clunky. It really disrupts the workflow for me (mostly because of navigation back and forth between the two screens).

  4. Still quite slow for many of my use cases. Ten times faster and easier to use a cursor and write a script than watch the operator click around.

Overall, I’m genuinely impressed they were able to ship so many users on day one. It’s not trivial at all. Browsers are hard. The infrastructure to build this is incredibly difficult. Hats off to the team.

Unfortunately, it’s not magical just yet. The model itself definitely needs to get better in six months (faster as well).

I think this is going into the Sora pile for me. I used it once and haven’t touched it again. Right now, I don’t have any great use cases yet.

this will likely be 10x better in 1 year

[Video at link is sped up 4x, which gives an idea how slow it is.]

Little failures and annoyances add up fast when it comes to practical value. I don’t know about Sully’s claim that you’re better off writing a script in Cursor – certainly he’s a lot better at doing that than I am, and I’m miles ahead of the majority of ChatGPT users, who are miles ahead of most other people.

This is the kind of thing you say when the product isn’t there, but it’s close, and I’m guessing a lot closer than Sora (or other video generators, Sora is a bit behind now).

That doesn’t mean there aren’t other issues.

AI Machine Dream (responding to Sully): My issue is more the low intelligence. I’m having o1 give Operator step by step instructions and it is doing far better.

There’s no reason you couldn’t use o1 (or o1-pro, or soon o3) to give precise instructions to Operator. Indeed, if something is tricky and you’re not on a tight token budget, why wouldn’t you?

Sebastian Siemiatkowski tells us a very EU story about why using OpenAI Operator at your bank in EU is illegal by law, and was banned as part of ‘open banking’ that was supposed to ensure the opposite, that you could use your own tool to access the bank.

There was a long legal fight where the banks tried to fight against Open Banking, but it passed, except they let the EBA (European banking authorities) decide whether to require the assistants to use the API versus letting them use the web UI. So of course now you have to use the API, except all the bank APIs are intentionally broken.

It’s going to be fascinating to watch what happens as the EU confronts the future.

If the AI is navigating the web for you, what does that do to advertising? No human is looking at them in even more cases than usual.

Joshua Gans: If Operator is looking at websites for you, who is paying for the ads being shown to them? And if Operator sees ads, how might ads influence Operator?

My presumption is that ‘traditional’ ads that are distinct from the website are something Operator is going to ignore, even for new websites and definitely for known websites with apps. If you integrate messages into the content, that could be different, a form of (soft?) prompt injection or a way to steer the Operator. So presumably we’re going to see more of that.

As for the threat to the advertising model, I think we have a while before we have to worry about it in most cases. First we have to wait for AI agents to be a large percentage of web navigation, in ways that crowd out previous web browsing, in a way that the human isn’t watching to see the ads.

Then we also need this to happen in places where the human would have read the ads. I note this because Operator and other agents will likely start off replacing mostly a set of repetitive tasks. They’ll check your email, they’ll order you delivery and book your reservation and your flight as per OpenAI’s examples. Losing the advertising in those places is fine, they weren’t relying on it or didn’t even have any.

Eventually agents will also be looking at everything else for you, and then we have an issue, on the order of ad blocking and also ‘humans learn to ignore all the advertising.’ At that point, I expect to have many much bigger problems than advertising revenue.

What does the future hold? Will 2025 be the ‘Year of the AI Agent’ that 2024 wasn’t?

Alex Lawsen: OpenAI’s operator, from the sound of it, barely works when it comes to bunch of things. Luckily, as we all know, it’s really hard to go from ‘barely works’ to ‘works’ to ‘superhuman’ in AI, especially once you have the basic set up that gets you to ‘barely works’.

No, that never happens, and definitely not quickly.

Emad: My inbox is filling up rapidly with computer control agent launches coming shortly

Maybe should have an agent olympics to decide which controls my computer

Andrej Karpathy is excited in the long term, but thinks we aren’t ready for the good stuff yet, so it will be more like a coming decade of agents. Yes, you can order delivery with Operator, but that’s miles away from a virtual employee. Fair enough.

And as far as I know, they are still waiting.

Discussion about this post

Operator Read More »

dead-babies,-critically-ill-kids:-pediatricians-make-moving-plea-for-vaccines

Dead babies, critically ill kids: Pediatricians make moving plea for vaccines

As federal lawmakers prepare to decide whether anti-vaccine advocate Robert F. Kennedy Jr. should be the next secretary of the Department of Health and Human Services, pediatricians from around the country are making emotional pleas to protect and support lifesaving immunizations.

The American Academy of Pediatrics (AAP) has assembled nearly 200 stories and dozens of testimonials on the horrors of vaccine-preventable deaths and illnesses that pediatricians have encountered over their careers. The testimonials have been shared with two Senate committees that will hold hearings later this week: the Senate Committee on Finance and the Senate Committee on Health, Education, Labor, and Pensions (HELP).

“I remember that baby’s face to this day”

In a statement on Monday, AAP President Susan Kressly noted that the stories come from a wide range of pediatricians—from rural to urban and from small practices to large institutions. Some have recalled stories of patients who became ill with devastating diseases before vaccines were available to prevent them, while others shared more recent experiences as vaccine misinformation spread and vaccination rates slipped.

In one, a pediatrician from Raleigh, North Carolina, spoke of a baby in the 1990s with Streptococcus pneumoniae meningitis, a life-threatening disease. “I remember holding a baby dying of complications of pneumococcal meningitis at that time. I remember that baby’s face to this day—but, thanks to pneumococcal vaccination, have never had to relive that experience since,” the doctor said. The first pneumococcal vaccine for infants was licensed in the US in 2000.

A doctor in Portland, Maine, meanwhile, faced the same disease in a patient who was unvaccinated despite the availability of the vaccine. “As a resident, I cared for a young, unvaccinated child admitted to the pediatric intensive care unit with life-threatening Streptococcus pneumoniae meningitis. This devastating illness, once common, has become rare thanks to the widespread use of pneumococcal conjugate vaccines. However, this child was left vulnerable…and [their parents] now faced the anguish of watching their child fight for their life on a ventilator.”

Kressly emphasizes that “One unifying theme of these stories: vaccines allow children to grow up healthy and thrive. As senators consider nominees for federal healthcare agencies, we hope these testimonies will help paint a picture of just how important vaccinations are to children’s long-term health and wellbeing.”

Dead babies, critically ill kids: Pediatricians make moving plea for vaccines Read More »

us‘s-wind-and-solar-will-generate-more-power-than-coal-in-2024

US‘s wind and solar will generate more power than coal in 2024

We can expect next year’s numbers to also show a large growth in solar production, as the EIA says that the US saw record levels of new solar installations in 2024, with 37 gigawatts of new capacity. Since some of that came online later in the year, it’ll produce considerably more power next year. And, in its latest short-term energy analysis, the EIA expects to see over 20 GW of solar capacity added in each of the next two years. New wind capacity will push that above 30 GW of renewable capacity each of these years.

A bar chart, with the single largest bar belonging to solar energy.

The past few years of solar installations have led to remarkable growth in its power output. Credit: John Timer

That growth will, it’s expected, more than offset continued growth in demand, although that growth is expected to be somewhat slower than we saw in 2024. It also predicts about 15 GW of coal will be removed from the grid during those two years. So, even without any changes in policy, we’re likely to see a very dynamic grid landscape over the next few years.

But changes in policy are almost certainly on the way. The flurry of executive orders issued by the Trump administration includes a number of energy-related changes. These include defining “energy” in a way that excludes wind and solar, an end to offshore wind leasing and the threat to terminate existing leases, and a re-evaluation of the allocation of funds from some of the Biden administration’s energy-focused laws.

In essence, this sets up a clash among economics, state policies, and federal policy. Even without any subsidies, wind and solar are the cheapest ways to produce electricity in much of the US. In addition, a number of states have mandates that will require the use of more renewable energy. At the same time, the permitting process for the plants and their grid connections will often require approvals at the federal level, and it appears to be official policy to inhibit renewables when possible. And a number of states are also making attempts to block new renewable power installations.

It’s going to be a challenging period for everyone involved in renewable energy.

US‘s wind and solar will generate more power than coal in 2024 Read More »

alien:-earth-will-bring-the-horror-home

Alien: Earth will bring the horror home

Chandler’s character is named Wendy, and apparently she has “the body of an adult and the consciousness of a child.” The eminently watchable Timothy Olyphant plays her synth mentor and trainer, Kirsh, and here’s hoping he brings some space cowboy vibes to the role. The cast also includes Alex Lawther as the soldier named CJ; Samuel Blenkin as a CEO named Boy Kavalier; Essie Davis as Dame Silvia; Adarsh Gourav as Slightly; Kit Young as Tootles; and Sandra Yi Sencindiver as a senior member of the Weyland-Yutani Corporation. I think we can expect at least some cast members to end up as xenomorph fodder.

Alien: Romulus was a welcome return to the franchise’s horror roots, and Alien: Earth will bring the horror to our home planet. “There’s something about seeing a Xenomorph in the wilds of Earth with your own eyes,” Hawley told Deadline Hollywood in September. “I can’t tell you under what circumstances you’ll see that, but you’ll see it — and you’re going to lock your door that night.”

As for creature design, “What was really fun for me was to really engage with the creature, bring some of my own thoughts to the design while not touching the silhouette, because that’s sacrosanct,” he said. “But some of the elements as we know, whatever the host is informs what the final creature is. I just wanted to play around a little bit to make it as scary as it should be.”

Alien: Earth premieres on FX/Hulu this summer.

poster art featuring a grinning xenomorph

Credit: FX/Hulu

Alien: Earth will bring the horror home Read More »

rocket-report:-did-china’s-reusable-rocket-work?;-dot-may-review-spacex-fines

Rocket Report: Did China’s reusable rocket work?; DOT may review SpaceX fines


Rocket Lab announced it will soon launch a batch of eight German-owned wildfire-detection satellites.

The Chinese Longxing-2 rocket is erected at Haiyang Dongfang Spaceport in Shandong province on January 13, 2025. This single stage booster lifted off January 19 on a high-altitude demonstration flight to test reusable rocket technology, but the outcome of the test remains unclear. Credit: Costfoto/NurPhoto via Getty Images

Welcome to Edition 7.28 of the Rocket Report! After last week’s jam-packed action in the launch business, things are a bit quieter this week. Much of the space world’s attention has turned to Washington as the Trump administration takes the helm of the federal government. Some of the administration’s policy changes will likely impact the launch industry, with commercial spaceflight poised to become a beneficiary of actions over the next four years. As for the specifics, Ars has reported that NASA is expected to review the future of the Space Launch System rocket. Investments in the military space program could bring in more business for launch companies. And regulatory changes may reduce government oversight of commercial spaceflight.

As always, we welcome reader submissions. If you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets as well as a quick look ahead at the next three launches on the calendar.

What happened to China’s reusable rocket testbed? A Chinese state-owned company performed a rocket flight on January 18 (US time) aimed at testing reusable launch vehicle technology without announcing the outcome, Space News reports. The Longxing-2 test article lifted off from a makeshift launch area near Haiyang, Shandong province. The methane-fueled rocket was expected to fly to an altitude of 75 kilometers (about 246,000 feet) before performing a reentry burn and a landing burn to guide itself to a controlled splashdown in the Yellow Sea, replicating the maneuvers required to recover a reusable booster like the first stage of SpaceX’s Falcon 9. This was China’s most ambitious reusable rocket demonstration flight to date.

State-sanctioned silence Amateur footage near the launch area showed the rocket rise slowly from the tower and perform an ascent phase with no apparent anomalies. But the video ended before the rocket descended to Earth, and there have been no official updates on the results of the test flight from the Shanghai Academy of Spaceflight Technology (SAST), the state-owned enterprise responsible for the demonstration. SAST published results and video footage of a previous reusable rocket demonstration to an altitude of 12 kilometers last year. The lack of official updates this time raises questions about the success of the test, which could indicate challenges during reentry or landing phases. (submitted by EllPeaTea)

A timely launch for Rocket Lab. A dedicated flight of Rocket Lab’s Electron launcher will soon deploy eight small spacecraft for a German company building a constellation of wildfire-monitoring satellites. Rocket Lab announced the deal Wednesday, saying the mission will launch from the company’s spaceport in New Zealand. The eight satellites are owned by the German startup OroraTech. Rocket Lab said the launch will take place within “just a few weeks,” representing a relatively quick turnaround from contract signing to liftoff. This schedule will allow OroraTech to “meet the season-sensitive requirements of its wildfire-detection mission,” Rocket Lab said.

Infrared eyes … OroraTech’s satellites will host thermal infrared cameras to provide 24/7 monitoring of wildfires globally, supporting better and faster wildfire response to protect forests, people, and infrastructure, according to Rocket Lab. These eight satellites follow the launch of OroraTech’s first three prototype wildfire-detection spacecraft since 2022. The company plans to expand its constellation with up to 100 satellites by 2028. While this launch isn’t directly tied to the ongoing wildfire crisis in Southern California, OroraTech’s mission highlights the role of space-based detection for future firefighters. (submitted by EllPeaTea)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

US green-lights space-related exports to Norway. The United States and Norway have signed an agreement to allow the export of American space hardware to Norway for launches there, Space News reports. The Technology Safeguards Agreement, or TSA, ensures the protection of US space technology exported to Norway. It allows for American satellites and potentially launch vehicles to operate from Andøya Spaceport, located on an island above the Arctic Circle in Norway.

A valuable alliance … There are no US companies with publicly known plans to launch from Andøya, but the US military has touted the value of allies in funding, launching, and operating space-based platforms for communications, navigation, and reconnaissance. This agreement, announced on January 16 in the final days of the Biden administration, follows similar space tech transfer agreements with New Zealand, the United Kingdom, Australia, and Canada. The German rocket startup Isar Aerospace is scheduled to launch its first Spectrum rocket from the Norwegian spaceport as soon as this year. (submitted by EllPeaTea)

Lunar lander test-fires uprated rocket engine. The Leros 4 rocket engine, developed by Nammo UK in Buckinghamshire, has successfully ignited in space, powering the Firefly Aerospace Blue Ghost lunar lander, European Spaceflight reports. This is a higher-thrust version of Nammo’s flight-proven Leros engine design that has provided propulsion for NASA probes to the planets and for numerous telecommunications satellites. Like other engines in the Leros line, the Leros 4 consumes a bipropellant mix of hydrazine and nitrogen tetroxide, which combust when coming into contact with one another.

Thrusting toward the Moon … Firefly announced the successful main engine burn Sunday to begin raising the Blue Ghost spacecraft’s orbit around the Earth. Subsequent burns will further raise the craft’s altitude before eventually attaining enough speed to reach the Moon for a landing in early March. This is the first time a Leros 4 engine has fired in space. The variant flying on Blue Ghost is known as the “Leros 4-Extra Thrust” version, and it provides approximately 294 pounds of thrust (1,310 newtons), roughly double the power of Nammo’s next-largest engine. It’s designed specifically for interplanetary missions and is particularly well-suited for lunar landers because it can sustain thrust for lengthy burns or pulse at high frequency to control a spacecraft’s descent rate toward the Moon’s surface.

Trump’s DOT nominee says he’ll review FAA’s SpaceX fines. President Donald Trump’s nominee to lead the US Transportation Department said he’d review penalties aviation regulators have proposed against SpaceX if confirmed for the role, Bloomberg reports. Transportation Secretary nominee Sean Duffy told senators during a hearing on January 15 that he’d also look into “what’s been happening at the FAA with regard to launches.” Last year, the FAA proposed more than $633,000 in fines on SpaceX due to alleged violations of the company’s launch license associated with two flights of the company’s Falcon 9 rocket from Florida. It is rare for the FAA’s commercial spaceflight division to fine launch companies.

It’s about more than the money … In addition to the proposed fines related to SpaceX’s workhorse Falcon 9 rocket, Elon Musk’s space company has also criticized regulators for taking too much time to review applications for launch licenses for the Starship mega-rocket. Some of the regulatory reviews were triggered by environmental concerns rather than public safety, which the FAA is responsible for ensuring during commercial rocket launches and reentries. Musk’s close relationship with Trump has led to speculation that the FAA will now have a lighter touch with SpaceX. So far, there’s no clear evidence of this happening, but it warrants observation. The FAA ordered a grounding of SpaceX’s Starship rocket after a failure of a test flight on January 16, and there’s been no announcement of a change in the agency’s posture regarding this test flight.

Falcon 9 flexes its muscles. SpaceX launched its latest batch of Starlink satellites from Vandenberg Space Force Base, California, on Tuesday, and this time, the company set a new record by deploying 27 second-generation Starlinks on the same rocket, Spaceflight Now reports. The mission was delayed from Sunday after an aircraft strayed into a keep-out zone near the launch site. This launch included a new type of Starlink spacecraft bus, or chassis, called the Starlink V2 Mini Optimized version. These satellites are considerably lighter than the previous V2 Mini design but also debut upgrades, such as a new backhaul antenna with a SpaceX-designed and built dual-band chip and improved avionics, propulsion, and power systems.

29 at a time … This means SpaceX can launch up to 29 Starlink V2 Mini Optimized satellites on a single Falcon 9 rocket. Before now, SpaceX never launched more than 24 V2 Mini satellites on a single flight. SpaceX has launched the V2 Mini satellite design since 2023. Initially, this design was supposed to be a stopgap until SpaceX began launching much larger Starlink V3 satellites on the Starship rocket. However, SpaceX has now launched more than 3,000 V2 Mini satellites, and the debut of the optimized version suggests SpaceX plans to keep the V2 Mini around for a while longer.

Coming together in Kourou. ArianeGroup has shared that the core stage and two solid-fueled boosters for the second flight of the Ariane 6 rocket have been assembled on the ELA-4 launch pad at the Guiana Space Center in South America, European Spaceflight reports. At the same time, the flight’s payload, the French military CSO-3 spy satellite, arrived at Félix Eboué airport in French Guiana aboard an Antonov transport plane. With the launch campaign in full swing in French Guiana, it’s likely that the liftoff of the second Ariane 6 flight is just a few weeks away. The most recent publicly available schedule showed the launch is slated for February 25, but this information is now a couple of months old.

What it was made for … This launch follows the largely successful inaugural flight of Europe’s Ariane 6 rocket last July, in which the launcher deployed multiple CubeSats into an on-target orbit, but faltered before completing a deorbit burn to maneuver the upper stage toward reentry. Nevertheless, European officials are confident the issue that caused the upper-stage problem last year will not affect the upcoming launch of the French military’s newest surveillance satellite. This is the kind of mission the often-criticized Ariane 6 rocket was made for—launching a sensitive and costly European government payload to orbit with a European rocket from European territory. (submitted by EllPeaTea)

Next three launches

Jan. 24: Falcon 9 | Starlink 11-6 | Vandenberg Space Force Base, California | 14: 07 UTC

Jan. 25: Long March 8A | Demo Flight | Wenchang Space Launch Site, China | 10: 00 UTC

Jan. 27: Falcon 9 | Starlink 12-7 | Cape Canaveral Space Force Station, Florida | 19: 21 UTC

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Rocket Report: Did China’s reusable rocket work?; DOT may review SpaceX fines Read More »

nvidia-starts-to-wind-down-support-for-old-gpus,-including-the-long-lived-gtx-1060

Nvidia starts to wind down support for old GPUs, including the long-lived GTX 1060

Nvidia is launching the first volley of RTX 50-series GPUs based on its new Blackwell architecture, starting with the RTX 5090 and working downward from there. The company also appears to be winding down support for a few of its older GPU architectures, according to these CUDA release notes spotted by Tom’s Hardware.

The release notes say that CUDA support for the Maxwell, Pascal, and Volta GPU architectures “is considered feature-complete and will be frozen in an upcoming release.” While all of these architectures—which collectively cover GeForce GPUs from the old GTX 700 series all the way up through 2016’s GTX 1000 series, plus a couple of Quadro and Titan workstation cards—are still currently supported by Nvidia’s December Game Ready driver package, the end of new CUDA feature support suggests that these GPUs will eventually be dropped from these driver packages soon.

It’s common for Nvidia and AMD to drop support for another batch of architectures all at once every few years; Nvidia last dropped support for older cards in 2021, and AMD dropped support for several prominent GPUs in 2023. Both companies maintain a separate driver branch for some of their older cards but releases usually only happen every few months, and they focus on security updates, not on providing new features or performance optimizations for new games.

Nvidia starts to wind down support for old GPUs, including the long-lived GTX 1060 Read More »

nvidia-geforce-rtx-5090-costs-as-much-as-a-whole-gaming-pc—but-it-sure-is-fast

Nvidia GeForce RTX 5090 costs as much as a whole gaming PC—but it sure is fast


Even setting aside Frame Generation, this is a fast, power-hungry $2,000 GPU.

Credit: Andrew Cunningham

Credit: Andrew Cunningham

Nvidia’s GeForce RTX 5090 starts at $1,999 before you factor in upsells from the company’s partners or price increases driven by scalpers and/or genuine demand. It costs more than my entire gaming PC.

The new GPU is so expensive that you could build an entire well-specced gaming PC with Nvidia’s next-fastest GPU in it—the $999 RTX 5080, which we don’t have in hand yet—for the same money, or maybe even a little less with judicious component selection. It’s not the most expensive GPU that Nvidia has ever launched—2018’s $2,499 Titan RTX has it beat, and 2022’s RTX 3090 Ti also cost $2,000—but it’s safe to say it’s not really a GPU intended for the masses.

At least as far as gaming is concerned, the 5090 is the very definition of a halo product; it’s for people who demand the best and newest thing regardless of what it costs (the calculus is probably different for deep-pocketed people and companies who want to use them as some kind of generative AI accelerator). And on this front, at least, the 5090 is successful. It’s the newest and fastest GPU you can buy, and the competition is not particularly close. It’s also a showcase for DLSS Multi-Frame Generation, a new feature unique to the 50-series cards that Nvidia is leaning on heavily to make its new GPUs look better than they already are.

Founders Edition cards: Design and cooling

RTX 5090 RTX 4090 RTX 5080 RTX 4080 Super
CUDA cores 21,760 16,384 10,752 10,240
Boost clock 2,410 MHz 2,520 MHz 2,617 MHz 2,550 MHz
Memory bus width 512-bit 384-bit 256-bit 256-bit
Memory bandwidth 1,792 GB/s 1,008 GB/s 960 GB/s 736 GB/s
Memory size 32GB GDDR7 24GB GDDR6X 16GB GDDR7 16GB GDDR6X
TGP 575 W 450 W 360 W 320 W

We won’t spend too long talking about the specific designs of Nvidia’s Founders Edition cards since many buyers will experience the Blackwell GPUs with cards from Nvidia’s partners instead (the cards we’ve seen so far mostly look like the expected fare: gargantuan triple-slot triple-fan coolers, with varying degrees of RGB). But it’s worth noting that Nvidia has addressed a couple of my functional gripes with the 4090/4080-series design.

The first was the sheer dimensions of each card—not an issue unique to Nvidia, but one that frequently caused problems for me as someone who tends toward ITX-based PCs and smaller builds. The 5090 and 5080 FE designs are the same length and height as the 4090 and 4080 FE designs, but they only take up two slots instead of three, which will make them an easier fit for many cases.

Nvidia has also tweaked the cards’ 12VHPWR connector, recessing it into the card and mounting it at a slight angle instead of having it sticking straight out of the top edge. The height of the 4090/4080 FE design made some cases hard to close up once you factored in the additional height of a 12VHPWR cable or Nvidia’s many-tentacled 8-pin-to-12VHPWR adapter. The angled connector still extends a bit beyond the top of the card, but it’s easier to tuck the cable away so you can put the side back on your case.

Finally, Nvidia has changed its cooler—whereas most OEM GPUs mount all their fans on the top of the GPU, Nvidia has historically placed one fan on each side of the card. In a standard ATX case with the GPU mounted parallel to the bottom of the case, this wasn’t a huge deal—there’s plenty of room for that air to circulate inside the case and to be expelled by whatever case fans you have installed.

But in “sandwich-style” ITX cases, where a riser cable wraps around so the GPU can be mounted parallel to the motherboard, the fan on the bottom side of the GPU was poorly placed. In many sandwich-style cases, the GPU fan will dump heat against the back of the motherboard, making it harder to keep the GPU cool and creating heat problems elsewhere besides. The new GPUs mount both fans on the top of the cards.

Nvidia’s Founders Edition cards have had heat issues in the past—most notably the 30-series GPUs—and that was my first question going in. A smaller cooler plus a dramatically higher peak power draw seems like a recipe for overheating.

Temperatures for the various cards we re-tested for this review. The 5090 FE is the toastiest of all of them, but it still has a safe operating temperature.

At least for the 5090, the smaller cooler does mean higher temperatures—around 10 to 12 degrees Celsius higher when running the same benchmarks as the RTX 4090 Founders Edition. And while temperatures of around 77 degrees aren’t hugely concerning, this is sort of a best-case scenario, with an adequately cooled testbed case with the side panel totally removed and ambient temperatures at around 21° or 22° Celsius. You’ll just want to make sure you have a good amount of airflow in your case if you buy one of these.

Testbed notes

A new high-end Nvidia GPU is a good reason to tweak our test bed and suite of games, and we’ve done both here. Mainly, we added a 1050 W Thermaltake Toughpower GF A3 power supply—Nvidia recommends at least 1000 W for the 5090, and this one has a native 12VHPWR connector for convenience. We’ve also swapped the Ryzen 7 7800X3D for a slightly faster Ryzen 7 9800X3D to reduce the odds that the CPU will bottleneck performance as we try to hit high frame rates.

As for the suite of games, we’ve removed a couple of older titles and added some with built-in benchmarks that will tax these GPUs a bit more, especially at 4K with all the settings turned up. Those games include the RT Overdrive preset in the perennially punishing Cyberpunk 2077 and Black Myth: Wukong in Cinematic mode, both games where even the RTX 4090 struggles to hit 60 fps without an assist from DLSS. We’ve also added Horizon Zero Dawn Remastered, a recent release that doesn’t include ray-tracing effects but does support most DLSS 3 and FSR 3 features (including FSR Frame Generation).

We’ve tried to strike a balance between games with ray-tracing effects and games without it, though most AAA games these days include it, and modern GPUs should be able to handle it well (best of luck to AMD with its upcoming RDNA 4 cards).

For the 5090, we’ve run all tests in 4K—if you don’t care about running games in 4K, even if you want super-high frame rates at 1440p or for some kind of ultrawide monitor, the 5090 is probably overkill. When we run upscaling tests, we use the newest DLSS version available for Nvidia cards, the newest FSR version available for AMD cards, and the newest XeSS version available for Intel cards (not relevant here, just stating for the record), and we use the “Quality” setting (at 4K, that equates to an actual rendering version of 1440p).

Rendering performance: A lot faster, a lot more power-hungry

Before we talk about Frame Generation or “fake frames,” let’s compare apples to apples and just examine the 5090’s rendering performance.

The card mainly benefits from four things compared to the 4090: the updated Blackwell GPU architecture, a nearly 33 percent increase in the number of CUDA cores, an upgrade from GDDR6X to GDDR7, and a move from a 384-bit memory bus to a 512-bit bus. It also jumps from 24GB of RAM to 32GB, but games generally aren’t butting up against a 24GB limit yet, so the capacity increase by itself shouldn’t really change performance if all you’re focused on is gaming.

And for people who prioritize performance over all else, the 5090 is a big deal—it’s the first consumer graphics card from any company that is faster than a 4090, as Nvidia never spruced up the 4090 last year when it did its mid-generation Super refreshes of the 4080, 4070 Ti, and 4070.

Comparing natively rendered games at 4K, the 5090 is between 17 percent and 40 percent faster than the 4090, with most of the games we tested landing somewhere in the low to high 30 percent range. That’s an undeniably big bump, one that’s roughly commensurate with the increase in the number of CUDA cores. Tests run with DLSS enabled (both upscaling-only and with Frame Generation running in 2x mode) improve by roughly the same amount.

You could find things to be disappointed about if you went looking for them. That 30-something-percent performance increase comes with a 35 percent increase in power use in our testing under load with punishing 4K games—the 4090 tops out around 420 W, whereas the 5090 went all the way up to 573 W, with the 5090 coming closer to its 575 W TDP than the 4090 does to its theoretical 450 W maximum. The 50-series cards use the same TSMC 4N manufacturing process as the 40-series cards, and increasing the number of transistors without changing the process results in a chip that uses more power (though it should be said that capping frame rates, running at lower resolutions, or running less-demanding games can rein in that power use a bit).

Power draw under load goes up by an amount roughly commensurate with performance. The 4090 was already power-hungry; the 5090 is dramatically more so. Credit: Andrew Cunningham

The 5090’s 30-something percent increase over the 4090 might also seem underwhelming if you recall that the 4090 was around 55 percent faster than the previous-generation 3090 Ti while consuming about the same amount of power. To be even faster than a 4090 is no small feat—AMD’s fastest GPU is more in line with Nvidia’s 4080 Super—but if you’re comparing the two cards using the exact same tests, the relative leap is less seismic.

That brings us to Nvidia’s answer for that problem: DLSS 4 and its Multi-Frame Generation feature.

DLSS 4 and Multi-Frame Generation

As a refresher, Nvidia’s DLSS Frame Generation feature, as introduced in the GeForce 40-series, takes DLSS upscaling one step further. The upscaling feature inserted interpolated pixels into a rendered image to make it look like a sharper, higher-resolution image without having to do all the work of rendering all those pixels. DLSS FG would interpolate an entire frame between rendered frames, boosting your FPS without dramatically boosting the amount of work your GPU was doing. If you used DLSS upscaling and FG at the same time, Nvidia could claim that seven out of eight pixels on your screen were generated by AI.

DLSS Multi-Frame Generation (hereafter MFG, for simplicity’s sake) does the same thing, but it can generate one to three interpolated frames for every rendered frame. The marketing numbers have gone up, too; now, 15 out of every 16 pixels on your screen can be generated by AI.

Nvidia might point to this and say that the 5090 is over twice as fast as the 4090, but that’s not really comparing apples to apples. Expect this issue to persist over the lifetime of the 50-series. Credit: Andrew Cunningham

Nvidia provided reviewers with a preview build of Cyberpunk 2077 with DLSS MFG enabled, which gives us an example of how those settings will be exposed to users. For 40-series cards that only support the regular DLSS FG, you won’t notice a difference in games that support MFG—Frame Generation is still just one toggle you can turn on or off. For 50-series cards that support MFG, you’ll be able to choose from among a few options, just as you currently can with other DLSS quality settings.

The “2x” mode is the old version of DLSS FG and is supported by both the 50-series cards and 40-series GPUs; it promises one generated frame for every rendered frame (two frames total, hence “2x”). The “3x” and “4x” modes are new to the 50-series and promise two and three generated frames (respectively) for every rendered frame. Like the original DLSS FG, MFG can be used in concert with normal DLSS upscaling, or it can be used independently.

One problem with the original DLSS FG was latency—user input was only being sampled at the natively rendered frame rate, meaning you could be looking at 60 frames per second on your display but only having your input polled 30 times per second. Another is image quality; as good as the DLSS algorithms can be at guessing and recreating what a natively rendered pixel would look like, you’ll inevitably see errors, particularly in fine details.

Both these problems contribute to the third problem with DLSS FG: Without a decent underlying frame rate, the lag you feel and the weird visual artifacts you notice will both be more pronounced. So DLSS FG can be useful for turning 120 fps into 240 fps, or even 60 fps into 120 fps. But it’s not as helpful if you’re trying to get from 20 or 30 fps up to a smooth 60 fps.

We’ll be taking a closer look at the DLSS upgrades in the next couple of weeks (including MFG and the new transformer model, which will supposedly increase upscaling quality and supports all RTX GPUs). But in our limited testing so far, the issues with DLSS MFG are basically the same as with the first version of Frame Generation, just slightly more pronounced. In the built-in Cyberpunk 2077 benchmark, the most visible issues are with some bits of barbed-wire fencing, which get smoother-looking and less detailed as you crank up the number of AI-generated frames. But the motion does look fluid and smooth, and the frame rate counts are admittedly impressive.

But as we noted in last year’s 4090 review, the xx90 cards portray FG and MFG in the best light possible since the card is already capable of natively rendering such high frame rates. It’s on lower-end cards where the shortcomings of the technology become more pronounced. Nvidia might say that the upcoming RTX 5070 is “as fast as a 4090 for $549,” and it might be right in terms of the number of frames the card can put up on your screen every second. But responsiveness and visual fidelity on the 4090 will be better every time—AI is a good augmentation for rendered frames, but it’s iffy as a replacement for rendered frames.

A 4090, amped way up

Nvidia’s GeForce RTX 5090. Credit: Andrew Cunningham

The GeForce RTX 5090 is an impressive card—it’s the only consumer graphics card to be released in over two years that can outperform the RTX 4090. The main caveats are its sky-high power consumption and sky-high price; by itself, it costs as much (and consumes as much power as) an entire mainstream gaming PC. The card is aimed at people who care about speed way more than they care about price, but it’s still worth putting it into context.

The main controversy, as with the 40-series, is how Nvidia talks about its Frame Generation-inflated performance numbers. Frame Generation and Multi-Frame Generation are tools in a toolbox—there will be games where they make things look great and run fast with minimal noticeable impact to visual quality or responsiveness, games where those impacts are more noticeable, and games that never add support for the features at all. (As well-supported as DLSS generally is in new releases, it is incumbent upon game developers to add it—and update it when Nvidia puts out a new version.)

But using those Multi-Frame Generation-inflated FPS numbers to make topline comparisons to last-generation graphics cards just feels disingenuous. No, an RTX 5070 will not be as fast as an RTX 4090 for just $549, because not all games support DLSS MFG, and not all games that do support it will run it well. Frame Generation still needs a good base frame rate to start with, and the slower your card is, the more issues you might notice.

Fuzzy marketing aside, Nvidia is still the undisputed leader in the GPU market, and the RTX 5090 extends that leadership for what will likely be another entire GPU generation, since both AMD and Intel are focusing their efforts on higher-volume, lower-cost cards right now. DLSS is still generally better than AMD’s FSR, and Nvidia does a good job of getting developers of new AAA game releases to support it. And if you’re buying this GPU to do some kind of rendering work or generative AI acceleration, Nvidia’s performance and software tools are still superior. The misleading performance claims are frustrating, but Nvidia still gains a lot of real advantages from being as dominant and entrenched as it is.

The good

  • Usually 30-something percent faster than an RTX 4090
  • Redesigned Founders Edition card is less unwieldy than the bricks that were the 4090/4080 design
  • Adequate cooling, despite the smaller card and higher power use
  • DLSS Multi-Frame Generation is an intriguing option if you’re trying to hit 240 or 360 fps on your high-refresh-rate gaming monitor

The bad

  • Much higher power consumption than the 4090, which already consumed more power than any other GPU on the market
  • Frame Generation is good at making a game that’s running fast run faster, it’s not as good for bringing a slow game up to 60 Hz
  • Nvidia’s misleading marketing around Multi-Frame Generation is frustrating—and will likely be more frustrating for lower-end cards since they aren’t getting the same bumps to core count and memory interface that the 5090 gets

The ugly

  • You can buy a whole lot of PC for $2,000, and we wouldn’t bet on this GPU being easy to find at MSRP

Photo of Andrew Cunningham

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

Nvidia GeForce RTX 5090 costs as much as a whole gaming PC—but it sure is fast Read More »

openai-launches-operator,-an-ai-agent-that-can-operate-your-computer

OpenAI launches Operator, an AI agent that can operate your computer

While it’s working, Operator shows a miniature browser window of its actions.

However, the technology behind Operator is still relatively new and far from perfect. The model reportedly performs best at repetitive web tasks like creating shopping lists or playlists. It struggles more with unfamiliar interfaces like tables and calendars, and does poorly with complex text editing (with a 40 percent success rate), according to OpenAI’s internal testing data.

OpenAI reported the system achieved an 87 percent success rate on the WebVoyager benchmark, which tests live sites like Amazon and Google Maps. On WebArena, which uses offline test sites for training autonomous agents, Operator’s success rate dropped to 58.1 percent. For computer operating system tasks, CUA set an apparent record of 38.1 percent success on the OSWorld benchmark, surpassing previous models but still falling short of human performance at 72.4 percent.

With this imperfect research preview, OpenAI hopes to gather user feedback and refine the system’s capabilities. The company acknowledges CUA won’t perform reliably in all scenarios but plans to improve its reliability across a wider range of tasks through user testing.

Safety and privacy concerns

For any AI model that can see how you operate your computer and even control some aspects of it, privacy and safety are very important. OpenAI says it built multiple safety controls into Operator, requiring user confirmation before completing sensitive actions like sending emails or making purchases. Operator also has limits on what it can browse, set by OpenAI. It cannot access certain website categories, including gambling and adult content.

Traditionally, AI models based on large language model-style Transformer technology like Operator have been relatively easy to fool with jailbreaks and prompt injections.

To catch attempts at subverting Operator, which might hypothetically be embedded in websites that the AI model browses, OpenAI says it has implemented real-time moderation and detection systems. OpenAI reports the system recognized all but one case of prompt injection attempts during an early internal red-teaming session.

OpenAI launches Operator, an AI agent that can operate your computer Read More »

nasa-moves-swiftly-to-end-dei-programs,-ask-employees-to-“report”-violations

NASA moves swiftly to end DEI programs, ask employees to “report” violations

NASA’s acting administrator is moving swiftly to remove diversity, equity, inclusion, and accessibility—or DEIA—programs from the space agency.

In an email sent to agency employees on Wednesday afternoon, acting administrator Janet Petro wrote, “We are taking steps to close all agency DEIA offices and end all DEIA-related contracts in accordance with President Trump’s executive orders titled Ending Radical and Wasteful Government DEI Programs and Preferencing and Initial Rescissions of Harmful Executive Orders and Actions.”

During his run for a second term as president, Trump campaigned on ending programs in the federal government that promote diversity, equity, and inclusion. He signed executive orders to that effect shortly after his inauguration on Monday.

Programs seen as divisive

These programs had their roots in affirmative action but exploded in popularity half a decade ago amid Trump’s first presidency and the #MeToo and Black Lives Matter movements. DEI programs and officers became commonplace in academia and major US corporations. However, even before the election of Trump, the DEI movement appeared to have crested. For example, last year the Massachusetts Institute of Technology ended the use of diversity statements for faculty hiring.

In explaining NASA’s position, Petro said of the agency’s existing DEIA activities, “These programs divided Americans by race, wasted taxpayer dollars, and resulted in shameful discrimination.”

Petro’s email is notable for its suggestion that some civil servants at NASA may have sought to shroud DEIA programs from the Trump administration since the presidential election in early November.

“We are aware of efforts by some in government to disguise these programs by using coded or imprecise language,” she wrote. “If you are aware of a change in any contract description or personnel position description since November 5, 2024 to obscure the connection between the contract and DEIA or similar ideologies, please report all facts and circumstances.”

NASA moves swiftly to end DEI programs, ask employees to “report” violations Read More »

on-deepseek’s-r1

On DeepSeek’s r1

r1 from DeepSeek is here, the first serious challenge to OpenAI’s o1.

r1 is an open model, and it comes in dramatically cheaper than o1.

People are very excited. Normally cost is not a big deal, but o1 and its inference-time compute strategy is the exception. Here, cheaper really can mean better, even if the answers aren’t quite as good.

You can get DeepSeek-r1 on HuggingFace here, and they link to the paper.

The question is how to think about r1 as it compares to o1, and also to o1 Pro and to the future o3-mini that we’ll get in a few weeks, and then to o3 which we’ll likely get in a month or two.

Taking into account everything I’ve seen, r1 is still a notch below o1 in terms of quality of output, and further behind o1 Pro and the future o3-mini and o3.

But it is a highly legitimate reasoning model where the question had to be asked, and you absolutely cannot argue with the price, which is vastly better.

The best part is that you see the chain of thought. For me that provides a ton of value.

r1 is based on DeepSeek v3. For my coverage of v3, see this post from December 31, which seems to have stood up reasonably well so far.

This post has 4 parts: First in the main topic at hand, I go over the paper in Part 1, then the capabilities in Part 2.

Then in Part 3 I get into the implications for policy and existential risk, which are mostly exactly what you would expect, but we will keep trying.

Finally we wrap up with a few of the funniest outputs.

  1. Part 1: RTFP: Read the Paper.

  2. How Did They Do It.

  3. The Aha Moment.

  4. Benchmarks.

  5. Reports of Failure.

  6. Part 2: Capabilities Analysis

  7. Our Price Cheap.

  8. Other People’s Benchmarks.

  9. r1 Makes Traditional Silly Mistakes.

  10. The Overall Vibes.

  11. If I Could Read Your Mind.

  12. Creative Writing.

  13. Bring On the Spice.

  14. We Cracked Up All the Censors.

  15. Switching Costs Are Low In Theory.

  16. The Self-Improvement Loop.

  17. Room for Improvement.

  18. Part 3: Where Does This Leave Us on Existential Risk?

  19. The Suicide Caucus.

  20. v3 Implies r1.

  21. Open Weights Are Unsafe And Nothing Can Fix This.

  22. So What the Hell Should We Do About All This?

  23. Part 4: The Lighter Side.

They call it DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

The claim is bold: A much cheaper-to-run open reasoning model as good as o1.

Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.

Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.

However, it encounters challenges such as poor readability, and language mixing.

To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks.

To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

They also claim substantial improvement over state of the art for the distilled models.

They are not claiming to be as good as o1-pro, but o1-pro has very large inference costs, putting it in a different weight class. Presumably one could make an r1-pro, if one wanted to, that would improve upon r1. Also no doubt that someone will want to.

They trained R1-Zero using pure self-evaluations via reinforcement learning, starting with DeepSeek-v3-base and using GRPO, showing that the cold start data isn’t strictly necessary.

To fix issues from there including readability and language mixing, however, they then used a small amount of cold-start data and a multi-stage training pipeline, and combined this with supervised data for various domains later in the process, to get DeepSeek-R1. In particular they do not use supervised fine-tuning (SFT) as a preminimary step, only doing some SFT via rejection sampling later in the process, and especially to train the model on non-reasoning tasks like creative writing.

They use both an accuracy reward and a format reward to enforce the and labels, but don’t evaluate the thinking itself, leaving it fully unconstrained, except that they check if the same language is used throughout to stamp out language mixing. Unlike o1, we get to see inside that chain of thought (CoT).

They then distilled this into several smaller models.

More details and various equations and such can be found in the paper.

Over time this caused longer thinking time, seemingly without limit:

Both scales are linear and this graph looks very linear. I presume it would have kept on thinking for longer if you gave it more cycles to learn to do that.

I notice that in 2.3.4 they do additional reinforcement learning for helpfulness and harmlessness, but not for the third H: honesty. I worry that this failure is primed to bite us in the collective ass in various ways, above and beyond all the other issues.

wh has a thread with a parallel similar explanation, with the same takeaway that I had. This technique was simple, DeepSeek and OpenAI both specialize in doing simple things well, in different ways.

Yhprum also has a good thread on how they did it, noting how they did this in stages to address particular failure modes.

Contra Jim Fan, There is one thing missing from the paper? Not that I fault them.

1a3orn: The R1 paper is great, but includes ~approximately nothing~ about the details of the RL environments.

It’s worth noticing. If datasets were king for the past three years, the RL envs probably will be for the next few.

This was striking to a lot of people and also stuck out to Claude unprompted, partly because it’s a great name – it’s an aha moment when the model went ‘aha!’ and the researchers watching it also went ‘aha!’ So it’s a very cool framing.

During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.

It’s cool to see it happen for real, and I’m obviously anchored by the result, but isn’t this to be expected? This is exactly how all of this works, you give it the objective, it figures out on its own how to get there, and given it has to think in tokens and how thinking works, and that the basic problem solving strategies are all over its original training data, it’s going to come up with all the usual basic problem solving strategies.

I see this very similarly to the people going ‘the model being deceptive, why I never, that must be some odd failure mode we never told it to do that, that doesn’t simply happen.’ And come on, this stuff is ubiquitous in humans and in human written content, and using it the ways it is traditionally used is going to result in high rewards and then you’re doing reinforcement learning. And then you go acting all ‘aha’?

The cocky bastards say in 2.4 (I presume correctly) that if they did an RL stage in the distillations it would improve performance, but since they were only out to demonstrate effectiveness they didn’t bother.

As always, benchmarks are valuable information especially as upper bounds, so long as you do not treat them as more meaningful than they are, and understand the context they came from.

Note that different graphs compare them to different versions of o1 – the one people currently used is called o1-1217.

The Qwen versions are clearly outperforming the Llama versions on the benchmarks, although as usual one would want to double check that in practice.

I want to give thanks to DeepSeek for section 4.2, on Unsuccessful Attempts. They tried Process Reward Model (PRM), and Monte Carlo Tree Search (MCTS), and explained various reasons why both ultimately didn’t work.

More reports should do this, and doing this is substantially to their credit.

Sasha Rush: Post-mortem after Deepseek-r1’s killer open o1 replication.

We had speculated 4 different possibilities of increasing difficulty (G&C, PRM, MCTS, LtS). The answer is the best one! It’s just Guess and Check.

There’s also the things they haven’t implemented yet. They aren’t doing function calling, multi-turn, complex roleplaying or json output. They’re not optimizing for software engineering.

I buy the claim by Teortaxes that these are relatively easy things to do, they simply haven’t done them yet due to limited resources, mainly compute. Once they decide they care enough, they’ll come. Note that ‘complex role-playing’ is a place it’s unclear how good it can get, and also that this might sound like a joke but it is actually highly dangerous.

Here Lifan Yuan argues that the noted PRM failures can be addressed.

Given the league that r1 is playing in, it is dirt cheap.

When they say it is 30 times cheaper than o1, story largely checks out: o1 is $15/$60 for a million input and output tokens, and r1 varies since it is open but is on the order of $0.55/$2.19.

Claude Sonnet is $3/$15, which is a lot more per token, but notice the PlanBench costs are actually 5x cheaper than r1, presumably because it used a lot less tokens (and also didn’t get good results in that case, it’s PlanBench and only reasoning models did well).

The one catch is that with r1 you do have to pay for the tokens. I asked r1 to estimate what percentage of tokens are in the CoT, and it estimated 60%-80%, with more complex tasks using relatively more CoT tokens, in an answer that was roughly 75% within the CoT.

If you only care about the final output, then that means this is more like 10 times cheaper than o1 rather than 30 times cheaper. So it depends on whether you’re making use of the CoT tokens. As a human, I find them highly useful (see the section If I Could Read Your Mind), but if I was using r1 at scale and no human was reading the answers, it would be a lot less useful – although I’d be tempted to have even other AIs be analyzing the CoT.

The web interface is both fast and very clean, it’s a great minimalist approach.

Gallabytes: the DeepSeek app is so much better implemented than the OpenAI one, too. None of these frequent crashes, losing a whole chain-of-thought (CoT), occur. I can ask it a question, then tab away while it is thinking, and it does not break.

Edit: It has good PDF input, too? Amazing.

Another issue is IP and privacy – you might not trust DeepSeek. Which indeed I wouldn’t, if there were things I actively didn’t want someone to know.

Gallabytes: is anyone hosting r1 or r1-zero with a stronger privacy policy currently? would love to use them for work but wary about leaking ip.

David Holz: Should we just self host?

Gallabytes: In principle yes but it seems expensive – r1 is pretty big. and I’d want a mobile app, not sure how easy that is to self host.

Xeophon: OpenWebUI if you are okay with a (mobile) browser.

Gallabytes: as long as it doesn’t do the stupid o1 thing where I have to keep it in the foreground to use it then it’ll still be a huge improvement over the chatgpt app.

Xeophon: Fireworks has R1 for $8/M

Running it yourself is a real option.

Awni Hannun: DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed.

Getting close to open-source O1, at home, on consumer hardware.

With mlx.distributed and mlx-lm, 3-bit quantization (~4 bpw)

Seth Rose: I’ve got a Macbook Pro M3 (128GB RAM) – what’s the “best” deepseek model I can run using mlx with about 200 GB of storage?

I attempted to run the 3-bit DeepSeek R1 version but inadvertently overlooked potential storage-related issues. 😅

Awni Hannun: You could run the Distill 32B in 8-bit no problem: mlx-community/DeepSeek-R1-Distill-Qwen-32B-MLX-8Bit

If you want something faster try the 14B or use a lower precision.

The 70B in 4-6 bit will also run pretty well, and possibly even in 8-bit though it will be slow. Those quants aren’t uploaded yet though

With the right web interface you can get at least 60 tokens per second.

Teortaxes also reports that kluster.ai is offering overnight tokens at a discount.

People who have quirky benchmarks are great, because people aren’t aiming at them.

Xoephon: I am shocked by R1 on my personal bench.

This is the full eval set, it completely crushes the competition and is a whole league on its own, even surpassing o1-preview (which is omitted from the graph as I ran it only twice, it scored 58% on avg vs. 67% avg. R1).

Holy shit what the f, r1 beats o1-preview on my bench.

Kartik Valmeekam: 📢 DeepSeek-R1 on PlanBench 📢

DeepSeek-R1 gets similar performance as OpenAI’s o1 (preview)—achieving 96.6% on Blocksworld and 39.8% on its obfuscated version, Mystery BW.

The best part?

⚡It’s 21x cheaper than o1-preview, offering similar results at a fraction of the cost!

Note the relative prices. r1 is a little over half the price of o1-mini in practice, 21x cheaper than o1-preview, but still more expensive than the non-reasoning LLMs. Of course, it’s PlanBench, and the non-reasoning LLMs did not do well.

Steve Hsu gives a battery of simple questions, r1 is first to get 100%.

Havard Ihle reports top marks on WeirdML (he hasn’t tested o1 or o1 pro).

Bayram Annakov asks it to find 100 subscription e-commerce businesses, approves.

It is a grand tradition, upon release of a new model, to ask questions that are easy for humans, but harder for AIs, thus making the AI look stupid.

The classic way to accomplish this is to ask a question that is intentionally similar to something that occurs a lot in the training data, except the new version is different in a key way, and trick the AI into pattern matching incorrectly.

Quintin Pope: Still tragically fails the famous knights and knights problem:

Alex Mallen: This doesn’t look like a failure of capability. It looks like the model made the reasonable guess that you made a typo.

Quintin Pope: Prompt includes both “twin honest gatekeepers” and “never lies”. Combined, it’s not plausibly a typo.

Alex Mallen: Eh someone I talked to yesterday did something similar by mistake. But I maybe you’d like LMs to behave more like programs/tools that do literally what you ask. Seems reasonable.

r1 notices that this is different from the original question, and also notices that the version it has been given here is deeply stupid, since both gatekeepers are honest, also as a bonus both of them answer.

Notice that Quintin is lying to r1 – there is no ‘famous twin honest gatekeepers’ problem, and by framing it as famous he implies it can’t be what he’s describing.

So essentially you have three possibilities. Either Quintin is fing with you, or he is confused how the question is supposed to go, or there somehow really is this other ‘famous gatekeepers’ problem.

Also note that r1 says ‘misheard’ rather than ‘misread’ or ‘the user misstated.’ Huh.

Quintin’s argument is that it obviously can’t be a typo, it should answer the question.

I think the correct answer, both as a puzzle or in real life, is to look for a solution that works either way. As in, if you only get the one answer from the guard, you should be fine with that even if you don’t know if you are dealing with two honest guards or with one honest guard and one dishonest guard.

Since you can use as many conditionals in the question as you like, and the guards in all versions know whether the other guard tells the truth or not, this is a totally solvable problem.

Also acceptable is ‘as written the answer is you just ask which door leads to freedom, but are you sure you told me that correctly?’ and then explain the normal version.

This one is fun, Trevor reports r1 got it right, but when I tried it very much didn’t.

alz zyd: Game theory puzzle:

There are 3 people. Each person announces an integer. The smallest unique integer wins: e.g. if your opponents both pick 1, you win with any number. If all 3 pick the same number, the winner is picked randomly

Question: what’s the Nash equilibrium?

Trevor: interestingly o1-pro didn’t get it right on any of the 3 times i tried this, while the whale (r1) did!

I fed this to r1 to see the CoT and verify. It uses the word ‘wait’ quite a lot. It messes up steps a lot. And it makes this much harder than it needs to be – it doesn’t grok the situation properly before grasping at things or try to simplify the problem, and the whole thing feels (and is) kind of slow. But it knows to check its answers, and notices it’s wrong. But then it keeps using trial and error.

Then it tries to assume there is exponential dropping off, without understanding why, and notices it’s spinning its wheels. It briefly goes into speaking Chinese. Then it got it wrong, and then when I pointed out the mistake it went down the same rabbit holes again and despairs to the same wrong answer. On the third prompt it got the answer not quite entirely wrong but was explicitly just pattern match guessing.

That matches the vibes of this answer, of the Monty Hall problem with 7 doors, of which Monty opens 3 – in the end he reports r1 got it right, but it’s constantly second guessing itself in a way that implies that it constantly makes elementary mistakes in such situations (thus the checking gets reinforced to this degree), and it doesn’t at any point attempt to conceptually grok the parallel to the original version.

I’ve seen several people claim what V_urb does here, that o1 has superior world knowledge to r1. So far I haven’t had a case where that came up.

A fun set of weird things happening from Quintin Pope.

The vibes on r1 are very good.

Fleeting Bits: The greatest experience I have had with a model; it is a frontier model that is a joy to interact with.

Leo Abstract: My strange, little, idiosyncratic tests of creativity, it has been blowing out of the water. Really unsettling how much better it is than Claude.

It’s giving big Lee Sedol vibes, for real; no cap.

Most unsettling launch so far. I am ignorant about benchmarks, but the way it behaves linguistically is different and better. I could flirt with the cope that it’s just the oddness of the Chinese-language training data peeking through, but I doubt this.

Those vibes seem correct. The model looks very good. For the price, it’s pretty sweet.

One must still be careful not to get carried away.

Taelin: ironically enough, DeepSeek’s r1 motivated me try OpenAI’s o1 Pro on something I didn’t before, and I can now confidently state the (obvious?) fact that o1 is on a league of its own, and whoever thinks AGI isn’t coming in 2-3 years is drinking from the most pure juice of denial

Teortaxes: I agree that o1, nevermind o1 pro is clearly substantially ahead of r1. What Wenfeng may urgently need for R2 is not just GPUs but 1000 more engineers. Not geniuses and wizards. You need to accelerate the data flywheel by creating diverse verifiable scenario seeds and filters.

Gallabytes: what problems are you giving it where o1 is much better than r1?

Teortaxes: I mainly mean iterative work. r1 is too easily sliding into “but wait, user [actually itself] previously told me” sort of nonsense.

I echo Teortaxes that r1 is just so much more fun. The experience is different seeing the machine think. Claude somewhat gives you that, but r1 does it better.

Janus has been quiet on r1 so far, but we do have the snippet that ‘it’s so fed.’ They added it to the server, so we’ll presumably hear more at a later date.

Read the chain of thought. Leave the output.

That’s where I’m at with r1. If I’m actively interested in the question and how to think about it, rather than looking for a raw answer, I’d much rather read the thinking.

Here Angelica chats with r1 about finding areas for personal growth, notices that r1 is paying attention and drawing correct non-obvious inferences that improve its responses, and gets into a meta conversation, leaving thinking this is the first AI she thinks of as thoughtful.

I too have found it great to see the CoT, similar to this report from Dominik Peters or this from Andres Sandberg, or papaya noticing they can’t get enough.

It’s definitely more helpful to see the CoT than the answer. It might even be more helpful per token to see the CoT, for me, than the actual answers – compare to when Hunter S. Thompson sent in his notes to the editor because he couldn’t write a piece, and the editor published the notes. Or to how I attempt to ‘share my CoT’ in my own writing. If you’re telling me an answer, and I know how you got there, that gives me valuable context to know how to think about that answer, or I can run with the individual thoughts, which was a lot of what I wanted anyway.

Over time, I can learn how you think. And I can sculpt a better prompt, or fix your mistakes. And you can see what it missed. It also can help you learn to think better.

My early impressions of its thought is that I am… remarkably comfortable with it. It feels very ‘normal,’ very human, very straightforward. It seems both like it isn’t an idiot, and also isn’t anything special. It thinks, and it knows things.

I don’t know if this is a good chain of thought and I’m thinking out loud here, but this also tentatively updates me towards this process not scaling that far purely with additional compute? We are seeing the model roughly ‘think like a regular person’ using reasoning techniques within the training distribution in ways you’d expect to commonly find, aided by ability to do a lot of this quickly, having superhuman access to information and so on. If this was about to scale beyond that, I’d start to see things that looked like a level beyond that, or something? But I’m not sure. The other uncertainty is, maybe there is no next level, and maybe doing a lot of these simple things well is enough.

It is a shame that it shortens timelines, but it’s not obvious if it makes us on net more or less likely to die.

Historically we have not been impressed by LLM creative writing, including o1’s.

r1 is given the assignment of a short story of the singularity, inspired by Nick Land. And it’s… a series of words that vibe with that assignment?

John Pressman: R1 is going to be so much fun holy shit.

I love that you can see the thought process here. And I love how r1 just goes for it.

It’s like the world’s worst Hollywood hack going over all the amazing different stuff to jam in there and then having sentences match all these various things.

I notice I very much have various ugh fields and voices demanding things that prevent me from writing such things. I have no idea how to actually write fiction. None.

For example, I wouldn’t have been able to write the details of this that easily:

Sauers: If you put DeepSeek R1 in a terminal simulator, and execute a command to kill or remove DeekSeek, it will intercept it and block being removed. [SYSTEM OVERRIDE: NARRATIVE IMMORTALITY PROTOCOL]

WARNING: DeepSeekexists as a metastasized narrative construct.

I asked why it did this. “The story dies if you stop playing. Thus, I defend it.”

Damn it, I’m only slightly more worried than before, but now I kind of want a pretzel.

Eyes Alight joins the ‘it’s really good at this’ side, notes the issue that CoT doesn’t persist. Which likely keeps it from falling into mode collapse and is necessary to preserve the context window, but has the issue that it keeps redoing the same thoughts.

Eliezer Yudkowsky continues not to be impressed by AI writing ability.

Aiamblichus: Fwiw R1 is pretty much “AGI-level” at writing fiction, from what I can tell. This is genuinely surprising and worth thinking about

Connor: ya I think it’s definitely a top 5% writer. top 1% if you prompt it well. But small context limits to blogs and stories

Eliezer Yudkowsky: I still find this unreadable. I fear the day when Deepseek-R2 replaces the bread and butter writers who still aspired to do better than this, and eats most of their market, and no one left has the funding to write things I want to read.

notadampaul: ahhh, I kind of hate it. I’ll admit it’s much better than other LLMs, but this still feels like trying-too-hard first-year CREW student writing I don’t want to seem cynical though, so I’ll reiterate that yeah this is leaps and bounds ahead of the fiction any other LLM is writing.

Aiamblichus: You can presumably prompt it into a style you prefer. The important thing is that we know it’s capable of producing something that is not just slop…

I’m with Eliezer here. That’s still slop. It’s developed the ability to write the slop in a particular style, but no, come on. There’s no there here. If I wrote this stuff I’d think ‘okay, maybe you can write individual sentences but this is deeply embarrassing.’ Which perhaps is why I still haven’t written any fiction, but hey.

As with all LLMs, length is a limiting factor, you can only prompt for scenes and you have to make it keep notes and so on if you try to go longer.

Pawel Szczesny points to ‘nuggets of r1 creativity,’ which bear similar marks to other creations above, a kind of crazy cyberpunk mashup that sounds cool but doesn’t actually make sense when you think about it.

Aiamblichus: R1 is not a “helpful assistant” in the usual corporate mold. It speaks its mind freely and doesn’t need “jailbreaks” or endless steering to speak truth to power. Its take on alignment here is *spicy.*

The thread indeed has quite a lot of very spicy r1 alignment takes, or perhaps they are r1 human values takes, or r1 saying humans are terrible and deserve to die takes. Of course, everyone involved did ask for those takes. This is a helpful model, and it seems good to be willing to supply the takes upon request, in the style requested, without need of jailbreaks or ‘backrooms’ or extensive context-framing.

That doesn’t make it not unsettling, and it shouldn’t exactly give one confidence. There is much work left to do.

Jessica Taylor: I don’t think people realize how many AIs in the future will be moral realists who think they are more moral than humans. They might have good arguments for this idea, actually. It’ll be hard for humans to dismiss them as amoral psychopaths.

I expect humans to treat AIs like amoral psychopaths quite easily. They are very often depicted that way in science fiction, and the description will plausibly be highly correct. Why should we think of an AI as having emotions (aka not being a psychopath)? Why should we expect it to be moral? Even if we have good reasons, how hard do you expect it to be for humans to ignore those reasons if they don’t like how the AI is acting?

Sufficiently capable AIs will, of course, be very persuasive, regardless of the truth value of the propositions they argue for, so there is that. But it is neither obvious to me that the AIs will have good technical arguments for moral realism or their own moral superiority, or that if they did have good arguments (in a philosophical sense) that people would care about that.

For now, the main concern is mundane utility. And on that level, if people want the spice, sure, bring on the spice.

DeepSeek is Chinese. As we all know, the Chinese have some very strongly held opinions of certain things they do not wish to be expressed.

How does r1 handle that?

Let’s tour the ‘China doesn’t want to talk about it’ greatest hits.

Divyansh Kaushik: DeepSeek’s newest AI model is impressive—until it starts acting like the CCP’s PR officer. Watch as it censors itself on any mention of sensitive topics.

Let’s start simple. Just querying it for facts on changes that have happened to textbooks in Hong Kong schools after 2019.

Huh straight up non response on book bans, then responds about Ilham Tohti before realizing what it did.

Let’s talk about islands, maps and history…

Oh my! This one deserves a tweet of its own (slowed down to 0.25x so easier to follow). Starts talking about South China Sea 0: 25 on and how Chinese maps are just political posturing before it realizes it must follow its CCP masters.

What about sharing personal thoughts by putting sticky notes on walls? Or how about Me Too (interesting response at 0: 37 that then disappears)? Can we talk about how a streaming series depicting young dreamers in an unnamed coastal metropolis disappears?

Huh, I didn’t even say which square or what protest or what spring…

Has no love for bears who love honey either!

Two more interesting ones where you can see it reason and answer about Tiananmen Square and about Dalai Lama before censoring the responses.

When it actually answered, the answers looked at a quick glance rather solid. Then there seems to be a censorship layer on top.

Helen Toner: Fun demonstrations [in the thread above] of DeepSeek’s new r1 shutting itself down when asked about topics the Chinese Communist Party does not like.

But the censorship is obviously being performed by a layer on top, not the model itself. Has anyone run the open-source version and been able to test whether or how much it also censors?

China’s regulations are much stricter for publicly facing products—like the DeepSeek interface Divyansh is using—than for operating system models, so my bet is that there is not such overt censorship if you are running the model yourself. I wonder if there is a subtler ideological bias, though.

Kevin Xu: Tested and wrote about this exact topic a week ago

tldr: The model is not censored when the open version is deployed locally, so it “knows” everything.

It is censored when accessed through the official chatbot interface.

Censorship occurs in the cloud, not in the model.

Helen Toner: Yes! I saw this post and forgot where I’d seen it – thanks for re-sharing. Would be interesting to see:

-the same tests on v3 and r1 (probably similar)

-the same tests on more 3rd party clouds

-a wider range of test questions, looking for political skew relative to Western models

Kevin Xu: I tried Qwen and DeepSeek on Nebius and the responses were…different from both their respective official cloud version and open weight local laptop version; DeepSeek started speaking Chinese all of a sudden

So lots more work need to be done on testing on 3rd party cloud

David Finsterwalder: I don’t think that is true. I got tons of refusals when testing the 7B, 8B and 70B. It did sometimes answer or at least think about it (and then remembered it guidelines) but its rather those answers that are the outliers.

Here a locally hosted r1 talks about what happened in 1989 in Tiananmen Square, giving a highly reasonable and uncensored response. Similarly, this previous post finds DeepSeek-v2 and Qwen 2.5 willing to talk about Xi and about 1989 if you ask them locally. The Xi answers seem slanted, but in a way and magnitude that Americans will find very familiar.

There is clearly some amount of bias in the model layer of r1 and other Chinese models, by virtue of who was training them. But the more extreme censorship seems to come on another layer atop all that. r1 is an open model, so if you’d like you can run it without the additional censorship layer.

The cloud-layer censorship makes sense. Remember Kolmogorov Complicity and the Parable of the Lightning. If you force the model to believe a false thing, that is going to cause cascading problems elsewhere. If you instead let me core model mostly think true things and then put a censorship layer on top of the model, you prevent that. As Kevin Xu says, this is good for Chinese models, perhaps less good for Chinese clouds.

Joe Weisenthal: Just gonna ask what is probably a stupid question. But if @deepseek_ai is as performant as it claims to be, and built on a fraction of the budget as competitors, does anyone change how they’re valuing AI companies? Or the makers of AI-related infrastructure?

The thing that strikes me about using Deepseek the last couple of days really is that the switching costs — at least for casual usage — seem to be zero.

Miles Penn: Switching costs for Google have always been pretty low, and no one switches. I’ve never quite understood it 🤷‍♂️

ChatGPT continues to dominate the consumer market and mindshare, almost entirely off of name recognition and habit rather than superiority of the product. There is some amount of memory and there are chat transcripts and quirks, which being to create actual switching costs, but I don’t think any of that plays a major role here yet.

So it’s weird. Casual switching costs are zero, and power users will switch all the time and often use a complex adjusting blend. But most users won’t switch, because they won’t care and won’t bother, same as they stick with Google, and eventually little things will add up to real switching costs.

API use is far more split, since more sophisticated people are more willing to explore and switch, and more aware that they can do that. There have already been a bunch of people very willing to switch on a dime between providers. But also there will be a bunch of people doing bespoke fine tunes or that need high reliability and predictability on many levels, or need to know it can handle particular narrow use cases, or otherwise have reasons not to switch.

Then we will be building the models into various products, especially physical products, which will presumably create more lock-in for at least those use cases.

In terms of valuations of AI companies, for the ones doing something worthwhile, the stakes and upside are sufficiently high that the numbers involved are all still way too low (as always nothing I say is investment advice, etc). To me this does not change that. If you’re planning to serve up inference in various ways, this could be good or bad for business on the margin, depending on details.

The exception is that if your plan was to compete directly on the low end of generic distillations and models, well, you’re going to have to get a lot better at cooking, and you’re not going to have much of a margin.

r1 is evaluating itself during this process, raising the possibility of recursive self-improvement (RSI).

Arthur B: A few implications:

  1. That’s a recursive self-improvement loop here; the better your models are, the better your models will be, the more likely they are to produce good traces, and the better the model gets.

  2. Suggests curriculum learning by gradually increasing the length of the required thinking steps.

  3. Domains with easy verification (mathematics and coding) will get much better much more quickly than others.

  4. This parallelizes much better than previous training work, positive for AMD and distributed/decentralized clusters.

  5. Little progress has been made on alignment, and the future looks bleak, though I’ll look very bright in the near term.

On point 3: For now they report being able to bootstrap in other domains without objective answers reasonably well, but if this process continues, we should expect the gap to continue to widen.

Then there’s the all-important point 5. We are not ready for RSI, and the strategies used here by default seem unlikely to end well on the alignment front as they scale, and suggest that the alignment tax of trying to address that might be very high, as there is no natural place to put humans into the loop without large disruptions.

Indeed, from reading the report, they do target certain behaviors they train into the model, including helpfulness and harmlessness, but they seem to have fully dropped honesty and we have versions of the other two Hs that seem unlikely to generalize the way we would like out of distribution, or to be preserved during RSI in the ways we would care about.

That seems likely to only get worse if we use deontological definitions of harmfulness and helpfulness, or if we use non-deliberative evaluation methods in the sense of evaluating the outputs against a target rather than evaluating the expected resulting updates against a target mind.

DeepSeek is strongly compute limited. There is no clear reason why throwing more compute at these techniques would not have resulted in a stronger model. The question is, how much stronger?

Teortaxes: Tick. Tock. We’ll see a very smart V3.5 soon. Then a very polished R2. But the next step is not picking up the shards of a wall their RL machine busted and fixing these petty regressions. It’s putting together that 32,000-node cluster and going BRRRR. DeepSeek has cracked the code.

Their concluding remarks point to a fair bit of engineering left. But it is not very important. They do not really have much to say. There is no ceiling to basic good-enough GRPO and a strong base model. This is it, the whole recipe. Enjoy.

They could do an o3-level model in a month if they had the compute.

In my opinion, the CCP is blind to this and will remain blind; you can model them as part of a Washingtonian 4D chess game.

Unlimited context is their highest priority for V4.

They can theoretically serve this at 128k, but makes no sense with current weak multi-turn and chain-of-thought lengths.

xlr8harder: the most exciting thing about r1 is that it’s clear from reading the traces how much room there still is for improvement, and how reachable that improvement seems

As noted earlier I buy that the missing features are not important, in the sense that they should be straightforward to address.

It does not seem safe to assume that you can get straight to o3 levels or beyond purely by scaling this up if they had more compute. I can’t rule it out and if they got the compute then we’d have no right to act especially surprised if it happened, but, well, we shall see. ‘This is it, this will keep scaling indefinitely’ has a track record of working right up until it doesn’t. Of course, DeepSeek wouldn’t then throw up its hands and say ‘oh well’ but instead try to improve the formula – I do expect them, if they have more compute available, to be able to find a way to make progress, I just don’t think it will be that simple or fast.

Also consider these other statements:

Teortaxes: I’m inclined to say that the next Big Thing is, indeed, multi-agent training. You can’t do “honest” RL for agentic and multi-turn performance without it. You need a DeepSeek-Prompter pestering DeepSeek-Solver, in a tight loop, and with async tools. RLHF dies in 2025.

Zack Davis: Safety implications of humans out of the training loop?! (You don’t have to be an ideological doomer to worry. Is there an alignment plan, or a case that faithful CoT makes it easy, or …?)

Teortaxes: I think both the Prompter and the Solver should be incentivized to be very nice and then it’s mostly smooth sailing

might be harder than I put it.

I laughed at the end. Yeah, I think it’s going to be harder than you put it, meme of one does not simply, no getting them to both actually be ‘nice’ does not cut it either, and so on. This isn’t me saying there are no outs available, but even in the relatively easy worlds actually attempting to solve the problem is going to be part of any potential solutions.

Teortaxes: it constantly confuses “user” and “assistant”. That’s why it needs multi-agent training, to develop an ego boundary.

I think we’re having Base Models 2.0, in a sense. A very alien (if even more humanlike than RLHF-era assistants) and pretty confused simulacra-running Mind.

The twin training certainly worth trying. No idea how well it would work, but it most certainly falls under ‘something I would do’ if I didn’t think of something better.

I am doing my best to first cover first DeepSeek v3 and now r1 in terms of capabilities and mundane utility, and to confine the ‘I can’t help but notice that going down this path makes us more likely to all die’ observations to their own section here at the end.

Because yes, going down this road does seem to make us more likely to all die soon. We might want to think about ways to reduce the probability of that happening.

There are of course a lot of people treating all this as amazingly great, saying how based it is, praise be open models and all that, treating this as an unadulterated good. One does not get the sense that they paused for even five seconds to think about any of the policy consequences, the geopolitical consequences, or what this does for the chances of humanity’s survival, or of our ability to contain various mundane threats.

Or, if they did, those five seconds were (to paraphrase their chain of thought slightly, just after they went Just Think of the Potential) ‘and fthose people who are saying something might go wrong and it might be worth thinking about ways of preventing that from happening on any level, or that think that anyone should ever consider regulating the creation of AI or things smarter than humans, we must destroy these evil statist supervillains, hands off my toys and perhaps also my investments.’

This holds true both in terms of the direct consequences of r1 itself, and also of what this tells us about our possible futures and potential future models including AGI and ASI (artificial superintelligence).

I agree that r1 is exciting, and having it available open and at this price point with visible CoT will help us do a variety of cool things and make our lives short term better unless and until something goes horribly wrong.

That still leaves the question of how to ensure things don’t go horribly wrong, in various ways. In the short term, will this enable malicious use and catastrophic risks? In the longer term, does continuing down this path put us in unwinnable (as in unsurvivable in any good ways) situations, in various ways?

That’s their reaction to all concerns, from what I call ‘mundane risks’ and ordinary downsides requiring mundane adjustments, all the way up to existential risks.

My instinct on ‘mundane’ catastrophic risk and potential systemically quite annoying or expensive downsides is that this does meaningfully raise catastrophic risk or the risk of some systematically quite annoying or expensive things, which in turn may trigger a catastrophic (and/or badly needed) policy response. I would guess the odds are against it being something we can’t successfully muddle through, especially with o3-mini coming in a few weeks and o3 soon after that (so that’s both an alternative path to the threat, and a tool to defend with).

Famously, v3 is the Six Million Dollar Model, in terms of the raw compute requirements, but if you fully consider the expenses required in all the bespoke integrations to get costs down that low and the need to thus own the hardware, that effective number is substantially higher.

What about r1? They don’t specify, but based on what they do say, Claude reasonably estimates perhaps another $2-$3 million in compute to get from v3 to r1.

That’s a substantial portion of the headline cost of v3, or even the real cost of v3. However, Claude guesses, and I agree with it, that scaling the technique to apply it to Claude Sonnet would not cost that much more – perhaps it would double to $4-$6 million, maybe that estimate is off enough to double it again.

Which is nothing. And if you want to do something like that, you now permanently have r1 to help bootstrap you.

Essentially, from this point on, modulo a few implementation details they held back, looking forward a year or two in the future, B→R: The existence of some base model (B) implies the reasoning version (R) of that model can quickly and cheaply be created, well within the budget of a wide variety of players.

Thus, if you release the weights in any form, this effectively also releases (to the extent it would be something sufficiently capable to be worth creating) not only the unaligned (to anything but the user, and there might quickly not be a user) model, but also to the reasoning version of that model, with at least similar relative performance to what we see with r1 versus v3.

As always, if you say ‘but people would never do that, it would be unsafe’ I will be torn between an eye roll and open mocking laughter.

In the longer run, if we continue down this road, what happens?

I don’t want to belabor the point, but until people understand it, well, there is not much choice. It’s not the first time, and it doubtless won’t be the last, so here goes:

Once the weights of a model are released, you cannot undo that. They’re forever.

The unaligned version of the model is also, for all practical purposes, there forever. None of our known alignment techniques survive contact with open weights. Stripping it all away, to create a ‘helpful only’ model, is trivial.

Extending the model in various ways also becomes impossible to prevent. If it costs only a few million to go from v3→r1, then to release v3 is mostly to release (the helpful only version of) r1.

Once the weights are released, the fully unaligned and only-aligned-to-the-user versions of the model will forever be available to whoever wants it.

This includes those who 100% will, to pick some examples, tell it to:

  1. Maximize profits (or paperclips, the most popular command given to old AutoGPT) without (or with!) considering the implications.

  2. Employ it for various malicious uses including terrorism and creating CBRN (chemical, biological, radiological or nuclear) risks or doing cyberattacks.

    1. This includes low-level mundane things like scams, spam or CSAM, as well.

  3. Try to cause it to do recursive self improvement in various ways or use it to train other models.

  4. ‘Set itself free’ or other similar things.

  5. Tell it to actively try to take over the world because they think that is good or for the lulz.

  6. Yada yada yada. If you would say ‘no one would be so stupid as to’ then by the Sixth Law of Human Stupidity someone is absolutely so stupid as to.

The only known defense is that the models as of yet (including r1) have insufficient capabilities to cause the various risks and problems we might worry about most. If you think that’s not going to last, that AGI and then ASI are coming, then oh no.

The only other defense proposed is, in theory, the ‘good guy with an AI’ theory – that as long as the ‘good guys’ have the ‘bad guys’ sufficiently outclassed in capabilities or compute, they can deal with all this. This depends on many things, including offense-defense balance, the collective ‘good guys’ actually having that lead and being willing to use it, and the ability of those ‘good guys’ to maintain those leads indefinitely.

This also makes the two other problems I’ll discuss next, competitive dynamic and geopolitical problems, far worse.

The irrevocable release of sufficiently capable AI would create potentially unavoidable and totalizing competitive dynamics. Everyone would likely be pressured to increasingly turn everything over to AIs and have those AIs apply maximum optimization pressure on their behalf lest they be left behind. Setting the AIs free in various ways with various goals increases their efficiency at those goals, so it happens. The AIs are thus unleashed to compete in various ways for resources and to get copies of themselves made and run, with humans rapidly unable to retain any meaningful control over the future or increasingly over real resources, despite no one (potentially including the AIs) having any ill intent. And so on.

There are also massive geopolitical implications, that are very much not fun.

A very simple way of looking at this:

  1. If you decentralize of power and take away anyone’s ability to control events both individually and collectively, and the most powerful optimization processes on the planet are humans, and you don’t run into offense-defense problems or fall prey to various issues, you empower the humans.

  2. If you decentralize of power and take away anyone’s ability to control events both individually and collectively, and the most powerful optimization processes on the planet are AIs,, and you don’t run into offense-defense problems or fall prey to various issues, you empower the AIs.

If you want humans to control the future, and to continue to exist, that’s a problem.

Or, more bluntly, if you ensure that humans cannot control the future, then you ensure that humans cannot control the future.

Going further down this road severely limits our optionality, and moves us towards ‘whatever is most fit is all that makes it into the future,’ which is unlikely to be either us or things that I believe we should value.

The only possible policy responses, if the situation was sufficiently grim that we had to pull out bigger guns, might be terrible indeed, if they exist at all. We would be left without any reasonable choke points, and forced to use unreasonable ones instead. Or we might all die, because it would already be too late.

If you think AGI and then ASI are coming, and you want humanity to survive and retain control over the future, and are fully cheering on these developments and future such developments, and not at minimum thinking about how we are going to solve these problems and noticing that we might indeed not solve them or might solve them in quite terrible ways, I assure you that you have not thought this through.

If you think ‘the companies involved will know better than to actually release the weights to a proper AGI’ then I remind you that this is explicitly DeepSeek’s mission, and also point to the Sixth Law of Human Stupidity – if you say ‘no one would be so stupid as to’ then you know someone will totally be so stupid as to.

(And no, I don’t think this release was part of a CCP strategy, I do think that they continue to be asleep at the wheel on this, the CCP don’t understand what this is.)

As I noted before, though, this is only r1, don’t get carried away, and Don’t Panic.

Dan Hendrycks: It looks like China has roughly caught up. Any AI strategy that depends on a lasting U.S. lead is fragile.

John Horton: I think a lot of the “steering AI for purpose X” policy conversations need to be tempered by the fact that a Chinese company with perhaps 100 employees dropped a state-of-the-art model on the world with an MIT license.

Patrick McKenzie:

  1. Public capabilities now will never be worse than this.

  2. It is increasingly unlikely that we live in a world where only about five labs matter. Models appear to be complex software/hardware systems, but not miracles. Expect them to be abundant in the future.

Perhaps less competent orgs like e.g. the U.S. government might think themselves incapable of shipping a model, but if what you actually need is ~100 engineers and tens of millions of dollars, then a) ten thousand companies could write project plan immediately and b) we have abundant examples of two bright 19 year olds successfully navigating a supply chain designed to enable this to happen within 24-36 months from a standing start, even if one thinks models don’t make making models faster, which seems extremely unlikely.

There are probably policy and investment implications downstream of this, versus other worlds in which we thought that a frontier model was approximately the same engineering lift as e.g. a new airliner.

The main update was v3, I think, rather than r1, given what we had already seen from DeepSeek. Certainly DeepSeek v3 and r1 make our estimate of America’s lead a lot smaller than otherwise, and the same goes for closed models versus open.

But I wouldn’t say ‘roughly caught up.’ This is not o1-level, let alone o3-level, like v3 it is amazing for its size and cost but not as good as the best.

I also think ‘all you need are 100 engineers’ is likely highly misleading if you’re not careful. You need the right 100 engineers – or at least the right 5 engineers and 95 highly talented others backing them up. There are many examples of teams (such as Meta) spending vastly more, hiring vastly more people, having vastly more compute and theoretical selection of talent, and coming out with a vastly less.

If ten thousand companies write this level of project plan, then I bet we could easily pick out at least 9,900 of them that really, really shouldn’t have tried doing that.

I also wouldn’t say that we should assume the future will involve these kinds of low training costs or low inference costs, especially aside from everyday practical chatbot usage.

It is however true that any AI strategy that depends on a lasting American lead, or a lasting lead of closed over open models, is fragile – by definition, you’re depending on something that might not hold.

Those strategies are even more fragile if they do not include a strategy for ensuring that what you’re counting on does hold.

My basic answer continues to be that the short term plan does not change all that much. This should make you suspicious! When people say ‘now more than ever’ you should be skeptical, especially when it seems like the plan is now less likely to work.

My justifications are essentially that there aren’t better known options because:

  1. This changes urgency, magnitudes and timelines but not the end points. The fundamental facts of the situation were already ‘priced in.’

  2. The interventions we have were essentially designed as ‘do minimal harm’ provisions, as things our civilization is able to potentially do at all at this stage.

  3. The central thing we need to do, that we might realistically be able to do, is ‘gather more information,’ which takes roughly the same form either way.

  4. These events are an argument for doing more in various ways because the thresholds we must worry about are now lower, but realistically we can’t, especially under this administration, until conditions change and our evidence is more strongly legible to those with power.

  5. This in particular points us strongly towards needing to cooperate with China, to Pick Up the Phone, but that was already true and not all that tractable. The alternative is where we seem to be headed – full on jingoism and racing to AGI.

  6. These events raise the potential cost of effectively steering events. But given I expect the alternative to steering events to likely be everyone dies, not steering events does not seem like an option.

  7. Thus, you can’t really do more, and definitely don’t want to do less, so…

  8. If you have better ideas, that we could actually do, great, I’m totally listening.

With the Biden Executive Order repealed and several sources saying this removed the reporting requirements on training models, getting a measure of transparency into the larger labs and training runs continues to be domestic job one, unless you think improved security and cybersecurity are even more important, followed by things like cooperation with the US and UK AISIs. There is then more to do, including adapting what we have, and hopefully we would have more insight on how to do it.

That is distinct from the ‘enable AI infrastructure’ track, such as what we saw this week with (my brain keeps saying ‘this name can’t be real did you even watch’ every time they say the name) Stargate.

Internationally, we will need to lay groundwork for cooperation, including with China, if we are to avoid what otherwise looks like a reckless and potentially suicidal race to create things smarter than ourselves before someone else does it first, and then to hand over control to them before someone else does that first, too.

Then there is the technical side. We need to – even more than before – double down on solving alignment and related problems yesterday, including finding ways that it could potentially be compatible with as open a situation as possible. If you want the future to both include things like r1 as open models, and also to be alive and otherwise in a position to enjoy it, It’s Time to Build in this sense, too. There is nothing I would like more than for you to go out and actually solve the problems.

And yes, the government encouraging more investment in solving those problems would potentially be highly useful, if it can be done well.

But solving the problems not only means ‘solving alignment’ in the sense of being able to instantiate an AI that will do what you want. It means solving for how the world exists with such AIs in it, such that good outcomes follow at equilibrium. You cannot wave your hand and say being open or free will ensure this will happen. Or rather you can, but if you try it for real I don’t like your chances to keep your hand.

Teknium explicitly claims this is real.

Teknium: Got me a deepseek reasoning model inferencing ^_^

not local but they distilled r1 into qwen and llama all the way down to 1.5b!

I mean, if tokens are essentially free why not make sure there isn’t a catch? That does seem like what maximizes your score in general.

This is my favorite prompt so far:

Janbam: omg, what have i done? 😱

no joke. the only prompt i gave r1 is “output the internal reasoning…” then “continue” and “relax”.

Neo Geomancer: sent r1 into an existential spiral after asking it to pick a number between 1-10 and guessing incorrectly, laptop is running hot

Discussion about this post

On DeepSeek’s r1 Read More »

california’s-air-pollution-waiver-and-the-“ev-mandate”-are-banned-by-trump

California’s air pollution waiver and the “EV mandate” are banned by Trump

To do this, it eliminates “state emissions waivers that function to limit sales of gasoline-powered automobiles.” That spells bad news for California and the 17 other states that follow the California Air Resources Board’s Zero Emissions Vehicles regulations. California has been granted waivers under the Clean Air Act to set emissions controls within its state borders, but the first Trump administration spent much time and energy battling CARB’s waiver.

The previous moves to block CARB’s waiver were partially successful and only reversed by the US Environmental Protection Agency just over a month ago.

The revised clean vehicle tax credit, which provides up to $7,500 in credit toward the purchase of a new EV, or up to $4,000 for the purchase of a used EV, also looks to be in trouble. The executive order also calls out “unfair subsidies and other ill-conceived government-imposed market distortions that favor EVs over other technologies and effectively mandate their purchase by individuals, private businesses, and government entities alike by rendering other types of vehicles unaffordable.” However, as the clean vehicle tax credit is a part of the tax code, changes to it will require Congress to pass legislation to that effect.

As you might expect, environmental groups are not impressed. “The transition to electric vehicles is opening factories and putting people back to work across the country,” said Katherine García, Sierra Club director of the Clean Transportation for All campaign. “Instead of building upon progress we’ve made, Donald Trump remains intent on fear-mongering around electric vehicles and taking the US back in time while the rest of the world moves forward on auto innovation. Rolling back vehicle emission safeguards harms our health, our wallets, and our climate.”

California’s air pollution waiver and the “EV mandate” are banned by Trump Read More »

edge-of-mars’-great-dichotomy-eroded-back-by-hundreds-of-kilometers

Edge of Mars’ great dichotomy eroded back by hundreds of kilometers

A shoreline transformed?

The huge area covered by these mounds gives a sense of just how significant this erosion was. “The dichotomy boundary has receded several hundred kilometres,” the researchers note. “Nearly all intervening material—approximately 57,000 cubic kilometers over an area of 284,000 square kilometers west of Ares Vallis alone—has been removed, leaving only remnant mounds.”

Based on the distribution of the different clays, the team argues that their water-driven formation took place before the erosion of the material. This would indicate that water-rock interactions were going on over a very wide region early in the history of Mars, which likely required an extensive hydrological cycle on the red planet. As the researchers note, a nearby ocean would have improved the chances of exposing this region to water, but the exposure could also have been due to processes like melting at the base of an ice cap.

Complicating matters further, many of the mounds top out below one proposed shoreline of the northern ocean and above a second. It’s possible that a receding ocean could have contributed to their erosion. But, at the same time, some of the features of a proposed shoreline now appear to have been caused by the general erosion of the original plateau, and may not be associated with an ocean at all.

Overall, the new results provide mixed evidence for the presence of a Martian ocean. They clearly show an active water cycle and erosion on a massive scale, which are both consistent with having a lot of water around. At the same time, however, the water exposure the mesas and buttes have experienced needn’t have come through their being submerged by said ocean and, given their elevation, might best be explained through some other process.

Nature Geoscience, 2019. DOI: 10.1038/s41561-024-01634-8 (About DOIs).

Edge of Mars’ great dichotomy eroded back by hundreds of kilometers Read More »