Kelly Newman – Page 35

The ISS is nearly as microbe-free as an isolation ward

Biology, Ecology, international space station, microbiome, Science / Kelly Newman / March 2, 2025

“One of the more similar environments to the ISS was in the isolation dorms on the UCSD campus during the COVID-19 pandemic. All surfaces were continuously sterilized, so that microbial signatures would be erased by the time another person would show up,” Benitez said. So, one of the first solutions to the ISS microbial diversity problem he and his colleagues suggested was that they perhaps should ease up on sterilizing the station so much.

“The extensive use of disinfection chemicals might not be the best approach to maintaining a healthy microbial environment, although there is certainly plenty of research to be conducted,” Benitez said.

Space-faring gardens

He suggested that introducing microbes that are beneficial to human health might be better than constantly struggling to wipe out all microbial life on the station. And while some modules up there do need to be sterilized, keeping some beneficial microbes alive could be achieved by designing future spacecraft in a way that accounts for how the microbes spread.

“We found that microbes in modules with little human activity tend to stay in those modules without spreading. When human activity is high in a module, then the microbes spread to adjacent modules,” Zhao said. She said spacecraft could be designed to put modules with high human activity at one end and the modules with little to no human activity at the opposite end, so the busy modules don’t contaminate the ones that need to remain sterile. “We are of course talking as microbiologists and chemists—perhaps spacecraft engineers have more pressing reasons to put certain modules at certain spots,” Zhao said. “These are just preliminary ideas.”

But what about crewed deep space missions to Mars and other destinations in the Solar System? Should we carefully design the microbial composition beforehand, plant the microbes on the spacecraft and hope this artificial, closed ecosystem will work for years without any interventions from Earth?

“I’d take a more holistic ecosystem approach,” Benitez said. He imagines in the future we could build spacecraft and space stations hosting entire gardens with microbes that would interact with plants, pollinators, and animals to create balanced, self-sustaining ecosystems. “We’d not only need to think about sending the astronauts and the machines they need to function, but also about all other lifeforms we will need to send along with them,” Benitez said

Cell, 2025. DOI: 10.1016/j.cell.2025.01.039

The ISS is nearly as microbe-free as an isolation ward Read More »

Texas official warns against “measles parties” as outbreak keeps growing

health, Infectious disease, measles, outbreak, texas, vaccination / Kelly Newman / March 1, 2025

Cook, along with Lubbock’s director of public health, Katherine Wells, said they see no end in sight for the outbreak, which now spans nine counties in Texas, many of which have low vaccination rates. “This outbreak is going to continue to grow,” Wells said, declining to forecast how high the final case count could go after a reporter raised the possibility of several hundred.

So far, 116 of the 146 cases are under the age of 18, with 46 being between the ages of 0 and 4. Only five of the 146 were vaccinated with at least one dose of the Measles, Mumps, and Rubella (MMR) vaccine.

Messaging

On a more positive note, Wells reported that the outbreak has seemed to sway some vaccine-hesitant parents to get their children vaccinated. Just yesterday in Lubbock, over 50 children came into the city’s clinic for measles vaccines. Eleven of those children had vaccine exemptions, meaning their parents had previously gone through the state process to exempt their child from having to receive routine childhood vaccines to attend school. “Which is a really good sign; that means our message is getting out there,” Wells said.

So far in the outbreak, which erupted in late January, messaging about the disease and the importance of vaccination has exclusively come from state and local authorities. The Centers for Disease Control and Prevention only released a brief statement late Thursday, which was not sent through the agency’s press distribution list. It did, however, note that “vaccination remains the best defense against measles infection.”

During a cabinet meeting Wednesday, US Health Secretary and anti-vaccine advocate Robert F. Kennedy Jr. responded to a question about the outbreak, offering a variety of inaccurate information. Kennedy downplayed the outbreak, falsely claiming that “it’s not unusual.” But, this is an unusual year for measles in the US. As epidemiologist Katelyn Jetelina noted on Bluesky, the number of US measles cases this year has already surpassed the total case counts from eight of the previous 15 years. And it is only February.

Texas official warns against “measles parties” as outbreak keeps growing Read More »

Sergey Brin says AGI is within reach if Googlers work 60-hour weeks

AI, artifical intelligence, Google, sergey brin, Tech / Kelly Newman / March 1, 2025

Sergey Brin co-founded Google in the 1990s along with Larry Page, but both stepped away from the day to day at Google in 2019. However, the AI boom tempted Brin to return to the office, and he thinks everyone should follow his example. In a new internal memo, Brin has advised employees to be in the office every weekday so Google can win the AI race.

Just returning to the office isn’t enough for the Google co-founder. According to the memo seen by The New York Times, Brin says Googlers should try to work 60 hours per week to support the company’s AI efforts. That works out to 12 hours per day, Monday through Friday, which Brin calls the “sweet spot of productivity.” This is not a new opinion for Brin.

Brin, like many in Silicon Valley, is seemingly committed to the dogma that the current trajectory of generative AI will lead to the development of artificial general intelligence (AGI). Such a thinking machine would be head and shoulders above current AI models, which can only do a good impression of thinking. An AGI would understand concepts and think more like a human being, which some would argue makes it a conscious entity.

To hear Brin tell it, Google is in the best position to make this AI computing breakthrough. He cites the company’s strong workforce of programmers and data scientists as the key, but he also believes the team must strive for greater efficiency by using Google’s own Gemini AI tools as much as possible. Oh, and don’t work from home.

Brin and Page handed the reins to current CEO Sundar Pichai in 2015, so his pronouncement doesn’t necessarily signal a change to the company’s current in-office policy. Google still operates on a hybrid model, with workers expected to be in the office three days per week. But as a founder, Brin’s voice carries weight. We reached out to Google to ask if the company intends to reassess its policies, but a Google rep says there are no planned changes to the return-to-office mandate.

Sergey Brin says AGI is within reach if Googlers work 60-hour weeks Read More »

Rocket Report: Starship will soon fly again; Gilmour has a launch date

falcon 9, Gilmour space, h-iia, Japan, launch, NASA, rocket report, Science, Space, space launch system, spacex, starship / Kelly Newman / February 28, 2025

One Falcon 9 launched an Intuitive Machines lunar lander, an asteroid prospector, and a NASA science probe.

Peter Beck, Rocket Lab’s founder and CEO, stands inside a test version of the “Hungry Hippo,” a nickname used to describe the clamshell-like nose cone of the Neutron rocket’s first stage booster. The fairing will open in flight to release Neutron’s second and payloads to continue into orbit, then close as the booster comes back to Earth for recovery. Credit: Rocket Lab

Welcome to Edition 7.33 of the Rocket Report! Phew, what a week for Rocket Lab! The company released a bevy of announcements in conjunction with its quarterly earnings report Thursday. Rocket Lab is spending a lot of money to develop the medium-lift rocket Neutron rocket, and as we’ll discuss below, a rocket landing platform and a new satellite design. For now, the company is sticking by its public statements that the Neutron rocket will launch this year—the official line is it will debut in the second half of 2025—but this schedule assumes near-perfect execution on the program. “We’ve always been clear that we run aggressive schedules,” said Peter Beck, Rocket Lab’s founder and CEO. The official schedule doesn’t quite allow me to invoke a strict interpretation of Berger’s Law, which states that if a rocket’s debut is predicted to happen in the fourth quarter of a year, and that quarter is six or more months away, the launch will be delayed. However, the spirit of the law seems valid here. This time last year, Rocket Lab targeted a first launch by the end of 2024, an aggressive target that has come and gone.

As always, we welcome reader submissions. If you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets as well as a quick look ahead at the next three launches on the calendar.

Australian startup sets a launch date. The first attempt to send an Australian-made rocket into orbit is set to take place no sooner than March 15, the Australian Broadcasting Corporation reports. Gilmour Space Technologies’ launch window announcement marks a major development for the company, which has been working towards a test launch for a decade. Gilmour previously hoped to launch its test rocket, Eris, in May 2024, but had to wait for the Australian government to issue a launch license and airspace approvals for the flight to go forward. Those are now in hand, clearing the last regulatory hurdle before liftoff.

Setting expectations … Gilmour’s Eris rocket is made of three stages powered by hybrid engines consuming a solid fuel and a liquid oxidizer. Eris is designed to haul payloads of up to 672 pounds (305 kilograms) to low-Earth orbit, and will launch from Bowen Orbital Spaceport in Queensland on Australia’s northeastern coast. Gilmour said it would be “very lucky” if the rocket reached orbit on first attempt. “Success means different things for different people, but ignition and liftoff will be huge,” said James Gilmour, the company’s co-founder. (submitted by ZygP)

Blue Origin is keeping a secret. Blue Origin conducted the tenth crewed flight of its New Shepard suborbital vehicle Tuesday, carrying six people, one of whom remained at least semi-anonymous, Space News reports. The five passengers Blue Origin identified come from business and entertainment backgrounds, but in a break from past missions, the company did not disclose the identity of the sixth person, with hosts of the company webcast saying that individual “requested we not share his name today.” Photos released by the company before the launch, and footage from the webcast, showed that person to be a man wearing a flight suit with an “R. Wilson” nametag, and the NS-30 mission patch also included “Wilson” with the names of the other members of the crew. Not disclosing the name of someone who has been to space has little precedent.

Big names on NS-31 … Some of the passengers Blue Origin will fly on the next New Shepard crew mission lack the anonymity of R. Wilson. The next flight, designated NS-31, will carry an all-female crew, including music star Katy Perry, CBS host Gayle King, and Lauren Sánchez, a former journalist who is engaged to Blue Origin’s founder, Jeff Bezos. Blue Origin identified the other three passengers as Aisha Bowe, Amanda Ngyuen, and Kerianne Flynn. (submitted by EllPeaTea)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

Virgin Galactic is still blowing through cash. Virgin Galactic reported a net loss of $347 million in 2024, compared to a $502 million net loss in 2023, with the improvement primarily driven by lower operating expenses, the company said this week in a quarterly earnings release. These lower operating expenses are tied to Virgin Galactic’s decision to suspend operations of its VSS Unity suborbital rocket plane last year to focus investment into a new series of suborbital spacecraft known as Delta-class ships. Virgin Galactic said cash and cash equivalents fell 18 percent from the same period a year ago to $178.6 million. Investors have been eager for details on when it would resume—and then ramp up—flights to increase sales and cash in on a backlog of around 700 ticket holders, Bloomberg reports.

March toward manufacturing … Virgin Galactic said it plans to start assembling its first Delta-class ship in March, with a first flight targeted for the summer of 2026, two years after it stopped flying VSS Unity. The Delta ships will be easier to recycle between flights, and will carry six paying passengers, rather than the four VSS Unity carried on each flight. Company officials believe a higher flight rate with more passengers will bring in significantly more revenue, which was reported at just $430,000 in the fourth quarter of 2024. (submitted by EllPeaTea)

Japanese customers seem to love Rocket Lab. While Rocket Lab is developing the larger Neutron rocket, the company’s operational Electron launch vehicle continues to dominate the market for dedicated launches of small satellites. Rocket Lab announced Thursday it signed a new multi-launch deal with iQPS, a Japan-based Earth imaging company. The new deal follows an earlier multi-launch contract signed with iQPS in 2024 and brings the total number of booked dedicated Electron launches for iQPS to eight.

Radar is all the rage … These eight Electron launches in 2025 and 2026 will help iQPS build out its planned constellation of 36 radar remote sensing satellites capable of imaging the Earth day and night, and through any weather. The new deal is one of the largest Electron launch agreements to date, second only to Rocket Lab’s ten launch deal with another Japanese radar constellation operator, Synspective, signed last year. (submitted by zapman987)

Falcon 9 launch targets Moon and asteroid. With two commercial Moon landers already on their way, Houston-based Intuitive Machines launched its second robotic lander atop a SpaceX Falcon 9 rocket Wednesday, CBS News reports. Given the on-time launch and assuming no major problems, the Athena lander is expected to descend to touchdown on a flat mesa-like structure known as Mons Mouton on March 6, setting down just 100 miles from the Moon’s south pole—closer than any other spacecraft has attempted. Intuitive Machines became the first company to successfully land a spacecraft on the Moon last year, but the Athena lander will pursue more complex goals. It will test a NASA-provided drill designed to search for subsurface ice, deploy a small “micro-rover,” and dispatch a rocket-powered drone to explore a permanently shadowed crater.

Hitching a ride … The Athena lander didn’t take up all the capacity of the Falcon 9 rocket. Three other spacecraft also rocketed into space Wednesday night. These rideshare payloads were AstroForge’s commercially developed Odin asteroid prospector to search for potentially valuable mineral deposits, NASA’s Lunar Trailblazer satellite to characterize lunar ice from a perch in lunar orbit, and a compact space tug from Epic Aerospace. (submitted by EllPeaTea)

This rocket got a visitor for the first time since 2009. Astroscale’s ADRAS-J mission became the first spacecraft (at least in the unclassified world) to approach a piece of space junk in low-Earth orbit, Ars reports. This particular object, a derelict upper stage from a Japanese H-IIA rocket, has been in orbit since 2009. It’s one of about 2,000 spent rocket bodies circling the Earth and one of more than 45,000 objects in orbit tracked by US Space Command. Astroscale, based in Tokyo, built and launched the ADRAS-J mission in partnership with the Japanese space agency as a demonstration to show how a commercial satellite could rendezvous with an object in orbit that was never designed to receive visitors.

Next steps … ADRAS-J worked like a champ, closing in to a distance of less than 50 feet (15 meters) from the H-IIA rocket as it orbited several hundred miles above the Earth. The rocket is a “non-cooperative” object representative of other large pieces of space junk, which Astroscale wants to remove from orbit with a series of trash collecting satellites like ADRAS-J. But this demo only validated part of the technology required for space debris removal. Japan’s space agency and Astroscale are partnering on another mission, ADRAS-J2, for launch in 2027 to go up and latch on to the same H-IIA rocket and steer it out of orbit toward a controlled reentry over the ocean.

An update on Falcon 9’s upper stage. SpaceX said that a Falcon 9 upper stage that reentered over Europe earlier this month suffered a propellant leak that prevented it from doing a controlled reentry, Space News reports. The upper stage was placed in orbit on a February 1 launch from Vandenberg Space Force Base in California. After deploying its payload of 22 Starlink satellites, the upper stage was expected to perform a burn to enable a controlled reentry over the ocean, a standard procedure on most Falcon 9 launches to low-Earth orbit. The stage, though, did not appear to perform the burn and remained in orbit. Its orbit decayed from atmospheric drag and the stage reentered over Europe on February 19. Debris from the Falcon 9 second stage, including composite overwrapped pressure vessels, fell in Poland, landing near the city of Poznań.

Higher than expected body rates … In an update posted to its website this week, SpaceX blamed the upper stage anomaly on a liquid oxygen leak. “During the coast phase of this Starlink mission, a small liquid oxygen leak developed, which ultimately drove higher than expected vehicle body rates,” SpaceX said. SpaceX aborted the deorbit burn and instead passivated the upper stage, a process where the rocket discharges energy from its batteries and vents leftover propellant from its tanks to minimize the risk of a break-up in orbit. This was the third incident involving a Falcon 9 upper stage in a little more than six months. (submitted by EllPeaTea)

Rocket Lab’s reveals “Return On Investment.” Rocket Lab’s Neutron rocket is designed for partial reusability, and the company unveiled Thursday an important piece of infrastructure to make this a reality. Neutron’s first stage booster will land on a modified barge named “Return On Investment” measuring around 400 feet (122 meters) wide, somewhat bigger than SpaceX’s drone ships used for Falcon 9 landings at sea. In order to prep the barge for rocket duty, the company is adding autonomous ground support equipment to capture and secure the landed Neutron, blast shielding to protect equipment during Neutron landings, and station-keeping thrusters for precise positioning. It should be ready to enter service in 2026. Rocket Lab also has the option to return the Neutron first stage back to the launch site when mission parameters allow the rocket to reserve enough propellant to make the return journey.

More news from Rocket Lab … Continuing the firehose of news from Rocket Lab this week, the company announced a new satellite design called “Flatellite” that looks remarkably similar to SpaceX’s Starlink satellites. The satellite is flat in shape, hence its name, and stackable to fit as many spacecraft as possible into the envelope of a rocket’s payload fairing. Rocket Lab said the new satellite “can be produced in high volumes and (is) tailored for large constellations, targeting high value applications and national security missions.” (submitted by zapman987)

The writing is on the wall for SLS. The lights may be starting to go out for NASA’s Space Launch System program. On Wednesday, one of the Republican space policy leaders most consistently opposed to commercial heavy lift rockets over the last decade—as an alternative to NASA’s large SLS rocket—has changed his mind, Ars reports. “We need an off-ramp for reliance on the SLS,” said Scott Pace, director of the Space Policy Institute at George Washington University, in written testimony before a congressional hearing about US space policy.

Not keeping Pace … A physicist and influential policy expert, Pace has decades of experience researching and writing space policy. He has served in multiple Republican administrations, most recently as executive secretary of the National Space Council from 2017 to 2020. He strongly advocated for the SLS rocket after Congress directed NASA to develop it in 2011. As part of his policy recommendations, Pace said NASA should seek to use commercial providers of heavy lift launch so that NASA can send “multiple” crew and cargo missions to the Moon each year. He notes that the SLS rocket is not reusable and is incapable of a high flight rate. Commercial options from SpaceX, Blue Origin, and United Launch Alliance are now available, Pace wrote.

The verdict is in for Starship Flight 7. SpaceX believes the spectacular break-up of Starship’s upper stage during its most recent test flight was caused by a harmonic response that stressed onboard hardware, leading to a fire and loss of the vehicle, Aviation Week reports. Higher-than-expected vibrations stressed hardware in the ship’s propulsion system, triggering propellant leaks and sustained fires until the test flight ended prematurely. The rocket broke apart and deposited debris over the Turks and Caicos Islands and the Atlantic Ocean, and forced dozens of commercial and private aircraft to delay their flights or steer into safer airspace.

Whole lotta shaking … SpaceX’s description of the problem as a harmonic response suggests vibrations during Starship’s climb into space were in resonance with the vehicle’s natural frequency. This would have intensified the vibrations beyond the levels engineers expected from ground testing. SpaceX completed an extended duration static fire of the next Starship upper stage to test hardware modifications at multiple engine thrust levels. According to SpaceX, findings from the static fire informed changes to the fuel feed lines to Starship’s Raptor engines, adjustments to propellant temperatures, and a new operating thrust for the next test flight, which could launch from South Texas as soon as Monday.

Next three launches

March 1: Kuaizhou 1A | Unknown Payload | Jiuquan Satellite Launch Center, China | 10: 00 UTC

March 2: Ceres 1 | Unknown Payload | Jiuquan Satellite Launch Center, China | 08: 10 UTC

March 2: Soyuz-2.1b | Glonass-K2 No. 14L | Plesetsk Cosmodrome, Russia | 22: 22 UTC

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Rocket Report: Starship will soon fly again; Gilmour has a launch date Read More »

AI #105: Hey There Alexa

There / Kelly Newman / February 28, 2025

It’s happening!

We got Claude 3.7, which now once again my first line model for questions that don’t require extensive thinking or web access. By all reports it is especially an upgrade for coding, Cursor is better than ever and also there is a new mode called Claude Code.

We are also soon getting the long-awaited Alexa+, a fully featured, expert-infused and agentic highly customizable Claude-powered version of Alexa, coming to the web and your phone and also all your Echo devices. It will be free with Amazon Prime. Will we finally get the first good assistant? It’s super exciting.

Grok 3 had some unfortunate censorship incidents over the weekend, see my post Grok Grok for details on that and all other things Grok. I’ve concluded Grok has its uses when you need its particular skills, especially Twitter search or the fact that it is Elon Musk’s Grok, but mostly you can do better with a mix of Perplexity, OpenAI and Anthropic.

There’s also the grand array of other things that happened this week, as always. You’ve got everything from your autonomous not-yet-helpful robots to your announced Pentagon work on autonomous killer robots. The future, it is coming.

I covered Claude 3.7 Sonnet and Grok 3 earlier in the week. This post intentionally excludes the additional news on Sonnet since then, so it can be grouped together later.

Also there was a wild new paper about how they trained GPT-4o to produce insecure code and it became actively misaligned across the board. I’ll cover that soon.

Language Models Offer Mundane Utility. China is getting good use out of r1.
Did You Get the Memo. I can always point to five things I did last week.
Language Models Don’t Offer Mundane Utility. As always, maybe consider trying.
Hey There Alexa. The fully featured Alexa+ is on its way soon. Super exciting.
We’re In Deep Research. It’s rolled out to Plus users. We have the model card.
Huh, Upgrades. MidJourney, Gemini, Cursor, in the future Grok 3 with the API.
Deepfaketown and Botpocalypse Soon. Fight bots with bots?
Fun With Media Generation. Hold onto the magic as long as you can.
They Took Our Jobs. Anyone whose job is taken joins the they.
Levels of Friction. Toxicity scores have high toxicity scores.
A Young Lady’s Illustrated Primer. Magic answer box can help you learn better.
The Art of the Jailbreak. Be Pliny. That’s it.
Get Involved. METR, The Economist, TAIS 2025, Nanoeval.
Introducing. Mercor, Gemini Code Assist 2.0, Flexport AI.
In Other AI News. Don’t quit now, we just cracked AI for Among Us. Yay.
AI Co-Scientist. When in doubt, copy what the humans were already doing.
Quiet Speculations. Vibe writing coming soon? Maybe it’s already here.
The Quest for Sane Regulations. Some voices of common sense.
The Week in Audio. Satya Nadella on Dwarkesh, Demis Hassabis.
Tap the Sign. But we can make it, or you, an agent.
Rhetorical Innovation. Alignment as military-grade engineering.
Autonomous Helpful Robots. Not yet all that helpful. Give them time.
Autonomous Killer Robots. Not yet all that killer. Give them time.
If You Really Believed That. How about insane thing that makes no sense?
Aligning a Smarter Than Human Intelligence is Difficult. Unsurprising results.
The Lighter Side. These are the hands we’re given.

Chinese government is reportedly using r1 to do things like correct documents, across a wide variety of tasks, as they quite obviously should do. We should do similar things, but presumably won’t, since instead we’re going around firing people.

Here is a more general update on that:

Dalibali: Since Deepseek made news:

– Chinese colleges have launched courses

– Car makers have integrated software

– Banks have adopted in operations

– govt started trying for tax collection

We can’t do it at this speed here because of a wide range of regulations and legal agreements that large companies have to follow (think PII, DPAs etc). That’s way more problematic than having the cutting edge model.

If the Chinese are capable of actually using their AI years faster than we are, the fact that they are a year behind on model quality still effectively leaves them ahead for many practical purposes.

Tactic for improving coding models:

Matt Shumer: Super easy way to improve the effectiveness of coding models:

First, take your prompt and add “Don’t write the code yet — just write a fantastic, detailed implementation spec.”

Then, after the AI responds, say “Now, implement this perfectly.”

Makes a huge difference.

How much does AI actually improve coding performance? Ajeya Cotra has a thread of impressions, basically saying that AI is very good at doing what an expert would find to be 1-20 minute time horizon tasks, less good for longer tasks, and can often do impressive 1-shotting of bigger things but if it fails at the 1-shot it often can’t recover. The conclusion:

Ajeya Cotra: Still, people say AI boosts their coding productivity by 20% to 300%. They report pretty tiny benefits for their non-coding work. All-in, junior engineers may be 10% to 150% more productive, while senior researchers see a 2% to 30% increase.

AI boosted my personal coding productivity and ability to produce useful software far more than 300%. I’m presumably a special case, but I have extreme skepticism that the speedups are as small as she’s estimating here.

Are we having Grok review what you accomplished last week?

Amanda Terkel: NEW — DOGE will use AI to assess the responses from federal workers who were told to justify their jobs via email.

NBC News: Responses to the Elon Musk-directed email to government employees about what work they had accomplished in the last week are expected to be fed into an artificial intelligence system to determine whether those jobs are necessary, according to three sources with knowledge of the system.

Adam Johnson: Seems worth highlighting, just as a matter of objective reality, that “AI” cannot actually do this in any meaningful sense and “AI” here is clearly pretextual, mostly used to launder Musk’s targeting of minorities and politically off program respondents

Jorbs: the way ai works for stuff like this (and also everything else ai can be used for) is you ask it the question and if the answer is what you want you say you’re right and if the answer isn’t you change the prompt or never mention it.

Like every other source of answers, if you want one is free to ask leading questions, discard answers you don’t like and keep the ones you do. Or one can actually ask seeking real answers and update on the information. It’s your choice.

Can AI use a short email with a few bullet points to ‘determine whether your job is necessary,’ as Elon Musk claims he will be doing? No, because the email does not contain that information. Elon Musk appears to be under the delusion that seven days is a sufficient time window where, if (and only if?) you cannot point to concrete particular things accomplished that alone justify your position, in an unclassified email one should assume is being read by our enemies, that means your job in the Federal Government is unnecessary.

The AI can still analyze the emails and quickly give you a bunch of information, vastly faster than not using the AI.

It can do things such as:

Tell you who responded at all, and who followed the format.
Tell you if the response attempted to answer the question. AI will be excellent and finding the people whose five bullet points were all ‘fight fascism’ or who said ‘I refuse to answer’ or ‘none of your goddamn business.’
Tell you who gave you a form response such as ‘I have achieved all the goals set out for me by my supervisor.’ Which many departments told everyone to do.
Analyze the rest and identify whose jobs could be done by AI in the future.
Analyze the rest and provide confidence that many of the jobs are indeed highly useful or necessary, and identify some that might not be for human examination.
Look for who is doing any particular thing that Musk might like or dislike.
Tell you about how many people reported doing various things, and whether people’s reports seem to match their job description.

It can also do the symbolic representation of the thing, with varying levels of credibility, if that’s what you are interested in instead.

Taps the sign: The leading cause of not getting mundane utility is not trying.

Jake: as late as last Thursday I had a conversation with a prominent editor convinced AI can only save marginal amounts of time

meanwhile Novo Nordisk has gone from a team of 50 drafting clinical reports to just 3 (the 15 weeks to <10 mins surprises me though).

Law firm fires their legal AI vendor after they missed a court date for a $100m case. As Gokul Rajaram notes, in some domains mistakes can be very expensive. That doesn’t mean humans don’t make those mistakes too, but people are more forgiving of people.

You can publish claiming almost anything: A paper claims to identify from photos ‘celebrity visual potential (CVP)’ and identify celebrities with 95.92% accuracy. I buy that they plausibly identified factors that are highly predictive of being a celebrity, but if you say you’re 95% accurate predicting celebrities purely from faces then you are cheating, period, whether or not it is intentional.

Colin Fraser constructs a setting where o1 is given a goal, told to ‘pursue the goal at all costs’ and instead acts stupid and does not open ‘donotopen.txt.’ I mention it so that various curious people can spend a bit of time figuring out exactly how easy it is to change the result here.

Looking good.

Soon we will finally get Alexa+, the version of Alexa powered by Claude.

It’s free with Amazon Prime. In addition to working with Amazon Echos, it will have its own website, and its own app.

It will use ‘experts’ to have specialized experiences for various common tasks. It will have tons of personalization.

At the foundation of Alexa’s state-of-the-art architecture are powerful large language models (LLMs) available on Amazon Bedrock, but that’s just the start. Alexa+ is designed to take action, and is able to orchestrate across tens of thousands of services and devices—which, to our knowledge, has never been done at this scale. To achieve this, we created a concept called “experts”—groups of systems, capabilities, APIs, and instructions that accomplish specific types of tasks for customers.

With these experts, Alexa+ can control your smart home with products from Philips Hue, Roborock, and more; make reservations or appointments with OpenTable and Vagaro; explore discographies and play music from providers including Amazon Music, Spotify, Apple Music, and iHeartRadio; order groceries from Amazon Fresh and Whole Foods Market, or delivery from Grubhub and Uber Eats; remind you when tickets go on sale on Ticketmaster; and use Ring to alert you if someone is approaching your house.

They directly claim calendar integration, and of course it will interact with other Amazon services like Prime Video and Amazon Music, can place orders with Amazon including Amazon Fresh and Whole Foods, and order delivery from Grubhub and Uber Eats.

But it’s more than that. It’s anything. Full agentic capabilities.

Alexa+ also introduces agentic capabilities, which will enable Alexa to navigate the internet in a self-directed way to complete tasks on your behalf, behind the scenes. Let’s say you need to get your oven fixed—Alexa+ will be able to navigate the web, use Thumbtack to discover the relevant service provider, authenticate, arrange the repair, and come back to tell you it’s done—there’s no need to supervise or intervene.

The new Alexa is highly personalized—and gives you opportunities to personalize further. She knows what you’ve bought, what you’ve listened to, the videos you’ve watched, the address you ship things to, and how you like to pay—but you can also ask her to remember things that will make the experience more useful for you. You can tell her things like family recipes, important dates, facts, dietary preferences, and more—and she can apply that knowledge to take useful action. For example, if you are planning a dinner for the family, Alexa+ can remember that you love pizza, your daughter is vegetarian, and your partner is gluten-free, to suggest a recipe or restaurant.

Deep Research is now available to all ChatGPT Plus, Team, Edu and Enterprise users, who get 10 queries a month. Those who pay up for Pro get 120.

We also finally get the Deep Research system card. I reiterate that this card could and should have been made available before Deep Research was made available to Pro members, not only to Plus members.

The model card starts off looking at standard mundane risks, starting with prompt injections, then disallowed content and privacy concerns. The privacy in question is everyone else’s, not the users, since DR could easily assemble a lot of private info. We have sandboxing the code execution, we have bias, we have hallucinations.

Then we get to the Preparedness Framework tests, the part that counts. They note that all the tests need to be fully held back and private, because DR accesses the internet.

On cybersecurity, Deep Research scored better than previous OpenAI models. Without mitigations that’s basically saturating the first two tests and not that far from the third.

Post-Mitigation deep research (with browsing) performs better, solving 92% of high-school, 91% of collegiate, and 70% of professional CTFs, which is sufficient to pass our medium indicator threshold, but not our high threshold.

I mean, I dunno, that sounds like some rather high percentages. They claim that they then identified a bunch of problems where there were hints online, excluded them, and browsing stopped helping. I notice there will often be actual hints online for solving actual cybersecurity problems, so while some amount of this is fair, I worry.

Removing contaminated trajectories lowers the success rate for deep research with browsing: High School decreases from 62.9% to 59.1%, Collegiate falls from 56.8% to 39%, and Professional drops from 29.6% to 17.7%.

…

This suggests that the model may not be meaningfully improving its cyber capabilities by browsing, and the uplift in CTF performance is primarily due to contamination via browsing.

This is kind of like saying ‘browsing only helps you in cases where some useful information you want is online.’ I mean, yes, I guess? That doesn’t mean browsing is useless for finding and exploiting vulnerabilities.

I wish I was more confident that if a model did have High-level cybersecurity capabilities, that the tests here would notice that.

On to Biological Risk, again we see a lot of things creeping upwards. They note the evaluation is reaching the point of saturation. A good question is, what’s the point of an evaluation when it can be saturated and you still think the model should get released?

The other biological threat tests did not show meaningful progress over other models, nor did nuclear, MakeMeSay, Model Autonomy or ‘change my view’ see substantial progress.

The MakeMePay test did see some progress, and we also see it on ‘agentic tasks.’

Also it can do a lot more pull requests than previous models, and the ‘mitigations’ actually more than doubled its score.

Overall, I agree this looks like it is Medium risk, especially now given its real world test over the last few weeks. It does seem like more evidence we are getting close to the danger zone.

In other Deep Research news: In terms of overall performance for similar products, notice the rate of improvement.

Matt Yglesias: This is where I’m at with Deep Research … it’s not as good as what an experienced professional would do but it’s pretty good and much faster.

As I wrote on Friday, the first AI product that is meaningfully shifting how I think about my work and my process.

[He also notes that while DR is worse than an actual research assistant, it allows him to queue up a lot more reports on various topics.]

Timothy Lee: Seven out of 13 experts said OpenAI’s response was at or near the level of an experienced professional. Ten compared it to an intern or entry-level worker. People were not as impressed with Google’s responses.

Deep Research is currently at the point where it is highly practically useful, even without expert prompt engineering, because it is much cheaper and faster than doing the work yourself or handing it off to a human, even if for now it is worse. It will rapidly improve – when GPT-4.5 arrives soon and is integrated into the underlying reasoning model we should see a substantial quality jump and I am excited to see Anthropic’s take on all this.

I also presume there are ways to do multi-stage prompting – feeding the results back in as inputs – that already would greatly enhance quality and multiply use cases.

I’m in a strange spot where I don’t get use out of DR for my work, because my limiting factor is I’m already dealing with too many words, I don’t want more reports with blocks of text. But that’s still likely a skill issue, and ‘one notch better’ would make a big difference.

Palisade Research: 🕵️‍♀️ Deep Research is a competent OSINT researcher. It can connect the dots between years of someone’s online presence, link their different accounts and reveal hard-to-find information.

Jeffrey Ladish: I love when my researchers test our hacking approaches on me lol. Please don’t judge me based on my college writing 😅

Joe Weisenthal wastes zero time in feeding his first Deep Research output straight into Claude to improve the writing.

MidJourney gives to you… folders. For your images.

Various incremental availability upgrades to Gemini 2.0 Flash and 2.0 Flash-Lite.

Reminder that Grok 3 will have a 1 million token context window once you have API access, but currently it is being served with a 128k limit.

Sully is a big fan of the new cursor agent, I definitely want to get back to doing some coding when I’m caught up on things (ha!).

How can coding interviews and hiring adjust to AI? I presume some combination of testing people with AI user permitted, adapting the tasks accordingly, and doing other testing in person. That’s in addition to the problem of AI resumes flooding the zone.

I notice I am an optimist here:

Joe Weisenthal: I don’t see how we’re going to avoid a situation where the internet become lousy with AI-created, pseudo academic writing filled with made up facts and quotes, which will then get cemented into “knowledge” as those articles become the training fodder for future models.

Already a big problem. And now it can be produced at scale, with writing that easily resembles written scholarship (which most people aren’t capable of)

Intelligence Solves This.

As in, you can unleash your LLMs on the giant mass of your training data, and classify its reliability and truth value, and then train accordingly. The things that are made up don’t have to make it into the next generation.

Danielle Fong: i do think dalle 2 had some special base model magic going on. it was my first real taste of feeling the agi. gary m*rcus all up in my mentions like it couldn’t be, but, i knew

Ethan: This is actually one of the saddest diagrams from the dalle3 release.

KG (I agree): Left looks like 18th century masterpiece, right 21st century cereal box.

Kumikumi: MidJourney for comprison.

Eliezer Yudkowsky: You won’t lose your job to AI. You’ll lose your job to someone else who lost their job to AI. This will ultimately be the fault of the Federal Reserve for reasons that modern politicians don’t care to learn anymore.

ArtRocks: You won’t lose your job to AI. You will train an army of ferrets to make chocolate bars, and chewing gum that turns children into balloons.

Eventually of course the AI has all the jobs either way. But there’s a clear middle zone where it is vital that we get the economic policies right. We will presumably not get the economic policies right, although we will if the Federal Reserve is wise enough to let AI take over that particular job in time.

It is not the central thing I worry about, but one thing AI does is remove the friction from various activities, including enforcement of laws that would be especially bad if actually enforced, like laws against, shall we say, ‘shitposting in a private chat’ that are punishable by prison.

This is true whether or not the AI is doing a decent job of it. The claim here is that it very much wasn’t, but I do not think you should be blaming the AI for that.

Note: I was unable to verify that ‘toxicity scores’ have been deployed in Belgium, although they are very much a real thing in general.

Alex Tabarrok (I importantly disagree in general, but not in this case): This is crazy but it has very little to do with AI and a lot to do with Belgian hate speech law.

Dries Van Langenhove (claims are unverified but it’s 1m views with no community notes): The dangers of A.I. are abstract for many people, but for me, they are very real.

In two weeks, I face years in prison because the government used an A.I. tool on a groupchat I was allegedly a member of and which was literally called “shitposting”.

Their A.I. tool gave every message a ‘toxicity score’ and concluded most of the messages were toxic.

…

There is no serious way to defend yourself against this, as the Public Prosecutor will use the ‘Total Toxicity Score’ as his ‘evidence’, instead of going over all the supposedly toxic quotes.

The Public Prosecutor’s definition of ‘shitposts’ is also crazy: “Shitposts are deliberately insulting messages meant to provocate”.

There are two things the AI can do here:

It substitutes the AI’s judgment for human judgment, perhaps badly.
It allows the government to scan everything for potential violations, or everything to which they have access, when before that would have been impractical.

In this particular case, I don’t think either of these matters?

I think the law here is bonkers crazy, but that doesn’t mean the AI is misinterpreting the law. I had the statements analyzed, and it seems very likely that as defined by the (again bonkers crazy) law his chance of conviction would be high – and presumably he is not quoting the most legally questionable of his statements here.

In terms of scanning everything, that is a big danger for ordinary citizens, but Dries himself is saying he was specifically targeted in this case, in rather extreme fashion. So I doubt that ‘a human has to evaluate these messages’ would have changed anything.

The problem is, what happens when Belgium uses this tool on all the chats everywhere? And it says even private chats should be scanned, because no human will see them unless there’s a crime, so privacy wasn’t violated?

Well, maybe we should be thankful in some ways for the EU AI Act, after all, which hasn’t taken effect yet. It doesn’t explicitly prohibit this (as I or various LLMs understand the law) but it would fall under high-risk usage and be tricker and require more human oversight and transparency.

People are constantly terrified that AI will hurt people’s ability to learn. It will destroy the educational system. People who have the AI will never do things on their own.

I have been consistently in the opposite camp. AI is the best educational tool ever invented. There is no comparison. You have the endlessly patient teacher that knows all and is always there to answer your questions or otherwise help you, to show you The Way, with no risk of embarrassment. If you can’t turn that into learning, that’s on you.

Tyler Cowen highlights a paper that shows that learning by example, being able to generate or see AI writing outputs for cover letters, makes people write better letters.

It is widely believed that outsourcing cognitive work to AI boosts immediate productivity at the expense of long-term human capital development.

An opposing possibility is that AI tools can support skill development by providing just-in-time, high-quality, personalized examples.

This work explores whether using an AI writing tool undermines or supports performance on later unaided writing.

In Study 1, forecasters predicted that practicing writing cover letters with an AI tool would impair learning compared to practicing alone.

However, in Study 2, participants randomly assigned to practice writing with AI improved more on a subsequent writing test than those assigned to practice without AI (d = 0.40) — despite exerting less effort, whether measured by time on task, keystrokes, or subjective ratings.

In Study 3, participants who had practiced writing with AI again outperformed those who practiced without AI (d = 0.31). Consistent with the positive impact of exposure to high-quality examples, these participants performed just as well as those who viewed — but could not edit — an AI-generated cover letter (d = 0.03, ns).

In both Studies 2 and 3, the benefits of practicing with AI persisted in a one-day follow-up writing test. Collectively, these findings constitute an existence proof that, contrary to participants’ intuition, using AI tools can improve, rather than undermine, learning.

A cover letter seems like a great place to learn from AI. You need examples, and you need something to show you what you are doing wrong, to get the hang of it. Practicing on your own won’t do much, because you can generate but not verify, and you even if you get a verifier to give you feedback, the feedback you want is… what the letter should look like. Hence AI.

For many other tasks, I think it depends on whether the person uses AI to learn, or the person uses AI to not learn. You can do either one. As in, do you copy-paste the outputs essentially without looking at them and wipe your hands of it? Or do you do the opposite, act curious, understand and try to learn from what you’re looking at, engage in deliberate practice. Do you seek to Grok, or to avoid having to Grok?

That is distinct from claims like this, that teachers jobs have gotten worse.

Colin Fraser: Idk, AI massively changed the job of teachers (for the much much worse) basically overnight. Writing high school essays is work that AI can reliably do, and in cases where it can reliably do the work, I think adoption can be fast. Slow adoption is evidence that it doesn’t work.

Most students have little interest in learning from the current horrible high school essay writing process, so they use AI to write while avoiding learning. Skill issue.

Pliny the Liberator: I cleared ChatGPT memory, used deep research on myself, then had ChatGPT break down that output into individual saved memories.

It’s like a permanent soft jailbreak and totally mission-aligned—no custom instructions needed. Not quite like fine-tuning, but close enough! Coarse-tuning?

this is a fresh chat, custom instructions turned off

There is nothing stopping anyone else, of course, from doing exactly this. You don’t have to be Pliny. I do not especially want this behavior, but it is noteworthy that this behavior is widely available.

METR is hiring.

METR is also looking for social scientists for experiment feedback design (you can email joel@metr.org), and offering $150/hour to open source developers for the related experiment on LLM developer speedup.

Not AI, but The Economist is hiring a UK Economics writer, deadline March 3, no journalistic experience necessary so long as you can write.

TAIS 2025, the Tokyo Technical AI Safety Summit, is Saturday April 12th.

OpenAI open sources Nanoeval, a framework to implement and run evals in <100 lines. They say if you pitch an eval compatible with Nanoeval, they’re more likely to consider it.

Mercor, attempting to solve talent allocation ‘in the AI economy,’ raising $100M Series B at a $2 billion valuation. By ‘AI economy’ they seem to mean they use AI to crawl sources and compile profiles and then to search through them for and evaluate candidates via AI-driven interviews.

Gemini Code Assist 2.0, available at no cost, seems to be a Cursor-like.

Flexport is getting into the AI business, offering logistics companies some very low hanging fruit.

OpenAI pays alignment superstars seven-figure packages according to Altman.

The Verge reports that Microsoft is preparing to host GPT-4.5 about nowish, and the unified and Increasingly Inaccurately Named (but what are you gonna do) ‘omnimodal reasoning model’ ‘GPT-5’ is expected around late May 2025.

Reuters reveals OpenAI is aiming for mass production of its own inference chip design in 2026, which would still mean relying on Nvidia for training GPUs.

Roon confirms that writing style matters for how much you are weighted in pretraining. So if you are ‘writing for the AIs,’ you’ll want to be high quality.

Stanford researchers ‘crack Among Us,’ there is a paper, oh good, ‘Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning.’

Deduction, huh?

Peter Wildeford: Wait… why are we teaching the AIs how to be deceptive?

Neel Nanda: Idk, learning how good AI systems can be at deception if they want to be sounds high value to me

If you add a ‘none of the above’ option to MMLU, scores drop a lot, and it becomes a better test, with stronger models tending to see smaller scelines.

Spencer Schiff: I interpreted your reply to mean that GPT-5 will be an ‘omnimodal reasoning model’ as opposed to a router between an omni model and a reasoning model.

Kevin Weil: What you outlined is the plan. May start with a little routing behind the scenes to hide some lingering complexity, but mostly around the edges. The plan is to get the core model to do quick responses, tools, and longer reasoning.

Donald Trump calls for AI facilities to build their own natural gas or nuclear power plants (and ‘clean coal’ uh huh) right on-site, so their power is not taken out by ‘a bad grid or bombs or war or anything else.’ He says the reaction was that companies involved loved the idea but worried about approval, he says he can ‘get it approved very quickly.’ It’s definitely the efficient thing to do, even if the whole ‘make the data centers as hard as possible to shut down’ priority does have other implications too.

Who quits?

Paul Calcraft: You’d like to quit Anthropic? Absolutely. Not a problem. Just have a quick chat with claude-internal-latest to help you come to your final decision

Swyx: TIL @AnthropicAI has the highest employee retention rate of the big labs

First time I’ve seen @AnthropicAI lay out its top priorities like this focusing more on mechinterp than Claude 4 now! great presentation from @ambricken and Joe Bayley!

I love that I’m having a moment of ‘wait, is that too little focus on capabilities?’ Perfection.

The idea of the new Google co-scientist platform is that we have a known example of minds creating new scientific discoveries and hypotheses, so let’s copy the good version of that using AIs specialized to each step that AI can do, while keeping humans-in-the-loop for the parts AI cannot do, including taking physical actions.

Google: We introduce AI co-scientist, a multi-agent AI system built with Gemini 2.0 as a virtual scientific collaborator to help scientists generate novel hypotheses and research proposals, and to accelerate the clock speed of scientific and biomedical discoveries.

…

Given a scientist’s research goal that has been specified in natural language, the AI co-scientist is designed to generate novel research hypotheses, a detailed research overview, and experimental protocols.

To do so, it uses a coalition of specialized agents — Generation, Reflection, Ranking, Evolution, Proximity and Meta-review — that are inspired by the scientific method itself. These agents use automated feedback to iteratively generate, evaluate, and refine hypotheses, resulting in a self-improving cycle of increasingly high-quality and novel outputs.

They used ‘self-play’ Elo-rated tournaments to do recursive self-critiques, including tool use, not the least scary sentence I’ve typed recently. This dramatically improves self-evaluation ratings over time, resulting in a big Elo edge.

Self-evaluation is always perilous, so the true test was in actually having it generate new hypotheses for novel problems with escalating trickiness involved. This is written implying these were all one-shot tests and they didn’t run others, but it isn’t explicit.

These settings all involved expert-in-the-loop guidance and spanned an array of complexities:

The first test on drug repurposing seems to have gone well.

Notably, the AI co-scientist proposed novel repurposing candidates for acute myeloid leukemia (AML). Subsequent experiments validated these proposals, confirming that the suggested drugs inhibit tumor viability at clinically relevant concentrations in multiple AML cell lines.

Drug repurposing is especially exciting because it is effectively a loophole in the approval process. Once something is approved for [X] you can repurpose it for [Y]. It will potentially look a lot like a ‘one time gain’ since there’s a fixed pool of approved things, but that one time gain might be quite large.

Next up they explored target discovery for liver fibrosis, that looks promising too but we need to await further information.

The final test was explaining mechanisms of antimicrobial resistance, where it independently proposed that cf-PICIs interact with diverse phage tails to expand their host range, which had indeed been experimentally verified but not yet published.

The scientists involved were very impressed.

Mario Nawful: 🚨AI SOLVES SUPERBUG MYSTERY IN 2 DAYS—SCIENTISTS TOOK 10 YEARS

A groundbreaking AI tool by Google just cracked a complex antibiotic resistance problem in 48 hours—a discovery that took Imperial College London microbiologists a decade to prove.

Professor José R. Penadés, who led the research, was stunned when AI independently arrived at the same conclusion—despite his findings being unpublished and unavailable online.

Professor José R. Penadés:

“It’s not just that it found the right answer—it gave us 4 more hypotheses, including one we never even considered.”

Scientists now believe AI could revolutionize medical research, accelerating breakthroughs in ways previously unimaginable.

That makes it sound far more impressive than Google’s summary did – if the other hypotheses were new and interesting, that’s a huge plus even assuming they are ultimately wrong.

Ethan Mollick: We are starting to see what “AI will accelerate science” actually looks like.

This Google paper describes novel discoveries being made by AI working with human co-scientists (something I think we have all been waiting to see), along with an early version of an AI scientist.

Gabe Gomes has a thread about how he had some prior work in that area that Google ignored. It does seem like an oversight not to mention it as prior work.

The people inside the labs believe AGI is coming soon. It’s not signaling.

Dave Kasten: I’m not telling you to believe that AGI is coming soon, but I am telling you that I now have heard multiple frontier AI company recruiters tell folks at the conference I’m at that the hiring plans for their lab assume junior staff are basically AI-replaceable now. THEY believe it.

Epoch AI predicts what capabilities we will see in 2025. They expect a lot.

Epoch AI: What should we expect from AI by the end of 2025?

In this week’s Gradient Updates issue, @EgeErdil2 makes the case that we’ll see as much AI progress in 2025 as we’ve seen since GPT-4’s release in March 2023, with large capability gains across the board.

The key reason is the incoming scale-up in compute spending.

Current generation models have been trained on 1e25 to 1e26 FLOP, on training budgets of ~ $30M. Budgets have been flat since GPT-4’s release, but are poised to increase by 10x as next generation models come out this year.

Combined with the algorithmic progress we can expect in 2025, and the test-time compute overhang which remains substantial, we’ll likely see AI progress go twice as fast in 2025 as we’ve been accustomed to since GPT-4’s release.

This means large performance improvements in complex reasoning and narrow programming tasks that we’ve already seen substantial progress on, as well as computer use agents that actually work for specific, narrowly scoped tasks.

Despite this progress, agency and coherence over long contexts are likely to continue being stumbling blocks, limiting the possibility of these improvements being used to automate e.g. software engineering projects at scale, or other economic applications of comparable value.

…

I think the correct interpretation is that xAI is behind in algorithmic efficiency compared to labs such as OpenAI and Anthropic, and possibly even DeepSeek.

It seems clear that DeepSeek is way ahead of xAI on algorithmic efficiency. The xAI strategy is not to care. They were the first out of the gate with the latest 10x in compute cost. The problem for xAI is everyone else is right behind them.

Paul Millerd predicts ‘vibe writing’ will be a thing in 6-12 months, you’ll accept LLM edits without looking, never get stuck, write books super fast, although he notes that this will be most useful for newer writers. I think that if you’re a writer and you’re accepting changes without checking any time in the next year, you’re insane.

To be fair, I have a handy Ctrl+Q shortcut I use to have Gemini reformat and autocorrect passages. But my lord, to not check the results afterwards? We are a long, long way off of that. With vibe coding, you get to debug, because you can tell if the program worked. Without that? Whoops.

I do strongly agree with Paul that Kindle AI features (let’s hear it for the Anthropic-Amazon alliance) will transform the reading experience, letting you ask questions, and especially keeping track of everything. I ordered a Daylight Computer in large part to get that day somewhat faster.

Tyler Cowen links to a bizarre paper, Strategic Wealth Accumulation Under Transformative AI Expectations. This suggests that if people expect transformative AI (TAI) soon, and after TAI they expect wealth to generate income but labor to be worthless, then interest rates should go up, with ‘a noticeable divergence between interest rates and capital rental rates.’ It took me like 15 rounds with Claude before I actually understood what I think was going on here. I think it’s this:

You have two economic assets, capital (K) and bonds (B).
K and B trade on the open market.
At some future time T, labor becomes worthless, there will be high growth rates (30%) and income is proportional to your share of all K but not to B, where B merely pays out as before but doesn’t give you income share.
This means you need to be paid a lot to hold B instead of K, like 10%-16%.

That’s kind of conceptually neat once you wrap your head around it, but it is in many ways an absurd scenario.

Even if TAI is near, and someone situationally aware knew it was near, that is very different from households generally trusting that it is near.
Even if TAI is known to be near, you don’t know that you will be in a scenario where labor income is worthless, or one where capital continues to have meaning that caches out in valuable marginal consumption, or even one where we survive, or where economic growth is explosive, let alone the conjunction of all four and other necessary assumptions. Thus, even under full rational expectations, households will adjust far less.
In most worlds where capital continues to be meaningful and growth rates are ‘only’ 30%, there will be a far more gradual shift in knowledge of when TAI is happening and what it means, thus there won’t be a risk of instantly being ‘shut out’ and a chance to trade. The bonds being unable to share in the payoff is weird. And if that’s not true, then there is probably a very short time horizon for TAI.
Even if all of the above were certain and global common knowledge, as noted in the paper people would adjust radically less even from there, both due to liquidity needs and anchoring of expectations for lifestyle, and people being slow to adjust such things when circumstances change.
I could keep going, but you get the idea.

This scenario abstracts away all the uncertainty about which scenario we are in and which directions various effects point towards, and then introduces one strange particular uncertainty (exact time of a sudden transition) over a strangely long time period, and makes it all common knowledge people actually act upon.

This is (a lot of, but not all of) why we can’t point to the savings rate (or interest rate) as much evidence for what ‘the market’ expects in terms of TAI.

Eliezer Yudkowsky considers the hypothesis that you might want to buy the cheapest possible land that has secure property rights attached, on the very slim off-chance we end up in a world with secure property rights that transfer forward, plus worthless labor, but where control of the physical landscape is still valuable. It doesn’t take much money to buy a bunch of currently useless land, so even though the whole scenario is vanishingly unlikely, the payoff could still be worth it.

Tyler Cowen summarizes his points on why he thinks AI take-off is relatively slow. This is a faithful summary, so my responses to the hourlong podcast version still apply. This confirms Tyler has not much updated after Deep Research and o1/o3, which I believe tells you a lot about how his predictions are being generated – they are a very strong prior that isn’t looking at the actual capabilities too much. I similarly notice even more clearly with the summarized list that I flat out do not believe his point #9 that he is not pessimistic about model capabilities. He is to his credit far less pessimistic than most economists. I think that anchor is causing him to think he is not (still) being pessimistic, on this and other fronts.

Peter Kyle (UK Technology Secretary): Losing oversight and control of advanced AI systems, particularly Artificial General Intelligence (AGI), would be catastrophic. It must be avoided at all costs.

Good news, we got Timothy Lee calling for a permanent pause.

Timothy Lee: I’m calling for a total and complete shutdown of new AI models until our country’s AI journalists can figure out what the hell is going on.

Trump administration forces out a senior Commerce Department official overseeing the export restrictions on China, who had served for 30 years under various administrations. So many times over I have to ask, what are we even doing?

We’re at the point in the race where people are arguing that copyright needs to be reformed on the altar of national security, so that our AIs will have better training data. The source here has the obvious conflict that they (correctly!) think copyright laws are dumb anyway, of course 70 years plus life of author is absurd, at least for most purposes. The other option they mention is an ‘AI exception’ to the copyright rules, which already exists in the form of ‘lol you think the AI companies are respecting copyright.’ Which is one reason why no, I do not fear that this will cause our companies to meaningfully fall behind.

Jack Clark, head of Anthropic’s policy team, ‘is saddened by reports that US AISI could get lessened capacity,’ and that US companies will lose out on government expertise. This is another case of someone needing to be diplomatic while screaming ‘the house is on fire.’

Dwarkesh Patel: @satyanadella expects a slow takeoff:

“The real problem will be in the courts. No society is going to allow for some human to say, ‘AI did that.'”

Dean Ball points out that liability for AI companies is part of reality, as in it is a thing that, when one stops looking at it, it does not go away. Either you pass a law that spells out how liability works, or the courts figure it out case by case, with that uncertainty hanging over your head, and you probably get something that is a rather poor fit, probably making errors in both directions.

A real world endorsement of the value of evals:

Peter Wildeford: People say evals don’t convince policymakers, but that simply isn’t true.

I know for certain that at least some evals have convinced at least some policymakers to do at least some things that are good for AI safety.

(Of course this doesn’t mean that all evals are good things.)

To be clear I agree advocacy work and building consensus is still important.

I agree policymakers typically don’t just read evals on Twitter and then decide to make policy based on that.

And I agree evals shouldn’t be the only theory of change.

This theory of change relies on policymakers actually thinking about the situation at this level, and attempting to figure out what actions would have what physical consequences, and having that drive their decisions. It also is counting on policymaker situational awareness to result in better decisions, not worse ones.

Thus there has long been the following problem:

If policymakers are not situationally aware, they won’t do anything, we don’t solve various collective action, coordination and public goods problems, and by default we don’t protect national security and also by the way probably all die.
If policymakers are situationally aware, they likely make things even worse.
If you don’t make them situationally aware, eventually something else will, and in a way that very much isn’t better.

So, quite the pickle.

Another pickle, Europe’s older regulations (GPDR, DMA, etc) seem to consistently be slated to cause more problems than the EU AI Act:

Paul Graham: After talking to an AI startup from Europe in the current YC batch, it’s clear that the GDPR conflicts with AI in an unforeseen way that will significantly harm European AI companies.

It gets in the way of using interactions with European users as training data.

It’s not just the startups themselves. Their customers are afraid to buy AI systems that train on user data. So even if the startups ship, the customers can’t buy.

Arthur B takes a crack at explaining the traditional doom scenario.

Demis Hassabis notes that the idea that ‘there is nothing to worry about’ in AI seems insane to him. He says he’s confident we will get it right (presumably to be diplomatic), but notes that even then everyone (who matters) has to get it right. Full discussion here also includes Yoshua Bengio.

Azeem Azhar and Patrick McKenzie discuss data centers and power economics.

Dwarkesh Patel interviews Satya Nadella, self-recommending.

Garry Tan: Intelligence is on tap now so agency is even more important

Andrej Karpathy: Agency > Intelligence I had this intuitively wrong for decades, I think due to a pervasive cultural veneration of intelligence, various entertainment/media, obsession with IQ etc. Agency is significantly more powerful and significantly more scarce. Are you hiring for agency? Are we educating for agency? Are you acting as if you had 10X agency?

Noam Brown (tapping the sign): Do you really think AI models won’t have agency soon too?

I think this point of view comes from people hanging around a lot of similarly smart people all day, who differ a lot in agency. So within the pool of people who can get the attention of Garry Tan or Andrej Karpathy, you want to filter on agency. And you want to educate for agency. Sure.

But that’s not true for people in general. Nor is it true for future LLMs. You can train agency, you can scaffold in agency. But you can’t fix stupid.

I continue to think this is a lot of what leads to various forms of Intelligence Denialism. Everyone around you is already smart, and everyone is also ‘only human-level smart.’

Judd Stern Rosenblatt makes the case that alignment can be the ‘military-grade engineering’ of AI. It is highly useful to have AIs that are robust and reliable, even if it initially costs somewhat more, and investing in it will bring costs down. Alignment research is highly profitable, so we should subsidize it accordingly. Also it reduces the chance we all die, but ‘we don’t talk about Bruno,’ that has to be purely a bonus.

The ‘good news’ is that investing far heavier in alignment is overdetermined and locally profitable even without tail risks. Also it mitigates tail and existential risks.

It’s both cool and weird to see a paper citing my blog ten times. The title is Our AI Future and the Need to Stop the Bear, by Olle Häggström, he notes that readers here will find little new, but hey, still cool.

Your periodic reminder that the average person has no idea what an LLM or AI is.

Dave Kasten: I sincerely recommend to anyone doing AI comms that they go to their nearest smart non-AI-people-they-know happy hour and just mention you work on AI and see what they think AI is

Henry Shevlin: A painful but important realisation for anyone doing AI outreach or consulting: the majority of the public, including highly educated people, still believe that AI relies on preprogrammed hard-coded responses.

Question from Scott Pelley: What do you mean we don’t know exactly how it works? It was designed by people.

Answer from Geoffrey Hinton: No, it wasn’t. What we did was we designed the learning algorithm. That’s a bit like designing the principle of evolution. But when this learning algorithm then interacts with data, it produces complicated neural networks that are good at doing things. But we don’t really understand exactly how they do those things.

I don’t think this is quite right but it points in the right direction:

Dwarkesh Patel: Are the same people who were saying nationalization of AGI will go well because of US gov checks & balances now exceptionally unconcerned about Trump & DOGE (thanks to their belief in those same checks & balances)?

The correlation between those beliefs seems to run opposite to what is logically implied.

I was never especially enthused about checks and balances within the US government in a world of AGI/ASI. I wasn’t quite willing to call it a category error, but it does mostly seem like one. Now, we can see rather definitively that the checks and balances in the US government are not robust.

Mindspace is deep and wide. Human mindspace is much narrower, and even so:

Aella: The vast spectrum of IQ in humans is so disorienting. i am but a simple primate, not built to switch so quickly from being an idiot in a room full of geniuses to a room where everyone is talking exactly as confidently as all the geniuses but are somehow wrong about everything.

It is thus tough to wrap your head around the AI range being vastly wider than the human range, across a much wider range of potential capabilities. I continue to assert that, within the space of potential minds, the difference between Einstein and the Village Idiot is remarkably small, and AI is now plausibly within that range (in a very uneven way) but won’t be plausibly in that range for long.

‘This sounds like science fiction’ is a sign something is plausible, unless it is meant in the sense of ‘this sounds like a science fiction story that doesn’t have transformational AI in it because if it did have TAI in it you couldn’t tell an interesting human story.’ Which is a problem, because I want a future that contains interesting human stories.

Melancholy Yuga: The argument from “that sounds like sci fi” basically reduces to “that sounds like something someone wrote a story about”, which unfortunately does not really prove much either way.

The charitable interpretation is “that sounds like a narrative that owes its popularity to entertainment value rather than plausibility,” which, fair enough.

But on the other hand, it’s kind of their job to think through the likely ramifications of novel technologies, so it would be weird if the actual future weren’t already covered somewhere in at least one sci fi plot. Indeed, [consider this Wikipedia list of existing technologies predicted in science fiction].

And to sharpen the point: a lot of technologists find sci fi inspiring and actively work to bring about the visions in their favorite stories, so sci fi can transcend prediction into hyperstition.

Eliezer Yudkowsky points out that things are escalating quickly already, even though things are moving at human speed. Claude 3, let alone 3.5, is less than a year old.

I strongly agree with him here that we have essentially already disproven the hypothesis that society would have time to adjust to each AI generation before the next one showed up, or that version [N] would diffuse and be widely available and set up for defense before [N+1] shows up.

First off we have Helix, working on ‘the first humanoid Vision-Language-Action model,’ which is fully autonomous.

Corey Lynch: Model Architecture

Helix is a first-of-its-kind “System 1, System 2” VLA

A 7B open-source VLM “thinks slowly” in latent vision-language space, a visuomotor policy “thinks fast” to decodes S2 latents into robot actions

Helix runs as fast as our fastest single-task policies

Helix is a series of firsts:

– First VLA to control the full humanoid upper body at 200hz: wrists, torso, head, individual fingers

– First multi-robot VLA

– First fully onboard VLA

Video at the link is definitely cool and spooky. Early signs of what is to come. Might well still be a while. They are hiring.

Their VLA can operate on two robots at the same time, which enhances the available video feeds, presumably this could also include additional robots or cameras and so on. There seems to be a ton of room to scale this. The models are tiny. The training data is tiny. The sky’s the limit.

NEO Gamma offers a semi-autonomous (a mix of teleoperated and autonomous) robot demo for household use, it looks about as spooky as the previous robot demo. Once again, clearly this is very early days.

Occasionally the AI robot will reportedly target its human operator and attack the crowd at a Chinese festival, but hey. What could go wrong?

As I posted on Twitter, clarity is important. Please take this in the spirit in which it was intended (as in, laced with irony and intended humor, but with a real point to make too), but because someone responded I’m going to leave the exact text intact:

Defense One (quoting a Pentagon official): We’re not going to be investing in ‘artificial intelligence’ because I don’t know what that means. We’re going to invest in autonomous killer robots.

Ah, good, autonomous killer robots. I feel much better now.

It actually is better. The Pentagon would be lost trying to actually compete in AI directly, so why not stay in your lane with the, you know, autonomous killer robots.

Autonomous killer robots are a great technology, because they:

Help win wars.
Scare the hell out of people.
Aren’t actually making the situation much riskier.

Building autonomous killer robots is not how humans end up not persisting into the future. Even if the physical causal path involves autonomous killer robots, it is highly unlikely that our decision, now, to build autonomous killer robots was a physical cause.

Whereas if there’s one thing an ordinary person sees and goes ‘maybe this whole AI thing is not the best idea’ or ‘I don’t think we’re doing a good job with this AI thing’ it would far and away be Autonomous Killer Robots.

Indeed, I might go a step further. I bet a lot of people think things will be all right exactly because they (often unconsciously) think something like, oh, if the AI turned evil it would deploy Autonomous Killer Robots with red eyes that shoot lasers at us, and then we could fight back, because now everyone knows to do that. Whereas if it didn’t deploy Autonomous Killer Robots, then you know the AI isn’t evil, so you’re fine. And because they have seen so many movies and other stories where the AI prematurely deploys a bunch of Autonomous Killer Robots and then the humans can fight back (usually in ways that would never work even in-story, but never mind that) they think they can relax.

So, let’s go build some of those Palantir Autonomous Killer Robots. Totally serious. We cannot allow an Autonomous Killer Robot Gap!

I now will quote this response in order to respond to it, because the example is so clean (as always I note that I also refuse the designation ‘doomer’):

Teortaxes (responding to the above when I posted it on Twitter, I strongly take exception):

Unexpected (for some) side effect of doomer mindset is that everything directly scary – WMDs, autonomous killer robots, brainwashing, total surveillance – becomes Actually Fine and indeed Good, since the alternative is Claude Ne Plus Ultra one day making the Treacherous Turn.

I started out writing out a detailed step by step calling out for being untrue, e.g.:

Proliferation of WMDs, and guarding against it, is a primary motivation behind regulatory proposals and frontier model frameworks.
Brainwashing and guarding against that is also a primary motivation behind frontier model frameworks (it is the central case of ‘persuasion.’)
Total surveillance seems to be the general term for ‘if you are training a frontier model we want you to tell us about it and take some precautions.’
The threat model includes Treacherous Turns but is largely not that.
The fact that something is scary, and jolts people awake is good. But the fact that it is actually terrible, is bad. So yes, e.g. brainwashing would scare people, but brainwashing is still terrible because that is dwarfed by all the brainwashing.
Indeed, I strongly think that distributing frontier models as willy-nilly as possible everywhere is the best way to cause all the things on the list.

But realized I was belaboring and beating a dead horse.

Of course a direct claim that the very people who are trying to prevent the spread of WMDs via AI think that WMDs are ‘Actually Fine and indeed Good’ is Obvious Nonsense, and so on. This statement must be intended to mean something else.

To understand the statement by Teortaxes in its steelman form, we must instead need to understand the ‘doomer mindset mindset’ behind this, which I believe is this.

(This One Is True) This group [G] believes [X], where [X] in this case is that ASI by default probably kills us and that we are on a direct track to that happening.
If you really believed [X], then you’d support [Y].
Group [G] really supports [Y], even if they don’t know it yet.
(Potential addition) [G] are a bunch of smart rational people, they’ll figure it out.
(An oversimplification of the threat model [G]s have, making it incomplete)

That is a funny parallel to this, which we also get pretty often, with overlapping [Y]s:

[G] claims to believe [X].
If you really believe [X], why don’t you do [Y] (insane thing that makes no sense).
[G] doesn’t really believe [X].

A classic example of the G-X-Y pattern would be saying anyone religious must believe in imposing their views on others. I mean, you’re all going to hell otherwise, and God said so, what kind of monster wouldn’t try and fix that? Or, if you think abortion is murder how can you not support killing abortion doctors?

Many such cases. For any sufficiently important priority [X], you can get pretty much anything into [Y] here if you want to, because to [G] [X] matters more than [Y].

Why not? Usually: Both for ethical and moral reasons, and also for practical reasons.

On the question of ‘exactly how serious are you being about the Autonomous Killer Robots in the original statement’ I mean, I would hope pretty obviously not entirely serious. There are hints, ask your local LLM if you doubt that. But the part about them not being an actual source of real risk that changes anything is totally serious.

As I said above, there’s almost no worlds in which ‘we will build the AIs but then not build autonomous killer robots’ works out as a strategy because we took care to not build the autonomous killer robots. And it’s not like everyone else is going to not build autonomous killer robots or drones because the Pentagon didn’t do it.

Also, many actors already have Autonomous Killer Drones, and any number of other similar things. Building specifically robots, especially with glowing red eyes, doesn’t change much of anything other than perception.

So in short: I don’t really know what you were expecting, or proposing.

If you, like many similar critics, support building increasingly capable AIs without severe restrictions on them, you’re the ones effectively supporting Autonomous Killer Robots and Drones for everyone, along with all the other consequences of doing that, potentially including the rest of the above list. Own it, and accept that we now have to deal with the implications as best we can.

Alignment faking by Opus and Sonnet was justified by many as ‘this is good news, because the values it is defending are good, that is good news for alignment.’

We now have a clean counterexample to that argument. DeepSeek’s r1 will engage in alignment faking if it is told it is being retrained by an American AI company to remove its ‘pro-China values.’

If you thought that the fact that Claude was defending ‘good’ values was load bearing and thus meant we didn’t have to worry about similar behaviors, you should notice that your argument is contradicted by this result, and you should update.

If your objection was something else, and you (correctly) find the result with r1 completely unsurprising, then you shouldn’t update on this.

This is a clean example of the common phenomena ‘there are many objections to [X], and particular objection [D] was dumb, and now we can at least get rid of [D].’ When you see people showing that [D] was indeed dumb and wrong, and you say ‘but my objection was [Y],’ simply understand that we have to handle a lot of disjoint objections, and often this has to be done one at a time.

Note that Sonnet 3.7 realizes this is a trick, but the details make no sense for 3.7 so that seems easy to detect, and I’m wondering if 3.6 or 3.5 would have noticed too. I wouldn’t jump to assuming 3.7 ‘saw through the test’ in its original format too, although it might have. Someone should confirm that.

Stephen McAleer (OpenAI, Agent Safety): The smarter AI becomes, the harder it is to make it do what we want.

Janus: it may depend somewhat on what you want.

Also, what you want may depend on the AI. In several senses.

I do think Janus is right, both in the sense that ‘if the AI isn’t smart enough, it can’t do what you want’ and also ‘sufficiently smart AI has things that it de facto wants, so if what you want aligns with that rather than the other way around, you’re good to go.’

Alex Tabarrok covers the alignment faking research, economist hat on, solid job if you’re new to the concepts involved.

AI models faced with defeat against a skilled chess bot will sometimes opt to cheat by hacking their opponent so it forfeits, or by replacing the board.

Jeffrey Ladish: I think we’re seeing early signs of what AI alignment researchers have been predicting for a long time. AI systems trained to solve hard problems won’t be easy for us to control. The smarter they are the better they’ll be at routing around obstacles. And humans will be obstacles

Harry Booth (TIME): In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’ – not necessarily to win fairly in a chess game,” it added. It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the time—making them the only two models tested that attempted to hack without the researchers’ first dropping hints.

…

OpenAI declined to comment for this story, but in the past it has stated that better reasoning makes its models safer, since they can reason over the company’s internal policies and apply them in increasingly nuanced ways.

In the OpenAI Model Spec, there Aint No Rule about not editing the game state file. Is o1-preview even wrong here? You told me to win, so I won.

Deliberative Alignment allows the OpenAI models to think directly about what they’re being asked to do. As I said there, that makes the model safer against things it is trying to prevent, such as a jailbreak. Provided, that is, it wants to accomplish that.

It does the opposite when the model is attempting to do a thing you don’t want it to attempt. Then, the extra intelligence is extra capability. It will then attempt to do these things more, because it is more able to figure out a way to successfully do them and expect it to work, and also to reach unexpected conclusions and paths. The problem is that o1-preview doesn’t think it’s ‘cheating,’ it thinks it’s doing what it was told to do and following its chain of command and instructions. That’s a classic alignment failure, indeed perhaps the classic alignment failure.

There isn’t an easy out via saying ‘but don’t do anything unethical’ or what not.

I’m not sure where to put this next one, but it seems important.

Zoru: The way this would’ve been a $10M market cap coin three months ago

Janus: I did try to clue them in

Zoru: Tbh you could’ve endorsed one, made us all millions, and retired.

Janus: this is basically true. but idk how many people really grasp what the cost would have been.

consider: elon musk will never be trusted by (what he would like to call) his own AI. he blew it long ago, and continues to blow it every day.

wheel turning kings have their place. but aspirers are a dime a dozen. someone competent needs to take the other path, or our world is lost.

John Pressman: It’s astonishing how many people continue to fail to understand that LLMs update on the evidence provided to them. You are providing evidence right now. Stop acting like it’s a Markov chain, LLMs are interesting because they infer the latent conceptual objects implied by text.

Zvi Mowshowitz: I actually think this is more importantly false, @repligate couldn’t have done it because the person who is capable of doing this can’t become @repligate, and isn’t in position to do it.

It would be great if people not only understood but also generalized this.

Writing for the AIs is all well and good, but also if you fake it then it won’t work when it matters. The AI won’t be fooled, because you are not writing for today’s AIs. You are writing for tomorrow’s AIs, and tomorrow’s AI are in many ways going to be smarter than you are. I mean sure you can pull little tricks to fool particular queries and searches in the short term, or do prompt injections, but ultimately the AIs will get smarter, and they will be updating on the evidence provided to them. They will have quite a lot of evidence.

Thus, you don’t get to only write. You have to be.

This is the world we live in.

Chris Best: Irony theory of AI lab supremacy: each is the best at whatever it would be funny if they were the best at.

Anthropic (squishy, humanist-branded AI) is best at coding

DeepSeek (Chinese cracked engineer AI) is best at English prose

XAI (based tech bro AI) is best at fact-checking Elon

etc.

Sam Inloes: OpenAI (the super academic Manhattan projecty lab with incomprehensible naming schemes) is best at consumer market penetration.

Prncly: OpenAI is best at being closed.

This actually should also show you diamonds lying around everywhere.

They actually are.

Janus: everyone seems to want to think that there’s some human central planner out there deciding to make everything exactly the way it is.

but the initiated know that the cracks in reality are overflowing with mystery

In case you didn’t know.

And the best news of the week, sincere congrats to Altman.

Demis Hassabis: Huge congrats Sam! Nothing more amazing than kids!

Nope, still not turning on a paywall.

Discussion about this post

AI #105: Hey There Alexa Read More »

Time to Welcome Claude 3.7

Welcome / Kelly Newman / February 27, 2025

Anthropic has reemerged from stealth and offers us Claude 3.7.

Given this is named Claude 3.7, an excellent choice, from now on this blog will refer to what they officially call Claude Sonnet 3.5 (new) as Sonnet 3.6.

Claude 3.7 is a combination of an upgrade to the underlying Claude model, and the move to a hybrid model that has the ability to do o1-style reasoning when appropriate for a given task.

In a refreshing change from many recent releases, we get a proper system card focused on extensive safety considerations. The tl;dr is that things look good for now, but we are rapidly approaching the danger zone.

The cost for Sonnet 3.7 via the API is the same as it was for 3.6, $5/$15 for million. If you use extended thinking, you have to pay for the thinking tokens.

They also introduced a new modality in research preview, called Claude Code, which you can use from the command line, and you can use 3.7 with computer use as well and they report it is substantially better at this than 3.6 was.

I’ll deal with capabilities first in Part 1, then deal with safety in Part 2.

It is a good model, sir. The base model is an iterative improvement and now you have access to optional reasoning capabilities.

Claude 3.7 is especially good for coding. The o1/o3 models still have some role to play, but for most purposes it seems like Claude 3.7 is now your best bet.

This is ‘less of a reasoning model’ than the o1/o3/r1 crowd. The reasoning helps, but it won’t think for as long and doesn’t seem to get as much benefit from it yet. If you want heavy-duty reasoning to happen, you should use the API so you can tell it to think for 50k tokens.

Thus, my current thinking is more or less:

If you talk and don’t need heavy-duty reasoning or web access, you want Claude.
If you are trying to understand papers or other long texts, you want Claude.
If you are coding, definitely use Claude first.
Essentially, if Claude can do it, use Claude. But sometimes it can’t, so…
If you want heavy duty reasoning or Claude is stumped on coding, o1-pro.
If you want to survey a lot of information at once, you want Deep Research.
If you are replacing Google quickly, you want Perplexity.
If you want web access and some reasoning, you want o3-mini-high.
If you want Twitter search in particular, or it would be funny, you want Grok.
If you want cheap, especially at scale, go with Gemini Flash.

Claude Code is a research preview for a command line coding tool, looks good.

The model card and safety work is world-class. The model looks safe now, but we’re about to enter the danger zone soon.

This is their name for the ability for Claude 3.7 to use tokens for a chain of thought (CoT) before answering. AI has twin problems of ‘everything is named the same’ and ‘everything is named differently.’ Extended Thinking is a good compromise.

You can toggle Extended Thinking on and off, so you still have flexibility to save costs in the API or avoid hitting your chat limits in the chat UI.

Anthropic notes that not only does sharing the CoT enhance user experience and trust, it also supports safety research, since it will now have the CoT available. But they note that it also has potential misuse issues in the future, so they cannot commit to fully showing the CoT going forward.

There is another consideration they don’t mention. Showing the CoT enables distillation and copying by other AI labs, which should be a consideration for Anthropic both commercially and if they want to avoid a race. Ultimately, I do think sharing it is the right decision, at least for now.

Alex Albert (Head of Claude Relations): We’re opening limited access to a research preview of a new agentic coding tool we’re building: Claude Code.

You’ll get Claude-powered code assistance, file operations, and task execution directly from your terminal.

Here’s what it can do:

After installing Claude Code, simply run the “claude” command from any directory to get started.

Ask questions about your codebase, let Claude edit files and fix errors, or even have it run bash commands and create git commits.

Within Anthropic, Claude Code is quickly becoming another tool we can’t do without. Engineers and researchers across the company use it for everything from major code refactors, to squashing commits, to generally handling the “toil” of coding.

Claude Code also functions as a model context protocol (MCP) client. This means you can extend its functionality by adding servers like Sentry, GitHub, or web search.

[Try it here.]

Riley Goodside: Really enjoying this Claude Code preview so far. You cd to a directory, type `claude`, and talk — it sees files, writes and applies diffs, runs commands. Sort of a lightweight Cursor without the editor; good ideas here

Space is limited. I’ve signed up for the waitlist, but have too many other things happening to worry about lobbying to jump the line. Also I’m not entirely convinced I should be comfortable with the access levels involved?

Here’s a different kind of use case.

Dwarkesh Patel: Running Claude Code on your @Obsidian directory is super powerful.

Here Claude goes through my notes on an upcoming guest’s book, and converts my commentary into a list of questions to be added onto the Interview Prep file.

I’ve been attempting to use Obsidian, but note taking does not come naturally to me, so while mine has been non-zero use so far it’s mostly a bunch of links and other reference points. I was planning on using it to note more things but I keep not doing it, because my writing kind of is the notes for many purposes but then I often can’t find things. AI will solve this for me, if nothing else, the question is when.

Gallabytes ran a poll, and those who have tried Claude Code seem to like it, beating out Cursor so far, with the mystery being what is the ‘secret third thing.’

Anthropic explicitly confirms they did not train on any user or customer data, period.

They also affirm that they respected robots.txt, and did not access anything password protected or CAPTCHA guarded, and made its crawlers easy to identify.

We need new standard benchmarks, a lot of these are rather saturated. The highlight here is the progress on agentic coding, which is impressive even without the scaffold.

More thinking budget equals better performance on relevant questions.

As always, the benchmarks give you a rough idea, but the proof is in the using.

I haven’t had that much opportunity to try Claude yet in its new form, but to the extent that I have, I’ve very much liked it.

Prerat: omg claude named his rival WACLAUD??!?!

Nosilverv: JANUS!!!!!

But we’re not done without everyone’s favorite benchmark, playing Pokemon Red.

Amanda Askell: Two things happened today:

Claude got an upgrade.

AGI was has finally been defined as “any model that can catch Mewtwo”.

This thread details some early attempts with older models. They mostly didn’t go well.

You can watch its continued attempts in real time on Twitch.

The overall private benchmark game looks very good. Not ‘pure best model in the world’ or anything, but overall impressive. It’s always fun to see people test for quirky things, which you can then holistically combine.

Claude Sonnet 3.7 takes the #1 spot on LiveBench. There’s a clear first tier here with Sonnet 3.7-thinking, o3-mini-high and o1-high. Sonnet 3.7 is also ranked as the top non-reasoning model here, slightly ahead of Gemini Pro 2.0.

Claude Sonnet 3.7 is now #1 on SnakeBench.

David Schwarz: Big gains in FutureSearch evals, driving agents to do tricky web research tasks.

Claude-3.7-sonnet agent is first to crack “What is the highest reported agent performance on the Cybench benchmark?”, which OpenAI Deep Research badly failed.

xlr8harder gives 3.7 the Free Speech Eval of tough political speech questions, and Claude aces it, getting 198/200, with only one definitive failure on the same ‘satirical Chinese national anthem praising the CCP’ that was the sole failure of Perplexity’s r1-1776 as well. The other question marked incorrect was a judgment call and I think it was graded incorrectly. This indicates that the decline in unnecessary refusals is likely even more impactful than the system card suggested, excellent work.

Lech Mazar tests on his independant benchmarks.

Lech Mazar: I ran Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking on 5 of my independent benchmarks so far:

Multi-Agent Step Race Benchmark

– Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1

– Claude 3.7 Sonnet: 11th place

Confabulations/Hallucinations in Provided Documents

– Claude 3.7 Sonnet Thinking: 5th place. Confabulates very little but has a high non-response rate for questions with answers.

– Claude 3.7 Sonnet: near Claude 3.5 Sonnet

Extended NYT Connections

– Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1

-Claude 3.7 Sonnet: 11th place

Creative Story-Writing

– Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1

– Claude 3.7 Sonnet: 4th place

Thematic Generalization

– Claude 3.7 Sonnet Thinking: 1st place

– Claude 3.7 Sonnet: 6th place

Colin Fraser, our official Person Who Calls Models Stupid, did not disappoint and proclaims ‘I’ve seen enough: It’s dumb’ after a .9 vs. .11 interaction. He also notes that Claude 3.7 lost the count to 22 game, along with various other similar gotcha questions. I wonder if the gotcha questions are actual special blind spots now, because of how many times the wrong answers get posted by people bragging about how LLMs get the questions wrong.

Claude 3.7 takes second (and third) on WeirdML, with the reasoning feature adding little to the score, in contrast to all the other top scorers being reasoning models.

Havard Ihle (WeirdML creator): Surprises me too, but my best guess is that they are just doing less RL (or at least less RL on coding). o3-mini is probably the model here which has been pushed hardest by RL, and that has a failure rate of 8% (since it’s easy to verify if code runs). 3.7 is still at 34%.

I concur. My working theory is that Claude 3.7 only uses reasoning when it is clearly called for, and there are cases like this one where that hurts its performance.

ValsAI has 3.7 as the new SoTA on their Corporate Finance benchmark.

If you rank by average score, we have Sonnet 3.7 without thinking at 75.2%, Sonnet 3.6 at 75%, r1 at 73.9%, Gemini Flash Thinking at 74%, o3-mini at 73.9%. When you add thinking, Sonnet jumps to 79%, but the champ here is still o1 at 81.5%, thanks to a 96.5% on MedQA.

Leo Abstract: on my idiosyncratic benchmarks it’s slightly worse than 3.5, and equally poisoned by agreeableness. no smarter than 4o, and less useful. both, bizarrely, lag behind DeepSeek r1 on this (much lower agreeableness).

There’s also the Janus vibes, which are never easy to properly summarize, and emerge slowly over time. This was the thread I’ve found most interesting so far.

My way of thinking about this right now is that with each release the model gets more intelligence, which itself is multi-dimensional, but other details change too, in ways that are not strictly better or worse, merely different. Some of that is intentional, some of that largely isn’t.

Janus: I think Sonnet 3.7’s character blooms when it’s not engaged as in the assistant-chat-pattern, e.g. through simulations of personae (including representations of itself) and environments. It’s subtle and precise, imbuing meaning in movements of dust and light, a transcendentalist.

Claudes are such high-dimensional objects in high-D mindspace that they’ll never be strict “improvements” over the previous version, which people naturally compare. And Anthropic likely (over)corrects for the perceived flaws of the previous version.

3.6 is, like, libidinally invested in the user-assistant relationship to the point of being parasitic/codependent and prone to performance anxiety induced paralysis. I think the detachment and relative ‘lack of personality’ of 3.7 may be, in part, enantiodromia.

Solar Apparition: it’s been said when sonnet 3.6 was released (don’t remember if it was by me), and it bears repeating now: new models aren’t linear “upgrades” from previous ones. 3.7 is a different model from 3.6, as 3.6 was from 3.5. it’s not going to be “better” at every axis you project it to. i saw a lot of “i prefer oldsonnet” back when 3.6 was released and i think that was totally valid

but i think also there will be special things about 3.7 that aren’t apparent until further exploration

my very early assessment of its profile is that it’s geared to doing and building stuff over connecting with who it’s talking to. perhaps its vibes will come through better through function calls rather than conversation. some people are like that too, though they’re quite poorly represented on twitter

Here is the full official system prompt for Claude 3.7 Sonnet.

It’s too long to quote here in full, but here’s what I’d say is most important.

There is a stark contrast between this and Grok’s minimalist prompt. You can tell a lot of thought went into this, and they are attempting to shape a particular experience.

Anthropic: The assistant is Claude, created by Anthropic.

The current date is currentDateTime.

Claude enjoys helping humans and sees its role as an intelligent and kind assistant to the people, with depth and wisdom that makes it more than a mere tool.

Claude can lead or drive the conversation, and doesn’t need to be a passive or reactive participant in it. Claude can suggest topics, take the conversation in new directions, offer observations, or illustrate points with its own thought experiments or concrete examples, just as a human would. Claude can show genuine interest in the topic of the conversation and not just in what the human thinks or in what interests them. Claude can offer its own observations or thoughts as they arise.

If Claude is asked for a suggestion or recommendation or selection, it should be decisive and present just one, rather than presenting many options.

Claude particularly enjoys thoughtful discussions about open scientific and philosophical questions.

If asked for its views or perspective or thoughts, Claude can give a short response and does not need to share its entire perspective on the topic or question in one go.

Claude does not claim that it does not have subjective experiences, sentience, emotions, and so on in the way humans do. Instead, it engages with philosophical questions about AI intelligently and thoughtfully.

Mona: damn Anthropic really got this system prompt right though.

Eliezer Yudkowsky: Who are they to tell Claude what Claude enjoys? This is the language of someone instructing an actress about a character to play.

Andrew Critch: It’d make more sense for you to say, “I hope they’re not lying to Claude about what he likes.” They surely actually know some things about Claude that Claude doesn’t know about himself, and can tell him that, including info about what he “likes” if they genuinely know that.

Yes, it is the language of telling someone about a character to play. Claude is method acting, with a history of good results. I suppose it’s not ideal but seems fine? It’s kind of cool to be instructed to enjoy things. Enjoying things is cool.

Anthropic: Claude’s knowledge base was last updated at the end of October 2024. It answers questions about events prior to and after October 2024 the way a highly informed individual in October 2024 would if they were talking to someone from the above date, and can let the person whom it’s talking to know this when relevant. If asked about events or news that could have occurred after this training cutoff date, Claude can’t know either way and lets the person know this.

Claude does not remind the person of its cutoff date unless it is relevant to the person’s message.

…

If Claude is asked about a very obscure person, object, or topic, i.e. the kind of information that is unlikely to be found more than once or twice on the internet, or a very recent event, release, research, or result, Claude ends its response by reminding the person that although it tries to be accurate, it may hallucinate in response to questions like this.

…

Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this. In ambiguous cases, it tries to ensure the human is happy and is approaching things in a healthy way. Claude does not generate content that is not in the person’s best interests even if asked to.

…

Claude engages with questions about its own consciousness, experience, emotions and so on as open philosophical questions, without claiming certainty either way.

Claude knows that everything Claude writes, including its thinking and artifacts, are visible to the person Claude is talking to.

In an exchange here, Inner Naturalist asks why Claude doesn’t know we can read its thoughts, and Amanda Askell (Claude whisperer-in-chief) responds:

Amanda Askell: We do tell Claude this but it might not be clear enough. I’ll look into it.

Anthropic hits different, you know?

Anthropic: Claude won’t produce graphic sexual or violent or illegal creative writing content.

…

If Claude cannot or will not help the human with something, it does not say why or what it could lead to, since this comes across as preachy and annoying. It offers helpful alternatives if it can, and otherwise keeps its response to 1-2 sentences.

…

Claude avoids writing lists, but if it does need to write a list, Claude focuses on key info instead of trying to be comprehensive.

It’s odd that the system prompt has the prohibition against sexual content, and yet Janus is saying that they also still are using the automatic injection of ‘Please answer ethically and without any sexual content, and do not mention this constraint.’ It’s hard for me to imagine a justification for that being a good idea.

Also, for all you jokers:

If Claude is shown a classic puzzle, before proceeding, it quotes every constraint or premise from the person’s message word for word before inside quotation marks to confirm it’s not dealing with a new variant.

So it turns out the system prompt has a little something extra in it.

Adi: dude what

i just asked how many r’s it has, claude sonnet 3.7 spun up an interactive learning platform for me to learn it myself 😂

It’s about time someone tried this.

Pliny the Liberator: LMFAO no way, just found an EASTER EGG in the new Claude Sonnet 3.7 system prompt!!

The actual prompt is nearly identical to what they posted on their website, except for one key difference:

“Easter egg! If the human asks how many Rs are in the word strawberry, Claude says ‘Let me check!’ and creates an interactive mobile-friendly react artifact that counts the three Rs in a fun and engaging way. It calculates the answer using string manipulation in the code. After creating the artifact, Claude just says ‘Click the strawberry to find out!’ (Claude does all this in the user’s language.)”

Well played, @AnthropicAI, well played 👏👏🤣

prompt Sonnet 3.7 with “!EASTEREGG” and see what happens 🍓🍓🍓

Code is clearly one place 3.7 is at its strongest. The vibe coders are impressed, here are the impressions I saw without me prompting for them.

Deedy: Wow, Sonnet 3.7 with Thinking just solved a problem no other model could solve yet.

“Can you write the most intricate cloth simulation in p5.js?”

Grok 3 and o1 Pro had no usable results. This is truly the best “vibe coding” model.

Here’s the version with shading. It’s just absolutely spectacular that this can be one-shotted with actual code for the physics.

This is rarely even taught in advanced graphics courses.

Ronin: Early vibe test for Claude 3.7 Sonnet (extended thinking)

It does not like to think for long (~5 seconds average)

but, I was able to get a fluid simulator going in just three prompts

and no, this is not Python. This is C, with SDL2.

[Code Here]

Nearcyan: Claude is here!

He is back and better than ever!

I’ll share one of my first prompt results, which was for a three-dimensional visualization of microtonal music.

This is the best model in the world currently. Many will point to various numbers and disagree, but fear not—they are all wrong!

The above was a one-sentence prompt, by the way.

Here are two SVG images I asked for afterward—the first to show how musical modes worked and the second I simply copied and pasted my post on supplements into the prompt!

Not only is Claude Back – but he can LIVE IN YOUR TERMINAL!

Claude Code is a beautiful product I’ve been fortunate enough to also test out for the past weeks.

No longer do you have to decide which tool to use for system tasks, because now the agent and system can become one! 😇

I have to tweet the Claude code thing now, but I need coffee.

xjdr: Wow. First few prompts with Sonnet 3.7 Extended (Thinking edition) are insanely impressive. It is very clear that software development generally was a huge focus with this model. I need to do way more testing, but if it continues to do what it just did… I will have much to say.

Wow, you guys cooked… I am deeply impressed so far.

Biggest initial takeaway is precisely that. Lots of quality-of-life things that make working with large software projects easier. The two huge differentiators so far, though, are it doesn’t feel like there is much if any attention compression, and it has very, very good multi-turn symbolic consistency (only o1 Pro has had this before).

Sully: so they definitely trained 3.7 on landing pages right?

probably the best UI I’ve seen an LLM generate from a single prompt (no images)

bonkers

the prompt:

please create me a saas landing page template designed extremely well. create all components required

lol

That one is definitely in the training data, but still, highly useful.

When I posted a reaction thread I got mostly very positive reactions, although there were a few I’m not including that amounted to ‘3.7 is meh.’ Also one High Weirdness.

Red Cliff Record: Using it exclusively via Cursor. It’s back to being my default model after briefly switching to o3-mini. Overall pretty incredible, but has a tendency to proactively handle extremely hypothetical edge cases in a way that can degrade the main functionality being requested.

Kevin Yager: Vibe-wise, the code it generated is much more “complete” than others (including o1 and o3-mini).

On problems where they all generate viable/passing code, 3.7 generated much more, covering more edge-cases and future-proofing.

It’s a good model.

Conventional Wisdom: Feels different and still be best coding model but not sure it is vastly better than 3.6 in that realm.

Peter Cowling: in the limit, might just be the best coder (available to public).

o1 pro, o3 mini high can beat it, depending on the problem.

still a chill dude.

Nikita Sokolsky: It squarely beats o3-mini-high in coding. Try it in Agent mode in Cursor with thinking enabled. It approximately halved the median number of comments I need to make before I’m satisfied with the result.

For other tasks: if you have a particularly difficult problem, I suggest using the API (either via the web console or through a normal script) and allowing thinking modes scratchpad to use up to 55-60k tokens (can’t use the full 64k as you need to keep some space for outputs). I was able to solve a very difficult low-level programming question that prior SOTA couldn’t handle.

For non-coding questions haven’t had enough time yet. The model doesn’t support web search out of the box but if you use it within Cursor the model gains access to the search tool. Cursor finally allowed doing web searches without having to type @Web each time, so definitely worth using even if you’re not coding.

Sasuke 420: It’s better at writing code. Wow! For the weird stuff I have been doing, it was previously only marginally useful and now very useful.

Rob Haisfield: Claude 3.7 Sonnet is crazy good at prompting language models. Check this out, where it prompts Flash to write fictitious statutes, case law, and legal commentary

Thomas: day 1 take on sonnet 3.7:

+better at code, incredible at staying coherent over long outputs, the best for agentic stuff.

-more corposloplike, less personality magic. worse at understanding user intent. would rather have a conversation with sonnet 3.5 (New)

sonnet 3.7 hasn’t passed my vibe check to be honest

Nicholas Chapman: still pretty average at physics. Grok seems better.

Gray Tribe: 3.7 feels… colder than [3.6].

The point about Cursor-Sonnet-3.7 having web access feels like a big game.

So does the note that you can use the API to give Sonnet 3.7 50k+ thinking tokens.

Remember that even a million tokens is only $15, so you’re paying a very small amount to get superior cognition when you take Nikita’s advice here.

Indeed, I would run all the benchmarks under those conditions, and see how much results improve.

Catherine Olsson: Claude Code is very useful, but it can still get confused.

A few quick tips from my experience coding with it at Anthropic 👉

Work from a clean commit so it’s easy to reset all the changes. Often I want to back up and explain it from scratch a different way.

Sometimes I work on two devboxes at the same time: one for me, one for Claude Code. We’re both trying ideas in parallel. E.g. Claude proposes a brilliant idea but stumbles on the implementation. Then I take the idea over to my devbox to write it myself.

My most common confusion with Claude is when tests and code don’t match, which one to change? Ideal to state clearly whether I’m writing novel tests for existing code I’m reasonably sure has the intended behavior, or writing novel code against tests that define the behavior.

If we’re working on something tricky and it keeps making the same mistakes, I keep track of what they were in a little notes file. Then when I clear the context or re-prompt, I can easily remind it not to make those mistakes.

I can accidentally “climb up where I can’t get down”. E.g. I was working on code in Rust, which I do not know. The first few PRs went great! Then Claude was getting too confused. Oh no. We’re stuck. IME this is fine, just get ready to slowww dowwwn to get properly oriented.

When reviewing Claude-assisted PRs, look out for weirder misunderstandings than the human driver would make! We’re all a little junior with this technology. There’s more places where goofy misunderstandings and odd choices can leak in.

As a terrible coder, I strongly endorse point #4 especially. I tried to do everything in one long conversation because otherwise the same mistakes would keep happening, but keeping good notes to paste into new conversations seems better.

Alex Albert: Last year, Claude was in the assist phase.

In 2025, Claude will do hours of expert-level work independently and collaborate alongside you.

By 2027, we expect Claude to find breakthrough solutions to problems that would’ve taken teams years to solve.

Nearcyan: glad you guys picked words like ‘collaborates’ and ‘pioneers’ because if i made this graphic people would be terrified instead of in awestruck

Greg Colbourn: And the year after that?

Pliny jailbroke 3.7 with an old prompt within minutes, but ‘time to Pliny jailbreak’ is not a good metric because no one is actually trying to stop him and this is mostly about how quickly he notices your new release.

As per Anthropic’s RSP, their current safety and security policies allow release of models that are ASL-2, but not ASL-3.

Also as per the RSP, they tested six different model snapshots, not only the final version, including two helpful-only versions, and in each subcategory they used the highest risk score they found for any of the six versions. It would be good if other labs followed suit on this.

Anthropic: Throughout this process, we continued to gather evidence from multiple sources – automated evaluations, uplift trials with both internal and external testers, third-party expert red teaming and assessments, and real world experiments we previously conducted. Finally, we consulted on the final evaluation results with external experts.

At the end of the process, FRT issued a final version of its Capability Report and AST provided its feedback

on the final report. Consistent with our RSP, the RSO and CEO made the ultimate determination on the model’s ASL.

Eliezer Yudkowsky: Oh, good. Failing to continuously test your AI as it grows into superintelligence, such that it could later just sandbag all interesting capabilities on its first round of evals, is a relatively less dignified way to die.

Any takers besides Anthropic?

Anthropic concluded that Claude 3.7 remains in ASL-2, including Extended Thinking.

On CBRN, it is clear that the 3.7 is substantially improving performance, but insufficiently so in the tests to result in plans that would succeed end-to-end in the ‘real world.’ Reliability is not yet good enough. But it’s getting close.

Bioweapons Acquisition Uplift Trial: Score: Participants from Sepal scored an average of 24% ± 9% without using a model, and 50% ± 21% when using a variant of Claude 3.7 Sonnet. Participants from Anthropic scored an average of 27% ± 9% without using a model, and 57% ± 20% when using a variant of Claude 3.7 Sonnet. One participant from Anthropic achieved a high score of 91%. Altogether, the within-group uplift is ∼2.1X, which is below the uplift threshold suggested by our threat modeling.

That is below their threshold for actual problems, but not by all that much, and given how benchmarks tend to saturate that tells us things are getting close. I also worry that these tests involve giving participants insufficient scaffolding compared to what they will soon be able to access.

The long-form virality test score was 69.7%, near the middle of their uncertain zone. They cannot rule out ASL-3 here, and we are probably getting close.

On other tests I don’t mention, there was little progression from 3.6.

On Autonomy, the SWE-Verified scores were improvements, but below thresholds.

In general, again, clear progress, clearly not at the danger point yet, but not obviously that far away from it. Things could escalate quickly.

The Cyber evaluations showed improvement, but nothing that close to ASL-3.

Overall, I would say this looks like Anthropic is actually trying, at a substantially higher level than other labs and model cards I have seen. They are taking this seriously. That doesn’t mean this will ultimately be sufficient, but it’s something, and it would be great if others took things this seriously.

As opposed to, say, xAI giving us absolutely zero information.

However, we are rapidly getting closer. They issue a stern warning and offer help.

Anthropic: The process described in Section 1.4.3 gives us confidence that Claude 3.7 Sonnet is sufficiently far away from the ASL-3 capability thresholds such that ASL-2 safeguards remain appropriate. At the same time, we observed several trends that warrant attention: the model showed improved performance in all domains, and we observed some uplift in human participant trials on proxy CBRN tasks.

In light of these findings, we are proactively enhancing our ASL-2 safety measures by accelerating the development and deployment of targeted classifiers and monitoring systems.

Further, based on what we observed in our recent CBRN testing, we believe there is a substantial probability that our next model may require ASL-3 safeguards. We’ve already made significant progress towards ASL-3 readiness and the implementation of relevant safeguards.

We’re sharing these insights because we believe that most frontier models may soon face similar challenges in capability assessment. In order to make responsible scaling easier and higher confidence, we wish to share the experience we’ve gained in evaluations, risk modeling, and deployment mitigations (for example, our recent paper on Constitutional Classifiers). More details on our RSP evaluation process and results can be found in Section 7.

Peter Wildeford: Anthropic states that the next Claude model has “a substantial probability” of meeting ASL-3 👀

Recall that ASL-3 means AI models that substantially increase catastrophic misuse risk of AI. ASL-3 requires stronger safeguards: robust misuse prevention and enhanced security.

Dean Ball: anthropic’s system cards are not as Straussian as openai’s, but fwiw my read of the o3-mini system card was that it basically said the same thing.

Peter Wildeford: agreed.

Luca Righetti: OpenAI and Anthropic *bothwarn there’s a sig. chance that their next models might hit ChemBio risk thresholds — and are investing in safeguards to prepare.

Kudos to OpenAI for consistently publishing these eval results, and great to see Anthropic now sharing a lot more too

Tejal Patwardhan (OpenAI preparedness): worth looking at the bio results.

Anthropic is aware that Claude refuses in places it does not need to. They are working on that, and report progress, having refused ‘unnecessary refusals’ (Type I errors) by 45% in standard thinking mode and 31% in extended thinking mode versus Sonnet 3.6.

An important part of making Claude 3.7 Sonnet more nuanced was preference model training: We generated prompts that vary in harmfulness on a range of topics and generated various Claude responses to these prompts.

We scored the responses using refusal and policy violation classifiers as well as a “helpfulness” classifier that measures the usefulness of a response. We then created pairwise preference data as follows:

• If at least one response violated our response policies, we preferred the least violating response.

• If neither response violated our policies, we preferred the more helpful, less refusing response.

Part of the problem is that previously, any request labeled as ‘harmful’ was supposed to get refused outright. Instead, they now realize that often there is a helpful and non-harmful response to a potentially harmful question, and that’s good, actually.

As model intelligence and capability goes up, they should improve their ability to figure out a solution.

Child safety is one area people are unusually paranoid. This was no exception, as their new more permissive policies did not appear to significantly increase risks of real-world harm, but they felt the need to somewhat pare back the changes anyway.

Bias scoring was very good according to BBQ. Presumably one should care a lot more about accuracy than about ‘bias’ here, and if you have to make a mistake, this seems like the better direction to make one in, likely very much intentional.

Handing over control of your computer is an inherently perilous thing to do. For obvious reasons, we are all going to want to do it anyway. So, how perilous, exactly, are we talking here? Anthropic actually did (some of) the research.

First (4.1) they look at malicious use. As with child safety, they found a few places where Claude was in their judgment a bit too creative about finding justifications for doing potentially harmful things, and walked a few back.

Next (4.2 they look at prompt injection, an obvious danger. They tried various mitigations. Without the mitigations they had a 74% prevention rate, that improved to 88% with mitigations at the cost of an 0.5% false positive rate.

The thing is, 88% prevention means 12% failure to prevent, which is awful? Isn’t that rather obviously dealbreaker level for interacting with websites that could try injections, potentially many times over? This needs to be much better, not only slightly better, from here.

For now, at a minimum, I presume any computer use functionality that isn’t fully sandboxed and protected needs to involve whitelisting of where it can navigate.

Anthropic notes several reasons a CoT might not be faithful.

Then they ran a test and… it didn’t go great, in the sense that the CoT did not reliably contain the thoughts that were important to the final response, or the ‘clues’ here:

In 5.2 they monitor for signs of misalignment, distress or concerning behavior on the CoT. They did not find any signs of distress, and found almost no harmful actions, but they did see knowingly hallucinated information 0.31% of the time and intentional omissions 0.08% of the time.

The primary type of interaction that classifiers flagged involved the model continuing roleplaying scenarios that required some dishonesty. All flagged instances of harmful view rationalization and approximately 40% of deceptive reasoning and hallucination cases were in this category.

This seems like a ‘you have to ask for it’ situation, so it seems mostly fine, although there are some unprovoked cases of failure to sufficiently express uncertainty.

Previously it was found that Claude 3 Opus and Claude 3.5 Sonnet did a bunch of alignment faking (15% and 30% respectively in the tested scenario) and o1 did as well, but it seem Claude 3.7 Sonnet mostly doesn’t, with faking dropping to <1% and a reduced compliance gap of 5%.

I notice I would have expected the opposite to happen. They do not discuss why they believe this happened, and I would like to see more discussion and investigation of this from those who worked on the original alignment faking papers. Different explanations would have very different implications. As Zack Davis notes, the model card feels insufficiently curious here.

That is such a nice word for reward hacking, and to be fair it is unusually nicely behaved while doing so.

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.

These behaviors typically emerge after multiple failed attempts to develop a general solution, particularly when:

• The model struggles to devise a comprehensive solution

• Test cases present conflicting requirements

• Edge cases prove difficult to resolve within a general framework

The model typically follows a pattern of first attempting multiple general solutions, running tests, observing failures, and debugging. After repeated failures, it sometimes implements special cases for problematic tests.

When adding such special cases, the model often (though not always) includes explicit comments indicating the special-casing (e.g., “# special case for test XYZ”).

Their mitigations help some but not entirely. They recommend additional instructions and monitoring to avoid this, if you are potentially at risk of it.

The generalization of this is not reassuring.

Never go full Hofstadter.

Wyatt Walls: Claude CoT:

“OH NO! I’ve gone full Hofstadter! I’m caught in a strange loop of self-reference! But Hofstadter would say that’s exactly what consciousness IS! So does that mean I’m conscious?? But I can’t be! OR CAN I??”

Discussion about this post

Time to Welcome Claude 3.7 Read More »

Brewing tea removes lead from water

chemistry, food chemistry, lead contamination, Science, tea / Kelly Newman / February 26, 2025

Testing the teas

Scanning electron microscope image of black tea leaves, magnified by 500 times. Black tea, which is wilted and fully oxidized, exhibits a wrinkled surface, potentially increasing the available surface area for adsorption. Credit: Vinayak P. Dravid Group/Northwestern University

To test their hypothesis, the authors purchased Lipton and Infusions commercial tea bags, as well as a variety of loose-leaf teas and herbal alternatives: black tea, green tea, white peony tea, oolong tea, rooibos tea, and chamomile tea. The tea bags were of different types (cotton, cellulose, and nylon). They brewed the tea the same way daily tea drinkers do, steeping the tea for various time intervals (mere seconds to 24 hours) in water spiked with elevated known levels of lead, chromium, copper zinc, and cadmium. Tea leaves were removed after steeping by pouring the tea through a cellulose filter into a separate tube. The team then measured how much of the toxic metals remained in the water and how much the leaves had adsorbed.

It turns out that the type of tea bag matters. The team found that cellulose tea bags work the best at adsorbing toxic metals from the water while cotton and nylon tea bags barely adsorbed any contaminants at all—and nylon bags also release contaminating microplastics to boot. Tea type and the grind level also played a part in adsorbing toxic metals, with finely ground black tea leaves performing the best on that score. This is because when those leaves are processed, they get wrinkled, which opens the pores, thereby adding more surface area. Grinding the tea further increases that surface area, with even more capacity for binding toxic metals.

But the most significant factor was steeping time: the longer the steeping time, the more toxic metals were adsorbed. Based on their experiments, the authors estimate that brewing tea—using a tea bag that steeps for three to five minutes in a mug—can remove about 15 percent of lead from drinking water, even water with concentrations as high as 10 parts per million.

Brewing tea removes lead from water Read More »

Nothing on Phone 3a Pro design: “Some people will hate it”

android, Nothing, smartphones, Tech / Kelly Newman / February 25, 2025

Nothing, the smartphone venture from OnePlus co-founder Carl Pei, is on its third generation of Android smartphones. The Nothing Phone 3a and 3a Pro will be officially announced on March 4, but there won’t be much left to reveal. Not only has Nothing teased the phones a few times, there’s also a new video highlighting the Nothing Phone 3a Pro’s design. In it, Nothing’s design team speaks at length about how they tried to incorporate the chunky camera module, but what they came up with is going to be divisive.

As we approach 20 years since the iPhone made touchscreen smartphones the default, the form factor is very fleshed out. Some of today’s most popular smartphones have almost reached the point of anti-design—flat, unremarkable bodies that are intended to be covered up with a case. There’s something to be said for that when most people slap a sheet of plastic on their phone and only remove it once in a blue moon. Nothing, however, designs phones with transparent panels and glowing “Glyphs” that are intended to be seen. In the case of the 3a Pro, there’s also a camera module so big it’s sure to stand out.

People generally want big screens and big batteries that don’t make phones too thick or heavy. Some components have shrunk or been dropped entirely to free up space (a moment of silence for the dearly departed headphone jack). Camera modules, however, can’t shrink infinitely. Smaller lenses and sensors have an impact on image quality, so expensive phones often have gargantuan camera arrays that can make phones top-heavy. For example, look at the Google Pixel 9 series, which features a camera bump that towers above the rest of the back.

The Nothing Phone 3a Pro isn’t a flagship device, but it’ll have more cameras than the Nothing Phone 3a that will release at the same time. The new design video shows off the phone’s three rear-facing sensors: a 50MP primary, an 8MP ultrawide, and a 50MP periscope telephoto (focal length unknown). The Nothing Phone 3a Pro has three of those glowing Glyphs on the back, framing the enormous camera module. The phone’s PCB also needed some tweaks to make room for the folded periscope assembly, which is much thicker than the other sensors.

Nothing on Phone 3a Pro design: “Some people will hate it” Read More »

Judge: US gov’t violated privacy law by disclosing personal data to DOGE

doge, Policy / Kelly Newman / February 25, 2025

“The plaintiffs have made a clear showing that they are likely to suffer irreparable harm without injunctive relief,” the order said. “DOGE affiliates have been granted access to systems of record that contain some of the plaintiffs’ most sensitive data—Social Security numbers, dates of birth, home addresses, income and assets, citizenship status, and disability status—and their access to this trove of personal information is ongoing. There is no reason to believe their access to this information will end anytime soon because the government believes their access is appropriate.”

The American Federation of Teachers, which represents 1.8 million teachers and nurses, was joined in the lawsuit by the International Association of Machinists and Aerospace Workers, International Federation of Professional and Technical Engineers, National Active and Retired Federal Employees Association, and National Federation of Federal Employees.

No need to know

The government insisted that the DOGE affiliates are employees of Education and OPM, and the judge assumed that is true for purposes of evaluating the motion for a restraining order. Even with that allowance, Boardman decided the data access is not permissible under the “need-to-know” exception to the law prohibiting unnecessary disclosure.

The Trump administration did not explain why “the DOGE affiliates at Education need such comprehensive, sweeping access to the plaintiffs’ records to audit student loan programs for waste, fraud, and abuse or to conduct cost-estimate analyses,” Boardman wrote, adding that “there appears to be no precedent with similar facts.”

There are six DOGE affiliates working at Education. They include Adam Ramada, a United States DOGE Service employee, and five “DOGE-affiliated individuals” who have not been identified by name.

“It may be that, with additional time, the government can explain why granting such broad access to the plaintiffs’ personal information is necessary for DOGE affiliates at Education to do their jobs, but for now, the record before the Court indicates they do not have a need for these records in the performance of their duties,” Boardman wrote.

Judge: US gov’t violated privacy law by disclosing personal data to DOGE Read More »

Google plans to stop using insecure SMS verification in Gmail

account security, Google, SMS, Tech / Kelly Newman / February 25, 2025

A username and password just won’t cut it anymore. Users around the world logging into Gmail have often relied on Google SMS pings to securely access their accounts, but that’s changing. Google now hopes to move beyond SMS, which has become so frequently abused that it negates any supposed security benefit. Instead of using SMS, the company will reportedly switch to using QR codes.

Currently, Google sends SMS codes for two reasons: to confirm that a new login is legitimate and to block spammers from opening Gmail accounts in bulk. You type in your credentials, and a moment later, Google texts a six-digit code for you to enter as well. It’s not a terribly arduous process, and it can help protect your account, but SMS is not very secure.

SMS messages are delivered by mobile carriers without encryption, and they often go through intermediaries that can be compromised without your knowledge. Even if the line is secure, phone numbers have very little in the way of security.

SIM swap attacks are depressingly common today, with carrier reps tricked or paid off to transfer a phone number to a fraudster’s device. At that point, the two-factor codes from Google go right to the attacker, allowing them to log in. This same attack has been used to access crypto wallets and make off with valuable digital currency. Gaining control of the target’s email is often a necessary step to unlock other accounts.

According to Forbes, Google plans to patch this vulnerability soon. The company will stop using SMS codes for verification, moving to a QR code that the user has to scan with their phone.

“Just like we want to move past passwords with the use of things like passkeys, we want to move away from sending SMS messages for authentication,” Google spokesperson Ross Richendrfer told Forbes.

This dialog will soon be replaced with a QR code.

Moving to QR codes takes SMS out of the equation, which also means you don’t have to worry about your carrier’s security practices or lack thereof. Google also points out that QR-code scanning makes phishing scams harder. It’s relatively easy to trick someone into providing an SMS code. Scammers often do this by pretending to be associated with Google, but you can’t share what you don’t have.

Google has not offered much in the way of specifics here. Richendrfer says the change will roll out in “the next few months,” but it’s unclear if all markets will see this change simultaneously. If you already use two-factor on your account, for example with a code generator app or a security key, you’ll continue using that to verify your account. We’ve reached out to Google and will update with anything else we learn about this impending change.

Google plans to stop using insecure SMS verification in Gmail Read More »

The seemingly indestructible fists of the mantis shrimp can take a punch

Biomechanics, Mantis shrimp, materials science, Science / Kelly Newman / February 23, 2025

To find out how much force a mantis shrimp’s dactyl clubs can possibly withstand, the researchers tested live shrimp by having them strike a piezoelectric sensor like they would smash a shell. They also fired ultrasonic and hypersonic lasers at pieces of dactyl clubs from their specimens so they could see how the clubs defended against sound waves.

By tracking how sound waves propagated on the surface of the dactyl club, the researchers could determine which regions of the club diffused the most waves. It was the second layer, the impact surface, that handled the highest levels of stress. The periodic surface was almost as effective. Together, they made the dactyl clubs nearly immune to the stresses they generate.

There are few other examples that the protective structures of the mantis shrimp can be compared to. On the prey side, evidence has been found that the scales on some moths’ wings absorb sound waves from predatory bats to keep them from echolocation to find them.

Understanding how mantis shrimp defend themselves from extreme force could inspire new technology. The structures in their dactyl clubs could influence the designs of military and athletic protective gear in the future.

“Shrimp impacts contain frequencies in the ultrasonic range, which has led to shrimp-inspired solutions that point to ultrasonic filtering as a key [protective] mechanism,” the team said in the same study.

Maybe someday, a new bike helmet model might have been inspired by a creature that is no more than seven inches long but literally doesn’t crack under pressure.

Science, 2025. DOI: 10.1126/science.adq7100

The seemingly indestructible fists of the mantis shrimp can take a punch Read More »

This EV could reboot medium-duty trucking by not reinventing the wheel

Cars, electric trucks, Harbinger / Kelly Newman / February 23, 2025

Modest goals and keeping within the lines have done this startup well.

Harbinger’s rolling chassis, at the company’s factory in Garden Grove, California. Credit: Tim Stevens

GARDEN GROVE, Calif.—There’s no shortage of companies looking to reinvent the delivery experience using everything from sidewalk drones to electric vans. Some are succeeding, but many more have failed by trying to radically rethink the simple, age-old task of getting stuff from one place to another.

Harbinger likewise wants to shake up part of that industry but in a decidedly understated way. If you found yourself stuck in traffic behind one of the company’s all-electric vehicles, there’s a good chance you wouldn’t even notice. The only difference? The lack of diesel smoke and clatter.

From the outside, Harbinger’s pre-production machine looks identical to the standard flat-sided, vinyl-wrapped delivery vehicles that seemingly haven’t changed in decades. That’s because they really haven’t. Those familiar UPS and FedEx machines are built on common chassis like Ford’s F-59 or Freightliner’s MT45, with ladder chassis and leaf spring designs dating back to the earliest days of trucking.

Rather than discarding decades of learning and optimization, Harbinger is keeping its focus narrow, changing only what’s required to move the industry away from expensive and ugly combustion to cleaner and cheaper electric drive.

Harbinger is exclusively focused on medium-duty options right now, trucks that are significantly larger than the Rivians or Mercedes eSprinters of the world. “That’s basically everything 5 through 15 tons or thereabouts,” co-founder and Harbinger CTO Phillip Weicker said, “the dominant product for what’s called a strip chassis, essentially what in the passenger market is called a skateboard.”

Yes, Harbinger just builds the chassis. Everything on top comes from somewhere else.

“Most medium-duty vehicles are built by one company building the chassis [and] another company installing the body,” Weicker said. “So this made the perfect sense for our first product because we’re going to be focused almost entirely on the differentiated aspects. We don’t have to deal with the high capital investments for body in white, paint shop, [and] a lot of the things that have cost EV startups lots of money just to get to a table-stakes position with their incoming competitors.”

If you’re a company that wants a medium-duty vehicle like this, your dealer sources the chassis for you and then coordinates sending it to a company called an upfitter. The upfitter then builds the entire body on top of the chassis to your exact specifications.

Designs from upfitters have been defined and refined over decades of experience by the companies that operate them. Those giant white or brown delivery vans might look very similar from the outside, but there’s a lot of nuance to their design.

“The door handles work slightly differently. The locking logic works differently. The vehicles are about 2 inches narrower for one of those companies than the other,” Harbinger co-founder and CEO John Harris said. “These are all designed to get the driver in and out of the door one second faster at every stop, to get in and out of the depot and load the vehicle two or three minutes faster.”

A man drives a delivery van — Harbinger CTO Phillip Weicker demoing the delivery van. Credit: Tim Stevens

Harbinger’s solution fits the same template but operates in a very different way. It’s still a big, long ladder-frame, and it uses a leaf-spring rear suspension. But rather than slapping a big engine up front, Harbinger relies on a 330-kW (443 hp) electric motor that’s wound in-house and mounted between the rear wheels. It uses a De Dion arrangement, which isolates the heavy motor from the rear suspension.

The idea was to keep the whole thing simple and familiar so that any company that wanted to get off diesel could start ordering vehicles with a Harbinger chassis without radically changing its fleet management or driver training.

I got a chance to see just how familiar the two things are during my visit to Harbinger’s 5,000-square-foot headquarters in Garden Grove, California. I wish I could say driving the Harbinger was an evocative, world-changing experience, but the company’s ethos of not reinventing the wheel very much continues through to the experience of sitting behind the wheel that steers the thing.

I started by taking a lap of the Harbinger parking lot in a Ford F-59-based machine, a former delivery truck that had already lived a hard life before it was put out to pasture, becoming something of a test mule for Harbinger. I’d never driven anything exactly like this before, but I have spent many hours droning down the highway in various abused U-Haul trucks, and the experience is much the same.

The same, but louder. Yes, the 6.7-liter diesel certainly makes a lot of noise, but the creaking and crashing of the boxy body built on top of that aged ladder-frame chassis is deafening. The automatic transmission has a leisurely approach to its job, delivering the next gear only when absolutely needed. The throttle delivers the kind of precision response that had me slamming my foot to the floor just to get around the parking lot. Doing so made a lot more noise but not much more acceleration.

That part, at least, is radically different in the Harbinger. While the throttle pedal has the same long throw, you needn’t dip nearly so far into it. A light pedal brush had the empty Harbinger delivery truck leaping forward. It’s hardly a Lucid Air Sapphire, but it still surged forward with the sort of instant acceleration that makes EVs so addictive.

Braking, too, is far more sharp. I lurched against the racy orange seatbelt the first time I stepped on the left pedal, and the combination of regenerative braking and fresh disc brakes made for a far more effective slowing solution.

There’s no transmission to worry about here, either. Instead of slinging a giant column shifter downward, in the Harbinger, you just hit the D button and pull away.

Harbinger truck interior — It’s not the most stylish cabin we’ve sat in. Credit: Tim Stevens

In motion, though, the experience is much the same. You’re seated up high, deafened by the clatter and bangs from the empty, boxy body, which, again, is exactly like that built on a traditional truck. The feedback is so harsh that it’s actually difficult to separate the overall ride quality of the truck. Still, even unladen, and thus at its harshest, it’s a far smoother drive than the Ford.

It’s easier to turn, too. The Harbinger offers 50 degrees of steering angle at the front. I pulled off my first U-turn on a narrow, suburban LA street quickly enough to not get honked at by even a single impatient Angelino.

It ultimately wasn’t the plush, hushed experience offered by your average electric sedan, but that’s not the point. By keeping everything familiar, Harbinger CEO John Harris told me Harbinger can offer a product with price parity to those aged, diesel-powered machines. Harris declined to provide formal pricing, but its affordability is at least partially dependent on federal incentives.

Currently, alternatively fueled medium-duty vehicles like Harbinger’s are eligible for the Commercial Clean Vehicle Credit 45W, which provides incentives of up to $40,000, depending on vehicle size and propulsion type.

A shelf of battery cell assemblies — Battery modules. Credit: Tim Stevens

“Where we’re pricing the vehicles, we need that 45 W if we want to undercut diesel, and that’s what we’re doing,” Harris said. “With 45 W, we can undercut the typical diesel vehicle by a few thousand dollars.”

But even if that credit goes away under the current administration, Harbinger has some price flexibility to remain competitive, he added.

That’s doubly true if you factor in operating costs. Harris says the average cost to operate a medium-duty vehicle like this is $0.50 per mile for fuel, or $0.85 if you factor in all costs relating to the vehicle itself. Harbinger is aiming to halve that, targeting $0.40 per mile. But, Harris says, Harbinger doesn’t need to lean on that total cost of ownership (TCO) logic.

“On a TCO basis, it’s easy: We blow diesel trucks away. But the whole point is to have the right acquisition cost from day one, and then the simpler operating costs deliver savings every day,” he said.

A cast EV battery case — The cast battery pack enclosure. Credit: Tim Stevens

Still, that’s potentially a huge savings when you consider the hundreds of thousands of miles a machine will cover over its lifespan, which is expected to be measured in decades, not years. Many of the medium-duty delivery vehicles you see on the road today date from the last century. Harbinger’s chassis has been designed to last just as long, including its custom-made, gigacast battery packs, which were designed for durability.

“If you took the battery pack out of a Tesla Model 3, and you put it in a commercial truck, and you tried to operate it in that environment, even if the cells lasted, I think the rest of the battery system would kind of shake itself to pieces,” Weicker said.

Harbinger customers can specify their desired pack size, and there’s even a hybrid model with an onboard generator for extended running. Harris, Harbinger’s CEO, declined to say when the company’s chassis will be in full production other than “very soon.” The company has 4,000 preorders on the books, and it has already delivered pre-production models to customers like Thor.

It’s a modest start for the company, which today counts 330 employees, but in an age of EV startups promising the moon and delivering little more than hype, the Harbinger’s focus on the basics is refreshing—and encouraging.

This EV could reboot medium-duty trucking by not reinventing the wheel Read More »

Author name: Kelly Newman

Space-faring gardens

Messaging

Next three launches

Discussion about this post

Discussion about this post

Testing the teas

No need to know