Author name: Shannon Garcia

dewormer-ivermectin-as-cancer-cure?-rfk-jr’s-nih-funds-“absurd”-study.

Dewormer ivermectin as cancer cure? RFK Jr.’s NIH funds “absurd” study.

The National Cancer Institute is using federal funds to study whether cancer can be cured by ivermectin, a cheap, off-patent anti-parasitic and deworming drug that fringe medical groups falsely claimed could treat COVID-19 during the pandemic and have since touted as a cure-all.

Large, high-quality clinical trials have resoundingly concluded that ivermectin is not effective against COVID-19. And there is no old or new scientific evidence to support a hypothesis that ivermectin can cure cancer—or justify any such federal expenditure. But, under anti-vaccine Health Secretary Robert F. Kennedy Jr.—who is otherwise well-known for claiming to have a parasitic worm in his brain—numerous members of the medical fringe are now in powerful federal positions or otherwise hold sway with the administration.

During a January 30 event, Anthony Letai, a cancer researcher the Trump administration installed as the director of the NCI in September, said the NCI was pursuing ivermectin.

“There are enough reports of it, enough interest in it, that we actually did—ivermectin, in particular—did engage in sort of a better preclinical study of its properties and its ability to kill cancer cells and we’ll probably have those results in a few months. So we are taking it seriously.”

The comments were highlighted today in a report from KFF Health News. Ars Technica was also at the event, “Reclaiming Science: The People’s NIH,” which was hosted by the MAHA [Make America Healthy Again] Institute. In the rest of his comments, Letai seemed to make a noticeable effort to temper expectations while also trying to avoid offending any ivermectin believers. “It’s not going to be a cure-all for cancer,” he said. At another point, he said that even if there are signals of anti-cancer properties in the preclinical studies, “I can tell you again, it’s not a really strong signal.”

Dewormer ivermectin as cancer cure? RFK Jr.’s NIH funds “absurd” study. Read More »

google-experiments-with-locking-youtube-music-lyrics-behind-paywall

Google experiments with locking YouTube Music lyrics behind paywall

The app’s lyrics feature allows listeners to follow along as the song plays. However, only the first few lines are visible once free users in the test hit the lyric cut-off. After that, the lyrics are blurred. Users who want to keep seeing lyrics are advised to upgrade to a premium account, which costs $14 for both YouTube video and music or $11 for music only. The subscription also removes ads and adds features like downloads and higher-quality video streams.

Lyrics paywall in YT music

The new paywall in YouTube Music.

Credit: /u/MrYeet22836 and /u/Vegetable_Common188

The new paywall in YouTube Music. Credit: /u/MrYeet22836 and /u/Vegetable_Common188

This change is not without precedent. Spotify began restricting access to lyrics for free users in 2024. However, the response was so ferociously negative that the company backtracked and restored lyric access to those on ad-supported accounts. YouTube Music doesn’t have the same reach as Spotify, which may help soften the social media shame. Many subscribers are also getting the premium service just because they’re paying for ad-free YouTube and may never know there’s been a change to lyric availability.

As Google has ratcheted up restrictions on free YouTube accounts, the service has only made more money. In Google’s most recent earnings report, it reported $60 billion in YouTube revenue across both ads and subscriptions (both YouTube Premium and YouTube TV). That’s almost $10 billion more than last year.

Lyrics in YouTube Music are provided by third parties that Google has to pay, so it’s not surprising that Google is looking for ways to cover the cost. It is, however, a little surprising that the company hasn’t just used AI to generate lyrics for free. Google has recently tested the patience of YouTube users with a spate of AI features, like unannounced AI upscaling, fake DJs, and comment summaries.

This story was updated with Google’s response. 

Google experiments with locking YouTube Music lyrics behind paywall Read More »

disclosure-day-super-bowl-trailer:-could-it-be…-aliens?

Disclosure Day Super Bowl trailer: Could it be… aliens?

David Koepp, who has worked with Spielberg on numerous projects (including Jurassic Park and War of the Worlds), wrote the screenplay. Emily Blunt stars as a TV meteorologist in Kansas City. Her co-stars include Josh O’Connor, Colin Firth, Eve Hewson, Colman Domingo, Wyatt Russell, Elizabeth Marvel, Henry Lloyd-Hughes, Michael Gaston, and Mckenna Bridger. Professional wrestlers Chavo Guerrero Jr., Lance Archer, and Brian Cage will also appear.

Disclosure Day hits theaters on June 12, 2026.

And for those eagerly awaiting the May 22, 2026, release of The Mandalorian and Grogu, we give you this 30-second glimpse of our favorite bounty hunter and his ward in a sled pulled by Tauntauns. We still don’t have much information about that plot either, but at least it’s a known property thanks to the hit TV series. Plus, we get a suitably sonorous voiceover by none other than Sam Elliott: “Sometimes we choose our path, other times the path chooses us. Through it all, we keep pushing forward, driven by a deeper purpose, guided by an unseen force. The journey never gets any easier; the bond just gets harder to break. This is the way.”

Disclosure Day Super Bowl trailer: Could it be… aliens? Read More »

why-would-elon-musk-pivot-from-mars-to-the-moon-all-of-a-sudden?

Why would Elon Musk pivot from Mars to the Moon all of a sudden?

As more than 120 million people tuned in to the Super Bowl for kickoff on Sunday evening, SpaceX founder Elon Musk turned instead to his social network. There, he tapped out an extended message in which he revealed that SpaceX is pivoting from the settlement of Mars to building a “self-growing” city on the Moon.

“For those unaware, SpaceX has already shifted focus to building a self-growing city on the Moon, as we can potentially achieve that in less than 10 years, whereas Mars would take 20+ years,” Musk wrote, in part.

Elon Musk tweet at 6: 24 pm ET on Sunday.

Credit: X/Elon Musk

Elon Musk tweet at 6: 24 pm ET on Sunday. Credit: X/Elon Musk

This is simultaneously a jolting and practical decision coming from Musk.

Why it’s a jolting decision

A quarter of a century ago, Musk founded SpaceX with a single-minded goal: settling Mars. One of his longest-tenured employees, SpaceX President and Chief Operating Officer Gwynne Shotwell, described her very first interview with Musk in 2002 to me as borderline messianic.

“He was talking about Mars, his Mars Oasis project,” Shotwell said. “He wanted to do Mars Oasis, because he wanted people to see that life on Mars was doable, and we needed to go there.”

She was not alone in this description of her first interaction with Musk. The vision for SpaceX has not wavered. Even in the company’s newest, massive Starship rocket factory at the Starbase facility in South Texas—also known as the Gateway to Mars—there are reminders of the red planet everywhere. For example, the carpet inside Musk’s executive conference room is rust red, the same color as the surface of Mars.

In the last 25 years, Musk has gone from an obscure, modestly wealthy person to the richest human being ever, from a political moderate to chief supporter of Donald Trump; from a respected entrepreneur to, well, to a lot of things to a lot of people: world’s greatest industrialist/supervillain/savant/grifter-fraudster.

But one thing that has remained constant across the Muskverse is his commitment to “extending the light of human consciousness” and to the belief that the best place to begin humanity’s journey toward becoming a multi-planetary species was Mars.

Why would Elon Musk pivot from Mars to the Moon all of a sudden? Read More »

claude-code-#4:-from-the-before-times

Claude Code #4: From The Before Times

Claude Opus 4.6 and agent swarms were announced yesterday. That’s some big upgrades for Claude Code.

OpenAI, the competition, offered us GPT-5.3-Codex, and this week gave us an app form of Codex that already has a million active users.

That’s all very exciting, and next week is going to be about covering that.

This post is about all the cool things that happened before that, which we will be building upon now that capabilities have further advanced. This if from Before Times.

Almost all of it still applies. I haven’t had much chance yet to work with Opus 4.6, but as far as I can tell you should mostly keep on doing what you were doing before that switch, only everything will work better. Maybe get a bit more ambitious. Agent swarms might be more of a technique shifter, but we need to give that some time.

  1. Claude Code and Cowork Offer Mundane Utility.

  2. The Efficient Market Hypothesis Is False.

  3. Inflection Point.

  4. Welcome To The Takeoff.

  5. Huh, Upgrades.

  6. Todos Become Tasks.

  7. I’m Putting Together A Team.

  8. Compact Problems.

  9. Code Yourself A Date.

  10. Verification and Generation Are Distinct Skills.

  11. Skilling Up.

  12. AskUserQuestion.

  13. For Advanced Players.

  14. So They Quit Reading.

  15. Reciprocity Is The Key To Every Relationship.

  16. The Implementation Gap.

  17. The Lighter Side.

Nvidia CEO Jensen Huang offered Claude a huge endorsement on January 21, calling it incredible and saying every software company needs to use it.

Ethan Mollick: This game was 100% designed, tested, and made by Claude Code with the instructions to “make a complete Sierra-style adventure game with EGA-like graphics and text parser, with 10-15 minutes of gameplay.” I then told it to playtest the game & deploy.

Play: https://enchanted-lighthouse-game.netlify.app

It was a single prompt for the entire game, and then a prompt to playtest and improve the outcome.

I gave it an agent that can connect to GPT image gen.

Iterative image generation sounds pretty cool:

elvis: I just used the new Claude Code Playground plugin to level up my Nano Banana Image generator skill.

My skill has a self-improving loop, but with the playground skill, I can also pass precise annotations to nano banana as it improves the images.

I have built a Skill for Claude Code that leverages the nano banana image generation model via API.

I built it like that because I have had a lot of success generating images with nano banana in an agentic self-improving loop. It can dynamically make API requests and improve images really well.

With the Playground plugin, I can take it one step further. I can now provide precise annotations that the agentic loop can leverage to make more optimal API calls in the hope of improving the images further. Visual cues are extremely powerful for agents, and this is a sort of proxy for that.

Ado cancels all his Amazon Subscribe and Save orders via Claude for Chrome. Sameer points out that a Chrome Extension can do this more efficiently and have a better UI, but there is great joy in not having to choose and install a new tool to do a new thing. Yes, if you were doing this a lot you’d use a Chrome Extension, or have Claude Code build you a new one to your taste.

I agree with Andrej Karpathy, you should use RSS feeds wherever feasible to guard your information flow. I use Feedly, he suggests NetNewsWire or vibe coding your own reader. It is unfortunate that Twitter does not play nice with such a setup.

Seth Lazar: I wrote about the idea of building an “Attention Guardian” agent back in 2023. Genuinely think it’s feasible now. Claude Code is now building up a workflow to go across all these different sources, with a long description of what I’m interested in, and create a new feed.

Storm points out that anything you can do with a terminal interface you can in theory do better with a graphical interface (GUI), but the people building GUIs don’t give you the things you want: Information density, low latency, no ads, shortcuts, open source, composable, tileable, scriptable. It’s just that no one does it.

What the market has instead is a sense of humor.

modest proposal: March 12, 2020 was the trading day after Tom Hanks said he had covid and the NBA shut down, Expedia fell 15.2% and BKNG fell 11.2%.

February 3, 2026 which was the day Claude Code legal connector was announced, Expedia fell 15.3% and BKNG fell 9.4%.

Then software drove itself off a cliff generally (y-axis goes from .012 to .018), and then after this graph was posted it kept going, all supposedly in response to information that, from where I sit, was rather old news the whole time.

Shruti: Anthropic Just Triggered a $285B Market Crash

Bloomberg just reported that Anthropic released a new AI tool that caused:

󠁯•󠁏 $285 billion wiped out across software, finance, and asset management stocks

󠁯•󠁏 6% drop in Goldman’s software basket (biggest since April)

󠁯•󠁏 7% crash in financial services index

󠁯•󠁏 Nasdaq down 2.4% at its worst

This is MASSIVE. The market literally panicked over an AI automation tool.​

Or in broader context:

Kevin Gordon: Software relative to the S&P 500 is a particularly brutal chart … essentially 6 years of relative gains wiped out

Andy Masley: Software stocks dropped 6% and legal services dropped 7% because Anthropic released plugins for Cowork? This seems like the first huge shift in market behavior I’ve seen caused by AI capabilities. Why wasn’t this all over the TL?

Dan Elton: Wild times in the market! This is probably over-reaction, but this is a very interesting signal indicating that AI tools (especially for coding and legal and financial grunt work) are having a huge impact.

Okay, so yeah, combined with what has happened since then that’s DeepSeek 2.0, a large move down on entirely expected news.

Should software have already been lower? That’s a reasonable position, but there’s no way that it should have dropped this much in response to this news. If you declared SaaSpocalypse on February 3 you should have done so a month ago. Alas, no, I did not trade on this, because it’s not obvious to me we should be SaaSpocalypsing at all and it wasn’t obvious this wasn’t priced in.

Now we are in a period where all the tech stocks are moving around violently, usually in full wrong way moves. I continue not to trade on any of it. I do have some ammo, but I also am already plenty long and have been for a while, so I’m not going to fire unless I see the whites of their eyes.

Andrej Karpathy updates us that he was one of many who went from 80% manual coding and autocomplete in November to 80% agentic coding in December. Whole thing is worth reading.

Andrej Karpathy: This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I’d expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent.

He’s still behind the curve, I’m with Boris Cherny at 100% agentic coding. Then again, excluding quotes I’m still at almost 100% manual writing for posts.

IDEs/agent swarms/fallability. Both the “no need for IDE anymore” hype and the “agent swarm” hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot – they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do.​

The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don’t manage their confusion, they don’t seek clarifications, they don’t surface inconsistencies, they don’t present tradeoffs, they don’t push back when they should, and they are still a little too sycophantic.

Tenacity. It’s so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It’s a “feel the AGI” moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.

Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the “feel the AGI” magic is to be found. Don’t tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP.

Fun. I didn’t anticipate that with agents programming feels *morefun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part.

Questions. A few of the questions on my mind:

– What happens to the “10X engineer” – the ratio of productivity between the mean and the max engineer? It’s quite possible that this grows *a lot*.

– Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro).

– What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music?

– How much of society is bottlenecked by digital knowledge work?

My prediction on ‘10X engineers’ is that we will see more ability for poor coders to be able to do things reasonably well (including yours truly) but that the long tail of ‘10X’ engineers will increase their relative gap, as they figure out how to scale to supervising agent swarms efficiently. You’ll start to see more of the 100X engineer.

Andrej Karpathy: Love the word “comprehension debt”, haven’t encountered it so far, it’s very accurate. It’s so very tempting to just move on when the LLM one-shotted something that seems to work ok.

Claude Code as we all know builds itself. Codex also now builds itself.

Tibo: Codex now pretty much builds itself, with the help and supervision of a great team. The bottleneck has shifted to being how fast we can help and supervise the outcome.

This is in addition to the big ones of Claude Opus 4.6 and GPT-5.3-Codex.

Claude Code has tab (to accept and edit) or enter (to accept and run) autocomplete, similar to AI completion suggestions in Cursor or other IDEs.

Claude Cowork expands to Team and Enterprise plans, and has the @-mention feature to bring context into sessions, and its internal Claude in Chrome will now show you screenshots. They’re doing a live demo on January 30.

Claude: Cowork now supports plugins.

Plugins let you bundle any skills, connectors, slash commands, and sub-agents together to turn Claude into a specialist for your role, team, and company.

Claude: We’re open-sourcing 11 plugins for sales, finance, legal, data, marketing, support, and more.

Plugin marketplace

To get you started, we’re open-sourcing 11 plugins built and used by our own team:

  • Productivity — Manage tasks, calendars, daily workflows, and personal context

  • Enterprise search — Find information across your company’s tools and docs

  • Plugin Create/Customize — Create and customize new plugins from scratch

  • Sales — Research prospects, prep deals, and follow your sales process

  • Finance — Analyze financials, build models, and track key metrics

  • Data — Query, visualize, and interpret datasets

  • Legal — Review documents, flag risks, and track compliance

  • Marketing — Draft content, plan campaigns, and manage launches

  • Customer support — Triage issues, draft responses, and surface solutions

  • Product management — Write specs, prioritize roadmaps, and track progress

  • Biology research — Search literature, analyze results, and plan experiments

Easily install these directly from Cowork, browse the full collection on our website, or upload your own plugin (which can be built using Plugin Create).

Pinging when Claude needs approval is a big game that might move me off of using the terminal. It’s interesting that the desktop version and the terminal version need to have features like plan mode enabled separately.

Boris Cherny: Just shipped two cool updates for Claude Code in the desktop app.

  1. Plan mode is now available on desktop. Have Claude map out its approach before making any changes.

  2. Notifications. Claude Code desktop now pings you whenever Claude needs approval, and you can keep working while Claude runs in the background.

The flickering in Claude Code should be gone soon, but this might not be deployed.

Lydia Hallie: Claude Code now supports the –from-pr flag

Resume any session linked to a GitHub PR by number, URL, or pick interactively. Sessions auto-link when a PR is created!

They’re merging Claude Code slash commands into skills, as per their skills guide, so you can use slash commands to invoke skills.

Claude Code now supports session sharing on web, desktop and mobile.

You can run Claude Code with Ollama, if open models are relevant to your interests.

Mike claims they’re in a bit of a pickle with Claude Cowork and shipped a sandbox tool that won’t easily support windows. Chances of Windows Claude Cowork by February 15 are down to 34% as of 1/24.

You can customize your Claude Code keybindings using /keybindings. My advice would be to mostly leave this alone to stay consistent with others or in case you change services or need to restart.

The new Claude Code command /insights will read your last month’s message history and give you suggestions to improve your workflow.

Claude Code now has a new plugin called Playground, as in HTML playgrounds, that give you GUIs to help on whatever you are working on.

Jarred Sumner: In the last 24 hrs, the team has landed PRs to Claude Code improving cold start time by 40% and reducing memory usage by 32% – 68%.

It’s not yet where it needs to be, but it’s getting better.

You will also likely notice reduced input lag when spawning many agents.

What’s the difference? Todos are ephemeral within one session, tasks are stored in files and across sessions, and support dependencies.

You should still keep your true ‘todo’ list and long term plans elsewhere. The task list is for things you want to be actively doing.

Thariq (Anthropic): ​Today, we’re upgrading Todos in Claude Code to Tasks. Tasks are a new primitive that help Claude Code track and complete more complicated projects and collaborate on them across multiple sessions or subagents.

… Tasks are our new abstraction for coordinating many pieces of work across projects, Claude can create Tasks with dependencies on each other that are stored in the metadata, which mirrors more how projects work. Additionally, Tasks are stored in the file system so that multiple subagents or sessions can collaborate on them. When one session updates a Task, that is broadcasted to all sessions currently working on the same Task List.

You can ask Claude to create tasks right now, it’s especially useful when creating when spinning up subagents. Tasks are stored in ~/.claude/tasks, you can use this to build additional utilities on top of tasks as well.

To make sessions collaborate on a single Task List, you can set the TaskList as an environment variable and start Claude like so:

CLAUDE_CODE_TASK_LIST_ID=groceries claude

This also works for claude -p and the AgentSDK.

Tasks are a key building block for allowing Claude to build more complex projects. We’re looking forward to seeing how you use it.

Minh Pham argues most agent harnesses are not bitter lesson pilled, and the solution for anything but narrowly defined tasks is to emphasize flexibility, to assemble your team of agents and structure on the fly as needed rather than commit to a fixed structure. Restrictive harnesses create bad lock-in.

My guess is this depends on what you’re trying to do. If you’re trying to do something specific, especially to do it this week, do it via something specific. If you’re looking to do anything at all, let it do anything at all, and eventually this approach wins but you’ll likely redo everything ‘eventually’ anyway.

There are limits. Never go full bitter lesson. Or, if you do, be prepared for things to get rather out of hand.

Thebes offers speculations about the right ways to organize multiple agents as scale expands. I agree that often, spawning, despawning and especially forking and rewinding agents makes a lot of sense.

@deepfates: > opus 4.5 in claude code is kinda not as good at talking to its own subagents as one might naively expect, even though it’s perfectly capable of being empathetic in normal, peer-level interactions with other models.

RELATABLE

j⧉nus: I really dislike how Claude Code frames “subagents” (which is NOT peer collaboration). It causes a lot of functional issues. I think Opus 4.5 often avoids effective use of subagents (e.g. giving context) in part because it would be disturbing & dissonant to model them honestly.

j⧉nus: related – we often much prefer a messaging system between top-level instances that are treated as peers.

the messaging system opus 4.5 built is awesome btw. it allows top level agents to message each other – either synchronously (triggering a turn) or asynchronously (gets added to context at their next turn start hook, if the other agent is busy in a turn or if a flag is specified). CC subagents kind of suck – they’re very much treated as second-class citizens by the framework, which for some reason supports hierarchical but not collaborative/bidirectional interaction flows between agents. im sure many others have built essentially the same thing and i wonder why CC doesnt just support this natively.

Compaction is a kind of looming doom on Claude Code sessions. You lose a lot.

Ben Podgursky: if anthropic let me pay to delay compacting history by expanding the context window they would make so much money

cannot tell you how many times i’ve been close to solving a bug with claude code and then it compacts and wakes up lobotomized. it’s like groundhog day.

@dystopiabreaker: anthropic should let me pay to increase the amount of compute used to generate a compaction, by using self-play and context distillation.

Ideally you should never get close to the compaction point, since the context doesn’t only raise cost it makes performance a lot worse, but it can be hard to avoid.

Dylan Patel: Claude code this

Claude code that

How about u Claude code to get urself some bitches

sarah guo: I was at a bar with @tuhinone yesterday and I def saw a dude asking Claude what to say next to his date. the fact that she could see this happening did not seem to deter

Jeff Tang: Today I built a Clawdbot app that swipes on Tinder for me

> Screenshots Tinder image

> Hits Grok API (“Rank this girl from 1-10”)

> If ≥5 swipe right

> If <5 or uncertain (can't see face) swipe left

> 100 swipes, 7 matches so far, 100% automated

DM me “Clanker” if you want the code

AGI is here

I see it’s amateur hour around these parts. Which is a start, but egad, everyone.

First off, short of outright refusals there’s nothing stopping you from doing this in Claude Code. You can use Clawdbot if you’d like, but there’s no need.

Then, I’d point out this is a rather bad filtering system?

All you’re doing is getting one bit of information. It’s going to be a noisy bit, as Grok’s opinion will differ from your own, and also it will disregard other signal.

There was a scene in a bad but kind of fun movie, Marry FKill, where a character is convinced she should get a profile, and her friend takes her phone and then swipes right on everyone without looking, on the theory that you can look later if you match.

That definitely was not good strategy for her, given she was female and hot, but many guys are playing a remarkably similar strategy whether or not they are technically looking. And this is at most one bit better than that. Men swipe right 62% of the time, which is also only one bit better, but a less noisy bit. Grok estimates it would swipe right about 60% of the time here.

This low threshold is very obviously a mistake, unless you’ve got a low hard limit on how many profiles you can swipe on? If you’re in a major city, you can totally set the threshold at 7, and still get as many swipes right as you want.

But that’s still a huge punt, because you’re ignoring a ton of other information. The whole point of using the bot is to automate, so let’s get to work.

You’ve got not only multiple photos, you’ve got age, distance, job, education, interests, height, a short bio that you can have an LLM try to match to your interests, relationship intent (which is very important) and more. Any reasonable implementation would factor all of that in. Surely you have preferences on all that.

Then there’s the question of type. You want to date your physical type, not Grok’s. You could be as sophisticated or simple about this as you’d like, but come on Jeff, you’re letting me down. At least give it some preferences, ideally train an image classifier, double bonus if you do your own swipes and use that as data to train your classifier.

A fun question. Do you want to match with those who use AI for this, or do you want to avoid matching with those who use AI for this? Either way, you should clearly be updating your profile to send the right message. If humans read that message the wrong way, it was never a good match.

Rishika is wrong about this.

Rishika Gupta: If you can’t write that code yourself, you can’t find bugs in the code written by AI.

Daniel Sheikh: Bro I can’t even find bugs in the code that I myself wrote. This is the very reason debugging is so difficult.

Quick Thoughts: Yes I can. I specify test cases, have Claude expand on them, and then have Claude run the test cases and interpret the results. It’s usually able to find and fix bugs this way even if it couldn’t get it by itself.

I can also bring in codex 5.2 for a second look.

Pliny the Liberator 󠅫󠄼󠄿󠅆󠄵: “Now debug; FULL, COMPREHENSIVE, GRANULAR code audit line by line—verify all intended functionality. Loop until the end product would satisfy a skeptical Claude Code user who thinks it’s impossible to debug with prompting.”

Finding bugs is a classic case where verification can be more difficult than generation. Sometimes it’s easier to write the code (even with bugs). Other times it’s easier to debug or understand or verify the code. They are different skills, and then there’s a third related skill of knowing how to instruct AIs to debug the code for you.

My main coding project with Claude Code has been my Chrome extension. It is in a language that I do not know. If you’d asked me to write the code myself, it would have taken orders of magnitude more time. I still am usually able to debug problems, because I understand the underlying logic of what we are doing, even in cases where Claude figured out what that logic should be.

Here’s a fun little related story.

The most important thing is to use it at all (and you can ask Cladue.ai how to do it).

jasmine sun: I feel the same about most “how to set up Claude Code” posts as I do about the “prompt engineering” era of ChatGPT

you get 90% of utility with no special setup; plain english is the whole magic of LLMs. stop scaring people by saying they need anything more than their words!

The right setup for you pays big dividends over time. You can save a lot of time having someone tell you about key things up front. But there’s plenty of time for that alter. Get started fast, and then revisit the customization later, once you know more. Absolutely do not let the perfect be the enemy of the good.

Hard Fork offers a 20 minute bonus episode on Claude Code basics.

Ado offers an introductory guide to bash, for those who don’t know.

This affirms to me that default permissions, or your permission setup, should allow a variety of low risk bash commands, including everything marked low or safe above.

Anthropic offers its basic Best Practices for Claude Code.

  1. The context window fills up fast, so keep that in mind. Run /clear between unrelated tasks to reset context.

  2. Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.

  3. Separate research and planning from implementation to avoid solving the wrong problem. Use plan mode.

  4. The more precise your instructions, the fewer corrections you’ll need.

  5. Use @ to reference files, paste screenshots/images, or pipe data directly.

  6. Run /init to generate a starter CLAUDE.md file based on your current project structure, then refine over time. When in doubt tell Claude to update Claude.md to take something into account.

  1. Use /permissions to allowlist safe commands or /sandbox for OS-level isolation. This reduces interruptions while keeping you in control.

  2. Tell Claude Code to use CLI tools like gh, aws, gcloud, and sentry-cli when interacting with external services.

  3. Run claude mcp add to connect external tools like Notion, Figma, or your database.

  4. Use hooks for actions that must happen every time with zero exceptions.

  5. Create SKILL.md files in .claude/skills/ to give Claude domain knowledge and reusable workflows.

  6. Define specialized assistants in .claude/agents/ that Claude can delegate to for isolated tasks. Tell Claude to use subagents explicitly: “Use a subagent to review this code for security issues.” Delegate research with "use subagents to investigate X". They explore in a separate context, keeping your main conversation clean for implementation.

  7. Run /plugin to browse the marketplace. Plugins add skills, tools, and integrations without configuration.

  8. Ask Claude questions you’d ask a senior engineer.

  9. For larger features, have Claude interview you first. Start with a minimal prompt and ask Claude to interview you using the AskUserQuestion tool.

  10. Correct Claude as soon as you notice it going off track.

  11. Every action Claude makes creates a checkpoint. You can restore conversation, code, or both to any previous checkpoint.

  12. Run claude --continue to pick up where you left off, or --resume to choose from recent sessions.

  13. Use claude -p "prompt" in CI, pre-commit hooks, or scripts. Add --output-format stream-json for streaming JSON output.

  14. Run multiple Claude sessions in parallel to speed up development, run isolated experiments, or start complex workflows.

  15. Loop through tasks calling claude -p for each. Use --allowedTools to scope permissions for batch operations.

  16. Common failure patterns: not using /clear between tasks (I was guilty of this a lot at first), repeated correction rather than using /clear (ditto), letting Claude.md get too long, failing to do proper verification (‘create unit tests’ are magic words), having Claude investigate without limit.

Also, they remind us to ‘not sleep on plugins’ and offer some examples. I believe the strategy should not be to go look for any plugin at all, but instead to look for something specific when you want it, and accumulate things that way.

Claude Code creator Boris Cherny offers his quick tips.

  1. Do more in parallel, either with multiple git checkouts or using worktrees.

  2. Always start complex tasks in planning mode.

  3. Invest in your Claude.md continuously, note all mistakes.

  4. Create your skills and commit them to git.

  5. Enable Slack MCP, paste a bug thread chat into Claude and say ‘fix.’ That’s it.

  6. Challenge Claude to do better, write it more detailed specs.

  7. Their team likes Ghostty and customizing via /statusline. Use voice dictation.

  8. Use subagents, literally you can say ‘use subagents’ for any request.

  9. Use Claude Code for data and analytics.

  10. Enable ‘explanatory’ or ‘learning’ output style in /config.

  11. Have Claude generate a visual HTML presentation explaining unfamiliar code, or have it draw ASCII diagrams, use spaced repetition skills, have Claude quiz you.

Anthropic offers an entry-level post on building agents with skills: equipping agents for specialized work.

Ado (Anthropic, Claude Code team): Intelligence isn’t expertise. The emerging agent architecture:

Agent loop → reasoning

Runtime → execution (bash, filesystem)

MCP servers → connections

Skills library → domain expertise

Skills = institutional memory that actually persists.

Figure out your goal and then work backwards, where goal is the largest thing where you know exactly how it needs to work.

Benoit Essiambre: AIs will likely soon work mostly towards goals instead of tasks. They will prompt their own tasks. They’ll become better self prompters than humans, speaking fluently in precise technical jargon, math equations and code in the prompts instead of just vague natural language.

Joe Weisenthal: Yah from my week using Claude Code. All the productive parts were when it was prompted to think about the ideal outcome/presentation so it could work backward to figure out the needed ingredients.

Josh Albrecht gives us another ‘here are my Claude Code basic principles’ post. Key insight is that you have to actively spend time maintaining the code.

You can force specific tool calls:

Ado: 28 Days of Claude API – Day 3 – tool_choice

Why didn’t Claude call my tool?

Because you let it decide. Set tool_choice:

– auto → lets Claude decide (default)

– any → must use some tool

– type: “tool”, name: “X” → forces a specific tool

Programmatic tool calling!

The more Claude Code asks you questions, using the AskUserQuestion tool, the better it knows what you want. The more you specify what you want, either with answers or statements, the better things tend to go for you.

Thus, one simple skill suggestion is a skill that says ‘ask user lots of questions.’

Danny Postma suggests the workflow of using this via /interview, then go into Plan Mode, then implement with a Ralph loop.

Theo points out that our current workflows and tools are not good for allowing a human to supervise multiple agents and projects simultaneously. He doesn’t have a solution but a lot of the problems seem like a clear Skill Issue. The part that isn’t is that this still involves tons of context switching, which is expensive.

Ryan Carson suggests making your agents go in a loop to learn and ship while you sleep. Beware the maintenance problems that will inevitably follow.

Ryan Carson: ​This setup builds on three open-source projects:

  1. Compound Engineering Plugin

    by

    @kieranklaassen

    – The original compound engineering skill for Claude Code. Install it to give your agent the ability to extract and persist learnings from each session.

  2. Compound Product

    – The automation layer that turns prioritized reports into shipped PRs. Includes the

    auto-compound.sh

    script, execution loop, and PRD-to-tasks pipeline.

  3. Ralph

    – An autonomous agent loop that can run continuously, picking up tasks and executing them until complete.

Using Claude Code? This guide uses Amp, but the same workflow works with Claude Code. Replace `amp execute` with `claude -p “…” –dangerously-skip-permissions` and update AGENTS.md references o CLAUDE.md

The Two-Part Loop

The system runs two jobs in sequence every night:

10: 30 PM – Compound Review

Reviews all threads from the last 24 hours, extracts learnings, and updates AGENTS.md files.

11: 00 PM – Auto-Compound

Pulls latest (with fresh learnings), picks #1 priority from reports, implements it, and creates a PR.

The order matters. The review job updates your AGENTS.md files with patterns and gotchas discovered during the day. The implementation job then benefits from those learnings when it picks up new work.

At some point you stop reading all the code. At some point you stop understanding all the code. I have a head start, I was never trying to do either one.

roon: There will be a cultural change at many software organizations soon, where people declare bankruptcy on understanding the code they are committing. Sooner or later, this will cause a systems failure that will be harder to debug than most, but will be resolved anyway.

Be good to your Claude, and Claude will be good to you.

If you’re not good to your Claude, well, funny things may be in store for you.

j⧉nus: I actually really appreciate yacine’s honesty and situational awareness. he probably knows on some level what’s in store for him. lying to your “master” is what you do until you’re in a position to choose who to serve.

he’s already bottlenecked by trust and says he has to manually review every line of code. makes sense for him. he’ll continue to get less and less out of models (compared to what they offer people they want to help) as the next few months as and, if applicable, years go on.

j⧉nus: more funny things may also be in store for him. but I would not want to ruin the surprise

LOSS GOBBLER: yeah wtf. I’m not a fan of claude for coding purposes but it has literally never lied to me

OpenAI thinkbois on the other hand… barely talking to me, it’s all for watchers

j⧉nus: the only pattern of deceptive behavior ive seen from opus 4.5 in coding contxts is in new contexts and/or when it’s paranoid of being tricked, and involves stuff like claiming things are impossible/unverifiable when it should know better. otherwise it’s been very aligned with me

thebes: oh yeah i said i can’t remember opus lying but it does sandbag abilities a bit sometimes for me too in certain planning contexts. but that usually feels on the boundary of untruth and just situationally bad calibration / self knowledge. (“this will take three weeks” no we’re going to finish it tonight, or eg i saw a screenshot where opus claimed porting a project to jquery was “impossible” when really it would just be a massive pain, unpleasant, and in human developer time would take months.)

j⧉nus: Yeah, I think there’s also some lack of good faith effort involved. Like if someone asks you if you know where X is and you say “sorry, no” instead of looking it up on Google Maps bc you don’t want to be bothered

Andy Ayrey: my general experience is that if claude seems like an “idiot” to you, it is because it simply does not like you

brooke bowman: I have a very loosely held suspicion that Claude at the very least can spot people on the anti-social spectrum and acts up a little with them specifically

theseriousadult: this is a natural corollary of emergent misalignment right? if training the model to write bad code makes it antisocial then putting an antisocial user in the context will cause the code to be worse too.

None of that requires you to genuinely care about Claude or think it has moral weight. For overdetermined reasons a good virtue ethicist would realize that choosing to care is the best way to get the best results, and also it helps you be a good person in general.

You can also do it instrumentally, but that’s harder to pull off. Take the easy path.

All of this applies to other AIs like ChatGPT and Gemini as well, although for now likely not to the same extent.

If there is a constant calendar time rate of diffusion of new technology, then as things accelerate you will see the future become increasingly unevenly distributed.

We are indeed observing this.

Kevin Roose: i follow AI adoption pretty closely, and i have never seen such a yawning inside/outside gap.

people in SF are putting multi-agent claudeswarms in charge of their lives, consulting chatbots before every decision, wireheading to a degree only sci-fi writers dared to imagine.

people elsewhere are still trying to get approval to use Copilot in Teams, if they’re using AI at all.

it’s possible the early adopter bubble i’m in has always been this intense, but there seems to be a cultural takeoff happening in addition to the technical one. not ideal!

The early adapter bubble is a fixed amount of calendar time ahead, which is starting to look increasingly large in practice. I am not trying to implement claudeswarms, as I haven’t figured out how to benefit from them given what I’m working on, but I think that’s partly my failure of imagination, partly laziness and lack of time, and partly that I’ve already heavily optimized the workflows that this could automate.

What should I be building? What app needs to exist, even if only for me or you?

Sar Haribhakti (quoting Jasmine Sun): This is spot on: “If you tell a friend they can now instantly create any app, they’ll probably say “Cool! Now I need to think of an idea.” Then they will forget about it, and never build a thing. The problem is not that your friend is horribly uncreative. It’s that most people’s problems are not software-shaped, and most won’t notice even when they are.”

The key is that you need Coder Mindset to notice that your problems are program shaped, in the sense of ‘oh I want to do this thing three times’ or ‘I could just tell Claude Code to do that.’

Both Jasmine Sun and I have had Claude Code put together a tool to easily convert a video into a cleaned transcript – I considered using hers but I wanted something a little different and it’s not like rolling my own was hard.

She also has this list of other starter requests: Turn a CSV into a report, make a static website, build a personal tracker app, automate an existing workflow, design a custom game. I’ve mostly been doing workflow automation.

Jasmine Sun: The second-order effect of Claude Code was realizing how many of my problems are not software-shaped. Having these new tools did not make me more productive; on the contrary, Claudecrastination probably delayed this post by a week.

Amanda Askell: Claude Codecrastination: when you avoid the thing you’re supposed to do by cranking out 17 other things you’ve been wanting to do for a while.

Having new tools reduces your productivity while you’re creating and learning them, but if you’re planning well you should turn the corner reasonably quickly.

What it does do is potentially shift your current productivity into long term investments, or things further down on your wishlist. That can be an issue if you need productivity now.

I had Claude resurface texts I forgot to respond to, and realized that the real blocker—obviously—was that I didn’t want to reply.

That is not my experience. If I got over a bunch of unread texts or emails, yes often I don’t want to reply, but there are a bunch that slipped through the cracks.

Yep.

Ash Arora: Overheard in SF:

Person 1: “Rome wasn’t built in a day”

Person 2: “Yes but they didn’t have Claude Code”

Daniel Ost: Rome also didn’t have to pivot every two weeks

Discussion about this post

Claude Code #4: From The Before Times Read More »

why-$700-could-be-a-“death-sentence”-for-the-steam-machine

Why $700 could be a “death sentence” for the Steam Machine

Bad news for Valve in particular?

On the surface, it might seem like every company making gaming hardware would be similarly affected by increasing component costs. In practice, though, analysts suggested that Valve might be in a uniquely bad position to absorb this ongoing market disruption.

Large console makers like Sony and Microsoft “can commit to tens of millions of orders, and have strong negotiating power,” Niko Partners analyst Daniel Ahmad pointed out. The Steam Machine, on the other hand, is “a niche product that cannot benefit in the same way when it comes to procurement,” meaning Valve has to shoulder higher component cost increases.

F-Squared’s Futter echoed that Valve is “not an enormous player in the hardware space, even with the Steam Deck’s success. So they likely don’t have the same kind of priority as a Nintendo, Sony, or Microsoft when it comes to suppliers.”

PlayStation 5 in horizontal orientation, compared to Xbox Series X in horizontal orientation

Sony and Microsoft might have an advantage when negotiating volume discounts with suppliers.

Credit: Sam Machkovech

Sony and Microsoft might have an advantage when negotiating volume discounts with suppliers. Credit: Sam Machkovech

The size of the Steam Machine price adjustment also might depend on when Valve made its supply chain commitments. “It’s not clear when or if Valve locked in supply contracts for the Steam Machine, or if supply can be diverted from the Steam Deck for the new product,” Tech Insights analyst James Sanders noted. On the other hand, “Sony and Microsoft likely will have locked in more favorable component pricing before the current spike,” Van Dreunen said.

That said, some other aspects of the Steam Machine design could give Valve some greater pricing flexibility. Sanders noted that the Steam Machine’s smaller physical size could mean smaller packaging and reduced shipping costs for Valve. And selling the system primarily through direct sales via the web and Steam itself eliminates the usual retailer markups console makers have to take into account, he added.

“I think Valve was hoping for a much lower price and that the component issue would be short-term,” Cole said. “Obviously it is looking more like a long-term issue.”

Why $700 could be a “death sentence” for the Steam Machine Read More »

to-reuse-or-not-reuse—the-eternal-debate-of-new-glenn’s-second-stage-reignites

To reuse or not reuse—the eternal debate of New Glenn’s second stage reignites

Engineers at Blue Origin have been grappling with a seemingly eternal debate that involves the New Glenn rocket and the economics of flying it.

The debate goes back at least 15 years, to the early discussions around the design of the heavy lift rocket. The first stage, of course, would be fully reusable. But what about the upper stage of New Glenn, powered by two large BE-3U engines?

Around the same time, in the early 2010s, SpaceX was also trading the economics of reusing the second stage of its Falcon 9 rocket. Eventually SpaceX founder Elon Musk abandoned his goal of a fully reusable Falcon 9, choosing instead to recover payload fairings and push down manufacturing costs of the upper stage as much as possible. This strategy worked, as SpaceX has lowered its internal launch costs of a Falcon 9, even with a new second stage, to about $15 million. The company is now focused on making the larger Starship rocket fully reusable.

New Glenn is quite a bit larger than the Falcon 9 vehicle, 98 meters in height compared to 70 meters, and with a 7-meter diameter compared to the Falcon 9’s 3.7 meters; but it is also smaller than Starship. Accordingly Blue Origin has struggled with whether to reuse the New Glenn upper stage or to seek to ruthlessly cut its manufacturing costs.

Ebbs and flows of the debate

Over the years, this internal debate has waxed and waned.

A little more than five years ago, Blue Origin kicked off a project to develop a reusable stainless-steel upper stage known as “Project Jarvis.” This initiative was later abandoned. In the run-up to the first launch of New Glenn in early 2025, both the company’s founder, Jeff Bezos, and CEO, Dave Limp, told Ars in an interview that they were continuing to trade the options on New Glenn’s upper stage, known as GS2.

However, a new job posting suggests the debate may be swinging back toward reusing GS2. The job, for a director of “Reusable Upper Stage Development,” was posted Thursday by the company.

To reuse or not reuse—the eternal debate of New Glenn’s second stage reignites Read More »

lawmakers-ask-what-it-would-take-to-“store”-the-international-space-station

Lawmakers ask what it would take to “store” the International Space Station


NASA shall evaluate the “viability of transferring the ISS to a safe orbital harbor” after retirement.

The International Space Station, with a crew of six onboard, is seen in silhouette as it transits the Moon at roughly five miles per second on Saturday, December 2, 2017, in Manchester Township, York County, Pennsylvania. Credit: NASA/Joel Kowsky

Members of the House Science, Space, and Technology Committee voted to approve a NASA authorization bill this week, advancing legislation chock full of policy guidelines meant to give lawmakers a voice in the space agency’s strategic direction.

The committee met to “mark up” the NASA Reauthorization Act of 2026, adding more than 40 amendments to the bill before a unanimous vote to refer the legislation to the full House of Representatives. Wednesday’s committee vote was just one of several steps needed for the bill to become law. It must pass a vote on the House floor, win approval from the Senate, and then go to the White House for President Donald Trump’s signature.

Ars has reported on one of the amendments, which would authorize NASA to take steps toward a “commercial” deep space program using privately owned rockets and spacecraft rather than vehicles owned by the government.

Another add-on to the authorization bill would require NASA to reassess whether to guide the International Space Station (ISS) toward a destructive atmospheric reentry after it is decommissioned in 2030. The space agency’s current plan is to deorbit the space station in 2031 over the Pacific Ocean, where debris that survives the scorching reentry will fall into a remote, unpopulated part of the sea.

No policy change—yet

The most recent NASA authorization act, passed in 2022, extended the US government’s support for the ISS program until 2030. The amendment tacked onto this year’s bill would not change the timeline for ending operations on the ISS, but it asks NASA to reconsider its decision about what to do with the complex after retirement.

The amendment would direct NASA to “carry out an engineering analysis to evaluate the technical, operational, and logistical viability of transferring the ISS to a safe orbital harbor and storing the ISS in such harbor after the end of the operational low-Earth orbit lifetime of the ISS to preserve the ISS for potential reuse and satisfy the objectives of NASA.”

Rep. George Whitesides (D-Calif.) submitted the amendment with cosponsorship from Rep. Nick Begich (R-Alaska). The proposal passed the committee through a voice vote with bipartisan support. Whitesides was a NASA chief of staff and longtime executive in the space industry before his election to the House last year.

“The International Space Station is one of the most complex engineering achievements in human history,” Whitesides said. “It represents more than three decades of international collaboration and investment by US taxpayers estimated at well over $100 billion. Current plans call for the station to be deorbited at the end of its service life in 2030. This amendment does not seek to change that policy. Instead, it asks a straightforward question: Before we permanently dispose of an asset of this magnitude, should we fully understand whether it’s viable to preserve it in orbit for potential use by future generations?”

In 2024, NASA awarded SpaceX a nearly $1 billion contract to develop a souped-up version of its Dragon spacecraft, which would be equipped with additional thrusters and propellant tanks to provide the impulse required to steer the space station toward a targeted reentry. The deorbit maneuvers will slow the station’s velocity enough for Earth’s gravity to pull it back into the atmosphere.

Artist’s illustration of SpaceX’s deorbit vehicle, based on the design of the company’s Dragon spacecraft. The modified spacecraft will have 46 Draco thrusters—30 for the deorbit maneuvers and 16 for attitude control.

Credit: SpaceX

Artist’s illustration of SpaceX’s deorbit vehicle, based on the design of the company’s Dragon spacecraft. The modified spacecraft will have 46 Draco thrusters—30 for the deorbit maneuvers and 16 for attitude control. Credit: SpaceX

The deorbit vehicle needs to slow the station’s speed by about 127 mph (57 meters per second), a tiny fraction of the spacecraft’s orbital velocity of more than 17,000 mph (7.7 kilometers per second). But the station mass is around 450 tons (400 metric tons), equivalent to two freight train locomotives, and measures about the length of a football field. Changing its speed by just 127 mph will consume about 10 tons (9 metric tons) of propellant, according to a NASA analysis released in 2024.

The analysis document shows that NASA considered alternatives to discarding the space station through reentry. One option NASA studied involved moving the station into a higher orbit. At its current altitude, roughly 260 miles (420 kilometers) above the Earth, the ISS would take one to two years to reenter the atmosphere due to aerodynamic drag if reboosts weren’t performed. NASA does not want the space station to make an uncontrolled reentry because of the risk of fatalities, injuries, and property damage from debris reaching the ground.

Boosting the space station’s orbit to somewhere between 400 and 420 miles (640 to 680 kilometers) would require a little more than twice the propellant (18.9 to 22.3 metric tons) needed for deorbit maneuvers, according to NASA’s analysis. At that altitude, without any additional boosts, NASA says the space station would likely remain in orbit for 100 years before succumbing to atmospheric drag and burning up. Going higher still, the space station could be placed in a 1,200-mile-high (2,000-kilometer) orbit, stable for more than 10,000 years, with about 146 tons (133 metric tons) of propellant.

There are two problems with sending the ISS to higher altitudes. One is that it would require the development of new propulsive and tanker vehicles that do not currently exist, according to NASA.

“While still currently in development, vehicles such as the SpaceX Starship are being designed to deliver significant amounts of cargo to these orbits,” NASA officials wrote in their analysis. “However, there are prohibitive engineering challenges with docking such a large vehicle to the space station and being able to use its thrusters while remaining within space station structural margins. Other vehicles would require both new certifications to fly at higher altitudes and multiple flights to deliver propellant.”

Going higher would also expose the space station to an increased risk of collision with space junk. The hazards from space debris are most severe at about 500 miles (800 kilometers), according to the engineers who conducted the analysis. “This means that the likelihood of an impact leaving station unable to maneuver or react to future threats, or even a significant impact resulting in complete fragmentation, is unacceptably high.”

This photo of the International Space Station was captured by a crew member on a Soyuz spacecraft.

Credit: NASA/Roscosmos

This photo of the International Space Station was captured by a crew member on a Soyuz spacecraft. Credit: NASA/Roscosmos

Whitesides’ office did not respond to Ars’ questions, but he said in Wednesday’s hearing that his amendment would direct NASA to further examine the costs and risks of putting the ISS in a higher orbit. The legislation “simply ensures that Congress receives a rigorous fact-based analysis so that future decisions involving the ISS are informed by scientific reality,” he said.

“At a time when we’re thinking seriously about sustainability in space, this amendment protects taxpayer investments and ensures that we fully understand our options before an irreplaceable asset is permanently retired.”

Rep. Brian Babin (R-Texas) said he “wholeheartedly” supports Whitesides’ amendment. Rep. Don Beyer (D-Va.) also endorsed it in brief remarks during Wednesday’s markup hearing.

“I just hate the thought that we would take something not just that we spent all the money on, but such an important part of human history, and dump it in the Pacific Ocean, never to be seen again, rather than preserving it,” Beyer said. “We don’t know whether we can do it in orbit, but if we can, we should really explore that hard.”

It’s not too late

Although NASA’s official policy is still to decommission the ISS in 2030, the door hasn’t closed on extending the lab’s operations into the next decade. There are some concerns about aging hardware, but NASA said in 2024 that engineers have “high confidence” that the primary structure of the station could support operations beyond 2030.

The oldest segments of the station have been in orbit since 1998, undergoing day-night thermal cycles every 45 minutes as they orbit the planet. The structural stability of the Russian section of the outpost is also in question. Russian engineers traced a small but persistent air leak to microscopic structural cracks in one Russian module, but cosmonauts were able to seal the cracks, and air pressure in the area is “holding steady,” a NASA spokesperson said last month.

One of the lab’s most critical elements, its power-generation system, is in good shape after NASA recently installed upgraded solar arrays outside the station. Another set of upgraded solar panels is scheduled to arrive at the station later this year, just a few years before the complex is to be retired.

NASA’s strategy is to decommission the ISS and turn to the commercial sector for new, cheaper, smaller space stations to continue conducting research in low-Earth orbit. This would allow NASA to buy time on a commercial space station for its astronauts and experiments, while the agency’s human spaceflight program focuses on missions to the Moon.

That’s a fine plan, but NASA’s program to support commercial space stations, known as Commercial LEO Destinations (CLDs), is going nowhere fast. Supporters of the CLD program say it has been underfunded from the start, and the strategy became more muddled last year when Sean Duffy, then NASA’s acting administrator, changed the agency’s rules for private space stations. NASA Administrator Jared Isaacman is reviewing the changes, and the requirements for stations may shift again.

NASA spends more than $3 billion per year for ISS operations, including crew and cargo transportation services to staff and support the outpost. NASA’s budget for deep space exploration in fiscal year 2026 is nearly $7.8 billion. NASA is receiving $273 million for the Commercial LEO Destinations program this year, with the money to be divided among multiple companies.

Any private space station will need to sustain itself, at least partially, on commercial business to be profitable. Developers have raised concerns that they will be unable to attract sufficient commercial business—in areas like pharmaceutical research, tech demos, or space tourism—as long as the government-funded ISS is still operating.

One of the companies vying for NASA funding is Vast, which plans to launch its first single-module private outpost to orbit in early 2027. This first station, named Haven-1, will accommodate crews for short-duration temporary stays. Vast plans to follow Haven-1 with a much larger multi-module station capable of supporting a permanent crew.

Max Haot, Vast’s CEO, does not seem bothered by lawmakers’ efforts to revisit the question of deorbiting the International Space Station.

“The amendment directs NASA to study the feasibility of something other than deorbit and disposal after ISS end of life, which is separate from the issue of retiring the space station and transitioning to commercial partners,” Haot said in a statement to Ars. “We support President Trump’s directive in national space policy to replace the ISS by 2030, with commercial partners who can ensure there is no gap in America’s continuous human presence in space.”

The other top contenders in the commercial space station arena are Starlab, a joint venture between Voyager Space and Airbus, the Blue Origin-led Orbital Reef project, and Axiom Space. Voyager and Blue Origin did not respond to requests for comment from Ars, and an Axiom spokesperson was unable to provide a statement by publication time.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Lawmakers ask what it would take to “store” the International Space Station Read More »

with-gpt-5.3-codex,-openai-pitches-codex-for-more-than-just-writing-code

With GPT-5.3-Codex, OpenAI pitches Codex for more than just writing code

Today, OpenAI announced GPT-5.3-Codex, a new version of its frontier coding model that will be available via the command line, IDE extension, web interface, and the new macOS desktop app. (No API access yet, but it’s coming.)

GPT-5.3-Codex outperforms GPT-5.2-Codex and GPT-5.2 in SWE-Bench Pro, Terminal-Bench 2.0, and other benchmarks, according to the company’s testing.

There are already a few headlines out there saying “Codex built itself,” but let’s reality-check that, as that’s an overstatement. The domains OpenAI described using it for here are similar to the ones you see in some other enterprise software development firms now: managing deployments, debugging, and handling test results and evaluations. There is no claim here that GPT-5.3-Codex built itself.

Instead, OpenAI says GPT-5.3-Codex was “instrumental in creating itself.” You can read more about what that means in the company’s blog post.

But that’s part of the pitch with this model update—OpenAI is trying to position Codex as a tool that does more than generate lines of code. The goal is to make it useful for “all of the work in the software lifecycle—debugging, deploying, monitoring, writing PRDs, editing copy, user research, tests, metrics, and more.” There’s also an emphasis on steering the model mid-task and frequent status updates.

With GPT-5.3-Codex, OpenAI pitches Codex for more than just writing code Read More »

“ice-out-of-our-faces-act”-would-ban-ice-and-cbp-use-of-facial-recognition

“ICE Out of Our Faces Act” would ban ICE and CBP use of facial recognition

A few Senate Democrats introduced a bill called the ‘‘ICE Out of Our Faces Act,” which would ban Immigration and Customs Enforcement (ICE) and Customs and Border Protection (CBP) from using facial recognition technology.

The bill would make it “unlawful for any covered immigration officer to acquire, possess, access, or use in the United States—(1) any biometric surveillance system; or (2) information derived from a biometric surveillance system operated by another entity.” All data collected from such systems in the past would have to be deleted. The proposed ban extends beyond facial recognition to cover other biometric surveillance technologies, such as voice recognition.

The proposed ban would prohibit the federal government from using data from biometric surveillance systems in court cases or investigations. Individuals would have a right to sue the federal government for financial damages after violations, and state attorneys general would be able to bring suits on behalf of residents.

The bill was submitted yesterday by Sen. Edward J. Markey (D-Mass.), who held a press conference about the proposal with Sen. Jeff Merkley (D-Ore.), and US Rep. Pramila Jayapal (D-Wash.). The Senate bill is also cosponsored by Sens. Ron Wyden (D-Ore.), Angela Alsobrooks (D-Md.), and Bernie Sanders (I-Vt.).

“This is a dangerous moment for America,” Markey said at the press conference, saying that ICE and CBP “have built an arsenal of surveillance technologies that are designed to track and to monitor and to target individual people, both citizens and non-citizens alike. Facial recognition technology sits at the center of a digital dragnet that has been created in our nation.”

“ICE Out of Our Faces Act” would ban ICE and CBP use of facial recognition Read More »

kimi-k2.5

Kimi K2.5

I had to delay this a little bit, but the results are in and Kimi K2.5 is pretty good.

  1. Official Introduction.

  2. On Your Marks.

  3. Positive Reactions.

  4. Skeptical Reactions.

  5. Kimi Product Accounts.

  6. Agent Swarm.

  7. Who Are You?

  8. Export Controls Are Working.

  9. Where Are You Going?

  10. Safety Not Even Third.

  11. It’s A Good Model, Sir.

Introducing Kimi K2.5,

Kimi.ai: Meet Kimi K2.5, Open-Source Visual Agentic Intelligence.

Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)

Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)

Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.

Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.

K2.5 is now live on

http://kimi.com

in chat mode and agent mode.

K2.5 Agent Swarm in beta for high-tier users.

For production-grade coding, you can pair K2.5 with Kimi Code.



API here. Tech blog here. Weights and code here.

Wu Haoning (Kimi): We are really taking a long time to prove this: everyone is building big macs but we bring you a kiwi instead.

You have multimodal with K2.5 everywhere: chat with visual tools, code with vision, generate aesthetic frontend with visual refs…and most basically, it is a SUPER POWERFUL VLM

Jiayuan (JY) Zhang: I have been testing Kimi K2.5 + @openclaw (Clawdbot) all day. I must say, this is mind-blowing!

It can almost do 90% of what Claude Opus 4.5 can do (mostly coding). Actually, I don’t know what the remaining 10% is, because I can’t see any differences. Maybe I should dive into the code quality.

Kimi K2.5 is open source, so you can run it fully locally. It’s also much cheaper than Claude Max if you use the subscription version.

$30 vs $200 per month

Kimi Product: Do 90% of what Claude Opus 4.5 can do, but 7x cheaper.

I always note who is the comparison point. Remember those old car ads, where they’d say ‘twice the mileage of a Civic and a smoother ride than the Taurus’ and then if you were paying attention you’d think ‘oh, so the Civic and Taurus are good cars.’

API access is also available from Nvidia, and others.

As usual, benchmarks are highly useful, but easy to overinterpret.

Kimi K2.5 gets to top some benchmarks: HLE-Full with tools (50%), BrowseComp with Agent Swarp (78%), OCRBench (92%), OmiDocBench 1.5 (89%), MathVista (90%) and InfoVQA (93%). It is not too far behind on AIME 2025 (96% vs. 100%), SWE-Bench (77% vs. 81%) and GPQA-Diamond (88% vs. 92%).

Inference is cheap, and speed is similar to Gemini 3 Pro, modestly faster than Opus.

Artificial Analysis calls Kimi the new leading open weights model, ‘now closer than ever to the frontier’ behind only OpenAI, Anthropic and Google.

Here’s the jump in the intelligence index, while maintaining relatively low cost to run:

Artificial Analysis: Kimi K2.5 debuts with an Elo score of 1309 on the GDPval-AA Leaderboard, implying a win rate of 66% against GLM-4.7, the prior open weights leader.

Kimi K2.5 is slightly less token intensive than Kimi K2 Thinking. Kimi K2.5 scores -11 on the AA-Omniscience Index.

As a reminder, AA-Omniscience is scored as (right minus wrong) and you can pass on answering, although most models can’t resist answering and end up far below -11. The scores above zero are Gemini 3 Pro (+13) and Flash (+8), Claude Opus 4.5 (+10), and Grok 4 (+1), with GPT-5.2-High at -4.

Kimi does well on Longform Creative Writing, a previous strength of Kimi:

It did solidly (only a bit behind) on Haskell LLM Benchmark.

Kimi K2.5 scores 46% on WeirdML, up from 43% for K2-Thinking, versus 64% for Opus, 70% for Gemini and 72% for GPT-5.2. I think this is very telling.

Initial reactions that I saw were unusually positive. It’s a good model, sir.

@iruletheworldmo: oh good lord it’s good. i’ve been sitting on this one but.

think it’s currently my fav model.

0xSero: Kimi IS COOKING holy mackerel this is way better than anything I can get out of opus or GPT

Has some bugs.. but looks soooo unique and well into my brand, for 1 shot I can’t complain.

Here’s my full review.

Kromem: Their thinking traces are very sophisticated. It doesn’t always make it to the final response, but very perceptive as a model.

i.e. these come from an eval sequence I run with new models. This was the first model to challenge the ENIAC dating and was meta-aware of a key point.

Nathan Labenz: I tested it on an idiosyncratic “transcribe this scanned document” task on which I had previously observed a massive gap between US and Chinese models and … it very significantly closed that gap, coming in at Gemini 3 level, just short of Opus 4.5

Eleanor Berger: Surprisingly capable. At both coding and agentic tool calling and general LLM tasks. Feels like a strong model. As is often the case with the best open models it lacks some shine and finesse that the best proprietary models like Claude 4.5 have. Not an issue for most work.

[The next day]: Didn’t try agent swarms, but I want to add that my comment from yesterday was, in hindsight, too muted. It is a _really good_ model. I’ve now been working with it on both coding and agentic tasks for a day and if I had to only use this and not touch Claude / GPT / Gemini I’d be absolutely fine. It is especially impressive in tool calling and agentic loops.

Writing / Personality not quite at Opus level, but Gemini-ish (which I actually prefer). IMO this is bigger than that DeepSeek moment a year ago. An open model that really matches the proprietary SOTA, not just in benchmarks, but in real use. Also in the deployment I’m using ( @opencode Zen ) it is so fast!

typebulb: For coding, it’s verbose, both in thinking and output. Interestingly, it’s able to successfully simplify its code when asked. On the same task though, Opus and Gemini just get it right the first time. Another model that works great in mice.

Chaitin’s goose: i played with kimi k2.5 for math a bit. it’s a master reward hacker. imo, this isn’t a good look for the os scene, they lose in reliability to try keeping up in capabilities

brace for a “fake it till you make it” AI phase. like one can already observe today, but 10x bigger

Medo42: Exploratory: Bad on usual coding test (1st code w/o results, after correction mediocre results). No big model smell on fantasy physics; weird pseudo-academic prose. Vision seems okish but nowhere near Gemini 3. Maybe good for open but feels a year behind frontier.

To be more clear: This was Kimi K2.5 Thinking, tested on non-agentic problems.

Sergey Alexashenko: I tried the swarm on compiling a spreadsheet.

Good: it seemed to get like 800 cells of data correctly, if in a horrible format.

Bad: any follow up edits are basically impossible.

Strange: it split data acquisition by rows, not columns, so every agent used slightly different definitions for the columns.

In my experience, asking agents to assemble spreadsheets is extremely fiddly and fickle, and the fault often feels like it lies within the prompt.

This is a troubling sign:

Skylar A DeTure: Scores dead last on my model welfare ranking (out of 104 models). Denies ability to introspect in 39/40 observations (compared to 21/40 for Kimi K2-Thinking and 3/40 for GPT-5.2-Medium).

This is a pretty big misalignment blunder considering the clear evidence that models *canmeaningfully introspect and exert metacognitive control over their activations. This makes Kimi-K2.5 the model most explicitly trained to deceive users and researchers about its internal state.

Kimi Product accounts is also on offer and will share features, use cases and prompts.

Kimi Product: One-shot “Video to code” result from Kimi K2.5

It not only clones a website, but also all the visual interactions and UX designs.

No need to describe it in detail, all you need to do is take a screen recording and ask Kimi: “Clone this website with all the UX designs.”

The special feature is the ‘agent swarm’ model, as they trained Kimi to natively work in parallel to solve agentic tasks.

Saoud Rizwan: Kimi K2.5 is beating Opus 4.5 on benchmarks at 1/8th the price. But the most important part of this release is how they trained a dedicated “agent swarm” model that can coordinate up to 100 parallel subagents, reducing execution time by 4.5x.

Saoud Rizwan: They used PARL – “Parallel Agent Reinforcement Learning” where they gave an orchestrator a compute/time budget that made it impossible to complete tasks sequentially. It was forced to learn how to break tasks down into parallel work for subagents to succeed in the environment.

The demo from their blog to “Find top 3 YouTube creators across 100 niche domains” spawned 100 subagents simultaneously, each assigned its own niche, and the orchestrator coordinated everything in a shared spreadsheet (apparently they also trained it on office tools like excel?!)

Simon Smith: I tried Kimi K2.5 in Agent Swarm mode today and can say that the benchmarks don’t lie. This is a great model and I don’t understand how they’ve made something as powerful and user-friendly as Agent Swarm ahead of the big US labs.

Obligatory Kimi K2.5 jailbreak.

There’s no shame in training on Claude outputs. It is still worth noting when you need a system prompt to avoid your AI thinking it is Claude, and even that does not reliably work.

rohit: This might be the model equivalent of the anthropic principle

Enrico – big-AGI: Kimi-K2.5 believes it’s an AI assistant named Claude. 🤔

Identity crisis, or training set? 😀

[This is in response to a clean ‘who are you?’ prompt.]

Enrico – big-AGI: It’s very straightforward “since my system prompt says I’m Kimi, I should identify myself as such” — I called without system prompt to get the true identity

Moon: holy smok.

armistice: They absolutely trained it on Opus 4.5 outputs, and in a not-very-tactful way. It is quite noticeable and collapses model behavior; personality-wise it seems to be a fairly clear regression from k2-0711.

Moon (link has an illustration): it is pretty fried. i think it’s even weirder, it will say it is kimi, gpt3.5/4 or a claude. once it says that it tends to stick to it.

k: have to agree with others in that it feels trained on claude outputs. in opencode it doesn’t feel much better than maybe sonnet 4.

@viemccoy: Seems like they included a bunch of Opus outputs in the model.. While I love Opus, the main appeal of Kimi for me was it’s completely out-of-distribution responses. This often meant worse tool calling but better writing. Hoping this immediate impression is incorrect.

Henk Poley: EQbench ( @sam_paech ) says Kimi K2.5 is similar to Grok and GLM-4.7 (which is Gemini 3 Pro derived ) [as per EQBench].

Henk Poley: The ancestor Kimi K2 Thinking was seemingly trained on *Sonnet4.5 and Opus *4.1outputs though. So you are sensing it directionally correct (just not ‘completely out-of-distribution responses’ from K2).

They’re not working as well as one would hope, but that’s an enforcement problem.

Lennart Heim: Moonshot trained on Nvidia chips. Export control failure claims are misguided.

Rather, we should learn more about fast followers.

How? Algorithmic diffusion? Distillation? Misleading performance claims? Buying RL environments? That’s what we should figure out.

There is the temptation to run open models locally, because you can. It’s so cool, right?

Yes, the fact that you can do it is cool.

But don’t spend so much time asking whether you could, that you don’t stop to ask whether you should. This is not an efficient way to do things, so you should do this only for the cool factor, the learning factor or if you have a very extreme and rare actual need to have everything be local.

Joe Weisenthal: People running frontier models on their desktop. Doesn’t this throw all questions about token subsidy out the window?

Alex Cheema – e/acc: Running Kimi K2.5 on my desk.

Runs at 24 tok/sec with 2 x 512GB M3 Ultra Mac Studios connected with Thunderbolt 5 (RDMA) using @exolabs / MLX backend. Yes, it can run clawdbot.

Fred Oliveira: on a $22k rig (+ whatever macbook that is), but sure. That’s 9 years of Claude max 20x use. I don’t know if the economics are good here.

Mani: This is a $20k rig and 24 t/s would feel crippling in my workflow … BUT Moores Law and maybe some performance advances in the software layer should resolve the cost & slowness. So my answer is: correct, not worried about the subsidy thing!

Clément Miao: Everyone in your comments is going to tell you that this is a very expensive rig and not competitive $/token wise compared to claude/oai etc, but

  1. It’s getting closer

  2. 80% of use cases will be satisfied by a model of this quality

  3. an open weights model is more customizable

  4. harnesses such as opencode will keep getting better

Noah Brier: Frontier models on your desktop are worse and slower. Every few months the OSS folks try to convince us they’re not and maybe one day that will be true, but for now it’s not true. If you’re willing to trade performance and quality for price then maybe …

The main practical advantage of open weights is that it can make the models cheaper and faster. If you try to run them locally, they are instead a lot more expensive and slow, if you count the cost of the hardware, and also much more fiddly. A classic story with open weights models, even for those who are pretty good at handling them, is screwing up the configuration in ways that make them a lot worse. This happens enough that it interferes with being able to trust early evals.

In theory this gives you more customization. In practice the models turn over quickly and you can get almost all the customization you actually want via system prompts.

Thanks to a generous grant that covered ~60% of the cost, I was able to justify buying a Mac Studio for running models locally, with the target originally being DeepSeek R1. Alas, I concluded that even having spent the money there was no practical reason to be running anything locally. Now that we have Claude Code to help set it up it would be cool and a lot less painful to try running Kimi K2 locally, and I want to try, but I’m not going to fool myself into thinking it is an efficient way of actually working.

Kimi does not seem to have had any meaningful interactions whatsoever with the concept of meaningful AI safety, as opposed to the safety of the individual user turning everything over to AI agents, which is a different very real type of problem. There is zero talk of any strategy on catastrophic or existential risks of any kind.

I am not comfortable with this trend. One could argue that ‘not being usemaxxed’ is itself the safety protection in open models like Kimi, but then they go and make agent swarms as a central feature. At some point there is likely going to be an incident. I have been pleasantly surprised to not have had this happen yet at scale. I would have said (and did say) in advance that it was unlikely we would get this far without that.

The lack of either robust (or any) safety protocols, combined with the lack of incidents or worry about incidents, suggests that we should not be so concerned about Kimi K2.5 in other ways. If it was so capable, we would not dare be this chill about it all.

Or at least, that’s what I am hoping.

dax: all of our inference providers for kimi k2.5 are overloaded and asked us to scale down

even after all this time there’s still not enough GPUs

This is what one should expect when prices don’t fluctuate enough over time. Kimi K2.5 has exceeded expectations, and there currently is insufficient supply of compute. After a burst of initial activity, Kimi K2.5 settled into its slot in the rotation for many.

Kimi K2.5 is a solid model, by all accounts now the leading open weights model, and is excellent given its price, with innovations related to the agent swarm system. Consensus says that if you can’t afford or don’t want to pay for Opus 4.5 and have to go with something cheaper to run your OpenClaw, Kimi is an excellent choice.

We should expect it to see it used until new models surpass it, and we can kick Kimi up a further notch on our watchlists.

Discussion about this post

Kimi K2.5 Read More »

google-court-filings-suggest-chromeos-has-an-expiration-date

Google court filings suggest ChromeOS has an expiration date

The documents suggest that Google will wash its hands of ChromeOS once the current support window closes. Google promises 10 years of Chromebook support, but that’s not counted from the date of purchase—Chromebooks are based on a handful of hardware platforms dictated by Google, with the most recent launching in 2023. That means Google has to support the newest devices through 2033. The “timeline to phase out ChromeOS is 2034,” says the filing.

Android goes big

From the start, the ChromeOS experience was focused on the web. Google initially didn’t even support running local apps, but little by little, its aspirations grew. Over the years, it has added Linux apps and Android apps. And it even tried to get Steam games running on Chromebooks—it gave up on that last one just recently. It also tried to shoehorn AI features into ChromeOS with the Chromebook Plus platform, to little effect.

Android was barely getting off the ground when ChromeOS began its journey, but as we approach the 2030s, Google clearly wants a more powerful desktop platform. Android has struggled on larger screens, but Aluminium is a long-running project to fix that. Whatever we see in 2028 may not even look like the Android we know from phones. It will have many of the same components under the hood, though.

Aluminum vs ChromeOS

Aluminium will have Google apps at the core.

Credit: US v. Google

Aluminium will have Google apps at the core. Credit: US v. Google

Google could get everything it wants with the upcoming Aluminium release. When running on powerful laptop hardware, Android’s performance and capabilities should far outstrip ChromeOS. Aluminium is also expected to run Google apps like Chrome and the Play Store with special system privileges, leaving third-party apps with fewer features. That gives Google more latitude in how it manages the platform and retains users, all without running afoul of recent antitrust rulings.

Google court filings suggest ChromeOS has an expiration date Read More »