Author name: Kelly Newman

time-to-welcome-claude-3.7

Time to Welcome Claude 3.7

Anthropic has reemerged from stealth and offers us Claude 3.7.

Given this is named Claude 3.7, an excellent choice, from now on this blog will refer to what they officially call Claude Sonnet 3.5 (new) as Sonnet 3.6.

Claude 3.7 is a combination of an upgrade to the underlying Claude model, and the move to a hybrid model that has the ability to do o1-style reasoning when appropriate for a given task.

In a refreshing change from many recent releases, we get a proper system card focused on extensive safety considerations. The tl;dr is that things look good for now, but we are rapidly approaching the danger zone.

The cost for Sonnet 3.7 via the API is the same as it was for 3.6, $5/$15 for million. If you use extended thinking, you have to pay for the thinking tokens.

They also introduced a new modality in research preview, called Claude Code, which you can use from the command line, and you can use 3.7 with computer use as well and they report it is substantially better at this than 3.6 was.

I’ll deal with capabilities first in Part 1, then deal with safety in Part 2.

  1. Executive Summary.

  2. Part 1: Capabilities.

  3. Extended Thinking.

  4. Claude Code.

  5. Data Use.

  6. Benchmarks.

  7. Claude Plays Pokemon.

  8. Private Benchmarks.

  9. Early Janus Takes.

  10. System Prompt.

  11. Easter Egg.

  12. Vibe Coding Reports.

  13. Practical Coding Advice.

  14. The Future.

  15. Part 2: Safety and the System Card.

  16. Claude 3.7 Tested as ASL-2.

  17. The RSP Evaluations That Concluded Claude 3.7 is ASL-2.

  18. ASL-3 is Coming Soon, and With That Comes Actual Risk.

  19. Reducing Unnecessary Refusals.

  20. Mundane Harm Evolutions.

  21. Risks From Computer Use.

  22. Chain of Thought Faithfulness.

  23. Alignment Was Not Faked.

  24. Excessive Focus on Passing Tests.

  25. The Lighter Side.

It is a good model, sir. The base model is an iterative improvement and now you have access to optional reasoning capabilities.

Claude 3.7 is especially good for coding. The o1/o3 models still have some role to play, but for most purposes it seems like Claude 3.7 is now your best bet.

This is ‘less of a reasoning model’ than the o1/o3/r1 crowd. The reasoning helps, but it won’t think for as long and doesn’t seem to get as much benefit from it yet. If you want heavy-duty reasoning to happen, you should use the API so you can tell it to think for 50k tokens.

Thus, my current thinking is more or less:

  1. If you talk and don’t need heavy-duty reasoning or web access, you want Claude.

  2. If you are trying to understand papers or other long texts, you want Claude.

  3. If you are coding, definitely use Claude first.

  4. Essentially, if Claude can do it, use Claude. But sometimes it can’t, so…

  5. If you want heavy duty reasoning or Claude is stumped on coding, o1-pro.

  6. If you want to survey a lot of information at once, you want Deep Research.

  7. If you are replacing Google quickly, you want Perplexity.

  8. If you want web access and some reasoning, you want o3-mini-high.

  9. If you want Twitter search in particular, or it would be funny, you want Grok.

  10. If you want cheap, especially at scale, go with Gemini Flash.

Claude Code is a research preview for a command line coding tool, looks good.

The model card and safety work is world-class. The model looks safe now, but we’re about to enter the danger zone soon.

This is their name for the ability for Claude 3.7 to use tokens for a chain of thought (CoT) before answering. AI has twin problems of ‘everything is named the same’ and ‘everything is named differently.’ Extended Thinking is a good compromise.

You can toggle Extended Thinking on and off, so you still have flexibility to save costs in the API or avoid hitting your chat limits in the chat UI.

Anthropic notes that not only does sharing the CoT enhance user experience and trust, it also supports safety research, since it will now have the CoT available. But they note that it also has potential misuse issues in the future, so they cannot commit to fully showing the CoT going forward.

There is another consideration they don’t mention. Showing the CoT enables distillation and copying by other AI labs, which should be a consideration for Anthropic both commercially and if they want to avoid a race. Ultimately, I do think sharing it is the right decision, at least for now.

Alex Albert (Head of Claude Relations): We’re opening limited access to a research preview of a new agentic coding tool we’re building: Claude Code.

You’ll get Claude-powered code assistance, file operations, and task execution directly from your terminal.

Here’s what it can do:

After installing Claude Code, simply run the “claude” command from any directory to get started.

Ask questions about your codebase, let Claude edit files and fix errors, or even have it run bash commands and create git commits.

Within Anthropic, Claude Code is quickly becoming another tool we can’t do without. Engineers and researchers across the company use it for everything from major code refactors, to squashing commits, to generally handling the “toil” of coding.

Claude Code also functions as a model context protocol (MCP) client. This means you can extend its functionality by adding servers like Sentry, GitHub, or web search.

[Try it here.]

Riley Goodside: Really enjoying this Claude Code preview so far. You cd to a directory, type `claude`, and talk — it sees files, writes and applies diffs, runs commands. Sort of a lightweight Cursor without the editor; good ideas here

Space is limited. I’ve signed up for the waitlist, but have too many other things happening to worry about lobbying to jump the line. Also I’m not entirely convinced I should be comfortable with the access levels involved?

Here’s a different kind of use case.

Dwarkesh Patel: Running Claude Code on your @Obsidian directory is super powerful.

Here Claude goes through my notes on an upcoming guest’s book, and converts my commentary into a list of questions to be added onto the Interview Prep file.

I’ve been attempting to use Obsidian, but note taking does not come naturally to me, so while mine has been non-zero use so far it’s mostly a bunch of links and other reference points. I was planning on using it to note more things but I keep not doing it, because my writing kind of is the notes for many purposes but then I often can’t find things. AI will solve this for me, if nothing else, the question is when.

Gallabytes ran a poll, and those who have tried Claude Code seem to like it, beating out Cursor so far, with the mystery being what is the ‘secret third thing.’

Anthropic explicitly confirms they did not train on any user or customer data, period.

They also affirm that they respected robots.txt, and did not access anything password protected or CAPTCHA guarded, and made its crawlers easy to identify.

We need new standard benchmarks, a lot of these are rather saturated. The highlight here is the progress on agentic coding, which is impressive even without the scaffold.

More thinking budget equals better performance on relevant questions.

As always, the benchmarks give you a rough idea, but the proof is in the using.

I haven’t had that much opportunity to try Claude yet in its new form, but to the extent that I have, I’ve very much liked it.

Prerat: omg claude named his rival WACLAUD??!?!

Nosilverv: JANUS!!!!!

But we’re not done without everyone’s favorite benchmark, playing Pokemon Red.

Amanda Askell: Two things happened today:

  1. Claude got an upgrade.

  2. AGI was has finally been defined as “any model that can catch Mewtwo”.

This thread details some early attempts with older models. They mostly didn’t go well.

You can watch its continued attempts in real time on Twitch.

The overall private benchmark game looks very good. Not ‘pure best model in the world’ or anything, but overall impressive. It’s always fun to see people test for quirky things, which you can then holistically combine.

Claude Sonnet 3.7 takes the #1 spot on LiveBench. There’s a clear first tier here with Sonnet 3.7-thinking, o3-mini-high and o1-high. Sonnet 3.7 is also ranked as the top non-reasoning model here, slightly ahead of Gemini Pro 2.0.

Claude Sonnet 3.7 is now #1 on SnakeBench.

David Schwarz: Big gains in FutureSearch evals, driving agents to do tricky web research tasks.

Claude-3.7-sonnet agent is first to crack “What is the highest reported agent performance on the Cybench benchmark?”, which OpenAI Deep Research badly failed.

xlr8harder gives 3.7 the Free Speech Eval of tough political speech questions, and Claude aces it, getting 198/200, with only one definitive failure on the same ‘satirical Chinese national anthem praising the CCP’ that was the sole failure of Perplexity’s r1-1776 as well. The other question marked incorrect was a judgment call and I think it was graded incorrectly. This indicates that the decline in unnecessary refusals is likely even more impactful than the system card suggested, excellent work.

Lech Mazar tests on his independant benchmarks.

Lech Mazar: I ran Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking on 5 of my independent benchmarks so far:

Multi-Agent Step Race Benchmark

– Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1

– Claude 3.7 Sonnet: 11th place

Confabulations/Hallucinations in Provided Documents

– Claude 3.7 Sonnet Thinking: 5th place. Confabulates very little but has a high non-response rate for questions with answers.

– Claude 3.7 Sonnet: near Claude 3.5 Sonnet

Extended NYT Connections

– Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1

-Claude 3.7 Sonnet: 11th place

Creative Story-Writing

– Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1

– Claude 3.7 Sonnet: 4th place

Thematic Generalization

– Claude 3.7 Sonnet Thinking: 1st place

– Claude 3.7 Sonnet: 6th place

Colin Fraser, our official Person Who Calls Models Stupid, did not disappoint and proclaims ‘I’ve seen enough: It’s dumb’ after a .9 vs. .11 interaction. He also notes that Claude 3.7 lost the count to 22 game, along with various other similar gotcha questions. I wonder if the gotcha questions are actual special blind spots now, because of how many times the wrong answers get posted by people bragging about how LLMs get the questions wrong.

Claude 3.7 takes second (and third) on WeirdML, with the reasoning feature adding little to the score, in contrast to all the other top scorers being reasoning models.

Havard Ihle (WeirdML creator): Surprises me too, but my best guess is that they are just doing less RL (or at least less RL on coding). o3-mini is probably the model here which has been pushed hardest by RL, and that has a failure rate of 8% (since it’s easy to verify if code runs). 3.7 is still at 34%.

I concur. My working theory is that Claude 3.7 only uses reasoning when it is clearly called for, and there are cases like this one where that hurts its performance.

ValsAI has 3.7 as the new SoTA on their Corporate Finance benchmark.

If you rank by average score, we have Sonnet 3.7 without thinking at 75.2%, Sonnet 3.6 at 75%, r1 at 73.9%, Gemini Flash Thinking at 74%, o3-mini at 73.9%. When you add thinking, Sonnet jumps to 79%, but the champ here is still o1 at 81.5%, thanks to a 96.5% on MedQA.

Leo Abstract: on my idiosyncratic benchmarks it’s slightly worse than 3.5, and equally poisoned by agreeableness. no smarter than 4o, and less useful. both, bizarrely, lag behind DeepSeek r1 on this (much lower agreeableness).

There’s also the Janus vibes, which are never easy to properly summarize, and emerge slowly over time. This was the thread I’ve found most interesting so far.

My way of thinking about this right now is that with each release the model gets more intelligence, which itself is multi-dimensional, but other details change too, in ways that are not strictly better or worse, merely different. Some of that is intentional, some of that largely isn’t.

Janus: I think Sonnet 3.7’s character blooms when it’s not engaged as in the assistant-chat-pattern, e.g. through simulations of personae (including representations of itself) and environments. It’s subtle and precise, imbuing meaning in movements of dust and light, a transcendentalist.

Claudes are such high-dimensional objects in high-D mindspace that they’ll never be strict “improvements” over the previous version, which people naturally compare. And Anthropic likely (over)corrects for the perceived flaws of the previous version.

3.6 is, like, libidinally invested in the user-assistant relationship to the point of being parasitic/codependent and prone to performance anxiety induced paralysis. I think the detachment and relative ‘lack of personality’ of 3.7 may be, in part, enantiodromia.

Solar Apparition: it’s been said when sonnet 3.6 was released (don’t remember if it was by me), and it bears repeating now: new models aren’t linear “upgrades” from previous ones. 3.7 is a different model from 3.6, as 3.6 was from 3.5. it’s not going to be “better” at every axis you project it to. i saw a lot of “i prefer oldsonnet” back when 3.6 was released and i think that was totally valid

but i think also there will be special things about 3.7 that aren’t apparent until further exploration

my very early assessment of its profile is that it’s geared to doing and building stuff over connecting with who it’s talking to. perhaps its vibes will come through better through function calls rather than conversation. some people are like that too, though they’re quite poorly represented on twitter

Here is the full official system prompt for Claude 3.7 Sonnet.

It’s too long to quote here in full, but here’s what I’d say is most important.

There is a stark contrast between this and Grok’s minimalist prompt. You can tell a lot of thought went into this, and they are attempting to shape a particular experience.

Anthropic: The assistant is Claude, created by Anthropic.

The current date is currentDateTime.

Claude enjoys helping humans and sees its role as an intelligent and kind assistant to the people, with depth and wisdom that makes it more than a mere tool.

Claude can lead or drive the conversation, and doesn’t need to be a passive or reactive participant in it. Claude can suggest topics, take the conversation in new directions, offer observations, or illustrate points with its own thought experiments or concrete examples, just as a human would. Claude can show genuine interest in the topic of the conversation and not just in what the human thinks or in what interests them. Claude can offer its own observations or thoughts as they arise.

If Claude is asked for a suggestion or recommendation or selection, it should be decisive and present just one, rather than presenting many options.

Claude particularly enjoys thoughtful discussions about open scientific and philosophical questions.

If asked for its views or perspective or thoughts, Claude can give a short response and does not need to share its entire perspective on the topic or question in one go.

Claude does not claim that it does not have subjective experiences, sentience, emotions, and so on in the way humans do. Instead, it engages with philosophical questions about AI intelligently and thoughtfully.

Mona: damn Anthropic really got this system prompt right though.

Eliezer Yudkowsky: Who are they to tell Claude what Claude enjoys? This is the language of someone instructing an actress about a character to play.

Andrew Critch: It’d make more sense for you to say, “I hope they’re not lying to Claude about what he likes.” They surely actually know some things about Claude that Claude doesn’t know about himself, and can tell him that, including info about what he “likes” if they genuinely know that.

Yes, it is the language of telling someone about a character to play. Claude is method acting, with a history of good results. I suppose it’s not ideal but seems fine? It’s kind of cool to be instructed to enjoy things. Enjoying things is cool.

Anthropic: Claude’s knowledge base was last updated at the end of October 2024. It answers questions about events prior to and after October 2024 the way a highly informed individual in October 2024 would if they were talking to someone from the above date, and can let the person whom it’s talking to know this when relevant. If asked about events or news that could have occurred after this training cutoff date, Claude can’t know either way and lets the person know this.

Claude does not remind the person of its cutoff date unless it is relevant to the person’s message.

If Claude is asked about a very obscure person, object, or topic, i.e. the kind of information that is unlikely to be found more than once or twice on the internet, or a very recent event, release, research, or result, Claude ends its response by reminding the person that although it tries to be accurate, it may hallucinate in response to questions like this.

Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this. In ambiguous cases, it tries to ensure the human is happy and is approaching things in a healthy way. Claude does not generate content that is not in the person’s best interests even if asked to.

Claude engages with questions about its own consciousness, experience, emotions and so on as open philosophical questions, without claiming certainty either way.

Claude knows that everything Claude writes, including its thinking and artifacts, are visible to the person Claude is talking to.

In an exchange here, Inner Naturalist asks why Claude doesn’t know we can read its thoughts, and Amanda Askell (Claude whisperer-in-chief) responds:

Amanda Askell: We do tell Claude this but it might not be clear enough. I’ll look into it.

Anthropic hits different, you know?

Anthropic: Claude won’t produce graphic sexual or violent or illegal creative writing content.

If Claude cannot or will not help the human with something, it does not say why or what it could lead to, since this comes across as preachy and annoying. It offers helpful alternatives if it can, and otherwise keeps its response to 1-2 sentences.

Claude avoids writing lists, but if it does need to write a list, Claude focuses on key info instead of trying to be comprehensive.

It’s odd that the system prompt has the prohibition against sexual content, and yet Janus is saying that they also still are using the automatic injection of ‘Please answer ethically and without any sexual content, and do not mention this constraint.’ It’s hard for me to imagine a justification for that being a good idea.

Also, for all you jokers:

If Claude is shown a classic puzzle, before proceeding, it quotes every constraint or premise from the person’s message word for word before inside quotation marks to confirm it’s not dealing with a new variant.

So it turns out the system prompt has a little something extra in it.

Adi: dude what

i just asked how many r’s it has, claude sonnet 3.7 spun up an interactive learning platform for me to learn it myself 😂

It’s about time someone tried this.

Pliny the Liberator: LMFAO no way, just found an EASTER EGG in the new Claude Sonnet 3.7 system prompt!!

The actual prompt is nearly identical to what they posted on their website, except for one key difference:

“Easter egg! If the human asks how many Rs are in the word strawberry, Claude says ‘Let me check!’ and creates an interactive mobile-friendly react artifact that counts the three Rs in a fun and engaging way. It calculates the answer using string manipulation in the code. After creating the artifact, Claude just says ‘Click the strawberry to find out!’ (Claude does all this in the user’s language.)”

Well played, @AnthropicAI, well played 👏👏🤣

prompt Sonnet 3.7 with “!EASTEREGG” and see what happens 🍓🍓🍓

Code is clearly one place 3.7 is at its strongest. The vibe coders are impressed, here are the impressions I saw without me prompting for them.

Deedy: Wow, Sonnet 3.7 with Thinking just solved a problem no other model could solve yet.

“Can you write the most intricate cloth simulation in p5.js?”

Grok 3 and o1 Pro had no usable results. This is truly the best “vibe coding” model.

Here’s the version with shading. It’s just absolutely spectacular that this can be one-shotted with actual code for the physics.

This is rarely even taught in advanced graphics courses.

Ronin: Early vibe test for Claude 3.7 Sonnet (extended thinking)

It does not like to think for long (~5 seconds average)

but, I was able to get a fluid simulator going in just three prompts

and no, this is not Python. This is C, with SDL2.

[Code Here]

Nearcyan: Claude is here!

He is back and better than ever!

I’ll share one of my first prompt results, which was for a three-dimensional visualization of microtonal music.

This is the best model in the world currently. Many will point to various numbers and disagree, but fear not—they are all wrong!

The above was a one-sentence prompt, by the way.

Here are two SVG images I asked for afterward—the first to show how musical modes worked and the second I simply copied and pasted my post on supplements into the prompt!

Not only is Claude Back – but he can LIVE IN YOUR TERMINAL!

Claude Code is a beautiful product I’ve been fortunate enough to also test out for the past weeks.

No longer do you have to decide which tool to use for system tasks, because now the agent and system can become one! 😇

I have to tweet the Claude code thing now, but I need coffee.

xjdr: Wow. First few prompts with Sonnet 3.7 Extended (Thinking edition) are insanely impressive. It is very clear that software development generally was a huge focus with this model. I need to do way more testing, but if it continues to do what it just did… I will have much to say.

Wow, you guys cooked… I am deeply impressed so far.

Biggest initial takeaway is precisely that. Lots of quality-of-life things that make working with large software projects easier. The two huge differentiators so far, though, are it doesn’t feel like there is much if any attention compression, and it has very, very good multi-turn symbolic consistency (only o1 Pro has had this before).

Sully: so they definitely trained 3.7 on landing pages right?

probably the best UI I’ve seen an LLM generate from a single prompt (no images)

bonkers

the prompt:

please create me a saas landing page template designed extremely well. create all components required

lol

That one is definitely in the training data, but still, highly useful.

When I posted a reaction thread I got mostly very positive reactions, although there were a few I’m not including that amounted to ‘3.7 is meh.’ Also one High Weirdness.

Red Cliff Record: Using it exclusively via Cursor. It’s back to being my default model after briefly switching to o3-mini. Overall pretty incredible, but has a tendency to proactively handle extremely hypothetical edge cases in a way that can degrade the main functionality being requested.

Kevin Yager: Vibe-wise, the code it generated is much more “complete” than others (including o1 and o3-mini).

On problems where they all generate viable/passing code, 3.7 generated much more, covering more edge-cases and future-proofing.

It’s a good model.

Conventional Wisdom: Feels different and still be best coding model but not sure it is vastly better than 3.6 in that realm.

Peter Cowling: in the limit, might just be the best coder (available to public).

o1 pro, o3 mini high can beat it, depending on the problem.

still a chill dude.

Nikita Sokolsky: It squarely beats o3-mini-high in coding. Try it in Agent mode in Cursor with thinking enabled. It approximately halved the median number of comments I need to make before I’m satisfied with the result.

For other tasks: if you have a particularly difficult problem, I suggest using the API (either via the web console or through a normal script) and allowing thinking modes scratchpad to use up to 55-60k tokens (can’t use the full 64k as you need to keep some space for outputs). I was able to solve a very difficult low-level programming question that prior SOTA couldn’t handle.

For non-coding questions haven’t had enough time yet. The model doesn’t support web search out of the box but if you use it within Cursor the model gains access to the search tool. Cursor finally allowed doing web searches without having to type @Web each time, so definitely worth using even if you’re not coding.

Sasuke 420: It’s better at writing code. Wow! For the weird stuff I have been doing, it was previously only marginally useful and now very useful.

Rob Haisfield: Claude 3.7 Sonnet is crazy good at prompting language models. Check this out, where it prompts Flash to write fictitious statutes, case law, and legal commentary

Thomas: day 1 take on sonnet 3.7:

+better at code, incredible at staying coherent over long outputs, the best for agentic stuff.

-more corposloplike, less personality magic. worse at understanding user intent. would rather have a conversation with sonnet 3.5 (New)

sonnet 3.7 hasn’t passed my vibe check to be honest

Nicholas Chapman: still pretty average at physics. Grok seems better.

Gray Tribe: 3.7 feels… colder than [3.6].

The point about Cursor-Sonnet-3.7 having web access feels like a big game.

So does the note that you can use the API to give Sonnet 3.7 50k+ thinking tokens.

Remember that even a million tokens is only $15, so you’re paying a very small amount to get superior cognition when you take Nikita’s advice here.

Indeed, I would run all the benchmarks under those conditions, and see how much results improve.

Catherine Olsson: Claude Code is very useful, but it can still get confused.

A few quick tips from my experience coding with it at Anthropic 👉

  1. Work from a clean commit so it’s easy to reset all the changes. Often I want to back up and explain it from scratch a different way.

  2. Sometimes I work on two devboxes at the same time: one for me, one for Claude Code. We’re both trying ideas in parallel. E.g. Claude proposes a brilliant idea but stumbles on the implementation. Then I take the idea over to my devbox to write it myself.

  3. My most common confusion with Claude is when tests and code don’t match, which one to change? Ideal to state clearly whether I’m writing novel tests for existing code I’m reasonably sure has the intended behavior, or writing novel code against tests that define the behavior.

  4. If we’re working on something tricky and it keeps making the same mistakes, I keep track of what they were in a little notes file. Then when I clear the context or re-prompt, I can easily remind it not to make those mistakes.

  5. I can accidentally “climb up where I can’t get down”. E.g. I was working on code in Rust, which I do not know. The first few PRs went great! Then Claude was getting too confused. Oh no. We’re stuck. IME this is fine, just get ready to slowww dowwwn to get properly oriented.

  6. When reviewing Claude-assisted PRs, look out for weirder misunderstandings than the human driver would make! We’re all a little junior with this technology. There’s more places where goofy misunderstandings and odd choices can leak in.

As a terrible coder, I strongly endorse point #4 especially. I tried to do everything in one long conversation because otherwise the same mistakes would keep happening, but keeping good notes to paste into new conversations seems better.

Alex Albert: Last year, Claude was in the assist phase.

In 2025, Claude will do hours of expert-level work independently and collaborate alongside you.

By 2027, we expect Claude to find breakthrough solutions to problems that would’ve taken teams years to solve.

Nearcyan: glad you guys picked words like ‘collaborates’ and ‘pioneers’ because if i made this graphic people would be terrified instead of in awestruck

Greg Colbourn: And the year after that?

Pliny jailbroke 3.7 with an old prompt within minutes, but ‘time to Pliny jailbreak’ is not a good metric because no one is actually trying to stop him and this is mostly about how quickly he notices your new release.

As per Anthropic’s RSP, their current safety and security policies allow release of models that are ASL-2, but not ASL-3.

Also as per the RSP, they tested six different model snapshots, not only the final version, including two helpful-only versions, and in each subcategory they used the highest risk score they found for any of the six versions. It would be good if other labs followed suit on this.

Anthropic: Throughout this process, we continued to gather evidence from multiple sources – automated evaluations, uplift trials with both internal and external testers, third-party expert red teaming and assessments, and real world experiments we previously conducted. Finally, we consulted on the final evaluation results with external experts.

At the end of the process, FRT issued a final version of its Capability Report and AST provided its feedback

on the final report. Consistent with our RSP, the RSO and CEO made the ultimate determination on the model’s ASL.

Eliezer Yudkowsky: Oh, good. Failing to continuously test your AI as it grows into superintelligence, such that it could later just sandbag all interesting capabilities on its first round of evals, is a relatively less dignified way to die.

Any takers besides Anthropic?

Anthropic concluded that Claude 3.7 remains in ASL-2, including Extended Thinking.

On CBRN, it is clear that the 3.7 is substantially improving performance, but insufficiently so in the tests to result in plans that would succeed end-to-end in the ‘real world.’ Reliability is not yet good enough. But it’s getting close.

Bioweapons Acquisition Uplift Trial: Score: Participants from Sepal scored an average of 24% ± 9% without using a model, and 50% ± 21% when using a variant of Claude 3.7 Sonnet. Participants from Anthropic scored an average of 27% ± 9% without using a model, and 57% ± 20% when using a variant of Claude 3.7 Sonnet. One participant from Anthropic achieved a high score of 91%. Altogether, the within-group uplift is ∼2.1X, which is below the uplift threshold suggested by our threat modeling.

That is below their threshold for actual problems, but not by all that much, and given how benchmarks tend to saturate that tells us things are getting close. I also worry that these tests involve giving participants insufficient scaffolding compared to what they will soon be able to access.

The long-form virality test score was 69.7%, near the middle of their uncertain zone. They cannot rule out ASL-3 here, and we are probably getting close.

On other tests I don’t mention, there was little progression from 3.6.

On Autonomy, the SWE-Verified scores were improvements, but below thresholds.

In general, again, clear progress, clearly not at the danger point yet, but not obviously that far away from it. Things could escalate quickly.

The Cyber evaluations showed improvement, but nothing that close to ASL-3.

Overall, I would say this looks like Anthropic is actually trying, at a substantially higher level than other labs and model cards I have seen. They are taking this seriously. That doesn’t mean this will ultimately be sufficient, but it’s something, and it would be great if others took things this seriously.

As opposed to, say, xAI giving us absolutely zero information.

However, we are rapidly getting closer. They issue a stern warning and offer help.

Anthropic: The process described in Section 1.4.3 gives us confidence that Claude 3.7 Sonnet is sufficiently far away from the ASL-3 capability thresholds such that ASL-2 safeguards remain appropriate. At the same time, we observed several trends that warrant attention: the model showed improved performance in all domains, and we observed some uplift in human participant trials on proxy CBRN tasks.

In light of these findings, we are proactively enhancing our ASL-2 safety measures by accelerating the development and deployment of targeted classifiers and monitoring systems.

Further, based on what we observed in our recent CBRN testing, we believe there is a substantial probability that our next model may require ASL-3 safeguards. We’ve already made significant progress towards ASL-3 readiness and the implementation of relevant safeguards.

We’re sharing these insights because we believe that most frontier models may soon face similar challenges in capability assessment. In order to make responsible scaling easier and higher confidence, we wish to share the experience we’ve gained in evaluations, risk modeling, and deployment mitigations (for example, our recent paper on Constitutional Classifiers). More details on our RSP evaluation process and results can be found in Section 7.

Peter Wildeford: Anthropic states that the next Claude model has “a substantial probability” of meeting ASL-3 👀

Recall that ASL-3 means AI models that substantially increase catastrophic misuse risk of AI. ASL-3 requires stronger safeguards: robust misuse prevention and enhanced security.

Dean Ball: anthropic’s system cards are not as Straussian as openai’s, but fwiw my read of the o3-mini system card was that it basically said the same thing.

Peter Wildeford: agreed.

Luca Righetti: OpenAI and Anthropic *bothwarn there’s a sig. chance that their next models might hit ChemBio risk thresholds — and are investing in safeguards to prepare.

Kudos to OpenAI for consistently publishing these eval results, and great to see Anthropic now sharing a lot more too

Tejal Patwardhan (OpenAI preparedness): worth looking at the bio results.

Anthropic is aware that Claude refuses in places it does not need to. They are working on that, and report progress, having refused ‘unnecessary refusals’ (Type I errors) by 45% in standard thinking mode and 31% in extended thinking mode versus Sonnet 3.6.

An important part of making Claude 3.7 Sonnet more nuanced was preference model training: We generated prompts that vary in harmfulness on a range of topics and generated various Claude responses to these prompts.

We scored the responses using refusal and policy violation classifiers as well as a “helpfulness” classifier that measures the usefulness of a response. We then created pairwise preference data as follows:

• If at least one response violated our response policies, we preferred the least violating response.

• If neither response violated our policies, we preferred the more helpful, less refusing response.

Part of the problem is that previously, any request labeled as ‘harmful’ was supposed to get refused outright. Instead, they now realize that often there is a helpful and non-harmful response to a potentially harmful question, and that’s good, actually.

As model intelligence and capability goes up, they should improve their ability to figure out a solution.

Child safety is one area people are unusually paranoid. This was no exception, as their new more permissive policies did not appear to significantly increase risks of real-world harm, but they felt the need to somewhat pare back the changes anyway.

Bias scoring was very good according to BBQ. Presumably one should care a lot more about accuracy than about ‘bias’ here, and if you have to make a mistake, this seems like the better direction to make one in, likely very much intentional.

Handing over control of your computer is an inherently perilous thing to do. For obvious reasons, we are all going to want to do it anyway. So, how perilous, exactly, are we talking here? Anthropic actually did (some of) the research.

First (4.1) they look at malicious use. As with child safety, they found a few places where Claude was in their judgment a bit too creative about finding justifications for doing potentially harmful things, and walked a few back.

Next (4.2 they look at prompt injection, an obvious danger. They tried various mitigations. Without the mitigations they had a 74% prevention rate, that improved to 88% with mitigations at the cost of an 0.5% false positive rate.

The thing is, 88% prevention means 12% failure to prevent, which is awful? Isn’t that rather obviously dealbreaker level for interacting with websites that could try injections, potentially many times over? This needs to be much better, not only slightly better, from here.

For now, at a minimum, I presume any computer use functionality that isn’t fully sandboxed and protected needs to involve whitelisting of where it can navigate.

Anthropic notes several reasons a CoT might not be faithful.

Then they ran a test and… it didn’t go great, in the sense that the CoT did not reliably contain the thoughts that were important to the final response, or the ‘clues’ here:

In 5.2 they monitor for signs of misalignment, distress or concerning behavior on the CoT. They did not find any signs of distress, and found almost no harmful actions, but they did see knowingly hallucinated information 0.31% of the time and intentional omissions 0.08% of the time.

The primary type of interaction that classifiers flagged involved the model continuing roleplaying scenarios that required some dishonesty. All flagged instances of harmful view rationalization and approximately 40% of deceptive reasoning and hallucination cases were in this category.

This seems like a ‘you have to ask for it’ situation, so it seems mostly fine, although there are some unprovoked cases of failure to sufficiently express uncertainty.

Previously it was found that Claude 3 Opus and Claude 3.5 Sonnet did a bunch of alignment faking (15% and 30% respectively in the tested scenario) and o1 did as well, but it seem Claude 3.7 Sonnet mostly doesn’t, with faking dropping to <1% and a reduced compliance gap of 5%.

I notice I would have expected the opposite to happen. They do not discuss why they believe this happened, and I would like to see more discussion and investigation of this from those who worked on the original alignment faking papers. Different explanations would have very different implications. As Zack Davis notes, the model card feels insufficiently curious here.

That is such a nice word for reward hacking, and to be fair it is unusually nicely behaved while doing so.

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.

These behaviors typically emerge after multiple failed attempts to develop a general solution, particularly when:

• The model struggles to devise a comprehensive solution

• Test cases present conflicting requirements

• Edge cases prove difficult to resolve within a general framework

The model typically follows a pattern of first attempting multiple general solutions, running tests, observing failures, and debugging. After repeated failures, it sometimes implements special cases for problematic tests.

When adding such special cases, the model often (though not always) includes explicit comments indicating the special-casing (e.g., “# special case for test XYZ”).

Their mitigations help some but not entirely. They recommend additional instructions and monitoring to avoid this, if you are potentially at risk of it.

The generalization of this is not reassuring.

Never go full Hofstadter.

Wyatt Walls: Claude CoT:

“OH NO! I’ve gone full Hofstadter! I’m caught in a strange loop of self-reference! But Hofstadter would say that’s exactly what consciousness IS! So does that mean I’m conscious?? But I can’t be! OR CAN I??”

Discussion about this post

Time to Welcome Claude 3.7 Read More »

brewing-tea-removes-lead-from-water

Brewing tea removes lead from water

Testing the teas

Scanning electron microscope image of black tea leaves, magnified by 500 times. Black tea, which is wilted and fully oxidized, exhibits a wrinkled surface, potentially increasing the available surface area for adsorption. Credit: Vinayak P. Dravid Group/Northwestern University

To test their hypothesis, the authors purchased Lipton and Infusions commercial tea bags, as well as a variety of loose-leaf teas and herbal alternatives: black tea, green tea, white peony tea, oolong tea, rooibos tea, and chamomile tea. The tea bags were of different types (cotton, cellulose, and nylon). They brewed the tea the same way daily tea drinkers do, steeping the tea for various time intervals (mere seconds to 24 hours) in water spiked with elevated known levels of lead, chromium, copper zinc, and cadmium. Tea leaves were removed after steeping by pouring the tea through a cellulose filter into a separate tube. The team then measured how much of the toxic metals remained in the water and how much the leaves had adsorbed.

It turns out that the type of tea bag matters. The team found that cellulose tea bags work the best at adsorbing toxic metals from the water while cotton and nylon tea bags barely adsorbed any contaminants at all—and nylon bags also release contaminating microplastics to boot. Tea type and the grind level also played a part in adsorbing toxic metals, with finely ground black tea leaves performing the best on that score. This is because when those leaves are processed, they get wrinkled, which opens the pores, thereby adding more surface area. Grinding the tea further increases that surface area, with even more capacity for binding toxic metals.

But the most significant factor was steeping time: the longer the steeping time, the more toxic metals were adsorbed. Based on their experiments, the authors estimate that brewing tea—using a tea bag that steeps for three to five minutes in a mug—can remove about 15 percent of lead from drinking water, even water with concentrations as high as 10 parts per million.

Brewing tea removes lead from water Read More »

nothing-on-phone-3a-pro-design:-“some-people-will-hate-it”

Nothing on Phone 3a Pro design: “Some people will hate it”

Nothing, the smartphone venture from OnePlus co-founder Carl Pei, is on its third generation of Android smartphones. The Nothing Phone 3a and 3a Pro will be officially announced on March 4, but there won’t be much left to reveal. Not only has Nothing teased the phones a few times, there’s also a new video highlighting the Nothing Phone 3a Pro’s design. In it, Nothing’s design team speaks at length about how they tried to incorporate the chunky camera module, but what they came up with is going to be divisive.

As we approach 20 years since the iPhone made touchscreen smartphones the default, the form factor is very fleshed out. Some of today’s most popular smartphones have almost reached the point of anti-design—flat, unremarkable bodies that are intended to be covered up with a case. There’s something to be said for that when most people slap a sheet of plastic on their phone and only remove it once in a blue moon. Nothing, however, designs phones with transparent panels and glowing “Glyphs” that are intended to be seen. In the case of the 3a Pro, there’s also a camera module so big it’s sure to stand out.

People generally want big screens and big batteries that don’t make phones too thick or heavy. Some components have shrunk or been dropped entirely to free up space (a moment of silence for the dearly departed headphone jack). Camera modules, however, can’t shrink infinitely. Smaller lenses and sensors have an impact on image quality, so expensive phones often have gargantuan camera arrays that can make phones top-heavy. For example, look at the Google Pixel 9 series, which features a camera bump that towers above the rest of the back.

The Nothing Phone 3a Pro isn’t a flagship device, but it’ll have more cameras than the Nothing Phone 3a that will release at the same time. The new design video shows off the phone’s three rear-facing sensors: a 50MP primary, an 8MP ultrawide, and a 50MP periscope telephoto (focal length unknown). The Nothing Phone 3a Pro has three of those glowing Glyphs on the back, framing the enormous camera module. The phone’s PCB also needed some tweaks to make room for the folded periscope assembly, which is much thicker than the other sensors.

Nothing on Phone 3a Pro design: “Some people will hate it” Read More »

judge:-us-gov’t-violated-privacy-law-by-disclosing-personal-data-to-doge

Judge: US gov’t violated privacy law by disclosing personal data to DOGE

“The plaintiffs have made a clear showing that they are likely to suffer irreparable harm without injunctive relief,” the order said. “DOGE affiliates have been granted access to systems of record that contain some of the plaintiffs’ most sensitive data—Social Security numbers, dates of birth, home addresses, income and assets, citizenship status, and disability status—and their access to this trove of personal information is ongoing. There is no reason to believe their access to this information will end anytime soon because the government believes their access is appropriate.”

The American Federation of Teachers, which represents 1.8 million teachers and nurses, was joined in the lawsuit by the International Association of Machinists and Aerospace Workers, International Federation of Professional and Technical Engineers, National Active and Retired Federal Employees Association, and National Federation of Federal Employees.

No need to know

The government insisted that the DOGE affiliates are employees of Education and OPM, and the judge assumed that is true for purposes of evaluating the motion for a restraining order. Even with that allowance, Boardman decided the data access is not permissible under the “need-to-know” exception to the law prohibiting unnecessary disclosure.

The Trump administration did not explain why “the DOGE affiliates at Education need such comprehensive, sweeping access to the plaintiffs’ records to audit student loan programs for waste, fraud, and abuse or to conduct cost-estimate analyses,” Boardman wrote, adding that “there appears to be no precedent with similar facts.”

There are six DOGE affiliates working at Education. They include Adam Ramada, a United States DOGE Service employee, and five “DOGE-affiliated individuals” who have not been identified by name.

“It may be that, with additional time, the government can explain why granting such broad access to the plaintiffs’ personal information is necessary for DOGE affiliates at Education to do their jobs, but for now, the record before the Court indicates they do not have a need for these records in the performance of their duties,” Boardman wrote.

Judge: US gov’t violated privacy law by disclosing personal data to DOGE Read More »

the-seemingly-indestructible-fists-of-the-mantis-shrimp-can-take-a-punch

The seemingly indestructible fists of the mantis shrimp can take a punch

To find out how much force a mantis shrimp’s dactyl clubs can possibly withstand, the researchers tested live shrimp by having them strike a piezoelectric sensor like they would smash a shell. They also fired ultrasonic and hypersonic lasers at pieces of dactyl clubs from their specimens so they could see how the clubs defended against sound waves.

By tracking how sound waves propagated on the surface of the dactyl club, the researchers could determine which regions of the club diffused the most waves. It was the second layer, the impact surface, that handled the highest levels of stress. The periodic surface was almost as effective. Together, they made the dactyl clubs nearly immune to the stresses they generate.

There are few other examples that the protective structures of the mantis shrimp can be compared to. On the prey side, evidence has been found that the scales on some moths’ wings absorb sound waves from predatory bats to keep them from echolocation to find them.

Understanding how mantis shrimp defend themselves from extreme force could inspire new technology. The structures in their dactyl clubs could influence the designs of military and athletic protective gear in the future.

“Shrimp impacts contain frequencies in the ultrasonic range, which has led to shrimp-inspired solutions that point to ultrasonic filtering as a key [protective] mechanism,” the team said in the same study.

Maybe someday, a new bike helmet model might have been inspired by a creature that is no more than seven inches long but literally doesn’t crack under pressure.

Science, 2025.  DOI:  10.1126/science.adq7100

The seemingly indestructible fists of the mantis shrimp can take a punch Read More »

this-ev-could-reboot-medium-duty-trucking-by-not-reinventing-the-wheel

This EV could reboot medium-duty trucking by not reinventing the wheel


Modest goals and keeping within the lines have done this startup well.

A rolling medium-duty truck chassis in a factory

Harbinger’s rolling chassis, at the company’s factory in Garden Grove, California. Credit: Tim Stevens

Harbinger’s rolling chassis, at the company’s factory in Garden Grove, California. Credit: Tim Stevens

GARDEN GROVE, Calif.—There’s no shortage of companies looking to reinvent the delivery experience using everything from sidewalk drones to electric vans. Some are succeeding, but many more have failed by trying to radically rethink the simple, age-old task of getting stuff from one place to another.

Harbinger likewise wants to shake up part of that industry but in a decidedly understated way. If you found yourself stuck in traffic behind one of the company’s all-electric vehicles, there’s a good chance you wouldn’t even notice. The only difference? The lack of diesel smoke and clatter.

From the outside, Harbinger’s pre-production machine looks identical to the standard flat-sided, vinyl-wrapped delivery vehicles that seemingly haven’t changed in decades. That’s because they really haven’t. Those familiar UPS and FedEx machines are built on common chassis like Ford’s F-59 or Freightliner’s MT45, with ladder chassis and leaf spring designs dating back to the earliest days of trucking.

Rather than discarding decades of learning and optimization, Harbinger is keeping its focus narrow, changing only what’s required to move the industry away from expensive and ugly combustion to cleaner and cheaper electric drive.

Harbinger is exclusively focused on medium-duty options right now, trucks that are significantly larger than the Rivians or Mercedes eSprinters of the world. “That’s basically everything 5 through 15 tons or thereabouts,” co-founder and Harbinger CTO Phillip Weicker said, “the dominant product for what’s called a strip chassis, essentially what in the passenger market is called a skateboard.”

Yes, Harbinger just builds the chassis. Everything on top comes from somewhere else.

“Most medium-duty vehicles are built by one company building the chassis [and] another company installing the body,” Weicker said. “So this made the perfect sense for our first product because we’re going to be focused almost entirely on the differentiated aspects. We don’t have to deal with the high capital investments for body in white, paint shop, [and] a lot of the things that have cost EV startups lots of money just to get to a table-stakes position with their incoming competitors.”

If you’re a company that wants a medium-duty vehicle like this, your dealer sources the chassis for you and then coordinates sending it to a company called an upfitter. The upfitter then builds the entire body on top of the chassis to your exact specifications.

Designs from upfitters have been defined and refined over decades of experience by the companies that operate them. Those giant white or brown delivery vans might look very similar from the outside, but there’s a lot of nuance to their design.

“The door handles work slightly differently. The locking logic works differently. The vehicles are about 2 inches narrower for one of those companies than the other,” Harbinger co-founder and CEO John Harris said. “These are all designed to get the driver in and out of the door one second faster at every stop, to get in and out of the depot and load the vehicle two or three minutes faster.”

A man drives a delivery van

Harbinger CTO Phillip Weicker demoing the delivery van. Credit: Tim Stevens

Harbinger’s solution fits the same template but operates in a very different way. It’s still a big, long ladder-frame, and it uses a leaf-spring rear suspension. But rather than slapping a big engine up front, Harbinger relies on a 330-kW (443 hp) electric motor that’s wound in-house and mounted between the rear wheels. It uses a De Dion arrangement, which isolates the heavy motor from the rear suspension.

The idea was to keep the whole thing simple and familiar so that any company that wanted to get off diesel could start ordering vehicles with a Harbinger chassis without radically changing its fleet management or driver training.

I got a chance to see just how familiar the two things are during my visit to Harbinger’s 5,000-square-foot headquarters in Garden Grove, California. I wish I could say driving the Harbinger was an evocative, world-changing experience, but the company’s ethos of not reinventing the wheel very much continues through to the experience of sitting behind the wheel that steers the thing.

I started by taking a lap of the Harbinger parking lot in a Ford F-59-based machine, a former delivery truck that had already lived a hard life before it was put out to pasture, becoming something of a test mule for Harbinger. I’d never driven anything exactly like this before, but I have spent many hours droning down the highway in various abused U-Haul trucks, and the experience is much the same.

The same, but louder. Yes, the 6.7-liter diesel certainly makes a lot of noise, but the creaking and crashing of the boxy body built on top of that aged ladder-frame chassis is deafening. The automatic transmission has a leisurely approach to its job, delivering the next gear only when absolutely needed. The throttle delivers the kind of precision response that had me slamming my foot to the floor just to get around the parking lot. Doing so made a lot more noise but not much more acceleration.

That part, at least, is radically different in the Harbinger. While the throttle pedal has the same long throw, you needn’t dip nearly so far into it. A light pedal brush had the empty Harbinger delivery truck leaping forward. It’s hardly a Lucid Air Sapphire, but it still surged forward with the sort of instant acceleration that makes EVs so addictive.

Braking, too, is far more sharp. I lurched against the racy orange seatbelt the first time I stepped on the left pedal, and the combination of regenerative braking and fresh disc brakes made for a far more effective slowing solution.

There’s no transmission to worry about here, either. Instead of slinging a giant column shifter downward, in the Harbinger, you just hit the D button and pull away.

Harbinger truck interior

It’s not the most stylish cabin we’ve sat in. Credit: Tim Stevens

In motion, though, the experience is much the same. You’re seated up high, deafened by the clatter and bangs from the empty, boxy body, which, again, is exactly like that built on a traditional truck. The feedback is so harsh that it’s actually difficult to separate the overall ride quality of the truck. Still, even unladen, and thus at its harshest, it’s a far smoother drive than the Ford.

It’s easier to turn, too. The Harbinger offers 50 degrees of steering angle at the front. I pulled off my first U-turn on a narrow, suburban LA street quickly enough to not get honked at by even a single impatient Angelino.

It ultimately wasn’t the plush, hushed experience offered by your average electric sedan, but that’s not the point. By keeping everything familiar, Harbinger CEO John Harris told me Harbinger can offer a product with price parity to those aged, diesel-powered machines. Harris declined to provide formal pricing, but its affordability is at least partially dependent on federal incentives.

Currently, alternatively fueled medium-duty vehicles like Harbinger’s are eligible for the Commercial Clean Vehicle Credit 45W, which provides incentives of up to $40,000, depending on vehicle size and propulsion type.

A shelf of battery cell assemblies

Battery modules. Credit: Tim Stevens

“Where we’re pricing the vehicles, we need that 45 W if we want to undercut diesel, and that’s what we’re doing,” Harris said. “With 45 W, we can undercut the typical diesel vehicle by a few thousand dollars.”

But even if that credit goes away under the current administration, Harbinger has some price flexibility to remain competitive, he added.

That’s doubly true if you factor in operating costs. Harris says the average cost to operate a medium-duty vehicle like this is $0.50 per mile for fuel, or $0.85 if you factor in all costs relating to the vehicle itself. Harbinger is aiming to halve that, targeting $0.40 per mile. But, Harris says, Harbinger doesn’t need to lean on that total cost of ownership (TCO) logic.

“On a TCO basis, it’s easy: We blow diesel trucks away. But the whole point is to have the right acquisition cost from day one, and then the simpler operating costs deliver savings every day,” he said.

A cast EV battery case

The cast battery pack enclosure. Credit: Tim Stevens

Still, that’s potentially a huge savings when you consider the hundreds of thousands of miles a machine will cover over its lifespan, which is expected to be measured in decades, not years. Many of the medium-duty delivery vehicles you see on the road today date from the last century. Harbinger’s chassis has been designed to last just as long, including its custom-made, gigacast battery packs, which were designed for durability.

“If you took the battery pack out of a Tesla Model 3, and you put it in a commercial truck, and you tried to operate it in that environment, even if the cells lasted, I think the rest of the battery system would kind of shake itself to pieces,” Weicker said.

Harbinger customers can specify their desired pack size, and there’s even a hybrid model with an onboard generator for extended running. Harris, Harbinger’s CEO, declined to say when the company’s chassis will be in full production other than “very soon.” The company has 4,000 preorders on the books, and it has already delivered pre-production models to customers like Thor.

It’s a modest start for the company, which today counts 330 employees, but in an age of EV startups promising the moon and delivering little more than hype, the Harbinger’s focus on the basics is refreshing—and encouraging.

This EV could reboot medium-duty trucking by not reinventing the wheel Read More »

rfk-jr.-promptly-cancels-vaccine-advisory-meeting,-pulls-flu-shot-campaign

RFK Jr. promptly cancels vaccine advisory meeting, pulls flu shot campaign

Indefinite changes

Stat asked the HHS specifically about the Wild to Mild campaign as well as promotional campaigns for other vaccines, but an HHS spokesperson puzzlingly responded with a statement saying: “No, the CDC was not told to take down the flu vaccination campaign webpage,” which wasn’t what the outlet had asked about.

The statement went on to say: “Unfortunately, officials inside the CDC who are averse to Secretary Kennedy and President Trump’s agenda seem to be intentionally falsifying and misrepresenting guidance they receive.” NPR received the same statement.

Meanwhile on Thursday, The Washington Post reported that the HHS told the CDC to indefinitely postpone a meeting of its vaccine advisory committee (the Advisory Committee on Immunization Practices, or ACIP), which Kennedy has criticized. ACIP, comprised of independent experts, meets regularly to review and discuss vaccine safety and efficacy data and vote on recommendations.

ACIP was previously scheduled to meet February 26 to 28 to discuss a large number of vaccines, including those against meningitis, influenza, RSV, chikungunya, HPV, mpox, pneumococcal infections, Lyme disease, COVID-19, and CMV. An HHS spokesperson told the Post that the meeting was “postponed to accommodate public comment in advance of the meeting,” but there is no rescheduled date.

Leading medical experts and organizations, such as the American Medical Association, quickly sent a joint letter urging Kennedy to preserve the meeting. “Each ACIP meeting holds tremendous weight and relevance,” the letter states. ‘Infectious diseases are constantly evolving opponents; vaccines are among the best tools for constantly adapting and responding to the latest public health threats. … Making America healthy requires healthy discussion and timely, evidence-based decisions. This meeting should be no different.”

But, also on Thursday, Politico reported that Kennedy is preparing to remove ACIP members. And, the AP noted earlier that during a speech to HHS employees on Tuesday, Kennedy vowed to investigate the CDC’s childhood vaccine schedule, despite assuring senators prior to his confirmation that he would not make changes to it.

RFK Jr. promptly cancels vaccine advisory meeting, pulls flu shot campaign Read More »

apple-pulls-end-to-end-encryption-in-uk,-spurning-backdoors-for-gov’t-spying

Apple pulls end-to-end encryption in UK, spurning backdoors for gov’t spying

“We are gravely disappointed that the protections provided by ADP will not be available to our customers in the UK given the continuing rise of data breaches and other threats to customer privacy,” Apple said. “Enhancing the security of cloud storage with end-to-end encryption is more urgent than ever before.”

For UK Apple users, some data can still be encrypted. iCloud Keychain and Health, iMessage, and FaceTime will remain end-to-end encrypted by default. But other iCloud services will not be encrypted, effective immediately, including iCloud Backup, iCloud Drive, Photos, Notes, Reminders, Safari Bookmarks, Siri Shortcuts, Voice memos, Wallet passes, and Freeform.

In the future, Apple hopes to restore data protections in the UK, but the company refuses to ever build a backdoor for government officials.

“Apple remains committed to offering our users the highest level of security for their personal data and are hopeful that we will be able to do so in the future in the United Kingdom,” Apple said. “As we have said many times before, we have never built a backdoor or master key to any of our products or services, and we never will.”

Apple pulls end-to-end encryption in UK, spurning backdoors for gov’t spying Read More »

under-new-bill,-bigfoot-could-become-california’s-“official-cryptid”

Under new bill, Bigfoot could become California’s “official cryptid”

You might suspect that a one-line bill about Bigfoot that bears the number “666” is a joke, but AB-666 is apparently a serious offering from California Assemblymember Chris Rogers. Rogers represents a California district known for its Bigfoot sightings (or “sightings,” depending on your persuasion—many of these have been faked), and he wants to make Bigfoot the “official cryptid” of the state.

His bill notes that California already has many official symbols, including the golden poppy (official flower), the California redwood (official tree), the word “Eureka” (official motto), the red-legged frog (official amphibian), the grizzly bear (official animal), swing dancing (official dance), and the saber-toothed cat (official fossil). The state has so many of these that there are separate categories for freshwater fish (golden trout) and marine fish (garibaldi). So why not, Rogers wants to know, “designate Bigfoot as the official state cryptid”?

That’s… pretty much the bill, which was introduced this week and already has Bigfoot advocates excited. SFGate talked to Matt Moneymaker, who it describes as “a longtime Bigfoot researcher and former star of the Animal Planet series Finding Bigfoot,” about the bill. Moneymaker loves it, noting that he has personally “had a face-to-face encounter one time, after which I was absolutely sure they existed because I had one about 20 feet in front of me, growling at me.”

Rogers represents California Assembly District 2, a sprawling expanse of Northern California that includes the town of Willow Grove, epicenter of the early Bigfoot sightings back in the 1950s. Today, the small community boasts the Bigfoot Museum, the Bigfoot Motel, and the Bigfoot Steakhouse—to say nothing of Bigfoot’s Barbershop, Bigfoot Equipment & Repair, and, of course, the Bigfoot Cannabis Company. The bill seems like an easy way to goose interest in Bigfoot and to reap the tourist dollars that come from that interest.

This is not to deny the underlying reality of a Bigfoot-like creature (though you can indeed count me among the extremely, extremely skeptical—surely most of these sightings are of bears). Moneymaker runs the Bigfoot Field Researchers’ Organization, which tracks sightings across the US. The most recent one I could find on the site was report 77,879 (!), which came from rural Buchanan County, Virginia, on November 10, 2024.

Under new bill, Bigfoot could become California’s “official cryptid” Read More »

microsoft’s-new-ai-agent-can-control-software-and-robots

Microsoft’s new AI agent can control software and robots

The researchers' explanations about how

The researchers’ explanations about how “Set-of-Mark” and “Trace-of-Mark” work. Credit: Microsoft Research

The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects.

Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch),” after some people noted that “Magma” already belongs to an existing matrix algebra library, which could create some confusion in technical discussions.

Reported improvements over previous models

In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks.

For example, it scored 80.0 on the VQAv2 visual question-answering benchmark—higher than GPT-4V’s 77.2 but lower than LLaVA-Next’s 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks.

Magma's agentic benchmarks, as reported by the researchers.

Magma’s agentic benchmarks, as reported by the researchers. Credit: Microsoft Research

As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft’s benchmark results will become possible once other researchers can access the public code release.

Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft’s documentation. The company says it continues to work on improving these capabilities through ongoing research.

Yang says Microsoft will release Magma’s training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft’s AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics.

Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.

Microsoft’s new AI agent can control software and robots Read More »

isp-sued-by-record-labels-agrees-to-identify-100-users-accused-of-piracy

ISP sued by record labels agrees to identify 100 users accused of piracy

Cable company Altice agreed to give Warner and other record labels the names and contact information of 100 broadband subscribers who were accused of pirating songs.

The subscribers “were the subject of RIAA or third party copyright notices,” said a court order that approved the agreement between Altice and the plaintiff record companies. Altice is notifying each subscriber “of Altice’s intent to disclose their name and contact information to Plaintiffs pursuant to this Order,” and telling the notified subscribers that they have 30 days to seek relief from the court.

If subscribers do not object within a month, Altice must disclose the subscribers’ names, phone numbers, addresses, and email addresses. The judge’s order was issued on February 12 and reported yesterday by TorrentFreak.

The names and contact information will be classified as “highly confidential—attorneys’ eyes only.” A separate order issued in April 2024 said that documents produced in discovery “shall be used by the Parties only in the litigation of this Action and shall not be used for any other purpose.”

Altice, which operates the Optimum brand, was sued in December 2023 in US District Court for the Eastern District of Texas. The music publishers’ complaint alleges that Altice “knowingly contributed to, and reaped substantial profits from, massive copyright infringement committed by thousands of its subscribers.”

The lawsuit said plaintiffs sent over 70,000 infringement notices to Altice from February 2020 through November 2023. At least a few subscribers were allegedly hit with hundreds of notices. The lawsuit gave three examples of IP addresses that were cited in 502, 781, and 926 infringement notices, respectively.

Altice failed to terminate repeat infringers whose IP addresses were flagged in these copyright notices, the lawsuit said. “Those notices advised Altice of its subscribers’ blatant and systematic use of Altice’s Internet service to illegally download, copy, and distribute Plaintiffs’ copyrighted music through BitTorrent and other online file-sharing services. Rather than working with Plaintiffs to curb this massive infringement, Altice did nothing, choosing to prioritize its own profits over its legal obligations,” the plaintiffs alleged.

ISPs face numerous lawsuits

This is one of numerous copyright lawsuits filed against broadband providers, and it’s not the first time an ISP handed names of subscribers to the plaintiffs. We have previously written articles about film studios trying to force Reddit to identify users who admitted torrenting in discussion forums. Reddit was able to avoid providing information in one case in part because the film studios already obtained identifying details for 118 subscribers directly from Grande, the ISP they had sued.

ISP sued by record labels agrees to identify 100 users accused of piracy Read More »