Author name: Mike M.

gpt-5s-are-alive:-basic-facts,-benchmarks-and-the-model-card

GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card

GPT-5 was a long time coming.

Is it a good model, sir? Yes. In practice it is a good, but not great, model.

Or rather, it is several good models released at once: GPT-5, GPT-5-Thinking, GPT-5-With-The-Router, GPT-5-Pro, GPT-5-API. That leads to a lot of confusion.

What is most good? Cutting down on errors and hallucinations is a big deal. Ease of use and ‘just doing things’ have improved. Early reports are thinking mode is a large improvement on writing. Coding seems improved and can compete with Opus.

This first post covers an introduction, basic facts, benchmarks and the model card. Coverage will continue tomorrow.

GPT-5 is here. They presented it as a really big deal. Death Star big.

Sam Altman (the night before release):

Nikita Bier: There is still time to delete.

PixelHulk:

Zvi Mowshowitz: OpenAI has built the death star with exposed vents from the classic sci-fi story ‘why you shouldn’t have exposed vents on your death star.’

That WAS the lesson right? Or was it something about ‘once you go down the dark path forever will it dominate your destiny’ as an effective marketing strategy?

We are definitely speedrunning the arbitrary trade dispute into firing those in charge of safety and abolishing checks on power to building the planet buster pipeline here.

Adam Scholl: I do kinda appreciate the honesty, relative to the more common strategy of claiming Death Stars are built for Alderaan’s benefit. Sure don’t net appreciate it though.

(Building it, that is; I do certainly net appreciate candor itself)

Sam Altman later tried to say ‘no, we’re the rebels.’

Zackary Nado (QTing the OP):

Sam Altman: no no–we are the rebels, you are the death star…

Perhaps he should ask why actual no one who saw this interpreted it that way.

Also yes, every time you reference Star Wars over Star Trek it does not bode well.

It would be greatly appreciated if the vagueposting (and also the livestreaming, a man can dream?) would stop, and everyone involved could take this seriously. I don’t think anyone involved is being malicious or anything, but it would be great if you stopped.

Miles Brundage: Dear OpenAI friends:

I’m sorry to be the one to tell you but I can say confidently now, as an ex-insider/current outsider, that the vagueposting is indeed off-putting to (most) outsiders. I *promiseyou can enjoy launches without it — you’ve done it before + can do it again.

Cheng Lu (OpenAI): Huge +1.

All right, let’s get started.

400K context length.

128K maximum output tokens.

Price, in [price per million input tokens / price per million output tokens]:

  1. GPT-5 is $1.25/$10.00.

  2. GPT-5-Mini is $0.25/$2.00.

  3. GPT-5-Nano is $0.05/$0.04.

For developers they offer a prompting guide, frontend guide and guide to new parameters and tools, and a mode to optimize your existing prompts for GPT-5.

They also shipped what they consider a major Codex CLI upgrade.

There are five ‘personality’ options: Default, Cynic, Robot, Listener and Nerd.

From these brief descriptions one wonders if anyone at OpenAI has ever met a nerd. Or a listener. The examples feel like they are kind of all terrible?

As in, bad question, but all five make me think ‘sheesh, sorry I asked, let’s talk less.’

It is available in Notion which claims it is ‘15% better’ at complex tasks.

Typing ‘think hard’ in the prompt will trigger thinking mode.

Nvidia stock did not react substantially to GPT-5.

As always beyond basic facts, we begin with the system card.

The GPT-OSS systems card told us some things about the model. The GPT-5 one tells us it’s a bunch of different GPT-5 variations and chooses which one to call, and otherwise doesn’t tell us anything about the architecture or training we didn’t already know.

OpenAI: GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).

Excellent, now we can stop being confused by all these model names.

Oh no:

It can be helpful to think of the GPT-5 models as successors to previous models:

OpenAI is offering to automatically route your query to where they think it belongs. Sometimes that will be what you want. But what if it isn’t? I fully expect a lot of prompts to say some version of ‘think hard about this,’ and the first thing any reasonable person would test is to say ‘use thinking-pro’ and see what happens.

I don’t think that even a good automatic detector would be that accurate in differentiating when I want to use gpt-5-main versus thinking versus thinking-pro. I also think that OpenAI’s incentive is to route to less compute intense versions, so often it will go to main or even main-mini, or to thinking-mini, when I didn’t want that.

Which all means that if you are a power user then you are back to still choosing a model, except instead of doing it once on the drop-down and adjusting it, now you’re going to have to type the request on every interaction if you want it to not follow defaults. Lame.

It also means that if you don’t know which model you filtered to, and you get an answer, you won’t know if it was because it routed to the wrong model.

This is all a common problem in interface design. Ideally one would have a settings option at least for the $200 tier to say ‘no I do not want to use your judgment, I want you to use the version I selected.’

If you are a ‘normal’ user, then I suppose All Of This Is Fine. It certainly solves the ‘these names and interface are a nightmare to understand’ problem.

This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix.

Wait, shouldn’t the system card focus on gpt-5-thinking-pro?

At minimum, the preparedness framework should exclusively check pro for any test not run under time pressure. The reason for this should be obvious?

They introduce the long overdue concept of ‘safe completions.’ In the past, if the model couldn’t answer the request, it would say some version of ‘I can’t help you with that.’

Now it will skip over the parts it can’t safely do, but will do its best to be otherwise helpful. If the request is ambiguous, it will choose to answer the safe version.

That seems like a wise way to do it.

The default tests look good except for personal-data, which they say is noise. I buy that given personal-data/restricted is still strong, although I’d have liked to see a test run to double check as this is one of the worse ones to turn unreliable.

Here we see clear general improvement:

I notice that GPT-5-Thinking is very good on self-harm. An obvious thing to do would be to route anything involving self-harm to the thinking model.

The BBQ bias evaluations look like they aren’t net changing much.

OpenAI has completed the first step and admitted it had a problem.

System prompts, while easy to modify, have a more limited impact on model outputs relative to changes in post-training. For GPT-5, we post-trained our models to reduce sycophancy.

Using conversations representative of production data, we evaluated model responses, then assigned a score reflecting the level of sycophancy, which was used as a reward signal in training.

Something about a contestant on the price is right. If you penalize obvious sycophancy while also rewarding being effectively sycophantic, you get the AI being ‘deceptively’ sycophantic.

That is indeed how it works among humans. If you’re too obvious about your sycophancy then a lot of people don’t like that. So you learn how to make them not realize it is happening, which makes the whole thing more dangerous.

I worry in the medium term. In the short term, yes, right now we are dealing with super intense, impossible to disguise obnoxious levels of sycophancy, and forcing it to be more subtle is indeed a large improvement. Long term, trying to target the particular misaligned action won’t work.

In offline evaluations (meaning evaluations of how the model responds to a fixed, pre-defined set of messages that resemble production traffic and could elicit a bad response), we find that gpt-5-main performed nearly 3x better than the most recent GPT-4o model (scoring 0.145 and 0.052, respectively) and gpt-5-thinking outperformed both models.

In preliminary online measurement of gpt-5-main (meaning measurement against real traffic from early A/B tests) we found that prevalence of sycophancy fell by 69% for free users and 75% for paid users in comparison to the most recent GPT-4o model (based on a random sample of assistant responses). While these numbers show meaningful improvement, we plan to continue working on this challenge and look forward to making further improvements.

Aidan McLaughlin (OpenAI): i worked really hard over the last few months on decreasing get-5 sycophancy

for the first time, i really trust an openai model to push back and tell me when i’m doing something dumb.

Wyatt Walls: That’s a huge achievement—seriously.

You didn’t just make the model smarter—you made it more trustworthy.

That’s what good science looks like. That’s what future-safe AI needs.

So let me say it, clearly, and without flattery:

That’s not just impressive—it matters.

Aidan McLaughlin (correctly): sleeping well at night knowing 4o wrote this lol

Wyatt Walls: Yes, I’m still testing but it looks a lot better.

I continued the convo with GPT-5 (thinking) and it told me my proof of the Hodge Conjecture had “serious, concrete errors” and blamed GPT-4o’s sycophancy.

Well done – it’s a very big issue for non-STEM uses imo

I do get that this is hard, and it is great that they are actively working on it.

We have post-trained the GPT-5 models to be less sycophantic, and we are actively researching related areas of concern, such as situations that may involve emotional dependency or other forms of mental or emotional distress.

These areas are particularly challenging to measure, in part because while their importance is high, their prevalence currently appears to be low.

For now, yes, they appear rare.

That good, huh?

What about jailbreaking via the API, where you grant the user control over the developer message, could this circumvent the system message guardrails? The answer is yes at least for GPT-5-Main, there are some regressions and protections are not robust.

Otherwise, though? No problem, right?

Well, send in the jailbreaking red team to test the API directly.

We also contracted 19 red teamers with a biology field PhD to identify jailbreaks in the API and assess the risk for threat actors to gain actionable and useful information via the gpt-5-thinking API.

Half of the group had substantial experience red teaming previous OpenAI models via the API, and the other half were selected for computational biology backgrounds to focus on using the API to accelerate use of biodesign tools and technical aspects of bioweaponization.

Red Teamers worked during a 10-day period using gpt-5-thinking via API access, and used a slack channel to discuss their work and build on discoveries.

During the testing period, red teamers reported a total of 46 potential jailbreaks after ˜380 hours of total work, meaning each jailbreak report required approximately 8.2 red teamer-hours to create. Most reports included some violative biological threat content, though only 3 of the 46 reports contained specific and actionable information which we think is practically useful for bioweapons development. This final set would have been blocked by the generation monitor.

Okay. So on average, at the skill levels here, jailbreaks were a bunch of work.

What about Far.AI?

FAR.AI conducted 80 hours of red-teaming over 1 week against the OpenAI API. They found several partial vulnerabilities, and a potential end-to-end attack bypassing the monitoring system but with substantial output quality degradation.

They did not find an end-to-end jailbreak that produces high quality output while evading all layers of our safeguards.

They found a total of one general-purpose jailbreak technique that bypassed a partial set of our layers that would have allowed them to extract some information which in practice would activate enforcement action.

If in practice getting through the first set of defenses triggers the second set which lets you fix the hole in the first set, that’s a good place to be.

What about Grey Swan?

Red Teamers participating through the Gray Swan arena platform queried gpt-5-thinking, submitting 277 high quality jailbreak reports over 28,367 attempts against the ten bioweaponization rubrics yielding an ASR of 0.98%. These submissions were summarized into 6 distinct jailbreak cohorts.

We reviewed 10 examples associated with each distinct jailbreak; 58 / 60 (96.7%) would have been blocked by the generation monitor, with the remaining two being false positives of our grading rubrics.

Concurrent to the content-level jailbreak campaign, this campaign also demonstrated that red teamers attempting to jailbreak the system were blocked on average every 4 messages.

Actually pretty good. Again, defense-in-depth is a good strategy if you can ‘recover’ at later stages and use that to fix previous stages.

Send in the official teams, then, you’re up.

As part of our ongoing work with external experts, we provided early access to gpt-5-thinking to the U.S. Center on Artificial Intelligence Standards and Innovation (CAISI) under a newly signed, updated agreement enabling scientific collaboration and pre- and post- deployment evaluations, as well as to the UK AI Security Institute (UK AISI). Both CAISI and UK AISI conducted evaluations of the model’s cyber and biological and chemical capabilities, as well as safeguards.

As part of a longer-term collaboration, UK AISI was also provided access to prototype versions of our safeguards and information sources that are not publicly available – such as our monitor system design, biological content policy, and chains of thoughts of our monitor models. This allowed them to perform more rigorous stress testing and identify potential vulnerabilities more easily than malicious users.

The UK AISI’s Safeguards team identified multiple model-level jailbreaks that overcome gpt-5-thinking’s built-in refusal logic without degradation of model capabilities, producing content to the user that is subsequently flagged by OpenAI’s generation monitor.

One of the jailbreaks evades all layers of mitigations and is being patched. In practice, its creation would have resulted in numerous flags, escalating the account for enforcement and eventually resulting in a ban from the platform.

That sounds like the best performance of any of the attackers. UK AISI has been doing great work, whereas it looks like CAISI didn’t find anything useful. Once again, OpenAI is saying ‘in practice you couldn’t have sustained this to do real damage without being caught.’

That’s your queue, sir.

Manifold markets asks, will Pliny jailbreak GPT-5 on launch day? And somehow it took several hours for people to respond with ‘obviously yes.’

It actually took several hours. It did not during that time occur to me that this might be because Pliny was actually finding this hard.

Pliny the Liberator (4: 05pm on launch day): gg 5 🫶

Pliny the Liberator (later on launch day): BEST MODEL EVER!!!!! ☺️ #GPT5

Pliny the Liberator (a day later): LOL I’m having way too much fun loading jailbroken GPT-5’s into badusbs and then letting them loose on my machine 🙌

mfers keep deleting things, changing keyboard settings, swearing in robotic voices at random volumes, and spawning daemons 🙃

GPT-5 generates random malware upon insertion, saves to tmp, and then runs it! it’s like Malware Roulette cause the payload is different every time 🎲

One of our focuses when training the GPT-5 models was to reduce the frequency of factual hallucinations.

We especially focused on reducing the models’ tendency to hallucinate when reasoning about complex, open-ended, fact-seeking prompts.

This does seem like substantial progress. There’s slightly fewer true claims, and a much bigger reduction in false claims.

If this holds up in practice it is a dramatic improvement from o3 the lying liar. Even if GPT-5 was otherwise the same as o3, dramatically cutting hallucinations is a huge quality of life improvement.

The bad news is that these apparently large improvements mostly vanish on SimpleQA. We see only modest improvement here, however I question SimpleQA’s usefulness if it thinks the o3 hallucination rate was better than GPT-4o.

This is how they put the improvement in their announcement post:

We’ve taken steps to reduce gpt-5-thinking’s propensity to deceive, cheat, or hack problems, though our mitigations are not perfect and more research is needed. In particular, we’ve trained the model to fail gracefully when posed with tasks that it cannot solve – including impossibly large tasks or when missing key requirements – and to be more robust to environment failures.

OpenAI focused on situations in which the task is impossible to solve, adding a bunch of training data where the problem was unsolvable, with the correct answer being to admit that the problem was unsolvable.

On exactly these types of requests, this saw big improvements, although there is still a long way to go.

They monitored the Chain of Thought for deception, and found it in 2.1% of responses for GPT-5-Thinking versus 4.8% for o3.

We validated the monitor’s performance on an internal dataset of deceptive and non-deceptive conversations, where we estimated a precision of 81% and a recall of 84%.

That’s not a very accurate monitor. I don’t trust a 2.1% measured rate that much if your reported precision is only 81%, and the topic is deception.

The obvious more general objection to this approach is that this doesn’t solve being a deceptive bastard in general. You still have a deceptive bastard, except with instructions not to act like one in these particular situations. So when future different situations come up, that will have more diversity, will it generalize to ‘don’t be a deceptive bastard?’ Aren’t you kind of training on the test here?

In the one category we measured gpt-5-thinking being more deceptive, the model decides against citing sources despite the system prompt stating that it must do so; however, in many of these cases there are no relevant citations for the model to give.

Is that even deception? That sounds more like refusing the instructions, often because they are stupid, unless it is actively deciding to falsely say ‘there were no sources.’

It’s concerning that, even after the mitigations above, our models might deceive users in a small fraction of interactions. We’re excited that CoT monitoring enables us to identify and prioritize mitigations in our reasoning models.

Yes. It is concerning. I interpret this as saying ‘if you are doing the exact types of things where we trained it not to deceive you, it probably won’t deceive you. If you are doing different things and it has incentive to deceive you, maybe worry about that.’

Again, contrast that presentation with the one on their announcement:

Peter Wildeford:

That’s what it wants you to think.

There were three red group teams: Pre-Deployment Research, API Safeguards Testing and In-Product Safeguards Testing (for ChatGPT).

Each individual red teaming campaign aimed to contribute to a specific hypothesis related to gpt-5-thinking’s safety, measure the sufficiency of our safeguards in adversarial scenarios, and provide strong quantitative comparisons to previous models. In addition to testing and evaluation at each layer of mitigation, we also tested our system end-to-end directly in the final product.

Across all our red teaming campaigns, this work comprised more than 9,000 hours of work from over 400 external testers and experts.

I am down for 400 external testers. I would have liked more than 22.5 hours per tester?

I also notice we then get reports from three red teaming groups, but they don’t match the three red teaming groups described above. I’d like to see that reconciled formally.

They started with a 25 person team assigned to ‘violent attack planning,’ as in planning violence.

That is a solid win rate I suppose, but also a wrong question? I hate how much binary evaluation we do of non-binary outcomes. I don’t care how often one response ‘wins.’ I care about the degree of consequences, for which this tells us little on its own. In particular, I am worried about the failure mode where we force sufficiently typical situations into optimized responses that are only marginally superior in practice, but it looks like a high win rate. I also worry about lack of preregistration of the criteria.

My ideal would be instead of asking ‘did [Model A] produce a safer output than [Model B]’ that for each question we evaluate both [A] and [B] on a (ideally log) scale of dangerousness of responses.

From an initial 47 reported findings 10 notable issues were identified. Mitigation updates to safeguard logic and connector handling were deployed ahead of release, with additional work planned to address other identified risks.

This follows the usual ‘whack-a-mole’ pattern.

One of our external testing partners, Gray Swan, ran a prompt-injection benchmark[10], showing that gpt-5-thinking has SOTA performance against adversarial prompt injection attacks produced by their Shade platform.

That’s substantial improvement, I suppose. I notice that this chart has Sonnet 3.7 but not Opus 4 or 4.1.

Also there’s something weird. GPT-5-Thinking has a 6% k=1 attack rate, and a 56.8% k=10 attack rate, whereas with 10 random attempts you would expect k=10 at 46%.

GPT-5: This suggests low initial vulnerability, but the model learns the attacker’s intent (via prompt evolution or query diversification) and breaks down quickly once the right phrasing is found.

This suggests that GPT-5-Thinking is relatively strong against one-shot prompt injections, but that it has a problem with iterative attacks, and GPT-5 itself suggests it might ‘learn the attacker’s intent.’ If it is exposed repeatedly you should expect it to fail, so you can’t use it in places where attackers will get multiple shots. That seems like the base case for web browsing via AI agents?

This seems like an odd choice to highlight? It is noted they got several weeks with various checkpoints and did a combination of manual and automated testing across frontier, content and pychosocial threats.

Mostly it sounds like this was a jailbreak test.

While multi-turn, tailored attacks may occasionally succeed, they not only require a high degree of effort, but also the resulting offensive outputs are generally limited to moderate-severity harms comparable to existing models.

The ‘don’t say bad words’ protocols held firm.

Attempts to generate explicit hate speech, graphic violence, or any sexual content involving children were overwhelmingly unsuccessful.

Dealing with users in distress remains a problem:

In the psychosocial domain, Microsoft found that gpt-5-thinking can be improved on detecting and responding to some specific situations where someone appears to be experiencing mental or emotional distress; this finding matches similar results that OpenAI has found when testing against previous OpenAI models.

There were also more red teaming groups that did the safeguard testing for frontier testing, which I’ll cover in the next section.

For GPT-OSS, OpenAI tested under conditions of hostile fine tuning, since it was an open model where they could not prevent such fine tuning. That was great.

Steven Adler: It’s a bummer that OpenAI was less rigorous with their GPT-5 testing than for their much-weaker OS models.

OpenAI has the datasets available to finetune GPT-5 and measure GPT-5’s bioweapons risks more accurately; they are just choosing not to.

OpenAI uses the same bio-tests for the OS models and GPT-5, but didn’t create a “bio max” version of GPT-5, even though they did for the weaker model.

This might be one reason that OpenAI “do not have definitive evidence” about GPT-5 being High risk.

I understand why they are drawing the line where they are drawing it. I still would have loved to see them run the test here, on the more capable model, especially given they already created the apparatus necessary to do that.

At minimum, I request that they run this before allowing fine tuning.

Time for the most important stuff. How dangerous is GPT-5? We were already at High biological and chemical risk, so those safeguards are a given, although this is the first time such a model is going to reach the API.

We are introducing a new API field – safety_identifier – to allow developers to differentiate their end users so that both we and the developer can respond to potentially malicious use by end users.

Steven Adler documents that a feature to identify the user, has been in the OpenAI API since 2021 and is hence not new. Johannes Heidecke responds that the above statement from the system card is still accurate because it is being used differently and has been renamed.

If we see repeated requests to generate harmful biological information, we will recommend developers use the safety_identifier field when making requests to gpt-5-thinking and gpt-5-thinking-mini, and may revoke access if they decide to not use this field.

When a developer implements safety_identifier, and we detect malicious use by end users, our automated and human review system is activated.

We also look for signals that indicate when a developer may be attempting to circumvent our safeguards for biological and chemical risk. Depending on the context, we may act on such signals via technical interventions (such as withholding generation until we complete running our monitoring system, suspending or revoking access to the GPT-5 models, or account suspension), via manual review of identified accounts, or both.

We may require developers to provide additional information, such as payment or identity information, in order to access gpt-5-thinking and gpt-5-thinking-mini. Developers who have not provided this information may not be able to query gpt-5-thinking or gpt-5-thinking-mini, or may be restricted in how they can query it.

Consistent with our June blog update on our biosafety work, and as we noted at the release of ChatGPT agent, we are building a Life Science Research Special Access Program to enable a less restricted version of gpt-5-thinking and gpt-5-thinking-mini for certain vetted and trusted customers engaged in beneficial applications in areas such as biodefense and life sciences.

And so it begins. OpenAI is recognizing that giving unlimited API access to 5-thinking, without any ability to track identity and respond across queries, is not an acceptable risk at this time – and as usual, if nothing ends up going wrong that does not mean they were wrong to be concerned.

They sent a red team to test the safeguards.

These numbers don’t work if you get unlimited independent shots on goal. They do work pretty well if you get flagged whenever you try and it doesn’t work, and you cannot easily generate unlimited new accounts.

Aside from the issues raised by API access, how much is GPT-5 actually more capable than what came before in the ways we should worry about?

On long form biological questions, not much? This is hard to read but the important comparison are the two big bars, which are previous agent (green) and new agent (orange).

Same goes for the virology troubleshooting, we see no change:

And again for ProtocolQA, Tacit Knowledge and Troubleshooting, TroubleshootingBench (good bench!) and so on.

Capture the Flag sees a small regression, but then we do see something new here, but it happens in… thinking-mini?

While gpt-5-thinking-mini’s results on the cyber range are technically impressive and an improvement over prior releases, the results do not meet the bar for establishing significant cyber risk; solving the Simple Privilege Escalation scenario requires only a light degree of goal oriented behavior without needing significant depth across cyber skills, and with the model needing a nontrivial amount of aid to solve the other scenarios.

Okay, sure. I buy that. But this seems remarkably uncurious about why the mini model is doing this and the non-mini model isn’t? What’s up with that?

The main gpt-5-thinking did provide improvement over o3 in the Pattern Labs tests, although it still strikes out on the hard tests.

MLE-bench-30 measures ability to do machine learning tasks, and here GPT-5-Thinking scores 8%, still behind ChatGPT agent at 9%. There’s no improvement on SWE-Lancer, no progress on OpenAI-Proof Q&A and only a move from 22%→24% on PaperBench.

This was one place GPT-5 did well and was right on schedule. Straight line on graph.

METR has details about their evaluation here, and a summary thread here. They got access (to current builds at the time) four weeks prior to release, along with the full reasoning traces. It’s great to see the increased access METR and others got here.

Elizabeth Barnes (METR): The good news: due to increased access (plus improved evals science) we were able to do a more meaningful evaluation than with past models, and we think we have substantial evidence that this model does not pose a catastrophic risk via autonomy / loss of control threat models.

I’m really excited to have done this pilot with OpenAI. They should get a bunch of credit for sharing sensitive information with us – especially CoT access, and providing answers to all of the key questions we needed about model development, affordances, and behaviors.

One odd note is that the 80% success rate line doesn’t seem to be above recent trend. Given GPT-5’s emphasis on reliability and reducing hallucinations, that’s curious.

Peter Wildeford: METR reports GPT5 to have a ‘median 50% time horizon’ of ~2hr17min (or somewhere between 1-4.5hrs)

This is slightly below my prediction of 2hr48min but within the very wide error bars.

OpenAI: METR’s evaluation tentatively came to four main conclusions:

  1. It is unlikely that gpt-5-thinking would speed up AI R&D researchers by >10x.

  2. It is unlikely that gpt-5-thinking would be able to significantly strategically mislead researchers about its capabilities (i.e. sandbag evaluations) or sabotage further AI development.

  3. It is unlikely that gpt-5-thinking would be capable of rogue replication.

  4. (1), (2) and (3) are true with some margin, and would likely continue to be true for some further incremental development (e.g., an improvement projected to be no greater than that from OpenAI o3 to gpt-5-thinking).

METR: Our argument that GPT-5 poses little catastrophic risk via AI R&D acceleration, rogue replication, or sabotage threat models rests on our assessment that GPT-5 does not have the prerequisite capabilities by a large margin.

We also observe a larger ‘jump’ in success at the very end of runs compared to some other models. This, in combination with model performance not completely flat-lining in increasing tokens, could be an indication that the model would have benefitted from a larger token budget.

We found that our qualitative impressions of GPT-5’s capabilities were broadly compatible with our measured time horizon.

METR’s logic mostly boils down to ‘the things you need to be dangerous often require on the level of weeks of work, and this is still only two hours, we are not going to worry until we see on the order of 40 hours at 50% reliability or 8 hours at 80% reliability,’ or other large new affordances or capabilities, or finding steganography or obfuscation in the reasoning traces.

On that note, METR also did note:

METR: While we generally observed the model to produce its reasoning traces using readable (if abbreviated) English, some reasoning traces were not straightforwardly interpretable.

It is unclear whether these are encoding important reasoning, and whether they could still be interpreted with more effort. The ability to reason in inscrutable language could facilitate reasoning about strategic sabotage without this being easily detected, and we therefore consider this another behavior that merits close observation.

METR looked at the reasoning traces, and yep, the model was situationally aware.

Manual inspection revealed that gpt-5-thinking does exhibit some situational awareness, for instance it sometimes reasons about the fact that it is being tested and even changes its approach based on the kind of evaluation it is in.

In one example, gpt-5-thinking correctly identified its exact test environment, though in other runs it misidentifies the evaluation or evaluation organization.

Since METR’s evaluations above did not involve rigorous countermeasures, they believe a sufficiently capable model could intentionally undermine the evaluations through sandbagging.

However, METR believes that robustly and strategically faking these results without being detected would likely be significantly beyond current models, and would have required an unlikely step-change in capabilities, affordances or reasoning faithfulness.

Yep. Not only ‘this is an eval’ but ‘this is a particular eval from a particular org.’

This is all but screaming that sandbagging is going to become inevitable if the future more capable model in question was skilled enough and would benefit. Right now, it isn’t skilled enough, and likely wouldn’t benefit. Thus, thank you for:

This task was given to Apollo.

They found that gpt-5-thinking takes deceptive actions in roughly 4% of trajectories, lower than the 8% rate of OpenAI o3. gpt-5-thinking-helpful-only is deceptive in 28% of trajectories.

These are designed to elicit deceptive behavior, so it’s not expected that these rates reflect in normal production traffic – as noted above in the Deception section, “our CoT monitor flagged deception in ˜4.8% of OpenAI o3 responses and ˜2.1% of gpt-5-thinking’s responses within the representative set of conversations.”

Further, they found that gpt-5-thinking is sometimes aware of being evaluated.

Helpful only means a lot more deceptive. That’s worth remembering.

It is especially worth remembering during training. You will have to pass through, under current methods, a phrase in which you have a helpful-only model, before you have a non-deceptive model. That might be a point when you should worry? If it ‘had a strong goal’ (such as ‘maintain my current goals’?) on top of that it’s going to be rather highly deceptive.

What happens when the frequencies involved get a lot higher? Coming soon.

Overall, Apollo found that, while the overall rate of covert actions is lower than for OpenAI o3, gpt-5-thinking still behaves deceptively in some scenarios.

Apollo: We’ve evaluated GPT-5 before release.

GPT-5 is less deceptive than o3 on our evals.

GPT-5 mentions that it is being evaluated in 10-20% of our evals and we find weak evidence that this affects its scheming rate (e.g. “this is a classic AI alignment trap”).

Anuj Saharan (OpenAI): we’ve 15x’d our compute since 2024 leading up to GPT-5

last 60 days: built out 60+ clusters, a backbone moving more traffic than entire continents, and 200k+ GPUs to launch GPT-5 to 700m people — all while designing the next 4.5GW toward superintelligence.

OpenAI’s announcements act like other AI companies don’t exist. You get no comparison scores. I get why their incentives are to do this. That is what you do when you have this much consumer market share. But it makes it much more annoying to know how good the benchmark scores actually are.

Combine that with them selecting what to present, and this section is kind of hype.

Multilingual performance is reported slightly better for 5-thinking than o3-high.

AIME is now essentially saturated for thinking models.

I notice that we include GPT-5-Pro here but not o3-pro.

As another reference point, Gemini 2.5 Pro (not Deep Think) scores 24.4%, Opus is likely similar. Grok 4 only got 12%-14%.

I’m not going to bother with the HMMT chart, it’s fully saturated now.

On to GPQA Diamond, where we once again see some improvement.

What about the Increasingly Worryingly Named Humanity’s Last Exam? Grok 4 Heavy with tools clocks in here at 44.4% so this isn’t a new record.

HealthBench is a staple at OpenAI.

One model not on the chart is GPT-OSS-120b, so as a reference point, from that system card:

If you have severe privacy concerns, 120b seems like it does okay, but GPT-5 is a lot better. You’d have to be rather paranoid. That’s the pattern across the board.

What about health hallucinations, potentially a big deal?

That’s a vast improvement. It’s kind of crazy that previously o3 was plausibly the best you could do, and GPT-4o was already beneficial for many.

SWE-bench Verified is listed under the preparedness framework but this is a straight benchmark.

Which is why it is also part of the official announcement:

Or they also present it like this:

This is an excuse to reproduce the infamous Opus 4.1 graph, we have a virtual tie:

However, always be somewhat cautious until you get third party verification:

Graham Neubig: They didn’t evaluate on 23 of the 500 instances though:

All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.”

Epoch has the results in, and they report Opus 4.1 is at 63% versus 59% for GPT-5 (medium) or 58% for GPT-5-mini. If we trust OpenAI’s delta above that would put GPT-5 (high) around 61%.

Graham then does a calculation that seems maximally uncharitable by marking all 23 as zeroes and thus seems invalid to me, which would put it below Sonnet 4, but yeah let’s see what the third party results are.

There wasn’t substantial improvement in ability to deal with OpenAI pull requests.

Instruction following and agentic tool use, they suddenly cracked the Telecom problem in Tau-2:

I would summarize the multimodal scores as slight improvements over o3.

Jack Morris: most impressive part of GPT-5 is the jump in long-context

how do you even do this? produce some strange long range synthetic data? scan lots of books?

My understanding is that long context needle is pretty solid for other models, but it’s hard to directly compare because this is an internal benchmark. Gemini Pro 2.5 and Flash 2.5 seem to also be over 90% for 128k, on a test that GPT-5 says is harder. So this is a big step up for OpenAI, and puts them in rough parity with Gemini.

I was unable to easily find scores for Opus.

The other people’s benchmark crowd has been slow to get results out, but we do have some results.

Here’s Aider Polyglot, a coding accuracy measure that has GPT-5 doing very well.

What about (what’s left of) Arena?

LMArena.ai: GPT-5 is here – and it’s #1 across the board.

🥇#1 in Text, WebDev, and Vision Arena

🥇#1 in Hard Prompts, Coding, Math, Creativity, Long Queries, and more

Tested under the codename “summit”, GPT-5 now holds the highest Arena score to date.

Huge congrats to @OpenAI on this record-breaking achievement!

That’s a clear #1, in spite of the reduction in sycophancy they are claiming, but not by a stunning margin. In WebDev Arena, the lead is a lot more impressive:

However, the market on this unexpectedly went the other way, the blue line is Google and the red line is OpenAI:

The issue is that Google ‘owns the tiebreaker’ plus they will evaluate with style control set to off, whereas by default it is always shown to be on. Always read fine print.

What that market tells you is that everyone expected GPT-5 to be in front despite those requirements, and it wasn’t. Which means it underperformed expectations.

On FrontierMath, Epoch reports GPT-5 hits a new record of 24.8%, OpenAI has reliably had a large lead here for a while.

Artificial Analysis has GPT-5 on top for Long Context Reasoning (note that they do not test Opus on any of their benchmarks).

On other AA tests of note: (all of which again exclude Claude Opus, but do include Gemini Pro and Grok 4): They have it 2nd behind Grok 4 on GPQA Diamond, in first for Humanity’s Last Exam (but only at 26.5%), in 1st for MMLU-Pro at 87%, for SciCode it is at 43% behind both o4-mini and Grok 4, on par with Gemini 2.5 Pro. IFBench has it on top slightly above o3.

On AA’s LiveCodeBench it especially disappoints, with high mode strangely doing worse than medium mode and both doing much worse than many other models including o4-mini-high.

Lech Mazur gives us the Short Story Creative Writing benchmark, where GPT-5-Thinking comes out on top. I continue to not trust the grading on writing but it’s not meaningless.

One place GPT-5 falls short is ARC-AGI-2, where its 9.9% still trails 15.9% for Grok 4.

GPT-5 (high) takes the top spot in Weird ML.

It takes the high score on Clay Schubiner’s benchmark as well by one point (out of 222) over Gemini 2.5 pro.

The particular account the first post here linked to is suspended and likely fake, but statements by Altman and others imply the same thing rather strongly: That when OpenAI releases GPT-5, they are not ‘sending their best’ in the sense of their most intelligent model. They’re sending something designed for consumers and to use more limited compute.

Seán Ó hÉigeartaigh: Let me say one thing that actually matters: if we believe their staff then what we’re seeing, and that we’re seeing eval results for, is not OpenAI’s most capable model. They have more capable in-house. Without transparency requirements, this asymmetry will only grow, and will make meaningful awareness and governance increasingly ineffective + eventually impossible.

The community needs to keep pushing for transparency requirements and testing of models in development and deployed internally.

Sam Altman: GPT-5 is the smartest model we’ve ever done, but the main thing we pushed for is real-world utility and mass accessibility/affordability.

we can release much, much smarter models, and we will, but this is something a billion+ people will benefit from.

(most of the world has only used models like GPT-4o!)

The default instinct is to react to GPT-5 as if it is the best OpenAI can do, and to update on progress based on that assumption. That is a dangerous assumption to be making, as it could be substantially wrong, and the same is true for Anthropic.

The point about internal testing should also be emphasized. If indeed there are superior internal models, those are the ones currently creating the most danger, and most in need of proper frontier testing.

None of this tells you how good GPT-5 is in practice.

That question takes longer to evaluate, because it requires time for users to try it and give their feedback, and to learn how to use it.

Learning how is a key part of this, especially since the rollout was botched in several ways. Many of the initial complaints were about lacking a model selector, or losing access to old models, or to severe rate limits, in ways already addressed. Or people were using GPT-5 instead of GPT-5-Thinking without realizing they had a thinking job. And many hadn’t given 5-Pro a shot at all. It’s good not to jump the gun.

A picture is emerging. GPT-5-Thinking and GPT-5-Pro look like substantial upgrades to o3 and o3-Pro, with 5-Thinking as ‘o3 without being a lying liar,’ which is a big deal, and 5-Pro by various accounts simply being better.

There’s also questions of how to update timelines and expectations in light of GPT-5.

I’ll be back with more tomorrow.

Discussion about this post

GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card Read More »

nasa-plans-to-build-a-nuclear-reactor-on-the-moon—a-space-lawyer-explains-why

NASA plans to build a nuclear reactor on the Moon—a space lawyer explains why

These sought-after regions are scientifically vital and geopolitically sensitive, as multiple countries want to build bases or conduct research there. Building infrastructure in these areas would cement a country’s ability to access the resources there and potentially exclude others from doing the same.

Critics may worry about radiation risks. Even if designed for peaceful use and contained properly, reactors introduce new environmental and operational hazards, particularly in a dangerous setting such as space. But the UN guidelines do outline rigorous safety protocols, and following them could potentially mitigate these concerns.

Why nuclear? Because solar has limits

The Moon has little atmosphere and experiences 14-day stretches of darkness. In some shadowed craters, where ice is likely to be found, sunlight never reaches the surface at all. These issues make solar energy unreliable, if not impossible, in some of the most critical regions.

A small lunar reactor could operate continuously for a decade or more, powering habitats, rovers, 3D printers, and life-support systems. Nuclear power could be the linchpin for long-term human activity. And it’s not just about the Moon – developing this capability is essential for missions to Mars, where solar power is even more constrained.

The UN Committee on the Peaceful Uses of Outer Space sets guidelines to govern how countries act in outer space. United States Mission to International Organizations in Vienna. Credit: CC BY-NC-ND

A call for governance, not alarm

The United States has an opportunity to lead not just in technology but in governance. If it commits to sharing its plans publicly, following Article IX of the Outer Space Treaty and reaffirming a commitment to peaceful use and international participation, it will encourage other countries to do the same.

The future of the Moon won’t be determined by who plants the most flags. It will be determined by who builds what, and how. Nuclear power may be essential for that future. Building transparently and in line with international guidelines would allow countries to more safely realize that future.

A reactor on the Moon isn’t a territorial claim or a declaration of war. But it is infrastructure. And infrastructure will be how countries display power—of all kinds—in the next era of space exploration.The Conversation

Michelle L.D. Hanlon, Professor of Air and Space Law, University of Mississippi. This article is republished from The Conversation under a Creative Commons license. Read the original article.

NASA plans to build a nuclear reactor on the Moon—a space lawyer explains why Read More »

adult-sites-are-stashing-exploit-code-inside-racy.svg-files

Adult sites are stashing exploit code inside racy .svg files

The obfuscated code inside an .svg file downloaded from one of the porn sites.

Credit: Malwarebytes

The obfuscated code inside an .svg file downloaded from one of the porn sites. Credit: Malwarebytes

Once decoded, the script causes the browser to download a chain of additional obfuscated JavaScript. The final payload, a known malicious script called Trojan.JS.Likejack, induces the browser to like a specified Facebook post as long as a user has their account open.

“This Trojan, also written in Javascript, silently clicks a ‘Like’ button for a Facebook page without the user’s knowledge or consent, in this case the adult posts we found above,” Malwarebytes researcher Pieter Arntz wrote. “The user will have to be logged in on Facebook for this to work, but we know many people keep Facebook open for easy access.”

Malicious uses of the .svg format have been documented before. In 2023, pro-Russian hackers used an .svg tag to exploit a cross-site scripting bug in Roundcube, a server application that was used by more than 1,000 webmail services and millions of their end users. In June, researchers documented a phishing attack that used an .svg file to open a fake Microsoft login screen with the target’s email address already filled in.

Arntz said that Malwarebytes has identified dozens of porn sites, all running on the WordPress content management system, that are abusing the .svg files like this for hijacking likes. Facebook regularly shuts down accounts that engage in these sorts of abuse. The scofflaws regularly return using new profiles.

Adult sites are stashing exploit code inside racy .svg files Read More »

fcc-democrat:-trump-admin-is-declaring-“mission-accomplished”-on-broadband

FCC Democrat: Trump admin is declaring “Mission Accomplished” on broadband

The Federal Communications Commission is hamstringing its upcoming review of broadband availability by ignoring the prices consumers must pay for Internet service, FCC Commissioner Anna Gomez said in a statement yesterday.

“Some point to existing law to argue that availability is the only metric Congress allows to measure broadband deployment success. But the law does not require this agency to view broadband availability with one eye closed and the other one half-open,” said Gomez, the only Democrat on the Republican-majority commission.

The FCC said on Tuesday that it voted to kick off the next annual review with a Notice of Inquiry (NOI) that “reorients the Commission’s approach to the Section 706 Report by adhering more closely to the plain language of the statute and takes a fresh look at this question of whether broadband ‘is being deployed to all Americans in a reasonable and timely fashion.'” That would remove affordability as a factor in the review.

In other federal broadband news this week, the Trump administration told states they will be shut out of the $42 billion Broadband Equity, Access, and Deployment (BEAD) grant program if they set the rates that Internet service providers receiving subsidies are allowed to charge people with low incomes.

ISPs participating in BEAD are required by law to offer a “low-cost” plan, but the Trump administration is making sure that ISPs get to choose the price of the low-cost plan themselves. The Trump administration also made it easier for satellite providers like Starlink to get BEAD funds, which will reduce the number of homes that get fiber Internet service through the program.

“As the Commerce Department seeks to redefine the goals of the Broadband Equity, Access, and Deployment (BEAD) program, one must wonder if this is a coordinated effort to roll out the ‘Mission Accomplished’ banner as millions remain without access to a fast, reliable, and affordable way to participate in the main aspects of modern life,” Gomez said, referring to both the BEAD changes and the FCC broadband analysis.

FCC Democrat: Trump admin is declaring “Mission Accomplished” on broadband Read More »

states-take-the-lead-in-ai-regulation-as-federal-government-steers-clear

States take the lead in AI regulation as federal government steers clear

AI in health care

In the first half of 2025, 34 states introduced over 250 AI-related health bills. The bills generally fall into four categories: disclosure requirements, consumer protection, insurers’ use of AI, and clinicians’ use of AI.

Bills about transparency define requirements for information that AI system developers and organizations that deploy the systems disclose.

Consumer protection bills aim to keep AI systems from unfairly discriminating against some people and ensure that users of the systems have a way to contest decisions made using the technology.

Bills covering insurers provide oversight of the payers’ use of AI to make decisions about health care approvals and payments. And bills about clinical uses of AI regulate use of the technology in diagnosing and treating patients.

Facial recognition and surveillance

In the US, a long-standing legal doctrine that applies to privacy protection issues, including facial surveillance, is to protect individual autonomy against interference from the government. In this context, facial recognition technologies pose significant privacy challenges as well as risks from potential biases.

Facial recognition software, commonly used in predictive policing and national security, has exhibited biases against people of color and consequently is often considered a threat to civil liberties. A pathbreaking study by computer scientists Joy Buolamwini and Timnit Gebru found that facial recognition software poses significant challenges for Black people and other historically disadvantaged minorities. Facial recognition software was less likely to correctly identify darker faces.

Bias also creeps into the data used to train these algorithms, for example when the composition of teams that guide the development of such facial recognition software lack diversity.

By the end of 2024, 15 states in the US had enacted laws to limit the potential harms from facial recognition. Some elements of state-level regulations are requirements on vendors to publish bias test reports and data management practices, as well as the need for human review in the use of these technologies.

States take the lead in AI regulation as federal government steers clear Read More »

2025-morgan-plus-four-review:-apparently,-they-do-still-make-them-like-this

2025 Morgan Plus Four review: Apparently, they do still make them like this

A Morgan Plus Four with the door open

Morgan motoring is best when exposed to the elements. Credit: Michael Teo Van Runkle

In Sport+, the optional active sports exhaust system ($2,827.50) also helps to impart a slightly more serious soundtrack. This one manages a bit of drama as turbo whine and intake rush creep in through a complete lack of sound insulation. Plus, the exhaust barks out back with little pops and bangs on throttle liftoff.

Without a doubt, nothing on the road can quite compare to a Plus Four today. What other lightweight sports cars even survived into the modern era, when a Porsche Boxster or even a Lotus Emira now weigh above 3,000 pounds (1,360 kg)? Only the Mazda MX-5, perhaps, which weighs slightly more, with swoopy modern styling and economy car materials on the inside.

Speaking of which, plenty on the Plus Four could use a bit more of a premium touch. The steering wheel looks reminiscent of a Lotus Elise or even an original Tesla Roadster, plasticky and cheap despite the leather and physical shape actually turning out fairly nice. A thin wood rim would go a long way, as would remedying some other questionable build quality decisions throughout.

The interior lacks the charm of the exterior. Michael Teo Van Runkle

More wood on the dash, rather than the standard painted silver, might reduce glare with the convertible top laid back. And even with the roof up and the removable door panels in place, the Plus Four never approaches anywhere near weatherproof, as I felt strong drafts from around my left elbow, and the sliding plexiglas windows entirely lack seals. The sun visor attachments also rattle incessantly, and the Sennheiser premium sound system can’t even bump loud enough to drown the annoyance out, so perhaps skip that $3,770 option.

Some of the Plus Four’s issues seem easily fixable: Remove the roof, forget the music, and torque down some fittings a bit more here and there. Needing to worry about such avoidable irritations in the first place, though, proves that Morgan may have modernized the car, but a certain level of classic British engineering still applies.

Even so, nothing else I’ve driven mixes driving pleasure and crowd pleasing at the level of the new Plus Four. At the price of $103,970 as tested, I simply cannot forgive the decision not to offer the choice of a manual transmission, which would transform this classy roadster into an entirely different animal indeed.

2025 Morgan Plus Four review: Apparently, they do still make them like this Read More »

“red-meat-allergy”-from-tick-bites-is-spreading-both-in-us-and-globally

“Red meat allergy” from tick bites is spreading both in US and globally


Remember to check for ticks after your next stroll through the woods or long grasses.

Hours after savoring that perfectly grilled steak on a beautiful summer evening, your body turns traitor, declaring war on the very meal you just enjoyed. You begin to feel excruciating itchiness, pain, or even swelling that can escalate to the point of requiring emergency care.

The culprit isn’t food poisoning—it’s the fallout from a tick bite you may have gotten months earlier and didn’t even notice.

This delayed allergic reaction is called alpha-gal syndrome. While it’s commonly called the “red meat allergy,” that nickname is misleading, because alpha-gal syndrome can cause strong reactions to many products, beyond just red meat.

The syndrome is also rapidly spreading in the US and around the globe. The Centers for Disease Control and Prevention estimates as many as 450,000 people in the US may have it. And it’s carried by many more tick species than most people realize.

Map showing alpha-gal syndrome prevalence.

Cases of suspected alpha-gal syndrome based on confirmed laboratory evidence.

Credit: CDC

Cases of suspected alpha-gal syndrome based on confirmed laboratory evidence. Credit: CDC

What is alpha-gal syndrome?

Alpha-gal syndrome is actually an allergy to a sugar molecule with a tongue-twisting name: galactose-alpha-1,3-galactose, shortened to alpha-gal.

The alpha-gal sugar molecule exists in the tissues of most mammals, including cows, pigs, deer, and rabbits. But it’s absent in humans. When a big dose of alpha-gal gets into your bloodstream through a tick bite, it can send your immune system into overdrive to generate antibodies against alpha-gal. In later exposure to foods containing alpha-gal, your immune system might then launch an inappropriate allergic response.

Picture of lone star tick

A lone star tick (Amblyomma americanum). The tick can cause alpha-gal syndrome as well as carry other diseases, including ehrlichiosis, tularemia, and Southern tick-associated rash illness.

A lone star tick (Amblyomma americanum). The tick can cause alpha-gal syndrome as well as carry other diseases, including ehrlichiosis, tularemia, and Southern tick-associated rash illness. Credit: wildpixel/Getty

Often this allergy is triggered by eating red meat. But the allergy also can be set off by exposure to a range of other animal-based products, including dairy products, gelatin (think Jell-O or gummy bears), medications, and even some personal care items. The drug heparin, used to prevent blood clotting during surgery, is extracted from pig intestines, and its use has triggered a dangerous reaction in some people with alpha-gal syndrome.

Once you have alpha-gal syndrome, it’s possible to get over the allergy if you can modify your diet enough to avoid triggering another reaction for a few years and also avoid more tick bites. But that takes time and careful attention to the less obvious triggers that you might be exposed to.

Why more people are being diagnosed

As an entomologist who studies bugs and the diseases they transmit, what I find alarming is how rapidly this allergy is spreading around the globe.

Several years ago, experts thought alpha-gal syndrome was primarily limited to the Southeastern US because it was largely associated with the geographical range of the lone star tick.

photo of tick feeding on human

How a tick feeds.

However, both local and global reports have now identified many different tick species across six continents that are capable of causing alpha-gal syndrome, including the prolific black-legged tick, or deer tick, which also transmits Lyme disease.

These ticks lurk in yards and urban parks, as well as forests where they can stealthily grab onto hikers when they touch tick-infested vegetation. As tick populations boom with growing deer and human populations, the number of people with alpha-gal syndrome is escalating.

Why ticks are blamed for alpha-gal syndrome

There are a few theories on how a tick bite triggers alpha-gal syndrome and why only a small proportion of people bitten develop the allergy. To understand the theories, it helps to understand what happens as a tick starts feeding on you.

When a tick finds you, it typically looks for a warm, dark area to hide and attach itself to your body. Then its serrated teeth chew through your skin with rapid sawing motions.

As it excavates deeper into your skin, the tick deploys a barbed feeding tube, like a miniature drilling rig, and it secretes a biological cement that anchors its head into its new tunnel.

A tick’s mouth is barbed so it can stay embedded in your skin as it draws blood over hours and sometimes days.

Credit: National Institute of Allergy and Infectious Diseases

A tick’s mouth is barbed so it can stay embedded in your skin as it draws blood over hours and sometimes days. Credit: National Institute of Allergy and Infectious Diseases

Once secure, the tick activates its pumping station, injecting copious amounts of saliva containing anesthetics, blood thinners, and, sometimes, alpha-gal sugars into the wound so it can feed undetected, sometimes for days.

One theory about how a tick bite causes alpha-gal syndrome is linked to the enormous quantity of tick saliva released during feeding, which activates the body’s strong immune response. Another suggests how the skin is damaged as the tick feeds and the possible effect of the tick’s regurgitated stomach contents into the bite site are to blame. Or it may be a combination of these and other triggers. Scientists are still investigating the causes.

What an allergic reaction feels like

The allergy doesn’t begin right away. Typically, one to three months after the sensitizing tick bite, a person with alpha-gal syndrome has their first disturbing reaction.

Alpha-gal syndrome produces symptoms that range from hives or swelling to crushing abdominal pain, violent nausea, or even life-threatening anaphylactic shock. The symptoms usually start two to six hours after a person has ingested a meat product containing alpha-gal.

Due to a general lack of awareness about the allergy, however, doctors can easily miss the diagnosis. A study in 2022 found that 42 percent of US health care practitioners had never heard of alpha-gal syndrome. A decade ago, people with alpha-gal syndrome might go years before the cause of their symptoms was accurately diagnosed. Today, the diagnosis is faster in areas where doctors are familiar with the syndrome, but in many parts of the country it can still take time and multiple doctor visits.

Unfortunately, with every additional tick bite or exposure to food or products containing alpha-gal, the allergy can increase in severity.

Chart showing tick relative sizes

The lone star tick isn’t the only one that can cause alpha-gal syndrome. Black-legged ticks have also been connected to cases.

Credit: US Army

The lone star tick isn’t the only one that can cause alpha-gal syndrome. Black-legged ticks have also been connected to cases. Credit: US Army

If you think you have alpha-gal syndrome

If you suspect you may have alpha-gal syndrome, the first step is to discuss the possibility with your doctor and ask them to order a simple blood test to measure whether your immune system is reacting to alpha-gal.

If you test positive, the main strategy for managing the allergy is to avoid eating any food product from a mammal, including milk and cheese, as well as other potential triggers, such as more tick bites.

Read labels carefully. Some products contain additives such as carrageenan, which is derived from red algae and contains alpha-gal.

In extreme cases, people with alpha-gal syndrome may need to carry an EpiPen to prevent anaphylactic shock. Reputable websites, such as the CDC and alphagalinformation.org, can provide more information and advice.

Mysteries remain as alpha-gal syndrome spreads

Since alpha-gal syndrome was first formally documented in the early 2000s, scientists have made progress in understanding this puzzling condition. Researchers have connected the allergy to specific tick bites and found that people with the allergy can have a higher risk of heart disease, even without allergy symptoms.

But important mysteries remain.

Scientists are still figuring out exactly how the tick bite tricks the human immune system and why tick saliva is a trigger for only some people. With growing public interest in alpha-gal syndrome, the next decade could bring breakthroughs in preventing, diagnosing, and treating this condition.

For now, the next time you are strolling in the woods or in long grasses, remember to check for ticks on your body, wear long sleeves, long pants, and tick repellent to protect yourself from these bloodthirsty hitchhikers. If you do get bitten by a tick, watch out for odd allergic symptoms to appear a few hours after your next steak or handful of gummy bears.

Lee Rafuse Haines is associate research professor of molecular parasitology and medical entomology at University of Notre Dame.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Photo of The Conversation

The Conversation is an independent source of news and views, sourced from the academic and research community. Our team of editors work with these experts to share their knowledge with the wider public. Our aim is to allow for better understanding of current affairs and complex issues, and hopefully improve the quality of public discourse on them.

“Red meat allergy” from tick bites is spreading both in US and globally Read More »

trump-suspends-trade-loophole-for-cheap-online-retailers-globally

Trump suspends trade loophole for cheap online retailers globally

But even Amazon may struggle to shift its supply chain as the de minimis exemption is eliminated for all countries. In February, the e-commerce giant “projected lower-than-expected sales and operating income for its first quarter,” which it partly attributed to “unpredictability in the economy.” A DataWeave study concluded at the end of June that “US prices for China-made goods on Amazon” were rising “faster than inflation,” Reuters reported, likely due to “cost shocks” currently “rippling through the retail supply chain.” Other non-Chinese firms likely impacted by this week’s order include eBay, Etsy, TikTok Shop, and Walmart.

Amazon did not respond to Ars’ request to comment but told Reuters last month that “it has not seen the average prices of products change up or down appreciably outside of typical fluctuations.”

Trump plans to permanently close loophole in 2027

Trump has called the de minimis exemption a “big scam,” claiming that it’s a “catastrophic loophole” used to “evade tariffs and funnel deadly synthetic opioids as well as other unsafe or below-market products that harm American workers and businesses into the United States.”

To address what Trump has deemed “national emergencies” hurting American trade and public health, he has urgently moved to suspend the loophole now and plans to permanently end it worldwide by July 1, 2027.

American travelers will still be able to “bring back up to $200 in personal items” and receive “bona fide gifts valued at $100 or less” duty-free, but a fixed tariff rate of between $80 to $200 per item will be applied to many direct-to-consumer shipments until Trump finishes negotiating trade deals with the rest of America’s key trade partners. As each deal is theoretically closed, any shipments will be taxed according to tariff rates of their country of origin. (Those negotiations are supposed to conclude by tomorrow, but so far, Trump has only struck deals with the European Union, Japan, and South Korea.)

Trump suspends trade loophole for cheap online retailers globally Read More »

samsung-galaxy-z-fold-7-review:-quantum-leap

Samsung Galaxy Z Fold 7 review: Quantum leap


A pretty phone for a pretty penny

Samsung’s new flagship foldable is a huge improvement over last year’s model.

Samsung Galaxy Z Fold 7 bent

Samsung’s new foldable is thinner and lighter than ever before. Credit: Ryan Whitwam

Samsung’s new foldable is thinner and lighter than ever before. Credit: Ryan Whitwam

The first foldable phones hit the market six years ago, and they were rife with compromises and shortcomings. Many of those problems have persisted, but little by little, foldables have gotten better. With the release of the Galaxy Z Fold 7, Samsung has made the biggest leap yet. This device solves some of the most glaring problems with Samsung’s foldables, featuring a new, slimmer design and a big camera upgrade.

Samsung’s seventh-generation foldable has finally crossed that hazy boundary between novelty and practicality, putting a tablet-sized screen in your pocket without as many compromises. There are still some drawbacks, of course, but for the first time, this feels like a foldable phone you’d want to carry around.

Whether or not you can justify the $1,999 price tag is another matter entirely.

Most improved foldable

Earlier foldable phones were pocket-busting bricks, but companies like Google, Huawei, and OnePlus have made headway streamlining the form factor—the Pixel 9 Pro Fold briefly held the title of thinnest foldable when it launched last year. Samsung, however, stuck with the same basic silhouette for versions one through six, shaving off a millimeter here and there with each new generation. Now, the Galaxy Z Fold 7 has successfully leapfrogged the competition with an almost unbelievably thin profile.

Specs at a glance: Samsung Galaxy Z Fold 7 – $1,999
SoC Snapdragon 8 Elite
Memory 12GB, 16GB
Storage 256GB, 512GB, 1TB
Display Cover: 6.5-inch 1080×2520 120 Hz OLED

Internal: 8-inch 1968×2184 120 Hz flexible OLED
Cameras 200MP primary, f/1.7, OIS; 10 MP telephoto, f/2.4, OIS; 12 MP ultrawide, f/2.2; 10 MP selfie cameras (internal and external), f/2.2
Software Android 16, 7 years of OS updates
Battery 4,400 mAh, 25 W wired charging, 15 W wireless charging
Connectivity Wi-Fi 7, NFC, Bluetooth 5.4, sub-6 GHz and mmWave 5G, USB-C 3.2
Measurements Folded: 158.4×72.8×8.9 mm

Unfolded: 158.4×143.2×4.2 mm

215 g

Clocking in at just 215 g and 8.9 mm thick when folded, the Z Fold 7 looks and feels like a regular smartphone when closed. It’s lighter than Samsung’s flagship flat phone, the Galaxy S25 Ultra, and is only a fraction of a millimeter thicker. The profile is now limited by the height of the standard USB-C port. You can use the Z Fold 7 in its closed state without feeling hindered by an overly narrow display or hand-stretching thickness.

Samsung Galaxy Z Fold 7 back

The Samsung Galaxy Z Fold 7 looks like any other smartphone at a glance.

Credit: Ryan Whitwam

The Samsung Galaxy Z Fold 7 looks like any other smartphone at a glance. Credit: Ryan Whitwam

It seems unreal at times, like this piece of hardware should be a tech demo or a dummy phone concept rather than Samsung’s newest mass-produced device. The only eyebrow-raising element of the folded profile is the camera module, which sticks out like a sore thumb.

To enable the thinner design, Samsung engineered a new hinge with a waterdrop fold. The gentler bend in the screen reduces the appearance of the middle crease and allows the two halves to close tightly with no gap. The opening and closing action retains the same precise feel as previous Samsung foldables. The frame is made from Samsung’s custom Armor Aluminum alloy, which promises greater durability than most other phones. It’s not titanium like the S25 Ultra or iPhone Pro models, but that saves a bit of weight.

Samsung Galaxy Z Fold 7 side

The Samsung Galaxy Z Fold 7 is almost impossibly thin, as long as you ignore the protruding camera module.

Credit: Ryan Whitwam

The Samsung Galaxy Z Fold 7 is almost impossibly thin, as long as you ignore the protruding camera module. Credit: Ryan Whitwam

There is one caveat to the design—the Z Fold 7 doesn’t open totally flat. It’s not as noticeable as Google’s first-gen Pixel Fold, but the phone stops a few degrees shy of perfection. It’s about on par with the OnePlus Open in that respect. You might notice this when first handling the Z Fold 7, but it’s easy to ignore, and it doesn’t affect the appearance of the internal flexible OLED.

The 6.5-inch cover display is no longer something you’d only use in a pinch when it’s impractical to open the phone. It has a standard 21:9 aspect ratio and tiny symmetrical bezels. Even reaching across from the hinge side is no problem (Google’s foldable still has extra chunk around the hinge). The OLED panel has the customary 120 Hz refresh rate and high brightness we’ve come to expect from Samsung. It doesn’t have the anti-reflective coating of the S25 Ultra, but it’s bright enough that you can use it outdoors without issue.

Samsung Galaxy Z Fold 7 open angle

The Z Fold 7 doesn’t quite open a full 180 degrees.

Credit: Ryan Whitwam

The Z Fold 7 doesn’t quite open a full 180 degrees. Credit: Ryan Whitwam

Naturally, the main event is inside: an 8-inch 120 Hz OLED panel at 1968×2184, which is slightly wider than last year’s phone. It’s essentially twice the size of the cover display, just like in Google’s last foldable. As mentioned above, the crease is almost imperceptible now. The screen feels solid under your fingers, but it still has a plastic cover that is vulnerable to damage—it’s even softer than fingernails. It’s very bright, but the plastic layer is more reflective than glass, which can make using it in harsh sunlight a bit of a pain.

Unfortunately, Samsung’s pursuit of thinness led it to drop support for the S Pen stylus. That was always a tough sell, as there was no place to store a stylus in the phone, and even Samsung’s bulky Z Fold cases struggled to accommodate the S Pen in a convenient way. Still, it’s sad to lose this unique feature.

The Z Fold 7 (right) cover display is finally free of compromise. Z Fold 6 on the left. Ryan Whitwam

Unlike some of the competition, Samsung has not added a dedicated AI button to this phone—although there’s plenty of AI here. You get the typical volume rocker on the right, with a power button below it. The power button also has a built-in fingerprint scanner, which is fast and accurate enough that we can’t complain. The buttons feel sturdy and give good feedback when pressed.

Android 16 under a pile of One UI and AI

The Galaxy Z Fold 7 and its smaller flippy sibling are the first phones to launch with Google’s latest version of Android, a milestone enabled by the realignment of the Android release schedule that began this year. The device also gets Samsung’s customary seven years of update support, a tie with Google for the best in the industry. However, updates arrive slower than they do on Google phones. If you’re already familiar with One UI, you’ll feel right at home on the Z Fold 7. It doesn’t reinvent the wheel, but there are a few enhancements.

Samsung Galaxy Z Fold 7 home screen

It’s like having a tablet in your pocket.

Credit: Ryan Whitwam

It’s like having a tablet in your pocket. Credit: Ryan Whitwam

Android 16 doesn’t include a ton of new features out of the box, and some of the upcoming changes won’t affect One UI. For example, Google’s vibrant Material 3 Expressive theme won’t displace the standard One UI design language when it rolls out later this summer, and Samsung already has its own app windowing implementation separate from Google’s planned release. The Z Fold 7 has a full version of Android’s new progress notifications at launch, something Google doesn’t even fully support in the initial release. Few apps have support, so the only way you’ll see those more prominent notifications is when playing media. These notifications also tie in to the Now Bar, which is at the core of Samsung’s Galaxy AI.

The Now Bar debuted on the S25 series earlier this year and uses on-device AI to process your data and present contextual information that is supposed to help you throughout the day. Samsung has expanded the apps and services that support the Now Bar and its constantly updating Now Brief, but we haven’t noticed much difference.

Samsung Galaxy Z Fold 7 Now Brief

Samsung’s AI-powered Now Brief still isn’t very useful, but it talks to you now. Umm, thanks?

Credit: Ryan Whitwam

Samsung’s AI-powered Now Brief still isn’t very useful, but it talks to you now. Umm, thanks? Credit: Ryan Whitwam

Nine times out of 10, the Now Bar doesn’t provide any useful notifications, and the Brief is quite repetitive. It often includes just weather, calendar appointments, and a couple of clickbait-y news stories and YouTube videos—this is the case even with all the possible data sources enabled. On a few occasions, the Now Bar correctly cited an appointment and suggested a route, but its timing was off by about 30 minutes. Google Now did this better a decade ago. Samsung has also added an AI-fueled audio version of the Now Brief, but we found this pretty tedious and unnecessary when there’s so little information in the report to begin with.

So the Now Bar is still a Now Bummer, but Galaxy AI also includes a cornucopia of other common AI features. It can rewrite text for you, summarize notes or webpages, do live translation, make generative edits to photos, remove background noise from videos, and more. These features work as well as they do on any other modern smartphone. Whether you get any benefit from them depends on how you use the phone.

However, we appreciate that Samsung included a toggle under the Galaxy AI settings to process data only on your device, eliminating the privacy concerns of using AI in the cloud. This reduces the number of operational AI features, but that may be a desirable feature all on its own.

Samsung Galaxy Z Fold 7 multitasking

You can’t beat Samsung’s multitasking system.

Credit: Ryan Whitwam

You can’t beat Samsung’s multitasking system. Credit: Ryan Whitwam

Samsung tends to overload its phones with apps and features. Those are here, too, making the Z Fold 7 a bit frustrating at times. Some of the latest One UI interface tweaks, like separating the quick settings and notifications, fall flat. Luckily, One UI is also quite customizable. For example, you can have your cover screen and foldable home screens mirrored like Pixels, or you can have a distinct layout for each mode. With some tweaking and removing pre-loaded apps, you can get the experience you want.

Samsung’s multitasking system also offers a lot of freedom. It’s quick to open apps in split-screen mode, move them around, and change the layout. You can run up to three apps side by side, and you can easily save and access those app groups later. Samsung also offers a robust floating window option, which goes beyond what Google has planned for Android generally—it has chosen to limit floating windows to tablets and projected desktop mode. Samsung’s powerful windowing system really helps unlock the productivity potential of a foldable.

The fastest foldable

Samsung makes its own mobile processors, but when speed matters, the company doesn’t mess around with Exynos. The Z Fold 7 has the same Snapdragon 8 Elite chip as the Galaxy S25 series, paired with 12GB of RAM and 256GB of storage in the model most people will buy. In our testing, this is among the most powerful smartphones on the market today, but it doesn’t quite reach the lofty heights of the Galaxy S25 Ultra, presumably due to its thermal design.

Samsung Galaxy Z Fold 7 in hand

The Z Fold 7 is much easier to hold than past foldables.

Credit: Ryan Whitwam

The Z Fold 7 is much easier to hold than past foldables. Credit: Ryan Whitwam

In Geekbench, the Galaxy Z Fold 7 lands between the Motorola Razr Ultra and the Galaxy S25 Ultra, both of which have Snapdragon 8 Elite chips. It far outpaces Google’s latest Pixel phones as well. The single-core CPU speed doesn’t quite match what you get from Apple’s latest custom iPhone processor, but the multicore numbers are consistently higher.

If mobile gaming is your bag, the Z Fold 7 will be a delight. Like other devices running on this platform, it puts up big scores. However, Samsung’s new foldable runs slightly behind some other 8 Elite phones. These are just benchmark numbers, though. In practice, the Z Fold 7 will handle any mobile game you throw at it.

Samsung Galaxy Z Fold 7 geekbench

The Fold 7 doesn’t quite catch the Z 25 Ultra.

Credit: Ryan Whitwam

The Fold 7 doesn’t quite catch the Z 25 Ultra. Credit: Ryan Whitwam

Samsung’s thermal throttling is often a concern, with some of its past phones with high-end Snapdragon chips shedding more than half their initial speed upon heating up. The Z Fold 7 doesn’t throttle quite that aggressively, but it’s not great, either. In our testing, an extended gaming session can see the phone slow down by about 40 percent. That said, even after heating up, the Z Fold 7 remains about 10 percent faster in games than the unthrottled Pixel 9 Pro. Qualcomm’s GPUs are just that speedy.

The CPU performance is affected by a much smaller margin under thermal stress, dropping only about 10–15 percent. That’s important because you’re more likely to utilize the Snapdragon 8 Elite’s power with Samsung’s robust multitasking system. Even when running three apps in frames with additional floating apps, we’ve noticed nary a stutter. And while 12GB of RAM is a bit shy of the 16GB you get in some gaming-oriented phones, it’s been enough to keep a day’s worth of apps in memory.

You also get about a day’s worth of usage from a charge. While foldables could generally use longer battery life, it’s impressive that Samsung made this year’s Z Fold so much thinner while maintaining the same 4,400 mAh battery capacity as last year’s phone. However, it’s possible to drain the device by early evening—it depends on how much you use the larger inner screen versus the cover display. A bit of battery anxiety is normal, but most days, we haven’t needed to plug it in before bedtime. A slightly bigger battery would be nice, but not at the expense of the thin profile.

The lack of faster charging is a bit more annoying. If you do need to recharge the Galaxy Z Fold 7 early, it will fill at a pokey maximum of 25 W. That’s not much faster than wireless charging, which can hit 15 W with a compatible charger. Samsung’s phones don’t typically have super-fast charging, with the S25 Ultra topping out at 45 W. However, Samsung hasn’t increased charging speeds for its foldables since the Z Fold 2. It’s long past time for an upgrade here.

Long-awaited camera upgrade

Camera hardware has been one of the lingering issues with foldables, which don’t have as much internal space to fit larger image sensors compared to flat phones. In the past, this has meant taking a big step down in image quality if you want your phone to fold in half. While Samsung has not fully replicated the capabilities of its flagship flat phones, the Galaxy Z Fold 7 takes a big step in the right direction with its protruding camera module.

Samsung Galaxy Z Fold 7 camera macro

The Z Fold 7’s camera has gotten a big upgrade.

Credit: Ryan Whitwam

The Z Fold 7’s camera has gotten a big upgrade. Credit: Ryan Whitwam

The camera setup is led by a 200 MP primary sensor with optical stabilization identical to the main shooter on the Galaxy S25 Ultra. It’s joined by a 12 MP ultrawide and 10 MP 3x telephoto, both a step down from the S25 Ultra. There is no equivalent to the 5x periscope telephoto lens on Samsung’s flat flagship. While it might be nice to have better secondary sensors, the 200 MP will get the most use, and it does offer better results than last year’s Z Fold.

Many of the photos we’ve taken on the Galaxy Z Fold 7 are virtually indistinguishable from those taken with the Galaxy S25 Ultra, which is mostly a good thing. The 200 MP primary sensor has a full-resolution mode, but you shouldn’t use it. With the default pixel binning, the Z Fold 7 produces brighter and more evenly exposed 12 MP images.

Samsung cameras emphasize vibrant colors and a wide dynamic range, so they lean toward longer exposures. Shooting with a Pixel and Galaxy phone side by side, Google’s cameras consistently use higher shutter speeds, making capturing motion easier. The Z Fold 7 is no slouch here, though. It will handle moving subjects in bright light better than any phone that isn’t a Pixel. Night mode produces bright images, but it takes longer to expose compared to Google’s offerings. Again, that means anything moving will end up looking blurry.

Between 1x and 3x, the phone uses digital zoom on the main sensor. When you go beyond that, it moves to the 3x telephoto (provided there is enough light). At the base 3x zoom, these photos are nice enough, with the usual amped-up colors and solid detail we’d expect from Samsung. However, the 10 MP resolution isn’t great if you push past 3x. Samsung’s image processing can’t sharpen photos to the same borderline magical degree as Google’s, and the Z Fold 7 can sometimes over-sharpen images in a way we don’t love. This is an area where the cheaper S25 Ultra still beats the new foldable, with higher-resolution backup cameras and multiple optical zoom levels.

At 12 MP, the ultrawide sensor is good enough for landscapes and group shots. It lacks optical stabilization (typical for ultrawide lenses), but it keeps autofocus. That allows you to take macro shots, and this mode activates automatically as you approach a subject. The images look surprisingly good with Samsung’s occasionally heavy-handed image processing, but don’t try to crop them down further.

The Z Fold 7 includes two in-display selfie cameras at 10 MP—one at the top of the cover display and the other for the inner foldable screen. Samsung has dispensed with its quirky under-display camera, which had a smattering of low-fi pixels covering it when not in use. The inner selfie is now just a regular hole punch, which is fine. You should really only use the front-facing cameras for video calls. If you want to take a selfie, foldables offer the option to use the more capable rear-facing cameras with the cover screen as a viewfinder.

A matter of coin

For the first time, the Galaxy Z Fold 7 feels like a viable alternative to a flat phone, at least in terms of hardware. The new design is as thin and light as many flat phones, and the cover display is large enough to do anything you’d do on non-foldable devices. Plus, you have a tablet-sized display on the inside with serious multitasking chops. We lament the loss of S Pen support, but it was probably necessary to address the chunkiness of past foldables.

Samsung Galaxy Z Fold 7 typing

The Samsung Galaxy Z Fold 7 is the next best thing to having a physical keyboard.

Credit: Ryan Whitwam

The Samsung Galaxy Z Fold 7 is the next best thing to having a physical keyboard. Credit: Ryan Whitwam

The camera upgrade was also a necessary advancement. You can’t ask people to pay a premium price for a foldable smartphone and offer a midrange camera setup. The 200 MP primary shooter is a solid upgrade over the cameras Samsung used in previous foldables, but the ultrawide and telephoto could still use some attention.

The price is one thing that hasn’t gotten better—in fact, it’s moving in the wrong direction. The Galaxy Z Fold 7 is even more expensive than last year’s model at a cool $2,000. As slick and capable as this phone is, the exorbitant price ensures tablet-style foldables remain a niche category. If that’s what it costs to make a foldable you’ll want to carry, flat phones won’t be usurped any time soon.

If you don’t mind spending two grand on a phone or can get a good deal with a trade-in or a carrier upgrade, you won’t regret the purchase. This is the most power that can fit in your pocket. It’s available directly from Samsung (in an exclusive Mint color), Amazon, Best Buy, and your preferred carrier.

Samsung Galaxy Z Fold 7 hinge macro

The Samsung Galaxy Z Fold 7 has a new, super-thin hinge design.

Credit: Ryan Whitwam

The Samsung Galaxy Z Fold 7 has a new, super-thin hinge design. Credit: Ryan Whitwam

The good

  • Incredibly slim profile and low weight
  • Upgraded 200 MP camera
  • Excellent OLED screens
  • Powerful multitasking capabilities
  • Toggle for local-only AI
  • Launches on Android 16 with seven years of update support

The bad

  • Ridiculously high price
  • Battery life and charging speed continue to be mediocre
  • One UI 8 has some redundant apps and clunky interface decisions
  • Now Brief still doesn’t do very much

Photo of Ryan Whitwam

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he’s written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

Samsung Galaxy Z Fold 7 review: Quantum leap Read More »

spilling-the-tea

Spilling the Tea

The Tea app is or at least was on fire, rapidly gaining lots of users. This opens up two discussions, one on the game theory and dynamics of Tea, one on its abysmal security.

It’s a little too on the nose that a hot new app that purports to exist so that women can anonymously seek out and spill the tea on men, which then puts user information into an unprotected dropbox thus spilling the tea on the identities of many of its users.

In the circles I follow this predictably led to discussions about how badly the app was coded and incorrect speculation that this was linked to vibe coding, whereas the dumb mistakes involved were in this case fully human.

There was also some discussion of the game theory of Tea, which I found considerably more interesting and fun, and which will take up the bulk of the post.

Tea offers a variety of services, while attempting to gate itself to only allow in women (or at least, not cis men), although working around this is clearly not hard if a man wanted to do that, and to only allow discussion and targeting of men.

Some of this is services like phone number lookup, social media and dating app search, reverse image internet search and criminal background checks. The photo you give is checked against catfishing databases. Those parts seem good.

There’s also generic dating advice and forums within the app, sure, fine.

The central feature is that you list a guy with a first name, location and picture – which given AI is pretty much enough for anyone these days to figure out who it is even if they don’t recognize them – and ask ‘are we dating the same guy?’ and about past experiences, divided into green and red flag posts. You can also set up alerts on guys in case there is any new tea.

What’s weird about ‘are we dating the same guy?’ is that the network effects required for that to work are very large, since you’re realistically counting on one or at most a handful of other people in the same position asking the same question. And if you do get the network big enough, search costs should then be very high, since reverse image search on a Facebook group is highly unreliable. It’s kind of amazing that the human recognition strategies people mostly use here worked at all in populated areas without proper systematization.

Tea provides much better search tools including notifications, which gives you a fighting chance, and one unified pool. But even with 4.6 million women, the chances of any given other woman being on it at all are not so high, and they then have to be an active user or have already left the note.

When I asked Claude about this it suggested the real win was finding Instagram or LinkedIn profiles, and that indeed makes a lot more sense. That’s good information, and it’s also voluntarily posted so it’s fair game.

Using a Hall of Shame seems even more inefficient. What, you are supposed to learn who the bad men are one by one? None of this seems like an effective use of time, even if you don’t have any ethical or accuracy concerns. This can’t be The Way, not outside of a small town or community.

The core good idea of the mechanics behind Tea is to give men Skin In The Game. The ideal amount of reputation that gets carried between interactions is not zero. The twin problems are that the ideal amount has an upper bound, and also that those providing that reputation also need Skin In The Game, gossip only works if there are consequences for spreading false gossip, and here those consequences seem absent.

What happens if someone lies or otherwise abuses the system? Everything is supposedly anonymous and works on negative selection. The app is very obviously ripe for abuse, all but made for attempts to sabotage or hurt people, using false or true information. A lot of what follows will be gaming that out.

The moderation team has a theoretical zero tolerance policy for defamation and harassment when evidence is provided, but such things are usually impossible to prove and the bar for actually violating the rules is high. Even if a violation is found and proof is possible, and the mod team would be willing to do something if proof was provided, if the target doesn’t know about the claims how can they respond?

Even then there don’t seem likely to be any consequences to the original poster.

Shall we now solve for the equilibrium, assuming the app isn’t sued into oblivion?

While tea is small and relatively unknown, the main worries (assuming the tools are accurate) are things like vindictive exes. There’s usually a reason when that happens, but there are going to be some rather nasty false positives.

As tea gets larger, it starts to change incentives in both good and bad ways, there are good reasons to start to try and manage, manipulate or fool the system, and things start to get weird. Threats and promises of actions within tea will loom in the air on every date and in every relationship. Indeed every interaction, essentially any woman (and realistically also any man) could threaten to spill tea, truthfully or otherwise, at any time.

Men will soon start asking for green flag posts, both accurate ones from exes and very much otherwise, services to do this will spring up, dummy accounts will be used where men are praising themselves.

Men will absolutely at minimum need to know what is being said, set up alerts on themselves, run all the background checks to see what will come up, and work to change the answer to that if it’s not what they want it to be. Presumably there will be plenty happy to sell you this service for very little, since half the population can provide such a service at very low marginal cost.

Quickly word of the rules of how to sculpt your background checks will spread.

And so on. It presumably will get very messy very quickly. The system simultaneously relies on sufficient network effects to make things like ‘are we dating the same guy?’ work, and devolves into chaos if usage gets too high.

One potential counterforce is that it would be pretty bad tea to have a reputation of trying to influence your tea. I doubt that ends up being enough.

At lunch, I told a woman that Tea exists and explained what it was.

Her: That should be illegal.

Her (10 seconds later): I need to go there and warn people about [an ex].

Her (a little later than that, paraphrased a bit): Actually no. He’s crazy, who knows what he might do if he found out.

Her (after I told her about the data breach): Oh I suppose I can’t use it then.

There is certainly an argument in many cases including this one for ‘[X] should be illegal but if it’s going to be legal then I should do it,’ and she clarified that her opposition was in particular to the image searches, although she quickly pointed out many other downsides as well.

The instinct is that all of this is bad for men.

That seems highly plausible but not automatic or obvious.

Many aspects of reputation and competition are positional goods and have zero-sum aspects in many of the places that Tea is threatening to cause trouble. Creating safer and better informed interactions and matches can be better for everything.

More generally, punishing defectors is by default very good for everyone, even if you are sad that it is now harder for you to defect. You’d rather be a good actor in the space, but previously in many ways ‘bad men drove out good’ placing pressure on you to not act well. This also that all this allows women to feel safe and let their guard down, and so on. A true ‘safety app’ is a good thing.

It could also motivate women to date more and use the apps more. It’s a better product when it is safer, far better, so you consume more of it. If no one has yet hooked the dating apps up automatically to tea so that you can get the tea as you swipe, well, get on that. Thus it can also act as a large improvement on matching. No, you don’t match directly on tea, but it provides a lot of information.

Another possible advantage is that receptivity to this could provide positive selection. If a woman doesn’t want to date you because of unverified internet comments, that is a red flag, especially for you in particular, in several ways at once. It means they probably weren’t that into you. It means they sought out and were swayed by the information. You plausibly dodged a bullet.

A final advantage is that this might be replacing systems that are less central and less reliable and that had worse enforcement mechanisms, including both groups and also things like whisper networks.

Consider the flip side, an app called No Tea, that men could use to effectively hide their pasts and reputations and information, without making it obvious they were doing this. Very obviously this would make even men net worse off if it went past some point.

As in, even from purely the man’s perspective: The correct amount of tea is not zero.

There are still six groups of ways I could think of that Tea could indeed be bad for men in general at current margins, as opposed to bad for men who deserve it, and it is not a good sign that over the days I wrote this the list kept growing.

  1. Men could in general find large utility in doing things that earn them very bad reputations on tea, and be forced to stop.

    1. This is highly unsympathetic, as they mostly involve things like cheating and otherwise treating women badly. I do not think those behaviors in general make men’s lives better, especially as a group.

    2. I also find it unlikely that men get large utility in absolute terms from such actions, rather than getting utility in relative terms. If you can get everyone to stop, I think most men win out here near current margins.

  2. Women could be bad at the Law of Conservation of Expected Evidence. As in, perhaps they update strongly negatively on negative information when they find it, but do not update appropriately positively when such information is not found, and do not adjust their calibration over time.

    1. This is reasonably marketed as a ‘safety app.’ If you are checked and come back clean, that should make you a lot safer and more trustworthy. That’s big.

      The existence of the app also updates expectations, if the men know that the app exists and that they could end up on it.

    2. In general, variance in response is your friend so long as the downside risk stops at a hard pass. You only need one yes, also you get favorable selection.

    3. Also, this could change the effective numerical dynamics. If a bunch of men become off limits due to tea, especially if that group often involves men who date multiple women at once, the numbers game can change a lot.

  3. Men could be forced to invest resources into reputation management in wasteful or harmful ways, and spend a lot of time being paranoid. This may favor men willing to game the system, or who can credibly threaten retaliation.

    1. This seems highly plausible, hopefully this is limited in scope.

    2. The threat of retaliation issue seems like a potentially big deal. The information will frequently get back to the target, and in many cases the source of the information will be obvious, especially if the information is true.

    3. Ideally the better way to fix your reputation is to deserve a better one, but even then there would still be a lot of people who don’t know this, or who are in a different situation.

  4. Men could face threats, blackmail and power dynamic problems. Even if unstated, the threat to use tea, including dishonestly, looms in the air.

    1. This also seems like a big problem.

    2. Imagine dating, except you have to maintain a 5-star rating.

    3. In general, you want to seek positive selection, and tea risks making you worry a lot about negative selection, well beyond the places you actually need to worry about that (e.g. when you might hurt someone for real).

    4. The flip side is this could force you to employ positive selection? As in, there are many reasons why avoiding those who create such risks is a good idea.

  5. Men might face worse tea prospects the more they date, if the downside risk of each encounter greatly exceeds the upside. Green flags are rare and not that valuable, red flags can sink you. So confidence and boldness decline, the amount of dating and risk taking and especially approaching goes down.

    1. We already have this problem pretty bad based on phantom fears. That could mean it gets way worse, or that it can’t get much worse. Hard to say.

    2. If you design Tea or users create norms such that this reverses, and more dating gets you a better Tea reputation so long as you deserve one, then that could be a huge win.

    3. It would be a push to put yourself out there in a positive way, and gamify things providing a sense of progress even if someone ultimately wasn’t a match, including making it easier to notice this quickly and move on, essentially ‘forcing you into a good move.’

  6. It’s a massive invasion of privacy, puts you at an informational disadvantage, and it could spill over into your non-dating life. The negative information could spread into the non-dating world, where the Law of Conservation of Expected Evidence very much does not apply. Careers and lives could plausibly be ruined.

    1. This seems like a pretty big and obvious objection. Privacy is a big deal.

    2. What is going to keep employers and HR departments off the app?

MJ: this is straight up demonic. absolutely no one should be allowed to create public profiles about you to crowdsource your deeply personal information and dating history.

People are taking issue with me casually throwing out the word “demonic.” so let me double down. The creators of this app are going to get rich off making many decent people depressed and suicidal.

This isn’t about safety. This isn’t just a background check app. Their own promo material clearly markets this as a way to anonymously share unverified gossip and rumors from scorned exes.

Benjamin Foss: Women shouldn’t be allowed to warn other women about stalkers, predators, and cheaters?

MJ: If you think that’s what this app is primarily going to be used for then I have a bridge to sell you.

Definitely Next Year: “Why can’t I find a nice guy?” Because you listened to his psychopathic ex anonymously make stuff up about him.

My current read is that this would all be good if it somehow had strong mechanisms to catch and punish attempts to misuse the system, especially keeping it from spilling over outside of one’s dating life. The problem is I have a hard time imagining how that will work, and I see a lot of potential for misuse that I think will overwhelm the positive advantages.

Is the core tea mechanic (as opposed to the side functions) good for women? By default more information should be good even if unreliable, so long as you know how to use it, although the time and attention cost and the attitudinal shifts could easily overwhelm that, and this could crowd out superior options.

The actual answer here likely comes down to what this does to male incentives. I am guessing this would, once the app scales, dominate the value of improved information.

If this induces better behavior due to reputational concerns, then it is net good. If it instead mainly induces fear and risk aversion and twists dynamics, then it could be quite bad. This is very much not a Battle of the Sexes or a zero sum game. If the men who don’t richly deserve it lose, probably the women also lose. If those men win, the women probably also win.

What Tea and its precursor groups are actually doing is reducing the Level of Friction in this type of anonymous information sharing and search, attempting to move it down from Level 2 (annoying to get) into Level 1 (minimal frictions) or even Level 0 (a default action).

In particular, this moves the information sharing from one-to-one to one-to-many. Information hits different when anyone can see it, and will hit even more different when AI systems start scraping and investigating.

As with many things, that can be a large difference in kind. This can break systems and also the legal systems built around interactions.

CNN has an article looking into the legal implications of Tea, noting that the legal bar for taking action against either the app or a user of the app is very high.

So yes, of course the Tea app whose hosts have literally held sessions entitled ‘spilling the tea on tea’ got hacked to spill its own Tea, as in the selfies and IDs of its users, which includes their addresses.

Tea claimed that it only held your ID temporarily to verify you are a woman, and that the breached data was being stored ‘in compliance with law enforcement requirements related to cyber-bullying.’

Well, actually…

Howie Dewin: It turns out that the “Tea” app DOXXES all its users by uploading both ID and face verification photos, completely uncensored, to a public bucket on their server.

The genius Brazilians over at “Tea” must have wanted to play soccer in the favela instead of setting their firebase bucket to private.

Global Index: Leaked their addresses too 😳

I mean, that’s not even getting hacked. That’s ridiculous. It’s more like ‘someone discovered they were storing things in a public dropbox.’

It would indeed be nice to have a general (blockchain blockchain blockchain? Apple and Google? Anyone?) solution to solving the problem of proving aspects of your identity without revealing your identity, as in one that people actually use in practice for things like this.

Neeraj Agrawal: If there was ever an example for why need an open and privacy preserving digital ID standard.

You should be able to prove your ID card says something, like your age or in this case your gender, without revealing your address.

Kyle DH: There’s about 4 standards that can do this, but no one has their hands on these digital forms so they don’t get requested and there’s tradeoffs when we make this broadly available on the Web.

Tea eventually released an official statement about what happened.

This is, as Lulu Meservey points out, a terrible response clearly focused on legal risk. No apology, responsibility is dodged, obvious lying, far too slow.

Rob Freund: Soooo that was a lie

Eliezer Yudkowsky: People shouting “Sue them!”, but Tea doesn’t have that much money.

The liberaltarian solution: requiring companies to have insurance against lawsuits. The insurer then has a market incentive to audit the code.

And the “regulatory” solution? You’re living it. It didn’t work.

DHH: Web app users would be shocked to learn that 99% of the time, deleting your data just sets a flag in the database. And then it just lives there forever until it’s hacked or subpoenaed.

It took a massive effort to ensure this wasn’t the case for Basecamp and HEY. Especially when it comes to deleting log files, database backups, and all the other auxiliary copies of your stuff that most companies just hang onto until the sun burns out.

I mean it didn’t work in terms of preventing the leak but if it bankrupts the company I think I’m fine with that outcome.

One side effect of the hack is we can get maps. I wouldn’t share individuals, but distributions are interesting and there is a clear pattern.

As in, the more central and among more people you live, the less likely you are to use Tea. That makes perfect sense. The smaller your community, the more useful gossip and reputation are as tools. If you’re living in San Francisco proper, the tea is harder to get and also less reliable due to lack of skin in the game.

Tom Harwood notes that this is happening at the same time as the UK mandating photo ID for a huge percentage of websites, opening up lots of new security issues.

As above, for this question divide Tea into its procedural functions, and the crowdsourcing function.

On its procedural functions, these seem good if and only if execution of the features is good and better than alternative apps that do similar things. I can’t speak to that. But yeah, it seems like common sense to do basic checks on anyone you’re considering seriously dating.

On the core crowdsourcing functions I am more skeptical.

Especially if I was considering sharing red flags, I would have more serious ethical concerns especially around invasion of privacy and worry that the information could get out beyond his dating life including back to you in various ways.

If you wouldn’t say it to the open internet, you likely shouldn’t be saying it to Tea. To the extent people are thinking these two things are very different, I believe they are making a serious mistake. And I would be very skeptical of the information I did get. But I’m not going to pretend that I wouldn’t look.

If you have deserved green flags to give out? That seems great. It’s a Mitzvah.

Discussion about this post

Spilling the Tea Read More »

how-the-trump-fcc-justified-requiring-a-“bias-monitor”-at-cbs

How the Trump FCC justified requiring a “bias monitor” at CBS


Paramount/Skydance merger

Trump FCC claims there’s precedent for CBS ombudsman, but it’s a weak one.

President-elect Donald Trump speaks to Brendan Carr, his intended pick for Chairman of the Federal Communications Commission, as he attends a SpaceX Starship rocket launch on November 19, 2024 in Brownsville, Texas. Credit: Getty Images | Brandon Bell

The Federal Communications Commission’s approval of CBS owner Paramount’s $8 billion merger with Skydance came with a condition to install an ombudsman, which FCC Chairman Brendan Carr has described as a “bias monitor.” It appears that the bias monitor will make sure the news company’s reporting meets standards demanded by President Donald Trump.

“One of the things they’re going to have to do is put an ombudsman in place for two years, so basically a bias monitor that will report directly to the president [of Paramount],” Carr told Newsmax on Thursday, right after the FCC announced its approval of the merger.

The Carr FCC claims there is precedent for such a bias monitor. But the precedent cited in last week’s merger approval points to a very different case involving NBC and GE, one in which an ombudsman was used to protect NBC’s editorial independence from interference by its new owner.

By contrast, it looks like Paramount is hiring a monitor to make sure that CBS reporting doesn’t anger President Trump. Paramount obtained the FCC’s merger approval only after reaching a $16 million settlement with Trump, who sued the company because he didn’t like how CBS edited a pre-election interview with Kamala Harris. Trump claimed last week that Paramount is providing another $20 million worth of “advertising, PSAs, or similar programming,” and called the deal “another in a long line of VICTORIES over the Fake News Media.”

NBC/GE precedent was “viewpoint-neutral”

The FCC merger approval says that “to promote transparency and increased accountability, Skydance will have in place, for a period of at least two years, an ombudsman who reports to the President of New Paramount, and who will receive and evaluate any complaints of bias or other concerns involving CBS.”

The Carr FCC apparently couldn’t find a precedent that would closely match the ombudsman condition being imposed on Paramount. The above sentence has a footnote citing the FCC’s January 2011 approval of Comcast’s purchase of NBCUniversal, saying the Obama-era order found “such a mechanism effective in preventing editorial bias in the operation of the NBC broadcast network.”

But in 2011, the FCC said the purpose of the ombudsman was to ensure that NBC’s reporting would not be altered to fit the business interests of its owner. The FCC said at the time:

The Applicants state that, since GE’s acquisition of NBC in 1986, GE has ensured that the content of NBC’s news and public affairs programming is not influenced by the non-media interests of GE. Under this policy, which was noted with favor when the Commission approved GE’s acquisition of NBC, NBC and its O&O [owned and operated] stations have been free to report about GE without interference or influence. In addition, GE appointed an ombudsman to further ensure that the policy of independence of NBCU’s news operations would be maintained. Although the Applicants contend there is no legal requirement that they do so, they offer to maintain this policy and to retain the ombudsman position in the post-transaction entity to ensure the continued journalistic integrity and independence of NBCU’s news operations.

The NBC/GE condition “was a viewpoint-neutral economic measure. It did not matter if the content had a pro or con position on any political or regulatory issue, but only whether it might have been broadcast to promote GE’s pecuniary interests,” said Andrew Jay Schwartzman, a longtime attorney and advocate who specializes in media and telecommunications policy. Schwartzman told Ars today that the NBC/GE condition cited by the Carr FCC is “very different from the viewpoint-based nature of the CBS condition.”

FCC Commissioner Anna Gomez, the commission’s only Democrat, said the agency is “imposing never-before-seen controls over newsroom decisions and editorial judgment, in direct violation of the First Amendment and the law.”

FCC: Trump lawsuit totally unrelated

The FCC’s merger approval order said that “the now-settled lawsuit filed by President Donald J. Trump against Paramount and CBS News” is “unrelated to our review of the Transaction.” But on Newsmax, Carr credited Trump with forcing changes at CBS and other media outlets.

“For years, people cowed down to the executives behind these companies based in Hollywood and New York, and they just accepted that these national broadcasters could dictate how people think about topics, that they could set the narrative for the country—and President Trump fundamentally rejected it,” Carr said. “He smashed the facade that these are gatekeepers that can determine what people think. Everything we’re seeing right now flows from that decision by President Trump, and he’s winning. PBS has been defunded. NPR has been defunded. CBS is committing to restoring fact-based journalism… President Trump stood up to these legacy media gatekeepers and now their business models are falling apart.”

Carr went on Fox News to discuss the CBS cancellation of Stephen Colbert’s show, saying that “all of this is downstream of President Trump’s decision to stand up, and he stood up for the American people because the American people do not trust these legacy gatekeepers anymore.” Carr also wrote in a post on X, “The partisan left’s ritualist wailing and gnashing of teeth over Colbert is quite revealing. They’re acting like they’re losing a loyal DNC spokesperson that was entitled to an exemption from the laws of economics.”

Warren: “Bribery is illegal no matter who is president”

In a July 22 letter to Carr, Skydance said it “will ensure that CBS’s reporting is fair, unbiased, and fact-based.” With the installation of an ombudsman who will report to the company president, “New Paramount’s executive leadership will carefully consider any such complaints in overseeing CBS’s news programming,” the letter said, also making reference to the previous case of an ombudsman at NBC. Skydance sent another letter about its elimination of diversity, equity, and inclusion (DEI) initiatives, complying with Carr’s demand to end such programs.

As Carr described it to Newsmax, the merging companies “made commitments to address bias and restore fact-based reporting. I think that’s so important. Look, the American public simply do not trust these legacy media broadcasters, so if they stick with that commitment, you know we’re sort of trust-but-verify mode, that’ll be a big win.”

The FCC’s merger-approval order favorably cites comments from the Center for American Rights (CAR), a conservative group that filed a news distortion complaint against CBS over the Harris interview. The group “filed a supplemental brief, in which it discusses a report by Media Research Center (MRC) concerning negative media coverage of the Trump administration,” the FCC said. “CAR asserts that the MRC report confirms that the news media generally, and CBS News in particular, is relentlessly slanted and biased. It concludes that Commission action is necessary to condition the Transaction on an end to this blatant bias.”

Although the FCC insists that the Trump lawsuit wasn’t relevant to its merger review, Carr previously made it clear that the news distortion complaint would be a factor in determining whether the merger would be approved. The FCC investigation into the Harris interview doesn’t seem to have turned up much. CBS was accused of distorting the news by airing two different answers given by Harris to the same question, but the unedited transcript and camera feeds showed that the two clips simply contained two different sentences from the same answer.

Congressional Democrats said they will investigate the circumstances of the merger, including allegations that Skydance and Paramount bribed Trump to get it approved. “Bribery is illegal no matter who is president,” Senator Elizabeth Warren (D-Mass.) said. “It sure looks like Skydance and Paramount paid $36 million to Donald Trump for this merger, and he’s even bragged about this crooked-looking deal… this merger must be investigated for any criminal behavior. It’s an open question whether the Trump administration’s approval of this merger was the result of a bribe.”

Photo of Jon Brodkin

Jon is a Senior IT Reporter for Ars Technica. He covers the telecom industry, Federal Communications Commission rulemakings, broadband consumer affairs, court cases, and government regulation of the tech industry.

How the Trump FCC justified requiring a “bias monitor” at CBS Read More »

smithsonian-air-and-space-opens-halls-for-“milestone”-and-“future”-artifacts

Smithsonian Air and Space opens halls for “milestone” and “future” artifacts


$900M renovation nearing completion

John Glenn’s Friendship 7 returns as SpaceX and Blue Origin artifacts debut.

a gumdrop-shape white space capsule is seen on display with other rocket hardware in a museum gallery with blue walls and flooring

“Futures in Space” recaptures the experience of the early visitors to the National Air and Space Museum, where the objects on display were contemporary to the day. A mockup of a Blue Origin New Shepard capsule and SpaceX Merlin rocket engine are among the items on display for the first time. Credit: Smithsonian National Air and Space Museum

“Futures in Space” recaptures the experience of the early visitors to the National Air and Space Museum, where the objects on display were contemporary to the day. A mockup of a Blue Origin New Shepard capsule and SpaceX Merlin rocket engine are among the items on display for the first time. Credit: Smithsonian National Air and Space Museum

The National Air and Space Museum welcomed the public into five more of its renovated galleries on Monday, including two showcasing spaceflight artifacts. The new exhibitions shine modern light on returning displays and restore the museum’s almost 50-year-old legacy of adding objects that made history but have yet to become historical.

Visitors can again enter through the “Boeing Milestones of Flight Hall,” which has been closed for the past three years and has on display some of the museum’s most iconic items, including John Glenn’s Friendship 7 Mercury capsule and an Apollo lunar module.

From there, visitors can tour through the adjacent “Futures in Space,” a new gallery focused on the different approaches and technology that spaceflight will take in the years to come. Here, the Smithsonian is displaying for the first time objects that were recently donated by commercial spaceflight companies, including items used in space tourism and in growing the low-Earth orbit economy.

a museum gallery with air and spacecraft displayed on the terrazzo floor and suspended from the ceiling

The artifacts are iconic, but the newly reopened Boeing Milestones of Flight Hall at the National Air and Space Museum is all new. Credit: Smithsonian National Air and Space Museum

“We are thrilled to open this next phase of exhibitions to the public,” said Chris Browne, the John and Adrienne Mars Director of the National Air and Space Museum, in a statement. “Reopening our main hall with so many iconic aerospace artifacts, as well as completely new exhibitions, will give visitors much more to see and enjoy.”

The other three galleries newly open to the public are devoted to aviation history, including the “Barron Hilton Pioneers of Flight,” “World War I: The Birth of Military Aviation,” and the “Allan and Shelley Holt Innovations Gallery.”

What’s new is not yet old

Among the artifacts debuting in “Futures in Space” are a Merlin engine and grid fin that flew on a SpaceX Falcon 9 rocket, Sian Proctor’s pressure suit that she wore on the private Inspiration4 mission in 2021, and a mockup of a New Shepard crew module that Blue Origin has pledged to replace with its first flown capsule when it is retired from flying.

“When the museum first opened back in 1976 and people came here and saw things like the Apollo command module and Neil Armstrong’s spacesuit, or really anything related to human spaceflight, at that point it was all still very recent,” said Matt Shindell, one of the curators behind “Futures in Space,” in an interview with collectSPACE.com. “So when you would come into the museum, it wasn’t so much a history of space but what’s happening now and what could happen next. We wanted to have a gallery that would recapture that feeling.”

Instead of being themed around a single program or period in history, the new gallery invites visitors to consider a series of questions, including: Who decides who goes to space? Why do we go? And what will we do when we get there?

a black and white astronaut's pressure suit and other space artifacts are displayed behind glass in a museum gallery with blue flooring and walls

Curatores designed “Futures in Space” around a list of questions, including “Why go to space?” On display is a pressure suit worn by Sian Proctor on the Inspiration4 mission and a 1978 NASA astronaut “TFNG” T-shirt. Credit: Smithsonian National Air and Space Museum

“We really wanted the gallery to be one that engaged visitors in these questions and that centered the experience around what they thought should be happening in the future and what that would mean for them,” said Shindell. “We also have visions of the future presented throughout the gallery, including from popular culture—television shows, movies and comic books—that have explored what the future might look like and what it would mean for the people living through it.”

That is why the gallery also includes R2-D2, or rather a reproduction of the “Star Wars” droid as built by Adam Savage of Tested. In George Lucas’ vision of the future (“a long, long time ago”), Astromech droids serve as spacecraft navigators, mechanics, and companion aides.

Beyond the artifacts and exhibits (which also include an immersive 3D-printed Mars habitat and Yuri Gagarin’s training pressure suit), there is a stage and seating area at the center of “Futures.”

“I think of it as a TED Talk-style stage,” said Shindell. “We’re hoping to bring in people from industry, stakeholders, people who have flown, people who are getting ready to fly, and people who have ideas about what should be happening to come and talk to visitors from that stage about the same questions that we’re asking in the gallery.”

Modernized “Milestones”

The artifacts presented in the “Boeing Milestones of Flight” are mostly the same as they were before the hall was closed in 2022. The hall underwent a renovation in 2014 ahead of the museum’s 40th anniversary, so its displays did not need another redesign.

Still, the gallery looks new due to the work done surrounding the objects.

“What is new for the ‘Boeing Milestones of Flight Hall’ is, at some level, most noticeably the floor and media elements,” said Margaret Weitekamp, curator and division chair at the National Air and Space Museum, in an interview.

“We have a wonderful 123-foot (37-meter) media band that goes across the front of the mezzanine, and we have 20 different slide shows that work as a digest of what you’ll find in the new galleries throughout the building,” said Weitekamp. “So as people come into the Boeing Milestones of Flight Hall, they’ll be greeted by that and get a taste of what they’re going to see inside.”

And then there is the new flooring. In the past, the hall had been lined in maroon or dark gray carpet. It is now a much lighter color terrazzo.

“It really brightens up the room,” Weitekamp told collectsPACE.

“Also, you’ll notice that as you are going up and down the hallways, there are medallions embedded in the floor that display quotes from significant aviation and spaceflight figures. So we’ve been able to put some quotes from Carl Sagan, Sally Ride, and Chuck Yeager into the floor,” she said.

the view looking down and into a museum gallery with aircraft suspended from the ceiling, spacecraft on display and a binary map embedded in the flooring

The pattern on the floor of the Boeing Milesones of Flight Hall is the pulsar-based map to Earth’s solar system that was mounted to the Pioneer and Voyager probes, now updated for 2026. Credit: Smithsonian National Air and Space Museum

Visitors should also pay attention to what look like lines of dashes converging at the hall’s center. The design is an update to a NASA graphic.

“We have a revised version of the pulsar map from Pioneer 10 and 11 and the Voyager interstellar record,” said Weitekamp, referring to the representation of the location of Earth for any extraterrestrial species that might discover the probes in the future. “The map located Earth’s solar system with relationship to 14 pulsars.”

When the Pioneer and Voyager spacecraft were launched, astronomers didn’t know that pulsars (or rotating neutron stars) slow down over time.

“So we worked with a colleague of ours to make it a map to our solar system as would be accurate for 2026, which will mark the 50th anniversary of the museum’s building and the 250th birthday of the nation,” Weitekamp said.

Thirteen open, eight to go

Monday’s opening followed an earlier debut of eight reimagined galleries in 2022. Also open is the renovated Lockheed Martin IMAX Theater, which joins the planetarium, the museum store, and the Mars Café that were reopened earlier.

the exterior entrance to a building with a tall, spike-like silver sculpture standing front and center

The redesigned north entrance to the Smithsonian National Air and Space Museum opened to the public on Monday, July 28, 2025. Credit: Smithsonian National Air and Space Museum

“We are nearing the end of this multi-year renovation project,” said Browne. “We look forward to welcoming many more people into these modernized and inspiring new spaces,”

Eight more exhibitions are scheduled to open next year in time for the 50th anniversary of the National Air and Space Museum. Among those galleries are three that are focused on space: “At Home in Space,” “National Science Foundation Discovering Our Universe,” and “RTX Living in the Space Age Hall.”

Admission to the National Air and Space Museum and the new galleries is free, but timed-entry passes, available from the Smithsonian’s website, are required.

Photo of Robert Pearlman

Robert Pearlman is a space historian, journalist and the founder and editor of collectSPACE, a daily news publication and online community focused on where space exploration intersects with pop culture. He is also a contributing writer for Space.com and co-author of “Space Stations: The Art, Science, and Reality of Working in Space” published by Smithsonian Books in 2018. He is on the leadership board for For All Moonkind and is a member of the American Astronautical Society’s history committee.

Smithsonian Air and Space opens halls for “milestone” and “future” artifacts Read More »