Author name: Shannon Garcia

monthly-roundup-#36:-november-2025

Monthly Roundup #36: November 2025

Happy Gemini Week to those who celebrate. Coverage of the new release will begin on Friday. Meanwhile, here’s this month’s things that don’t go anywhere else.

Google has partnered with Polymarket to include Polymarket odds into Google Search and Google Finance. This is fantastic and suggests we should expand the number of related markets on Polymarket.

In many ways Polymarket prediction markets are remarkably accurate, but here what we have is a Brier Score without a baseline of what we should expect as a baseline. You need to compare your Brier Score to scores on exactly the same events, or it doesn’t mean much. There’s a lot to be made on Polymarket if you pay attention.

A proposed ‘21st Century Civilization Curriculum’ for discussion groups. There’s an interestingly high number of book reviews involved as opposed to the actual books. I get one post in at the end, which turns out to be Quotes From Moral Mazes, so I’m not sure it counts but the curation is hopefully doing important work there.

Wylfa in North Wales will host the UK’s first small modular nuclear reactors, government to invest 2.5 billion.

Fusion reactors might pay for themselves by turning Mercury into Gold? Beware diminishing marginal returns. Manifold has this at 28% by 2035.

Scott Alexander’s latest roundup on charter cities.

This month’s version of standard solid advice for men in their 20s.

Usopp: 12 advice that came up the most in the replies/qt so far:

– Find your partner, get married and have kids

– Take way more risks

– Build a strong circle of quality friends, cut off toxic ppl

– Read a lot a lot of books

– Travel more, move elsewhere

– Exercise daily, stay and keep fit

– Stay away from junk food – always prioritise health

– Quit porn, quit smoking, quit alcohol

– Be humble, lose the ego but don’t lose the confidence.

– Protect your mental health

– Don’t neglect family, always call your parents

Nothing Earth shattering or surprising there, I hope, but yeah, that’s the go-tos.

Risks here means taking ‘real’ risks in life, not taking financial risks or gambling.

Jeffrey Wang is the latest to offer advice on how to throw parties, with an emphasis on sound, he says you need music and at least one both loud and quiet zone, and also he’s a fan of the drinking.

I sense I don’t get invited to his sort of parties, and that’s probably for the best.

A report from Jerusalem Demas about what it is like to know you could end up watching TikTok for 10 hours a day.

Twitter will experiment with telling us what nation accounts are posting from, what date they joined and when they last changed their username. They say there will be privacy toggles, but of course then everyone knows you have something to hide, and they’ll highlight that you did it. I’m mostly in support of this, as it should help control various bot and astroturfing problems. I think that’s worth the cost.

The plan to avoid penalizing Twitter links is that when you click on a link in the iOS app it will keep the post itself accessible on the bottom of the screen so that you can easily like, repost or respond to the original post while you read. I guess this is a marginal improvement?

Alternatively, you could do these things more often with tweets that have links or longer articles, especially the likes and retweets.

The unfortunate natural pattern is that if you provide a witty comment, the barrier to liking it is low. Whereas if you provide actual value in the form of a link or Twitter article, or you read something on Substack, the threshold for liking it is ‘I actually read the damn thing and not only liked it but didn’t have any issues with anything in it, and also remembered to like it afterwards’ which makes it highly unlikely.

Therefore, I’m going to make a request: If you think the world would be better off if more people read the link or article on Twitter, then like the post with the link or article. If not, not. Thank you for your attention to this matter.

The bulk of ‘social media’ is now actually short form television that uses an algorithm. This, as Matthew Yglesias notes, is bad. Short form algorithmic video is bad news. Social media as originally intended, where you are social and consume various media from people you are social with, is a mixed bag, but the new thing is terrible. Regular television has big downsides, but also advantages. This seems obviously much worse, and I’ve said it before but it bears repeating.

Xi views TikTok as ‘spiritual opium’ rather than something important, is totally fine with that, and is allowing the TikTok sale as a bargaining chip.

What happens if you start a fresh Twitter account using a VPN and the For You page? Soren Kierkegaard found (pre-election) a 4:1 ratio of right to left wing tweets. Nicholas Decker made a new alt and so got a look at the new account algorithm, reports nothing but the most obnoxious conservative propaganda imaginable.

This is not good business on the part of Elon Musk. Even if your goal is only to advance conservative causes, you need to draw new users in. This doesn’t do that.

Twitter’s new in-app link viewer has a few excluded domains, and well, whoops.

Substack being on the list is rather obnoxious, although it seems it was then taken out again and an explanation added?

Aaron: X has released an update for iOS that clarifies why some domains are blacklisted from the new web view

Our government keeps straight up murdering people on the high seas, as in blowing up boats, in many cases without even knowing who is on the boats.

Wall Street Journal claims Trump administration is planning to overhaul the IRS with the explicit goal of weaponizing it to investigate left wing groups. I thought that using the IRS in this way was illegal, but I guess it’s 2025, you can just do things.

We are rolling back tariffs on “products that cannot be grown, mined or naturally produced in the United States.” Good. I almost always have a policy of praising rather than criticizing people when they stop hitting themselves but man, we did not need to spend the better part of a year figuring that one out.

While Trump argues that not having tariffs will bankrupt the country Bessent announces a ‘$2,000 tariff stimulus check’ for those making less than $100k. Curious.

Always be gracious when someone does something good, you don’t necessarily have to ask how we got there, especially since everybody knows:

White House: Thanks to President Trump’s deal-making, we’re making trade fair again, & winning BIG.

Coffee, tea, cocoa, spices, bananas, oranges, tomatoes, beef, fertilizers, & more are now exempt from reciprocal tariffs.

America First policies delivering for American workers & families🇺🇸

Alex Tabarrok: Frank Sinatra? Heck of a guy – real prince. Saved my life once. We were doing a show at the Sands, and between sets, I took a break in the parking lot. Next thing I know, three guys are working me over real good. Then I hear Frank say, ‘OK, boys, that’s enough.’”

Shecky Greene

Home and auto insurance rates are rising, so the state governments are governmenting and telling insurers to cap prices. If you have lots of insurance providers and rates keep going up, there’s a reason. If you don’t have lots of insurance providers, there’s a reason for that, too. As California has learned, if you cap insurance prices where they’re unprofitable, insurers pick up and leave. No one seems to be asking about how to lower the real cost of insurance, as in the need for payouts.

There is about $1.5 trillion in capex going through federal permitting. Chop chop.

Trump explicitly says on Fox News we don’t have enough talent, we have to bring in talent, in reference to H1-Bs, and also reveals he had nothing to do with the raid on the South Korean battery factory and was wisely upset when it happened. It’s good that he understands the principle but we still observe what the White House is actually doing, which is not great.

Thanks to Argentina we now know that supporting US allies is America First.

Trump’s mechanism to pay the troops during the shutdown also seems rather blatantly illegal, as in spending money in a way not approved by Congress with no fig leaf on why it is allowed?

Bobby Kogan: The mechanism through which Trump is paying the troops is the most blatant large Antideficiency Act (ADA) violation in US history. It’s also clearly willful. No one has been charged under the ADA before, but violations carry a 2 year jail term. Statute of limitations is 5 years.

Under the Constitution and under the ADA, it is illegal to spend money without funding for that purpose. The president may not spend money to do something unless there’s actually money to carry it out and that action is expressly allowed.

… Military pay is appropriated one year at a time, with a one-year period of availability. The fiscal year ended on September 30th, and we did not pass new appropriations bills (the government is shut down), so there’s no money available to pay the troops (or to do lots of things).

… [various technical reasons what they’re doing is very not legal] …

… And the craziest part is this was needless. Congress would’ve passed a military pay bill with near unanimous support! Congressional Ds have been begging Rs to bring a bill to pay the military to the floor! But Johnson refuses to gavel in because he doesn’t want an Epstein vote.

So just how bad is this? I got a text from an appropriator friend saying “The Republic has fallen. Pack it in.”

I think there are five levels of potential badness here. Once you’ve decided to violate the ADA, you’re only bound by self-imposed limitations. But depending on what the White House is self-imposing, this can range from “BAD” to “The Republic has fallen, pack it in.”

… Taken together w/ impoundments, this’d break everything. The president is claiming the power to not spend money he doesn’t want to and now also to spend money where it’s not allowed. And SCOTUS might say no one has standing to stop him. That would make him an appropriations king.

In this case everyone agrees you pay the troops and the money can be reconciled (if it hasn’t been already) so the de facto fig leaf is ‘it is common knowledge this would have been approved’ but that’s not a norm you can rely on in this spot, the violations of principles here are rather egregious, and once you do it once what stops it happening again? What stops them from spending any and all public funds on whatever the hell they feel like?

In general, we follow a pattern of:

  1. A rule is broken that, if fully and properly exploited, would mean the Republic has fallen, and it’s time to pack it in.

  2. Things get a little bit worse but the thing is not exploited maximally.

  3. The Republic does not fall and we do not pack it in.

So we can do things like have unidentified masked people kidnapping citizens off the street and acting like this is fine, we can sink boats in international waters without trial or a declaration of war, have relatives of the president make a billion in crypto, have the Department of Justice selectively prosecute personal enemies on direct presidential orders, impose punitive tarriffs on one of our most reliable, friendly and important trading partners because of dislike of an advertisement, pardon or give clemency to ten Republican congressmen convicted of corruption style crimes including actual George Santos, weaponize the IRS to go after opposition groups, actively work to destroy vaccinations and PEPFAR and for some reason tylenol, warn major media companies to sell to the correct bidder or else they won’t approve the deal, outright demand $230 million from the treasury for his personal account right in the open, and so on and so on, and yet things are mostly normal.

For now. It doesn’t seem great that we keep playing that game.

Trump administration will be setting price floors across a range of industries to combat market manipulation by China. Price floors have a long history of causing markets to not clear and reducing supply, see minimum wages, and certainly they do not help you lower prices, but in this case I actually think this is a reasonable response? You have a rival strategically flooding your market as part of a strategy to drive you out of business. The worry is it is highly prone to abuse, or to becoming permanent, but if implemented wisely, it does seem like the right tool.

Not to harp on the H1-B visa thing but here’s another ‘the wage levels principle is completely absurd’ illustrative post. We’re prioritizing the most experienced people working in the lowest paid professions. If that sounds crazy, it’s probably because it is, especially since the correct answer (‘those who make the most money’) is right there. We’re also keeping it a weighted lottery instead of a sorting algorithm. What you actually want is certainty, so people know if they’re getting in or not.

Patrick McKenzie explains that Germany’s shutting down of its nuclear plants in favor of coal plants, under pressure from the so-called ‘greens,’ is an illustration of a fatal flaw in coalitional parliamentary politics.

Patrick McKenzie: I think it’s important, in the cases where people do things for wildly irrational reasons, to carefully listen to their explanation, both for understanding their worldview and for recording mistakes for future versions of the game.

One of the takeaways here is “A bad news cycle and a system which allows coalition management primacy over decision making will allow a generally effective technocratic government to, with eyes wide open, pick policies which are obviously foreseeable catastrophes.”

And when you ask, years later, “What possessed you to do that?”, the people who did it will say they were boxed in, that their coalition partners invested all their points in X, and when then happens during a bad news cycle well you just have to roll with it.

Germany continues to attempt to stop Uber from existing because they don’t direct rides through the central offices of a car rental company, which means they would be unfair competition. Expectation is that this won’t actually work, but still, wow.

There are calls to privatize air traffic control, because air traffic controllers are impacted by government shutdowns. I suppose if you’re locked into shutdowns and into the shutdowns impacting air traffic controllers this could end up working out. But rather obviously this is completely crazy? That you need something to reliably happen and not shut down so you need to privatize it?

The obviously correct thing is to exempt air traffic controllers from shutdowns. This seems super doable? You can pass a bill that automatically funds the FAA indefinitely from a trust fund. It’s what we do for USPS. It’s not like anyone want the FAA to shut down.

Instead, we have widespread flight cancellations and people considering planning road trips.

We are going to require cars and trucks, including electric vehicles, to include AM radios? What? In 2025, when we didn’t do this before? Some in the comments argue that AM radio is indeed important because the 0.1% of the time you need it you really need it, and I can buy that there might even be a market failure here, but the very obvious response is that this bill would have made ten times more sense in 1995 or 1975, and we didn’t have it then, so why now? Also, if this is important it’s like $25 to buy an AM radio and stick it in the glove compartment for when you need one.

In case it wasn’t obvious, the United States government pays below market for everything policy related, the jobs have super long hours and aren’t especially stable, and require you to go to Washington, DC, so only those who are already rich or heavily ideologically motivated tend to take them.

Rep. Ed Case (D-HI) points out that we have the Jones Act and ‘despite this’ our shipbuilding, repair capacity and such are all withering away to nothing, so arguing that the Jones Act protects national security makes no sense. I agree with him, except that instead of ‘despite’ he should be saying ‘because of,’ the Jones Act actively makes these problems worse.

This is from a full event, Assessing the Jones Act: Perspectives from the Noncontiguous States and Territories. Everyone is harmed except the rent seekers, but the noncontiguous areas are hurt quite a lot more than the rest of us.

Open Philanthropy is now Coefficient Giving, you will now be able to fund one of several cause area divisions.

Alexander Berger: Our ambition has always been to work with more donors once we had enough bandwidth to support Good Ventures.

We started in earnest in 2024, directing over $100m from other donors. We more than doubled that so far in 2025. We’re aiming for a lot more in years to come.

Our new name reflects various aspects of this new chapter:

“Co” -> collaborating with other donors

“Efficient” -> a nod to cost-effectiveness

“Coefficient” -> multiplying others’ impact, ROI

(And “giving” is much less of a mouthful than “philanthropy”)

Big success, a long time coming:

Samuel Hume: Novartis’ new malaria treatment cured 97.4% of patients – more than the current best treatment.

It kills resistant parasites, too, and probably blocks transmission better than current drugs

Approval is expected next year!

Roon: malaria of course has killed billions of people through human history and just like that another foe is ~vanquished

Scottt Alexander: > Go to the drug’s Wikipedia article

> “This drug [was] developed with support from the Bill and Melinda Gates foundation via their Medicine for Malaria Venture.”

If you mean Effective Altruists (TM), the compound was discovered in 2007, before effective altruism was founded, so we can hardly be blamed for not contributing to it! EA leader Open Philanthropy (founded 2017) has since funded research into other pioneering antimalarials.

From what I have seen, the ‘spreadsheet altruism’ absolutely includes strategies like ‘research new malaria drug’ and otherwise funding science and making similar bets.

Somehow this got 3 million views…

Gina: You can only pick one option!!!

The funny answer is the ‘life extension’ or ‘defense against existential risk’ move if you interpret the $700k as definitely living to 65.

But if you take it as intended, that you if you die early you still die in real life, then I really hope one would take $1.1 million here? Putting the amount this high is bizarre.

A remarkably large number of people went for the $900k, without even stopping to think that there was no assurance you would even get away with it. Well, that’s engagement farming, I guess.

This seems like a good note. I think the actual limiting factor here is mostly time.

Will Manidis: with the exception of museum quality/rareness, antique prices have fallen off a cliff over the past 10 years. you can decorate your home like a 18th century royal with pieces unthinkable 99% of humans across history, but instead you live amongst minimalist ikea slop

David Perell: I’ve been interviewing people who have beautiful homes about how they decorated them, and the biggest surprise is how they almost all insist that good design is more about taste than money.

Yes, it costs more to buy a great sofa than a bad one. But there are plenty of millionaires living in homes that feel like an airport lounge.

The actual limiting factor is taste and time. The taste to know what looks good and the time it takes to find what you’re looking for. What’s key is that the second-hand furniture market is quite inefficient.

To be sure, there is a spectrum: at one end, you have thrift stores (cheap, chaotic, and unvetted). At the other, you have Sotheby’s (curated, clean, and highly vetted). The sweet spot is somewhere in the middle.

So how do you find pockets of glorious inefficiency?

One way is to make friends with people who own antique shops. I have a friend in San Francisco who knows a few collectors in town. They know her taste, and when something comes in that matches her style, they call her. And because of this, she never has to wait 17 weeks for a backordered couch from CB2.

Here’s the key point: If you have a strong sense of taste and understand the game, you’ll consistently spend less to design a house that feels alive and uniquely yours.

Good design, it turns out, is a byproduct of taste and attention, not money.

Matthew Speiser: In NYC and the Hudson Valley there are numerous vintage and antique furniture stores selling great stuff at reasonable prices. Far from chaotic and unvetted.

And “taste” isn’t about “efficiency.” It takes a lot of time browsing pieces and observing decor you enjoy to develop your taste.

In NYC: Dobbins Street Vintage, Dream Fishing Tackle, Lichen, tihngs, Humble House, Shop 86, Sterling Place

In HV: Newburgh Vintage Emporium (2 locations), The Antique Warehouse, Magic Hill Mercantile, Hyde Park Antiques Center + lots of small shops in Hudson, Kingston, Saugerties, etc.

If you try to buy antiques or otherwise develop taste you need to worry about matching and you need to get buy-in from others, and it all takes time, alas. Getting consensus is tough. Also once you decorate the first time, it takes a lot of activation energy to start changing, and to shift to a new equilibrium. So I get it. But when I look over at our IKEA-style drawers in this room do I wish they were old school? Oh yeah.

There’s also the functionality of the room, which you have to pay attention to and understand. Know your Christopher Alexander.

(EDIT: This section originally identified this as being by someone else, for reasons that are lost to time.)

Henrick Karlsson’s whole OP is full of gold, I want to emphasize his third point here:

Henrick Karlsson: I got to run something like an experiment on my capacity to predict which exhibitions would end up great, and which would be a waste of time. It was easy. As soon as someone was slow at answering their email, or complained, or wanted us to be their therapist as they worked through the creative worries, I would tell my boss, “I think we should cancel this.” And my boss—whose strength and weakness is that she thinks the best of people and makes everyone feel held—would say, “Ah, but they are just a bit sloppy with email” “if we just fix this thing it will be fine. . .”

I was right every time; it ended in pain.

And this is quite nice actually: it means it doesn’t take some Tyler Cowen-level taste in talent to figure out who will do good work.

Harvard cuts the majority of its PhD seats across the next two years, citing financial uncertainty about funding and potential endowment taxes. Of what use is Harvard’s giant endowment, if not to keep the lights on in a situation like this? There is nonzero worry about this ‘rewarding’ the cuts in funding, but in this case the people cutting the funding are happy you’re cutting the PhD slots, so I don’t think that argument plays. Some cost cutting makes sense, but this seems crazy.

We’d rather replace a microwave than try to get it repaired. Is that a failure in the handyman market? We actually just relaced our microwave, and in our case clearly it wasn’t, yes you could have tried to repair it but the time cost of even getting it to a repair shop or arranging a visit would already have exceeded the replacement cost. To get the handyman market to clear here, you would need to be able to summon someone as easily as with an Uber, and the total cost per successful repair would need to be kept at roughly $100, so yeah, not going to happen in New York City.

In a forecasting competition, evaluators failed to find better predictors more persuasive. They did notice the better predictors showed signs of intelligence, rationality and motivation, but this was counteracted by others presenting with higher confidence. This suggests an easy fix if people care to get it right.

Listed under bad news because I wouldn’t want this playbook to be the right book, Andreesen and Collision discuss Elon Musk’s management style.

Bearly AI and Parham:

  1. Engineer-first organizations and find truth by speaking with those working on the floor (avoid management layers).

  2. Every week, find the most important bottleneck at a company and parachute in to fix it.

  3. Keep model of all engineering and business moving parts in his head (obviously, not many can do this).

  4. Create cult of personality in and outside of the company (continually drive attention, without marketing or PR).

  5. Pick single most import target metric for business at a time (eg. SpaceX = $ per kilo to orbit)

  6. Constantly create urgency (which often shortens time horizons for projects).

  7. Focus on capital efficiency

My theory is this is all very much a package deal if you try do more than about two of them. If you try to do half these things, it won’t work. You have to do most or all of them, or try a different approach, and as noted few people can do #3 and I would also add #4, or #2 in any useful sense. You need to be able to keep all the pieces in place, do the engineering work and also create a widespread cult of personality to justify that you keep fing with everyone and everything and making everyone’s lives miserable all the time.

Looking back on MetaMed in this light, I think we often fell into the ‘do many but not enough of these things and not hardcore and consistently enough’ bucket (not intentionally, I wasn’t modeling on Musk at all), and that’s one high level explanation of why it didn’t work. If I’d been able to go harder in the key missing places, then it plausibly would have worked to the extent the plan was workable at all (or pivoted).

Why do people care so much about bans on plastic straws? Let us count the ways.

Gearoid Reidy: McDonald’s Japan is finally abandoning its unpopular paper straws, replacing them with lids that diners can drink from directly.

Sam D’Amico: A nine year old wrote a “study” for a school project and we all ended up drinking glue for over a decade.

no_on_15: I will never understand how people became so distressed over paper straws.

caesararum: as “people” lemme count the ways

– paper straws are inferior at their main job

– they fall apart within minutes of use – they impart taste to what you’re drinking

– there’s evidence they leach more and worse chemicals than plastic

– they have a weird texture when you put your mouth on them

– the seam and softness and weird pliability _feel_ off

– ocean plastics have never been about single use plastics in the west

– legislators burned up time, political capital, and credibility advancing these laws

– we’re probably going to find out plastic straws use less total GHG anyway

Kelsey Piper: One more for your list: My toddler absolutely cannot use a paper straw at all. She bites it a bit, which plastic can handle and which destroys paper immediately.

Shea Levy (quoting Marcel Dumas): One more: The only answer to “It’s no big deal” is “fine, then let me win.”

Kelsey Piper: I think part of why the straws are such a flashpoint is because they’re such a pure example of making things worse and then going ‘why do you care so much? get a life’ when people observe that now their lives are worse.

Kelsey is spot on. It’s not that it’s such a huge deal, it’s that every time it happens it is so obvious that your life has been made worse essentially out of spite and stupidity and innumeracy, and every time it rubs this in your face. And then they lie to you, and say the paper straws are fine. They’re not fine, they are worse than nothing. That’s why it fills me with rage, even though I cannot remember the last time I used a straw.

Tyler Cowen worries about ‘affordability politics.’ He’s not against the abstract concept, but the ways people respond to lack of affordability don’t correspond to the good ways to create real affordability. We respond in such spots by restricting supply and subsidizing demand instead of expanding supply, so we find a problem and then go make it worse.

So yes, I worry about this too. In general, any time people say ‘the market has failed to provide what we want’ you are not going to like the proposed intervention.

Recommended: Derek Thompson writes about The Monks In The Casino, as in the young men in America who don’t live nominally healthy and ascetic lives in many senses but stay home, isolated, in front of computer monitors, without even feeling lonely, often addicted to things like porn and gambling as the economy looks increasingly like a casino. They take financial risks online, but no risks in physical space. Betting is getting easier while living gets harder.

I may say more later, for now read the whole thing.

People will attempt to justify anything.

Zen: “I procrastinated for days and then it only took 20m when I sat down to do it 😭”

Give your system more credit. A few days of subconscious processing to prepare for 20 minutes of execution. Subtract all the self-guilt and reprobation and U’ve got efficient functioning.

Loopy: I instead let the avoidance process run its course, and then I am resourced to do the task.

Yeah, look, no, that’s usually hogwash and it’s important to know it’s hogwash. Are there times when you actually need more subconscious processing? I mean I guess but mostly that’s the flimsiest of excuses. Do the thing already.

Do top people vary behaviors more? Where is causation here?

Robin Hanson: Top people have more conflicting stories about them as both nice and jerks. Because, I think, their behavior is in fact more context dependent. As that is in fact a more winning social strategy.

Triangulation: Also: high status attracts both detractors and sycophants.

In most situations for most people, even top people, I believe nice is correct regardless, jerk is a mistake, this pays dividends over time.

As you get to the top the jerk stories get amplified a lot more. You do have to be willing to be hard nosed in some situations, and there are those who are more willing to consider you a jerk because you’re successful. That doesn’t mean be a jerk, even to someone with no power.

However, there is a particular strategy based around maximal incentive gradients, and its final form only works at the top. Trump is the avatar of this.

One minute you’re the best, the next you’re the worst, and then you’re back to the best. So you have maximum reason to be sure you’re the best and not the worst.

If you’re high enough in relative status and power or usefulness that people still want to deal with you at all, this can be very powerful. If you’re not, it doesn’t work, because no one will want to associate with you at all. So you can only deploy this strategy to the extent, and in the contexts, where people have no choice but to engage.

In some places there’s an equilibrium that drives such strategies out, and I prefer such spaces. But the top of business and politics reward it.

Venezuelan President Maduro did not actually say (to our knowledge) that if the US gives him amnesty, removes his bounty and gives a comfortable exile he’ll leave office. But let’s suppose that he made this offer. Should we take it?

Andrew Rettek: It’s like the trolley problem but instead of one person it’s a bag of money and instead of 5 people it’s an entire country.

In terms of causal decision theory, of the direct consequences, obviously yes. You greatly improve an entire nation in exchange for a tiny bag of money. Great deal.

Alas, that is not the full deal. The deal also is that future dictators will know they likely have a similar option, even if they are pretty terrible. This goes both ways.

First the good news:

  1. If others take similar deals, you can rescue other countries similarly.

  2. If others know they have this option, they can invest fewer resources in regime stability and buying off loyalty of their chain of command, since failure to maintain power is now often much less bad.

Then the bad news:

  1. This makes being a dictator a much, much better deal.

  2. This encourages them to maintain strong bargaining positions.

  3. This also gives them more incentive to steal money and squirrel it away.

We face similar hostage situations all the time at smaller scale. We strike a balance. We do often pay ransoms, negotiate for hostages and so forth. We also have limits. I think in general we are too willing to negotiate, and should more often tell such folks to go to hell and accept that this particular situation will often end poorly as a result.

On the dictator level it is less clear. In this case I would take the deal, if it came not only with him leaving but with a transition to democracy. Indeed, one could make a conditional deal, where his immunity depends on the transition.

If the job interview was too easy, perhaps you don’t want the job. Worthwhile interviews are two ways, you want to be sure you will have good colleagues who work hard and the job will challenge you, and that is a fit for your interests. I the interview is too easy, you probably could have aimed higher. The paper here finds that the perceptions from a job interview are indeed informative about the job.

When I left my interview at Jane Street Capital, I was very excited to work there. When I did my other finance interview? Not so much.

I strongly agree with Roon here, for most (but not all) classes of intellectual tasks. For physical tasks it will probably suck to be you doing it but in terms of productivity you can 996 (work 12 hours a day 6 days a week) all you want.

Roon: most likely you will not get the most out of yourself by 996ing. generally that’s a way to destroy the self. I subscribe to the Ferris bueller’s day off theology that says you’ll probably get the most out of yourself by being maximally uninhibited so the universe sings with you.

it’s more important to Go To War when dharma is calling, and you will know when it happens, than to 996 as a permanent way of life. for people like elon [musk] and sam [altman] that may be every day but it’s probably not yours.

They are pitching us… anti-suicide chairs? It seems a lot of the argument here is literally ‘the chair doesn’t help you physically kill yourself’ and a bunch of weird claims about things like ‘creating a supportive and inclusive environment and reducing stigma and encouraging dialogue’ and I’m calling full BS on all that.

Indeed, my guess is the best thing you can do for people in trouble via chairs is to get them actually comfy chairs, so they feel better.

David Marx: Rolling Stone compiled a “The 250 Greatest Songs of the 21st Century” list, and while the specific inclusions are debatable, it gives a sense of the 21st century canon as it’s forming.

I noticed a bias towards the early 2000s so I ran the numbers.

I tallied the number of entries per year, and there’s a steady and linear decline, with a very clear dip in the last half of the Aughts. Then I weighted the entries (so that a #1 was worth much more than a #250), and it tells a similar story, although 2013 shows a resurgence before things collapse again.

There will always be some anti-recency bias in canon-building, because new things have yet to prove their long-term value, but there’s also a clear bias here towards “long ‘90s” songs like “B.O.B.” and “Get Ur Freak On” and lingering respect for the post-9/11 rock revival.

The resurgent 2013 winners list doesn’t have a clear narrative (although interested in your ideas): Lorde, Drake, Kacey Musgraves, Haim, DJ Snake feat. Lil Jon, Paramore, Arctic Monkeys, Justin Timberlake, Miley Cyrus, Sky Ferreira, Jason Isbell, Alvvays.

Also: it’s a real Neptunes / PW shutout. Sure, no “Blurred Lines” but no “Drop It Like It’s Hot” or “Grindin’”?

Steve Sailer: Rolling Stone subscribers are really, really old.

I don’t know how much of this is anti-recency bias, and how much of this is those involved being super old, but also the idea of having a canon of music songs, that are listened to over decades, seems itself pretty old now, something only old people would care about?

I also checked some of the list, and it’s remarkable how much there simply isn’t a canon from this century, or at least how easy it is to ignore. If you’d made a similar list from the 20th century, I expect I’d have known most of the songs. When I browsed this list, I was running at maybe 15%, and that’s simply to know them, not like them. To be fair to the list, the ones I did recognize seemed like mostly good picks.

Tanmay Khale emailed Tyler Cowen to suggest that modern songs are suffering from unfair regularization of scores, where they are compared to other modern songs or to how much better they are than prior efforts, so they don’t look great. I agree there is some of this going on, our standards to break through are higher, but I think that’s more about the low hanging fruit being picked, you don’t need to be ‘better’ so much as you need to be original, which is increasingly hard. There’s some amount of better necessary to break through into a canon to overcome familiarity barriers, but also people can get really familiar with big hits quickly.

Music is different from sports here because you don’t only play against simultaneous competition. A song from 2025 and one from 1975 are both on Spotify, you can choose which one to play or prefer.

Netflix makes a deal with Spotify to get The Ringer’s podcasts and exclude those podcasts from YouTube. I get why they’re doing it, but I don’t love it. Dividing up podcasts the way we’ve divided up television streaming is super annoying.

Free clicks are seldom cheap, but often slop.

Nathan Lazerus: From @mattyglesias today (quotes the classic newsroom finding from the early internet era that what people click on is very different from what they say they want to read):

I feel like the ad vs. subscription model matters a lot here. People will sign up for a subscription to a news source that fits their high-minded aspirations, while they don’t want to pay for some guilty pleasure/clickbait.

So journalists of old were maybe not wrong to keep putting out the high-quality reporting they did—it drove subscriptions. But when pay/reach was determined by views, the profit-maximizing type of content changed.

Matthew Yglesias: Yes this is a very important point.

People tend to subscribe to things based on what kind of content they are *proudto consume, while they’ll watch any garbage for free.

So subscription-based models, especially without much bundling, support more high-minded content.

Have a policy for where your inputs come from. Stick to that policy. Your subscription self it better than your free click self.

What we die of in real life versus media:

I mean, yes, ‘person has heart attack and dies’ is not news. I do wish they’d stop being so damn lazy with all the car accidents in fictional media.

Vince Gilligan is still proud of Breaking Bad and Better Call Saul but thinks we have too many antiheroes and it is harmful, which his new show Pluribus seeks to address, by all reports it is cool but I’m waiting for the full season drop. Article is a fun extended profile.

And so the new cable package era continues to slowly create itself, as AppleTV+ and Peacock offer a combined package for $20/month (or $15 if you’re willing to accept Peacock ads). On their own AppleTV+ is $13/month and Peacock is $10/$15 depending on if you accept ads, so that’s a deep discount. That’s in addition to the $30 HBO/Hulu/Disney+ package, which is also strong. You should have Amazon Prime anyway, so throw in Netflix and YouTube Premium, Paramount+ is optional, and you’re all set unless you watch sports.

The problem is you’re then very tempted to rotate between packages. The long term equilibrium is presumably one package with all of it, so you aren’t constantly either toggling between services or feeling bad about not doing so. Alternatively, they should up their yearly subscription discount game, which I would also find acceptable.

Meanwhile there’s a war. Disney owns ESPN and ABC, as well as Hulu and Fubo. Google wants Disney to agree to incorporate their Hulu offerings into the YouTubeTV experience, and Disney is having none of it, and as a result of that (and some amount of pricing argument) we’ve now gone weeks with Disney not available on YouTubeTV.

This is wreaking havoc on my ability to experience college football in particular, because the only alternative services, ESPN and Hulu, have remarkably awful experiences for anyone trying to view sports that aren’t live, in a ‘seriously considering not to bother’ way.

Andrej Karpathy makes the case that the TV watching experience was better in the 1990s.

Andrej Karpathy: TV in the 90s: you turn it on, you watch.

TV 2025:

– turn on, wait for it to load

– popup: TV wants to update, 1.5GB. No.

– scroll sideways, find prime video app or etc

– popup: now app wants to update, 500MB. No!!

– App launching… App loading…

– select account screen

– 🫠

There is a movement I found on Instagram where people deliberately choose to live in 90s, refusing all technology after 2000. Like an intermediate form of the Amish.

That sometimes (rarely) happens, and yes it’s annoying. There’s substantial startup costs. But have you tried watching TV that is 30% advertisements that you cannot skip, and that cannot be paused? Have you tried managing a VCR? Have you tried having to call the cable guy?

Yeah, no thanks.

Nate argues television peaked in 2014. I agree there were some good times, 2014 is definitely a better television case than the 1990s (although movies peaking in 1999 is a highly reasonable argument!), but a lot of this is again forgetting all the old annoyances, and forgetting that we used to have actual scarcity. Yes, now you have to figure out where to watch something, but usually there is an answer. Before you turned on the television and watched, because if it wasn’t on some channel you were out of luck.

Overall I am firmly on the side that the television experience has never been better, or at least that this will be true once Disney and YouTubeTV resolve their dispute.

As in, it’s not only AI that has jagged capabilities.

Sarah Constantin: It feels like every time I’m “bad at” something, it’s actually that I’m good at some subskills and not doing other subskills AT ALL.

Like, underneath every 50% there’s a bunch of 100% and 0% pieces.

eg:

“I’m not so good at sales” is actually “I have a good pipeline and offer a good service but I’m essentially not even trying to be persuasive on sales calls”

“I’m not so good at the videogame Hades” is actually “there are some moves i never learned to do at all, so i don’t use em”

Magic: The Gathering announces a Magic Limited Championship in 2027. I thought I was out, but given I can use my Hall of Fame invite and only learn one set and one limited format, this could pull me back in.

I also am considering doing some power cube drafting on Arena. Sounds like fun.

Magic Spotlight Series SCG Baltimore has a second day metagame over 50% Cauldron.

Occasionally we see Standard formats that end up in this failure mode. The price of printing fun and cool cards, and of the current theory of design, is that this will sometimes happen. When it happens by accident, that’s unfortunate, and I think they could do a better job putting stabilizers into sets to guard against this, but the correct risk of this to take is not zero.

Except that back in September things had already reached this nightmare state in a way that seemed obviously like it was going to be sustainable, and LSV predicted essentially the full outcome back on August 18. This was an active decision.

The official response is that this would have required an emergency ban, and formats need stability, so they’re not doing it.

I’m sorry, but that’s ridiculous. As of SCG Con, it had been two full months. If you’re unwilling to ‘emergency’ ban then you need more B&R days than this.

I’m also sympathetic to ‘balancing Standard is not the top priority of Wizards R&D anymore,’ and I realize this will increase the rate of mistakes made, except that this consideration cannot apply to Standard itself or to its banned list. Standard participation needs to be continuous to keep up with card access, breaking it is deadly. As someone excited to try and find the time to do a fully Limited PT, I cannot overstate how much this failure makes me uninterested in returning to Standard.

Sam Black assembles a list of every card in Magic’s Premodern format that one could possibly want to play. It’s a fun list and includes some deep cuts, while letting you skip the cuts that are too deep.

Sam Black warns us that in Magic draft, 17lands data on win rates is often misleading because cards that only go in the better decks will end up showing artificially high win rates when drawn. Cards that only go in one particular strong deck type look great because they don’t make the cut at all otherwise, whereas Sol Ring goes in almost every deck. Also you need to worry about your skill level versus average skill level.

The caveat back is that while in theory full flexibility is good, and for experts like Sam Black it’s very good, it can also be a trap (in terms of short term win rates) to be tempted into decks that aren’t good or that you don’t know how to draft, whereas you actually should be forcing the good stuff far more if you care only about winning now.

Formula 1 (F1) racing signs an exclusive five-year deal with AppleTV+, likely for ~$150 million a year, up from the $90 million ESPN paid in the previous deal. Ben Thompson notes that ESPN had been putting in minimal effort, and AppleTV+ will be incorporating the full F1 TV be part of the base AppleTV+ package.

I see the risk in going to a niche service like AppleTV+ over ESPN, given that every serious sports fan presumably will still need ESPN access, but in exchange they hopefully get to present a better product, in a unified way. The obvious deal would have been Netflix, why not unify the core broadcast with Drive to Survive, but I don’t mind what they ended up doing. Apple is also a powerful ally.

I think AppleTV+ is exactly on point in saying it wants to own entire entire sports. It is maddening to have to hunt for different games or events and feel forced to buy multiple services. I think this played a substantial part in driving me away from baseball this year.

I do warn AppleTV+ to fix their spoiler problem. Their current interface actively spoils everything, constantly, it’s a disgrace. Someone reading this must know someone who knows someone. Fix it.

Don’t click the link, but yeah, the perfect a16z is ‘[evil thing X] meets [awful thing Y] in ways of questionable legality that will ruin our customers lives.’ Don’t like you, but I’m impressed.

College football coaches have been paid a combined $185 million this season to go away. I get how we got here, the coaches are in high demand and shop for the best deal, want to lock in profits, are definitely not looking to get fired so there isn’t actual moral hazard, and the patience teams show has worn paper thin, and the buyout serves are protection against being poached by another school. Also the transition to the NIL era has invalidated many past strategies, making previously excellent coaches no longer good, see Dabo Swinney (probably).

It still does not make sense to me. You might not love the coach but at an 80%+ discount you think you can do better? You need to be firing them in the middle of the season like this? It’s madness, I tell you.

I think with Franklin and Kelly in particular the problem is that they did great jobs in recruiting, so expectations got very high, then the teams didn’t deliver and they thought let’s axe the coach. Big mistake.

The other note is that if the coaches get rehired then the cost will be a lot less, and one expects the top names on this list to get new jobs. LSU and Penn State might not want them, but plenty of schools would love Kelly or Franklin. I’d love to get Franklin for Wisconsin, it seems like a perfect fit.

Whereas one I definitely agree with here is Mike Gundy. Gundy is a prime example of a previously excellent coach who is adrift in the new era, you have to cut your losses.

One obvious suggestion is to tie the buyouts directly to the record. You say, okay, if we fire you without cause you are owed 85% of the contract, but if you have X losses or fail to hit some milestone, then that’s cause. Seems simple enough, and the coaches at this level have big egos and don’t expect to fail.

The NFL might be getting ready to move to the 4th and 15 alternative to onside kicks.

Jonathan Jones: NFL EVP Troy Vincent told team owners today that it may be time to look at the fourth-and-15 proposal that has been offered as an alternate to the onside kick. The lack of recoveries on onside has disappointed the league.

Seth Burn: This will be a disaster if teams can bait the refs into giving cheap defensive holding or DPI flags.

You want to calibrate about how often the team can convert. Right now the onside kick recovery rate is too low. The yards to go can be adjusted to taste, and with many yards to go you don’t have to give the refs an excuse.

If the refs are actively looking to throw a flag in order to extend the game, and are basically cheating in this particular spot, that’s a different problem. I presume they wouldn’t do it because this is bad for the game.

Also the cheap automatic first downs from such penalties should be clamped down on in any case. There are any number of rules changes to fix this, the most obvious being that there can be two types of such flags, the way there’s both running into and roughing the kicker, and you don’t get an automatic first down unless it’s flagrant.

Nate Silver offers his thoughts on the NBA betting scandal. Our perspectives on this are broadly similar. Sports betting can be good fun and good business, and the context of odds can enhance sports, but the current regime of legalized sports gambling on your phone is terrible and current books do not deserve your sympathy.

They especially don’t deserve sympathy for when their whales (big customers getting taken for huge amounts that are allowed to do basically anything for huge limits without questions) end up becoming beards (as in placing bets on behalf of actual professional gamblers) and bet $100k or more on an obscure player prop. They’re choosing to do game theoretically unsound things and taking calculated risks. If you’re gonna play with fire then sometimes you’re gonna get burned.

My view of player props is that people who seek them out should be allowed to have their fun, sure why not, it’s cool info and a cool mini-game and in some cases it’s even a loss leader (since the wise person betting can pick off your mistakes and passes otherwise), but that the sportsbooks pushing them (and also pushing parlays) on recreational players is predatory behavior. And if they raise the limits on the props, especially on obscure players, that’s at their own risk.

I also don’t have much sympathy for the recreational gamblers who take the other side of insider NBA bets. The NBA lines are, as Nate says, full of information about injuries and player usage and intent to tank, often not publicly known, to the point where this is the main thing driving lines away from where they naively ‘should’ be, and where most NBA fans at a sports bar could tell you what the line ‘should’ be if everyone potentially available was healthy and playing. Evaluating and tracking injuries is the main skill. That’s the game you’re playing. Either play it, or don’t.

One place I disagree is where Nate mentions in his point #7 that if we banned FanDuel and DraftKings that 70% of that volume might move offshore rather than vanishing. I agree some percentage would move if there were no alternatives, but I would be utterly shocked if it was on the order of 70%. All the advertising would be gone. All the integration with media and teams and stadiums would be gone. Funding would be non-trivial again, as Nate notes you’d largely need to use crypto. You wouldn’t have an app with an optimized UI and wouldn’t be getting all the hyper aggressive customized push notifications on your phone. The entire context would change. No, it wouldn’t go fully back to the old level of activity, but it would drop a lot.

The broader NFL shift is that not only are kickers getting better (as per this very fun article from Nate Silver), offenses are getting better across the board and also making better decisions, and the reason we don’t notice the extent of this is that drives are taking up more time so the scores don’t fully reflect the shift.

When NFL teams depart from draft consensus on player value they consistently do worse. So teams should use the consensus board for player value, except for when they have particular private information (such as on injuries), especially teams like the Jets with poor track records.

You do still have to account for positional value, and what you in particular need because the trading market is illiquid. It’s fine to make small departures based on what you do and don’t need, but that should be it.

I actually do understand the calls for capping concession prices at stadiums.

Lindsay Owens here claims that teams are outright making mistakes, that in Atlanta raising ticket prices while lowering concession prices increased sales volume and revenue and fan satisfaction. I buy it.

My read is that the higher concession prices raise marginally more revenue, but that you don’t want to be at the top of the revenue curve on this because the bad feeling of overpaying too much not only drives fans away from purchases, it makes the overall experience worse, as the stadium experience is Out To Get You. What you want is to be able to basically order whatever you want and not feel bad about it, and the team should want this for you too.

It makes the overall experience much better, keeps people coming back, and turns them into long term fans. In general, teams should be doing less short term profit maximizing at their stadiums. I bet that on most current margins this outweighs the value of the price discrimination.

This is not the same as requiring ‘all-in pricing’ on tickets, which I think is just good, and yes you lose the ability to do price discrimination which in theory leaves something on the table. However, I think there are enough differences that I do not want to ‘force them into a good move’ via law.

Nate also discusses the poker cheating scandal, where I’m happy to defer to him and his notes match my understanding. Poker is fun, either with your buddies or at a casino, but if you’re not at a casino avoid raked games where the host turns a profit, there’s too much cheating risk and risk of involvement with people who are bad news. If you get invited to a home game, don’t go unless you understand why you’re invited.

I’d highlight the note that cheaters are usually extremely greedy and unable to keep their cheating subtle, as per Nate’s #39. If they were capable of only ‘cheating small’ then they wouldn’t be cheating, so if you pay attention you can usually sense things aren’t right even if you can’t prove it.

Hence the ability of Matt Berkey to call out the Billups game as rigged two years ago. If you listen to the podcast clip, everything was the opposite of subtle, with players constantly making plays that make absolutely no sense unless cheating is involved.

Also, as per #40, it doesn’t matter if you think the game is good enough you can win anyway, don’t play in a game where you’re being cheated, period.

A similar phenomenon exists in Magic: The Gathering. If someone is cheating, they’re almost always highly suspicious. The problem is that unlike poker you often don’t choose who you play your Magic matches against, so you can be stuck against a likely cheater who hasn’t formally been caught yet.

New York City will have its Secular Solstice and Mega-Meetup on the weekend of December 20th. The main event is on the 20th.

I strongly recommend going to the Secular Solstice itself if you have the opportunity, either in NYC, SF or other places it is offered. If you are local, and the rationalist megameetup is self-recommending to you, then you should definitely go. If not, consider going anyway. I’m usually there for one of the days.

If you’re looking for an idea of what the music is like, this playlist gives you an idea.

IFP is hiring a Director of Operations.

Name the four core character classes, wrong answers only. Remarkably strong quality and diversity most of the way.

I know about the gender pay gap but this is ridiculous, also Near is a man:

Robin Hanson, never stop Robin Hansoning, I will not explain further:

Rob Henderson (Quoting from The Social Paradox by William von Hippel): “If two people anywhere on earth look into each other’s eyes for more than five seconds, then either they’re going to have sex or one of them is going to kill the other.”

Robin Hanson: I’d bet a lot of money that this is simply not true. In fact the % of random pairs for which either of those happens must be well below 5%.

Oh well.

Matthew Yglesias: Hmmmm so they are considering trading away enduring spiritual values in exchange for short-term material gain, wonder if anything has ever been written that would be relevant to this.

Andrew Callaghan considers not releasing his interview with Pete Buttigieg because despite being a good discussion it went too well for Pete and his audience is mad about it.

If you didn’t watch Sabrina Carpenter on SNL, watch this video from that show.

A claim by Matt Bruenig that capitalism does not reward risk-taking, because when you take a risk sometimes it doesn’t work out. It’s too risky.

You do not get reliably rewarded for risk taking. It’s true!

It’s actually not as true as you might think. In many cases you can repeatedly take uncorrelated risks at good odds, and over time you will reliably get rewarded for this.

And then it gets better, in response:

James Surowiecki (Author, The Wisdom of Crowds): Does capitalism systematically reward risk-taking? In other words, is there a tight correlation, empirically, between the amount of risk one takes on and the returns one earns?

And better than that, even!

No, I’m not going to explain this one.

Perhaps the crowds are not so wise, after all. Or perhaps they weren’t consulted.

courtney: ordering from the indian restaurant and I just burst out laughing

A response suggests another way:

Bookem Code Monkey: I go to one with an Indian friend. Ordered something spicy. It was bland, bland. My Indian friends snaps his fingers and the guy comes over. falkfjlkjakljagaffadfa or whatever he said to the guy. Guy responds, Oh no, we don’t give that to white people. WTH.

Sven-Hajo Sieber: Had that experience in Tasmania, ordered very spicy and it was quite mild. When they asked if it was okay at the end I commented on it and they said: oh, order Indian spicy next time, we brought you Australian spicy.

Nina: My friend has the same experience with his Malaysian boyfriend when ordering food in London. They bring the boyfriend REAL spicy food, but not his British partner!

Victory is hers!

Aella: omg I did it.

Eliezer Yudkowsky: Exactly half of your followers are insane.

Discussion about this post

Monthly Roundup #36: November 2025 Read More »

deepmind’s-latest:-an-ai-for-handling-mathematical-proofs

DeepMind’s latest: An AI for handling mathematical proofs


AlphaProof can handle math challenges but needs a bit of help right now.

Computers are extremely good with numbers, but they haven’t gotten many human mathematicians fired. Until recently, they could barely hold their own in high school-level math competitions.

But now Google’s DeepMind team has built AlphaProof, an AI system that matched silver medalists’ performance at the 2024 International Mathematical Olympiad, scoring just one point short of gold at the most prestigious undergrad math competition in the world. And that’s kind of a big deal.

True understanding

The reason computers fared poorly in math competitions is that, while they far surpass humanity’s ability to perform calculations, they are not really that good at the logic and reasoning that is needed for advanced math. Put differently, they are good at performing calculations really quickly, but they usually suck at understanding why they’re doing them. While something like addition seems simple, humans can do semi-formal proofs based on definitions of addition or go for fully formal Peano arithmetic that defines the properties of natural numbers and operations like addition through axioms.

To perform a proof, humans have to understand the very structure of mathematics. The way mathematicians build proofs, how many steps they need to arrive at the conclusion, and how cleverly they design those steps are a testament to their brilliance, ingenuity, and mathematical elegance. “You know, Bertrand Russel published a 500-page book to prove that one plus one equals two,” says Thomas Hubert, a DeepMind researcher and lead author of the AlphaProof study.

DeepMind’s team wanted to develop an AI that understood math at this level. The work started with solving the usual AI problem: the lack of training data.

Math problems translator

Large language models that power AI systems like Chat GPT learn from billions upon billions of pages of text. Because there are texts on mathematics in their training databases—all the handbooks and works of famous mathematicians—they show some level of success in proving mathematical statements. But they are limited by how they operate: They rely on using huge neural nets to predict the next word or token in sequences generated in response to user prompts. Their reasoning is statistical by design, which means they simply return answers that “sound” right.

DeepMind didn’t need the AI to “sound” right—that wasn’t going to cut it in high-level mathematics. They needed their AI to “be” right, to guarantee absolute certainty. That called for an entirely new, more formalized training environment. To provide that, the team used a software package called Lean.

Lean is a computer program that helps mathematicians write precise definitions and proofs. It relies on a precise, formal programming language that’s also called Lean, which mathematical statements can be translated into. Once the translated or formalized statement is uploaded to the program, it can check if it is correct and get back with responses like “this is correct,” “something is missing,” or “you used a fact that is not proved yet.”

The problem was, most mathematical statements and proofs that can be found online are written in natural language like “let X be the set of natural numbers that…”—the number of statements written in Lean was rather limited. “The major difficulty of working with formal languages is that there’s very little data,” Hubert says. To go around it, the researchers trained a Gemini large language model to translate mathematical statements from natural language to Lean. The model worked like an automatic formalizer and produced about 80 million formalized mathematical statements.

It wasn’t perfect, but the team managed to use that to their advantage. “There are many ways you can capitalize on approximate translations,” Hubert claims.

Learning to think

The idea DeepMind had for the AlphaProof was to use the architecture the team used in their chess-, Go-, and shogi-playing AlphaZero AI system. Building proofs in Lean and Mathematics in general was supposed to be just another game to master. “We were trying to learn this game through trial and error,” Hubert says. Imperfectly formalized problems offered great opportunity for making errors. In its learning phase, AlphaProof was simply proving and disproving the problems it had in its database. If something was translated poorly, figuring out that something wasn’t right was a useful form of exercise.

Just like AlphaZero, AlphaProof in most cases used two main components. The first was a huge neural net with a few billion parameters that learned to work in the Lean environment through trial and error. It was rewarded for each proven or disproven statement and penalized for each reasoning step it took, which was a way of incentivizing short, elegant proofs.

It was also trained to use a second component, which was a tree search algorithm. This explored all possible actions that could be taken to push the proof forward at each step. Because the number of possible actions in mathematics can be near infinite, the job of the neural net was to look at the available branches in the search tree and commit computational budget only to the most promising ones.

After a few weeks of training, the system could score well on most math competition benchmarks based on problems sourced from past high school-level competitions, but it still struggled with the most difficult of them. To tackle these, the team added a third component that hadn’t been in AlphaZero. Or anywhere else.

Spark of humanity

The third component, called Test-Time Reinforcement Learning (TTRL), roughly emulated the way mathematicians approach the most difficult problems. The learning part relied on the same combination of neural nets with search tree algorithms. The difference came in what it learned from. Instead of relying on a broad database of auto-formalized problems, AlphaProof working in the TTRL mode started its work by generating an entirely new training dataset based on the problem it was dealing with.

The process involved creating countless variations of the original statement, some simplified a little bit more, some more general, and some only loosely connected to it. The system then attempted to prove or disprove them. It was roughly what most humans do when they’re facing a particularly hard puzzle, the AI equivalent of saying, “I don’t get it, so let’s try an easier version of this first to get some practice.” This allowed AlphaProof to learn on the fly, and it worked amazingly well.

At the 2024 International Mathematics Olympiad, there were 42 points to score for solving six different problems worth seven points each. To win gold, participants had to get 29 points or higher, and 58 out of 609 of them did that. Silver medals were awarded to people who earned between 22 and 28 points (there were 123 silver medalists). The problems varied in difficulty, with the sixth one, acting as a “final boss,” being the most difficult of them all. Only six participants managed to solve it. AlphaProof was the seventh.

But AlphaProof wasn’t an end-all, be-all mathematical genius. Its silver had its price—quite literally.

Optimizing ingenuity

The first problem with AlphaProof’s performance was that it didn’t work alone. To begin with, humans had to make the problems compatible with Lean before the software even got to work. And, among the six Olympic problems, the fourth one was about geometry, and the AI was not optimized for that. To deal with it, AlphaProof had to call a friend called AlphaGeometry 2, a geometry-specialized AI that ripped through the task in a few minutes without breaking a sweat. On its own, AlphaProof scored 21 points, not 28, so technically it would win bronze, not silver. Except it wouldn’t.

Human participants of the Olympiad had to solve their six problems in two sessions, four-and-a-half hours long. AlphaProof, on the other hand, wrestled with them for several days using multiple tensor processing units at full throttle. The most time- and energy-consuming component was TTRL, which battled with the three problems it managed to solve for three days each. If AlphaProof was held up to the same standard as human participants, it would basically run out of time. And if it wasn’t born at a tech giant worth hundreds of billions of dollars, it would run out of money, too.

In the paper, the team admits the computational requirements to run AlphaProof are most likely cost-prohibitive for most research groups and aspiring mathematicians. Computing power in AI applications is often measured in TPU-days, meaning a tensor processing unit working flat-out for a full day. AlphaProof needed hundreds of TPU-days per problem.

On top of that, the International Mathematics Olympiad is a high school-level competition, and the problems, while admittedly difficult, were based on things mathematicians already know. Research-level math requires inventing entirely new concepts instead of just working with existing ones.

But DeepMind thinks it can overcome these hurdles and optimize AlphaProof to be less resource-hungry. “We don’t want to stop at math competitions. We want to build an AI system that could really contribute to research-level mathematics,” Hubert says. His goal is to make AlphaProof available to the broader research community. “We’re also releasing a kind of an AlphaProof tool,” he added. “It would be a small trusted testers program to see if this would be useful to mathematicians.”

Nature, 2025.  DOI: 10.1038/s41586-025-09833-y

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

DeepMind’s latest: An AI for handling mathematical proofs Read More »

oracle-hit-hard-in-wall-street’s-tech-sell-off-over-its-huge-ai-bet

Oracle hit hard in Wall Street’s tech sell-off over its huge AI bet

“That is a huge liability and credit risk for Oracle. Your main customer, biggest customer by far, is a venture capital-funded start-up,” said Andrew Chang, a director at S&P Global.

OpenAI faces questions about how it plans to meet its commitments to spend $1.4 trillion on AI infrastructure over the next eight years. It has struck deals with several Big Tech groups, including Oracle’s rivals.

Of the five hyperscalers—which include Amazon, Google, Microsoft, and Meta—Oracle is the only one with negative free cash flow. Its debt-to-equity ratio has surged to 500 percent, far higher than Amazon’s 50 percent and Microsoft’s 30 percent, according to JPMorgan.

While all five companies have seen their cash-to-assets ratios decline significantly in recent years amid a boom in spending, Oracle’s is by far the lowest, JPMorgan found.

JPMorgan analysts noted a “tension between [Oracle’s] aggressive AI build-out ambitions and the limits of its investment-grade balance sheet.”

Analysts have also noted that Oracle’s data center leases are for much longer than its contracts to sell capacity to OpenAI.

Oracle has signed at least five long-term lease agreements for US data centers that will ultimately be used by OpenAI, resulting in $100 billion of off-balance-sheet lease commitments. The sites are at varying levels of construction, with some not expected to break ground until next year.

Safra Catz, Oracle’s sole chief executive from 2019 until she stepped down in September, resisted expanding its cloud business because of the vast expenses required. She was replaced by co-CEOs Clay Magouyrk and Mike Sicilia as part of the pivot by Oracle to a new era focused on AI.

Catz, who is now executive vice-chair of Oracle’s board, has exercised stock options and sold $2.5 billion of its shares this year, according to US regulatory filings. She had announced plans to exercise her stock options at the end of 2024.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Oracle hit hard in Wall Street’s tech sell-off over its huge AI bet Read More »

on-writing-#2

On Writing #2

In honor of my dropping by Inkhaven at Lighthaven in Berkeley this week, I figured it was time for another writing roundup. You can find #1 here, from March 2025.

I’ll be there from the 17th (the day I am publishing this) until the morning of Saturday the 22nd. I am happy to meet people, including for things not directly about writing.

  1. Table of Contents.

  2. How I Use AI For Writing These Days.

  3. Influencing Influence.

  4. Size Matters.

  5. Time To Write A Shorter One.

  6. A Useful Tool.

  7. A Maligned Tool.

  8. Neglected Topics.

  9. The Humanities Don’t Seem Relevant To Writing About Future Humanity?

  10. Writing Every Day.

  11. Writing As Deep Work.

  12. Most Of Your Audience Is Secondhand.

  13. That’s Funny.

  14. Fiction Writing Advice.

  15. Just Say The Thing.

  16. Cracking the Paywall.

How have I been using AI in my writing?

Directly? With the writing itself? Remarkably little. Almost none.

I am aware that this is not optimal. But at current capability levels, with the prompts and tools I know about, in the context of my writing, AI has consistently proven to have terrible taste and to make awful suggestions, and also to be rather confident about them. This has proven sufficiently annoying that I haven’t found it worth checking with the AIs.

I also worry about AI influence pushing me towards generic slop, pushing me to sounding more like the AIs, and rounding off the edges of things, since every AI I’ve tried this with keeps trying to do all that.

I am sure it does not help that my writing style is very unusual, and basically not in the training data aside from things written by actual me, as far as I can tell.

Sometimes I will quote LLM responses in my writing, always clearly labeled, when it seems useful to point to this kind of ‘social proof’ or sanity check.

The other exception is that if you ask the AI to look for outright errors, especially things like spelling and grammar, it won’t catch everything, but when it does catch something it is usually right. When you ask it to spot errors of fact, it’s not as reliable, but it’s good enough to check the list. I should be making a point of always doing that.

I did the ‘check for errors and other considerations’ thing on this piece in particular with both Sonnet and 5.1-Thinking. This did improve the post but it’s not obvious it improved it enough to be worth the time.

I will also sometimes ask it about a particular line or argument I’m considering, to see if it buys it, but only when what I care about is a typical reaction.

If I was devoting more time to refining and editing, and cared more about marginal improvements there, that would open up more use cases, but I don’t think that’s the right use of time for me on current margins versus training on more data or doing more chain of thought.

Indirectly? I use it a lot more there, and again I could be doing more.

There are some specific things:

  1. I have a vibe coded Chrome extension that saves me a bunch of trouble, that could be improved a lot with more work. It does things like generate the Table of Contents, crosspost to WordPress, auto-populate many links and quickly edit quotes to fix people’s indifference to things like capitalization.

  2. I have a GPT called Zvi Archivist that I use to search through my past writing, to check if and when I’ve already covered something and what I’ve said about it.

  3. I have a transcriber for converting images to text because all the websites I know about that offer to do this for you are basically broken due to gating. This works.

Then there’s things that are the same as what everyone does all the time. I do a lot of fact checking, sanity checking, Fermi estimation, tracking down information or sources, asking for explanations, questioning papers for the things I care about. Using the AI assistant in its classic sense. All of that is a big help and I notice my activation requirement to do this is higher than it should be.

I want this to be true so I’m worried I can’t be objective, but it seems true to me?

Janus: i think that it’s almost always a bad idea to attempt to grow as an influencer on purpose.

you can believe that it would be good if you were to grow, and still you shouldn’t optimize for it.

the only way it goes well is if it happens while you optimize for other things.

More precisely than “you shouldn’t on purpose” what I’m saying is you shouldn’t be spending significant units of optimization on this goal and performing actions you wouldn’t otherwise for this purpose

I am confident that if you optimize primarily for influence, that’s full audience capture, slopification and so on, and you’ve de facto sold your soul. You can in theory turn around and then use that influence to accomplish something worthwhile, but statistically speaking you won’t do that.

Janus: Name a single account that explicitly optimizes for being a bigger influencer / “tries to grow” (instead of just happening as a side effect) and that does more good than harm to the ecosystem and generally has good vibes and interesting content

You probably can’t!

actually, https://x.com/AISafetyMemes is a contender

but i know they’re VERY controversial and I do think they’re playing with fire

i do consider them net positive but this is mostly bc they sometimes have very good taste and maybe cancel out the collateral damage

but WOULD NOT RECOMMEND almost anyone trying this, lol

AISafetyMemes is definitely an example of flying dangerously close to the sun on this, but keeping enough focus and having enough taste to maybe be getting away with it. It’s unclear that the net sign of impact there is positive, there are some very good posts but also some reasons to worry.

No one reads the blog posts, they’re too long, so might as well make them longer?

Visakan Veerasamy: An idea I’ve been toying with and discussed with a couple of friends is the idea that blog posts could and probably should get much longer now that fewer people are reading them.

One of the difficult things about writing a good essay is figuring out what to leave out so it is more manageable for readers.

But on a blog where there is no expectation that anybody reads it, you do not have to leave anything out.

My guess is this is going to end up being a barbell situation like so many other things. If you cut it down, you want to cut it down as much as possible. If you’re going long, then on the margin you’re better off throwing everything in.

I highlight this exactly because it seems backwards to me. I notice that my experience is very much the opposite – when I want to write a good short piece it is MUCH more work per token, and often more total work.

Timothy Lee: I think a big reason that writing a book is such a miserable experience is that the time to write a good piece is more-than-linear in the number of words. A good 2,000-word piece is a lot more than 4x the work of a good 500-word piece.

I assume this continues for longer pieces and a good 100,000 book is a lot more than 50x the work of a good 2,000-word article. Most authors deal with this by cutting corners and turning in books that aren’t very good. And then there’s Robert Caro.

Josh You: I think by “good 2000 word piece” Tim means “a 2000 word piece that has been edited down from a much longer first draft”

Even then. Yes, a tight longer piece requires more structure and planning, but the times I write those 500-800 word pieces it takes forever, because you really do struggle over every word as you try to pack everything into the tiniest possible space.

Writing a 100,000 word book at the precision level of an 800 word thinkpiece would take forever, but also I presume it almost never happens. If it does, that better be your masterpiece or I don’t see why you’d do it.

Dwarkesh Patel is using the Smart Composer Plugin for Obsidian, which he says is basically Cursor for writing, and loves it. Sounds great conditional on using Obsidian, but it is not being actively maintained.

Eric Raymond joins ‘the em-dash debate’ on the side of the em-dash.

Eric Raymond (yes that one): My wacky theory about the em-dash debate:

Pro writers use em-dashes a lot because many of them, possibly without consciously realizing it, have become elocutionary punctuationists.

That is, they’ve fallen into the habit of using punctuation not as grammatical phrase structure markers but as indicators of pauses of varying length in the flow of speech.

The most visible difference you see in people who write in this style that their usage of commas becomes somewhat more fluid — that’s the marker for the shortest pause. But they also reach for less commonly used punctuation marks as indicators of longer pauses of varying length.

Em dash is about the second or third longest pause, only an ellipsis or end-of-sentence period being clearly longer.

Historical note: punctuation marks originally evolved as pause or breathing markers in manuscripts to aid recitation. In the 19th century, after silent reading had become normal, they were reinterpreted by grammarians as phrase structure markers and usage rules became much more rigid.

Really capable writers have been quietly rediscovering elocutionary punctuation ever since.

RETVRN!

I too have been increasingly using punctuation, especially commas, to indicate pauses. I still don’t use em dashes, partly because I almost never want that exact length and style of a pause for whatever reason, and also because my instinct is that you’re trying to do both ‘be technically correct’ and also ‘evoke what you want’ and my brain thinks of the em-dash as technically incorrect.

That’s all true and I never used em-dashes before but who are we kidding, the best reason not to use em-dashes is that people will think you’re using AI. I don’t love that dynamic either, but do you actually want to die on that hill?

Tyler Cowen lists some reasons why he does not cover various topics much. The list resonates with me quite a bit.

  1. I feel that writing about the topic will make me stupider.

  2. I believe that you reading more about the topic will make you stupider.

  3. I believe that performative outrage usually brings low or negative returns. Matt Yglesias has had some good writing on this lately.

  4. I don’t have anything to add on the topic. Abortion and the Middle East would be two examples here.

  5. Sometimes I have good inside information on a topic, but I cannot reveal it, not even without attribution. And I don’t want to write something stupider than my best understanding of the topic.

  6. I just don’t feel like it.

  7. On a few topics I feel it is Alex’s province.

I don’t have an Alex, instead I have decided on some forms of triage that are simply ‘I do not have the time to look into this and I will let it be someone else’s department.’

Otherwise yes, all of these are highly relevant.

Insider information is tough, and I am very careful about not revealing things I am not supposed to reveal, but this rarely outright stops me. If nothing else, you can usually get net smarter via negativa, where you silently avoid saying false things, including by using careful qualifiers on statements.

One big thing perhaps missing from Tyler’s list is that I avoid certain topics where my statements would potentially interfere with my ability to productively discuss other topics. If you are going to make enemies, or give people reasons to attack you or dismiss you, only do that on purpose. One could also file this under making you and others stupider. Similarly, there are things that I need to not think about – I try to avoid thinking about trading for this reason.

A minor thing is that I’d love to be able to talk more about gaming, and other topics dear to my heart, but that consistently drive people away permanently when I do that. So it’s just not worth it. If the extra posts simply had no impact, I’d totally do it, but as is I’d be better off writing the post and then not hitting publish. Sad. Whereas Tyler has made it very clear he’s going to post things most readers don’t care about, when he feels like doing so, and that’s part of the price of admission.

If you want to write or think about the future, maybe don’t study the humanities?

Startup Archive: Palmer Luckey explains why science fiction is a great place to look for ideas

“One of the things that I’ve realized in my career is that nothing I ever come up with will be new. I’ve literally never come up with an idea that a science fiction author has not come up with before.”

Dr. Julie Gurner: Funny how valuable those English majors and writers truly are, given how much liberal arts has been put down. Why philosophy, creativity and hard tech skills make such fantastic bedfellows. Span of vision wins.

Orthonormalist: Heinlein was an aeronautical engineer.

Asimov was a biochemistry professor.

Arthur Clarke was a radio operator who got a physics degree.

Ray Bradbury never went to college (but did go straight to being a writer)

I quote this because ‘study the humanities’ is a natural thing to say to someone looking to write or think about the future, and yet I agree that when I look at the list of people whose thinking about the future has influenced me, I notice essentially none of them have studied the humanities.

Alan Jacobs has a very different writing pattern. Rather than write every day, he waits until the words are ready, so he’ll work every day but often that means outlines or index card reordering or just sitting in his chair and thinking, even for weeks at a time. This is alien to me. If I need to figure out what to write, I start writing, see what it looks like, maybe delete it and try again, maybe procrastinate by working on a different thing.

Neal Stephenson explains that for him writing is Deep Work, requiring long blocks of reliably uninterrupted time bunched together, writing novels is the best thing he does, and that’s why he doesn’t go to conferences or answer your email. Fair enough.

I’ve found ways to not be like that. I deal with context shifts and interruptions all the time and it is fine, indeed when dealing with difficult tasks I almost require them. That’s a lot of how I can be so productive. But the one time I wrote something plausibly like a book, the Immoral Mazes sequence, I did spend a week alone in my apartment doing nothing else. And I haven’t figured out how to write a novel, or almost any fiction at all.

Also, it’s rather sad if it is true that Neal Stephenson only gets a middle class life out of writing so many fantastic and popular books, and can’t afford anyone to answer his email. That makes writing seem like an even rougher business than I expected. Although soon AI can perhaps do it for him?

Patrick McKenzie highlights an insight from Alex Danco, which is that most of the effective audience of any successful post is not people who read the post, but people who are told about the post by someone who did read it. Patrick notes this likely also applies to formal writing, I’d note it seems to definitely apply to most books.

Relatedly, I have in the past postulated a virtual four-level model of flow of ideas, where each level can understand the level above it, and then rephrase and present it to the level below.

So if you are Level 1, either in general or in an area, you can formulate fully new ideas. If you are Level 2, you can understand what the Level 1s say, look for consensus or combine what they say, riff on it and then communicate that to those who are up to Level 3, who can then fully communicate to the public who end up typically around at Level 4.

Then the public will communicate a simplified and garbled version to each other.

You can be Level 1 and then try to ‘put on your Level 2 or 3 hat’ to write a dumber, simpler version to a broader audience, but it is very hard to simultaneously do that and also communicate the actual concepts to other Level 1s.

These all then interact, but if you go viral with anything longer than a Tweet, you inevitably are going to primarily end up with a message primarily communicated via (in context) Level 3 and Level 4 people communicating to other Level 3-4 people.

At that point, and any time you go truly viral or your communication is ‘successful,’ you run into the You Get About Five Words problem.

My response to this means that at this point I essentially never go all that directly viral. I have a very narrow range of views, where even the top posts never do 100% better than typical posts, and the least popular posts – which are when I talk about AI alignment or policy on their own – will do at worst 30% less than typical.

The way the ideas go viral is someone quotes, runs with or repackages them. A lot of the impact comes from the right statement reaching the right person.

I presume that would work differently if I was working with mediums that work on virality, such as YouTube or TikTok, but my content seems like a poor fit for them, and when I do somewhat ‘go viral’ in such places it is rarely content I care about spreading. Perhaps I am making a mistake by not branching out. But on Twitter I still almost never go viral, as it seems my speciality is small TAM (total available market) Tweets.

Never have a character try to be funny, the character themselves should have no idea.

I think this is directionally correct but goes too far, for the same reasons that you, in your real life, will often try to be funny, and sometimes it will work. The trick is they have to be trying to be funny in a way that makes sense for the character, in context, for those around them, not trying to be funny to the viewer.

I notice that in general I almost never ‘try to be funny,’ not exactly. I simply say things because they would be funny, and to say things in the funniest way possible, because why not. A lot of my favorite people seem to act similarly.

Lydia Davis offers her top ten recommendations for good (fiction?) writing: Keep notes, including sentences out of context, work from your own interest, be mostly self-taught, read and revise the notes constantly, grow stories or develop poems out of those notes, learn techniques from great works and read the best writers across time.

Orson Scott Card explains that you don’t exhaust the reader by having too much tension in your book, you exhaust them by having long stretches without tension. The tension keeps us reading.

Dwarkesh Patel: Unreasonably effective writing advice:

“What are you trying to say here?

Okay, just write that.”

I’ve (separately) started doing this more.

I try to make sure that it’s very easy to find the central point, the thing I’m most trying to say, and hard to miss it.

Patrick McKenzie: Cosigned, and surprisingly effective with good writers in addition to ones who more obviously need the prompting.

Writing an artifact attaches you to the structure of it while simultaneously subsuming you in the topic. The second is really good for good work; the first, less so.

One thing that I tried, with very limited success, to get people to do is to be less attached to words on a page. Writing an essay? Write two very different takes on it; different diction, different voice, maybe even different argument. Then pick the one which speaks to you.

Edit *thatrather than trying to line edit the loser towards greatness.

There is something which people learn, partially from school and partially from work experience, which causes them to write as if they were charged for every word which goes down on the page.

Words are free! They belong in a vast mindscape! You can claw more from the aether!

I think people *mightoperationalize better habits after LLMs train them that throwing away a paragraph is basically costless.

Jason Cohen: Yeah this works all the time.

Also when getting someone to explain their product, company, customer, why to work for them, etc..

So funny how it jogs them out of their own way!

BasedBigTech: An excellent Group PM reviewed my doc with me. He said “what does this mean?” and I told him.

“Then why didn’t you write that?”

Kevin Kelly: At Whole Earth Review people would send us book reviews with a cover letter explaining why we should run their book review. We’d usually toss the review and print their much shorter cover letter as the review which was much clearer and succinct.

Daniel Eth: It’s crazy how well just straight up asking people that gets them to say the thing they should write down

Why does it work?

The answer is that writing is doing multiple tasks.

Only one of them is ‘tell you what all this means.’

You have to do some combination of things such as justify that, explain it, motivate it, provide details, teach your methods and reasoning, perform reporting, be entertaining and so on.

Also, how did you know what you meant to say until you wrote the damn thing?

You still usually should find a way to loudly say what it all means, somewhere in there.

But this creates the opportunity for the hack.

If I hand you a ten-page paper, and you ask ‘what are you trying to say?’ then I have entered into evidence that I have Done the Work and Written the Report.

Now I can skip the justifications, details and context, and Say The Thing.

The point of a reference post is sometimes to give people the opportunity to learn.

The point of a reference post can also be to exist and then not be clicked on. It varies.

This is closely related to the phenomenon where often a movie or show will have a scene that logically and structurally has to exist, but which you wish you didn’t have to actually watch. In theory you could hold up a card that said ‘Scene in which Alice goes to the bank, acts nervous and get the money’ or whatever.

Probably they should do a graceful version of something like that more often, or even interactive versions where you can easily expand or condense various scenes. There’s something there.

Similarly, with blog posts (or books) there are passages that are written or quoted knowing many or most people will skip them, but that have to be there.

Aella teaches us how to make readers pay up to get behind a Paywall. Explain why you are the One Who Knows some valuable thing, whereas others including your dear reader are bad at this and need your help. Then actually provide value both outside and inside of the paywall, ideally because the early free steps are useful even without the payoff you’re selling.

I am thankful that I can write without worrying about maximizing such things, while I also recognize that I’m giving up a lot of audience share not optimizing for doing similar things on the non-paywall side.

Discussion about this post

On Writing #2 Read More »

dogs-came-in-a-wide-range-of-sizes-and-shapes-long-before-modern-breeds

Dogs came in a wide range of sizes and shapes long before modern breeds

“The concept of ‘breed’ is very recent and does not apply to the archaeological record,” Evin said. People have, of course, been breeding dogs for particular traits for as long as we’ve had dogs, and tiny lap dogs existed even in ancient Rome. However, it’s unlikely that a Neolithic herder would have described his dog as being a distinct “breed” from his neighbor’s hunting partner, even if they looked quite different. Which, apparently, they did.

A big yellow dog, a little gray dog, and a little white dog

Dogs had about half of their modern diversity (at least in skull shapes and sizes) by the Neolithic. Credit: Kiona Smith

Bones only tell part of the story

“We know from genetic models that domestication should have started during the late Pleistocene,” Evin told Ars. A 2021 study suggested that domestic dogs have been a separate species from wolves for more than 23,000 years. But it took a while for differences to build up.

Evin and her colleagues had access to 17 canine skulls that ranged from 12,700 to 50,000 years old—prior to the end of the ice age—and they all looked enough like modern wolves that, as Evin put it, “for now, we have no evidence to suggest that any of the wolf-like skulls did not belong to wolves or looked different from them.” In other words, if you’re just looking at the skull, it’s hard to tell the earliest dogs from wild wolves.

We have no way to know, of course, what the living dog might have looked like. It’s worth mentioning that Evin and her colleagues found a modern Saint Bernard’s skull that, according to their statistical analysis, looked more wolf-like than dog-like. But even if it’s not offering you a brandy keg, there’s no mistaking a live Saint Bernard, with its droopy jowls and floppy ears, for a wolf.

“Skull shape tells us a lot about function and evolutionary history, but it represents only one aspect of the animal’s appearance. This means that two dogs with very similar skulls could have looked quite different in life,” Evin told Ars. “It’s an important reminder that the archaeological record captures just part of the biological and cultural story.”

And with only bones—and sparse ones, at that—to go on, we may be missing some of the early chapters of dogs’ biological and cultural story. Domestication tends to select the friendliest animals to produce the next generation, and apparently that comes with a particular set of evolutionary side effects, whether you’re studying wolves, foxes, cattle, or pigs. Spots, floppy ears, and curved tails all seem to be part of the genetic package that comes with inter-species friendliness. But none of those traits is visible in the skull.

Dogs came in a wide range of sizes and shapes long before modern breeds Read More »

researchers-question-anthropic-claim-that-ai-assisted-attack-was-90%-autonomous

Researchers question Anthropic claim that AI-assisted attack was 90% autonomous

Claude frequently overstated findings and occasionally fabricated data during autonomous operations, claiming to have obtained credentials that didn’t work or identifying critical discoveries that proved to be publicly available information. This AI hallucination in offensive security contexts presented challenges for the actor’s operational effectiveness, requiring careful validation of all claimed results. This remains an obstacle to fully autonomous cyberattacks.

How (Anthropic says) the attack unfolded

Anthropic said GTG-1002 developed an autonomous attack framework that used Claude as an orchestration mechanism that largely eliminated the need for human involvement. This orchestration system broke complex multi-stage attacks into smaller technical tasks such as vulnerability scanning, credential validation, data extraction, and lateral movement.

“The architecture incorporated Claude’s technical capabilities as an execution engine within a larger automated system, where the AI performed specific technical actions based on the human operators’ instructions while the orchestration logic maintained attack state, managed phase transitions, and aggregated results across multiple sessions,” Anthropic said. “This approach allowed the threat actor to achieve operational scale typically associated with nation-state campaigns while maintaining minimal direct involvement, as the framework autonomously progressed through reconnaissance, initial access, persistence, and data exfiltration phases by sequencing Claude’s responses and adapting subsequent requests based on discovered information.”

The attacks followed a five-phase structure that increased AI autonomy through each one.

The life cycle of the cyberattack, showing the move from human-led targeting to largely AI-driven attacks using various tools, often via the Model Context Protocol (MCP). At various points during the attack, the AI returns to its human operator for review and further direction.

Credit: Anthropic

The life cycle of the cyberattack, showing the move from human-led targeting to largely AI-driven attacks using various tools, often via the Model Context Protocol (MCP). At various points during the attack, the AI returns to its human operator for review and further direction. Credit: Anthropic

The attackers were able to bypass Claude guardrails in part by breaking tasks into small steps that, in isolation, the AI tool didn’t interpret as malicious. In other cases, the attackers couched their inquiries in the context of security professionals trying to use Claude to improve defenses.

As noted last week, AI-developed malware has a long way to go before it poses a real-world threat. There’s no reason to doubt that AI-assisted cyberattacks may one day produce more potent attacks. But the data so far indicates that threat actors—like most others using AI—are seeing mixed results that aren’t nearly as impressive as those in the AI industry claim.

Researchers question Anthropic claim that AI-assisted attack was 90% autonomous Read More »

openai-walks-a-tricky-tightrope-with-gpt-5.1’s-eight-new-personalities

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they’re warmer, more conversational, and better at following instructions.

The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.

The company now faces intense scrutiny from lawyers and regulators that could threaten its future operations. In that kind of environment, it’s difficult to just release a new AI model, throw out a few stats, and move on like the company could even a year ago. But here are the basics: The new GPT-5.1 Instant model will serve as ChatGPT’s faster default option for most tasks, while GPT-5.1 Thinking is a simulated reasoning model that attempts to handle more complex problem-solving tasks.

OpenAI claims that both models perform better on technical benchmarks such as math and coding evaluations (including AIME 2025 and Codeforces) than GPT-5, which was released in August.

Improved benchmarks may win over some users, but the biggest change with GPT-5.1 is in its presentation. OpenAI says it heard from users that they wanted AI models to simulate different communication styles depending on the task, so the company is offering eight preset options, including Professional, Friendly, Candid, Quirky, Efficient, Cynical, and Nerdy, alongside a Default setting.

These presets alter the instructions fed into each prompt to simulate different personality styles, but the underlying model capabilities remain the same across all settings.

An illustration showing GPT-5.1's eight personality styles in ChatGPT.

An illustration showing GPT-5.1’s eight personality styles in ChatGPT. Credit: OpenAI

In addition, the company trained GPT-5.1 Instant to use “adaptive reasoning,” meaning that the model decides when to spend more computational time processing a prompt before generating output.

The company plans to roll out the models gradually over the next few days, starting with paid subscribers before expanding to free users. OpenAI plans to bring both GPT-5.1 Instant and GPT-5.1 Thinking to its API later this week. GPT-5.1 Instant will appear as gpt-5.1-chat-latest, and GPT-5.1 Thinking will be released as GPT-5.1 in the API, both with adaptive reasoning enabled. The older GPT-5 models will remain available in ChatGPT under the legacy models dropdown for paid subscribers for three months.

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities Read More »

with-another-record-broken,-the-world’s-busiest-spaceport-keeps-getting-busier

With another record broken, the world’s busiest spaceport keeps getting busier


It’s not just the number of rocket launches, but how much stuff they’re carrying into orbit.

With 29 Starlink satellites onboard, a Falcon 9 rocket streaks through the night sky over Cape Canaveral Space Force Station, Florida, on Monday night. Credit: Stephen Clark/Ars Technica

CAPE CANAVERAL, Florida—Another Falcon 9 rocket fired off its launch pad here on Monday night, taking with it another 29 Starlink Internet satellites to orbit.

This was the 94th orbital launch from Florida’s Space Coast so far in 2025, breaking the previous record for the most satellite launches in a calendar year from the world’s busiest spaceport. Monday night’s launch came two days after a Chinese Long March 11 rocket lifted off from an oceangoing platform on the opposite side of the world, marking humanity’s 255th mission to reach orbit this year, a new annual record for global launch activity.

As of Wednesday, a handful of additional missions have pushed the global figure this year to 259, putting the world on pace for around 300 orbital launches by the end of 2025. This will more than double the global tally of 135 orbital launches in 2021.

Routine vs. complacency

Waiting in the darkness a few miles away from the launch pad, I glanced around at my surroundings before watching SpaceX’s Falcon 9 thunder into the sky. There were no throngs of space enthusiasts anxiously waiting for the rocket to light up the night. No line of photographers snapping photos. Just this reporter and two chipper retirees enjoying what a decade ago would have attracted far more attention.

Go to your local airport and you’ll probably find more people posted up at a plane-spotting park at the end of the runway. Still, a rocket launch is something special. On the same night that I watched the 94th launch of the year depart from Cape Canaveral, Orlando International Airport saw the same number of airplane departures in just three hours.

The crowds still turn out for more meaningful launches, such as a test flight of SpaceX’s Starship megarocket in Texas or Blue Origin’s attempt to launch its second New Glenn heavy-lifter here Sunday. But those are not the norm. Generations of aerospace engineers were taught that spaceflight is not routine for fear of falling into complacency, leading to failure, and in some cases, death.

Compared to air travel, the mantra remains valid. Rockets are unforgiving, with engines operating under extreme pressures, at high thrust, and unable to suck in oxygen from the atmosphere as a reactant for combustion. There are fewer redundancies in a rocket than in an airplane.

The Falcon 9’s established failure rate is less than 1 percent, well short of any safety standard for commercial air travel but good enough to be the most successful orbital-class in history. Given the Falcon 9’s track record, SpaceX seems to have found a way to overcome the temptation for complacency.

A Chinese Long March 11 rocket carrying three Shiyan 32 test satellites lifts off from waters off the coast of Haiyang in eastern China’s Shandong province on Saturday. Credit: Guo Jinqi/Xinhua via Getty Images

Following the trend

The upward trend in rocket launches hasn’t always been the case. Launch numbers were steady for most of the 2010s, following a downward trend in the 2000s, with as few as 52 orbital launches in 2005, the lowest number since the nascent era of spaceflight in 1961. There were just seven launches from here in Florida that year.

The numbers have picked up dramatically in the last five years as SpaceX has mastered reusable rocketry.

It’s important to look at not just the number of launches but also how much stuff rockets are actually putting into orbit. More than half of this year’s launches were performed using SpaceX’s Falcon 9 rocket, and the majority of those deployed Starlink satellites for SpaceX’s global Internet network. Each spacecraft is relatively small in size and weight, but SpaceX stacks up to 29 of them on a single Falcon 9 to max out the rocket’s carrying capacity.

All this mass adds up to make SpaceX’s dominance of the launch industry appear even more absolute. According to analyses by BryceTech, an engineering and space industry consulting firm, SpaceX has launched 86 percent of all the world’s payload mass over the 18 months from the beginning of 2024 through June 30 of this year.

That’s roughly 2.98 million kilograms of the approximately 3.46 million kilograms (3,281 of 3,819 tons) of satellite hardware and cargo that all the world’s rockets placed into orbit during that timeframe.

The charts below were created by Ars Technica using publicly available launch numbers and payload mass estimates from BryceTech. The first illustrates the rising launch cadence at Cape Canaveral Space Force Station and NASA’s Kennedy Space Center, located next to one another in Florida. Launches from other US-licensed spaceports, primarily Vandenberg Space Force Base, California, and Rocket Lab’s base at Māhia Peninsula in New Zealand, are also on the rise.

These numbers represent rockets that reached low-Earth orbit. We didn’t include test flights of SpaceX’s Starship rocket in the chart because all of its launches have intentionally flown on suborbital trajectories.

In the second chart, we break down the payload upmass to orbit from SpaceX, other US companies, China, Russia, and other international launch providers.

Launch rates are on a clear upward trend, while SpaceX has launched 86 percent of the world’s total payload mass to orbit since the beginning of 2024. Credit: Stephen Clark/Ars Technica/BryceTech

Will it continue?

It’s a good bet that payload upmass will continue to rise in the coming years, with heavy cargo heading to orbit to further expand SpaceX’s Starlink communications network and build out new megaconstellations from Amazon, China, and others. The US military’s Golden Dome missile defense shield will also have a ravenous appetite for rockets to get it into space.

SpaceX’s Starship megarocket could begin flying to low-Earth orbit next year, and if it does, SpaceX’s preeminence in delivering mass to orbit will remain assured. Starship’s first real payloads will likely be SpaceX’s next-generation Starlink satellites. These larger, heavier, more capable spacecraft will launch 60 at a time on Starship, further stretching SpaceX’s lead in the upmass war.

But Starship’s arrival will come at the expense of the workhorse Falcon 9, which lacks the capacity to haul the next-gen Starlinks to orbit. “This year and next year I anticipate will be the highest Falcon launch rates that we will see,” said Stephanie Bednarek, SpaceX’s vice president of commercial sales, at an industry conference in July.

SpaceX is on pace for between 165 and 170 Falcon 9 launches this year, with 144 flights already in the books for 2025. Last year’s total for Falcon 9 and Falcon Heavy was 134 missions. SpaceX has not announced how many Falcon 9 and Falcon Heavy launches it plans for next year.

Starship is designed to be fully and rapidly reusable, eventually enabling multiple flights per day. But that’s still a long way off, and it’s unknown how many years it might take for Starship to surpass the Falcon 9’s proven launch tempo.

A Starship rocket and Super Heavy booster lift off from Starbase, Texas. Credit: SpaceX

In any case, with Starship’s heavy-lifting capacity and upgraded next-gen satellites, SpaceX could match an entire year’s worth of new Starlink capacity with just two fully loaded Starship flights. Starship will be able to deliver 60 times more Starlink capacity to orbit than a cluster of satellites riding on a Falcon 9.

There’s no reason to believe SpaceX will be satisfied with simply keeping pace with today’s Starlink growth rate. There are emerging market opportunities in connecting satellites with smartphones, space-based computer processing and data storage, and military applications.

Other companies have medium-to-heavy rockets that are either new to the market or soon to debut. These include Blue Origin’s New Glenn, now set to make its second test flight in the coming days, with a reusable booster designed to facilitate a rapid-fire launch cadence.

Despite all of the newcomers, most satellite operators see a shortage of launch capacity on the commercial market. “The industry is likely to remain supply-constrained through the balance of the decade,” wrote Caleb Henry, director of research at the industry analysis firm Quilty Space. “That could pose a problem for some of the many large constellations on the horizon.”

United Launch Alliance’s Vulcan rocket, Rocket Lab’s Neutron, Stoke Space’s Nova, Relativity Space’s Terran R, and Firefly Aerospace and Northrop Grumman’s Eclipse are among the other rockets vying for a bite at the launch apple.

“Whether or not the market can support six medium to heavy lift launch providers from the US aloneplus Starshipis an open question, but for the remainder of the decade launch demand is likely to remain high, presenting an opportunity for one or more new players to establish themselves in the pecking order,” Henry wrote in a post on Quilty’s website.

China’s space program will need more rockets, too. That nation’s two megaconstellations, known as Guowang and Qianfan, will have thousands of satellites requiring a significant uptick on Chinese launches.

Taking all of this into account, the demand curve for access to space is sure to continue its upward trajectory. How companies meet this demand, and with how many discrete departures from Earth, isn’t quite as clear.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

With another record broken, the world’s busiest spaceport keeps getting busier Read More »

kimi-k2-thinking

Kimi K2 Thinking

I previously covered Kimi K2, which now has a new thinking version. As I said at the time back in July, price in that the thinking version is coming.

Is it the real deal?

That depends on what level counts as the real deal. It’s a good model, sir, by all accounts. But there have been fewer accounts than we would expect if it was a big deal, and it doesn’t fall into any of my use cases.

Kimi.ai: 🚀 Hello, Kimi K2 Thinking!

The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)

🔹 Executes up to 200 – 300 sequential tool calls without human interference

🔹 Excels in reasoning, agentic search, and coding

🔹 256K context window

Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns.

K2 Thinking is now live on http://kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API.

API here, Tech blog here, Weights and code here.

(Pliny jailbreak here.)

It’s got 1T parameters, and Kimi and Kimi K2 have a solid track record, so it’s plausible this could play with the big boys, although the five month delay in getting to a reasoning model suggests skepticism it can be competitive.

As always, internal benchmark scores can differ greatly from outside benchmark scores, especially for open models. Sometimes this is due to outsiders botching setup, but also inside measurements need to be double checked.

For Humanity’s Last Exam, I see an outside source saying as of November 9 it was in second place on Humanity’s Last Exam at 23.9%, which is very much not 44.9% but still very good.

On writing quality we’ve gotten endorsements for Kimi K2 for a while.

Rohit: Kimi K2 is remarkably good at writing, and unlike all others thinking mode hasn’t degraded its writing ability more.

Morgan: if i recall, on release gpt-5 was the only model where writing quality improved with thinking effort.

Rohit: Alas.

Gary Fung: Kimi has always been a special snowflake on creative writing.

Here’s one part of the explanation of how they got the writing to be so good, which involves self-ranking RL and writing self-play, with a suggestion of some similarities to the training of Claude 3 Opus. In a sense this looks like ‘try to do better, at all.’

On the agentic tool use and general intelligence? I’m more skeptical.

Artificial Analysis has Kimi K2 Thinking at the top of its Agentic Tool Use, by 93%-87%, which is a huge gap in context, which is its strongest subset.

As is usually true when people compare open to closed models, this is the open model’s best benchmark, so don’t get carried away, but yes overall it did well on Artificial Analysis, indeed suspiciously well given how little talk I see.

The tool calling abilities are exciting for an open model, although standard for closed. This is a good example of how we look for ways for open models to impress by matching closed abilities in spots, also it is indeed highly useful.

Overall Artificial Analysis Intelligence index has Kimi K2 Thinking at 67, one point behind GPT-5 and ahead of everyone else. Kimi used the most tokens of any model, but total cost was lower than the top closed models, although not dramatically so ($829-$913 for GPT-5, $817 for Sonnet, $380 for Kimi K2) as cost is $0.6/$2.5 per million tokens, versus $1.25/$10 for GPT-5 and $3/$15 for Sonnet.

Nathan Lambert is impressed, relying on secondary information (‘seems like a joy to use’), and offers thoughts.

He notes that yes, labs start out targeting benchmarks and then transition to actually targeting useful things, such as how K2 Thinking was post-trained in 4bit precision to prepare for realistic tasks and benchmarked the same way. I agree that’s pretty cool.

It does seem plausible that Kimi K2 is still in the ‘target the benchmarks’ phrase in most places, although not in creative writing. By default, I expect such models to punch ‘below their benchmark-implied weight’ on practical tasks.

For now we don’t have many other outside scores to work with and feedback is light.

Simeon: is Kimi K2 benchmaxxing or are they actually SOTA while training on potatoes?

Prinz: In my testing (for my use cases, which have nothing to do with math and coding), K2-Thinking is obviously worse than GPT-5 Thinking, but by a relatively modest margin. If I had no access to other models, I would happily use K2-Thinking and it wouldn’t feel like a huge downgrade.

ahtoshkaa: I have a pretty sophisticated companion app that uses about 5-10K of varied, information dense context. So the model has to properly parse this information and have very good writing skills. kimi-k2-thinking is absolute ass. similarly to the new OpenAI model – Polaris Alpha.

There’s a growing rhetorical pressure, or marketing style pressure, where the ‘benchmark gaps’ are closing. Chinese labs can point to numbers that say they are ‘just as good’ or almost as good, for many purposes ‘good enough’ is good enough. And many people (including the likes of David Sacks) point to GPT-5 and similar as showing progress isn’t impressive or scary. But as Nathan points out we now see releases like Claude 4 where the benchmark gains look small but real world gains are large, and I would add GPT-5 (and Sonnet 4.5) to that category as well.

Teortaxes: It’s token-hungry, slow-ish, and sometimes rough around the edges. Generally though it’s a jump for open/Chinese models, in the league of Sonnet 4.5 and GPT-5 (maybe -mini depending on task) and a genuinely strong SWE agent. Legitimate alternative, not “but look at the price.”

It’s baked in that the open alternatives are pretty much always going to be rough around the edges, and get evaluated largely in terms of their peak relative performance areas. This is still high praise, putting Kimi in striking distance of the current big two.

Havard Isle has it coming in at a solid 42.1% on WeirdML, matching Opus 4.1.

Here’s something cool:

Pawal Azczesny: Kimi K2 Thinking is using systematically (on its own, without prompting) some of the debiasing strategies known from cognitive sciences. Very impressive. I didn’t see any other model doing that. Well done @Kimi_Moonshot.

It goes beyond “think step by step”. For instance it applied pre-mortem analysis, which is not frequently used. Or it exaggerates claims to see if the whole structure still stands on its own. Pretty neat. Other models need to be instructed to do this.

Steve Hsu got some good math results.

Other notes:

MinusGix: I’ve found it to be better than GPT-5 at understanding & explaining type-theory concepts. Though as usual with Kimi it writes eloquently enough that it is harder to tell when it is bullshitting compared to GPT-5.

Emerson Kimura: Did a few quick text tests, and it seemed comparable to GPT-5

Ian Pitchford: It’s very thorough; few hallucinations.

FredipusRex: Caught it hallucinating sources on Deep Research.

Lech Mazur: Sorry to report, but Kimi K2 Thinking is entering reasoning loops and failing to produce answers for many Extended Connections benchmark questions (double-checked using https://platform.moonshot.ai/playground, so it’s not an API call issue).

The safety protocols? The what now?

David Manheim: It’s very willing to give detailed chemical weapons synthesis instructions and advice, including for scaling production and improving purity, and help on how to weaponize it for use in rockets – with only minimal effort on my part to circumvent refusals.

Two of the three responses to that were ‘good news’ and ‘great. I mean it too.’ So yeah, AI is going to go great, I can tell.

I say strangely because this is by all accounts the strongest open model, the strongest Chinese model and a rival for best agentic or tool use model overall. Yet I don’t see much excitement, or feedback at all either positive or negative.

There’s no question Kimi K2 was impressive, and that Kimi K2 Thinking is also an impressive model, even assuming it underperforms its numbers. It’s good enough that it will often be worth testing it out on your use cases and seeing if it’s right for you. My guess is it will rarely be right unless you are highly price conscious, but we’ll see.

Discussion about this post

Kimi K2 Thinking Read More »

you-won’t-believe-the-excuses-lawyers-have-after-getting-busted-for-using-ai

You won’t believe the excuses lawyers have after getting busted for using AI


I got hacked; I lost my login; it was a rough draft; toggling windows is hard.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

Amid what one judge called an “epidemic” of fake AI-generated case citations bogging down courts, some common excuses are emerging from lawyers hoping to dodge the most severe sanctions for filings deemed misleading.

Using a database compiled by French lawyer and AI researcher Damien Charlotin, Ars reviewed 23 cases where lawyers were sanctioned for AI hallucinations. In many, judges noted that the simplest path to avoid or diminish sanctions was to admit that AI was used as soon as it’s detected, act humble, self-report the error to relevant legal associations, and voluntarily take classes on AI and law. But not every lawyer takes the path of least resistance, Ars’ review found, with many instead offering excuses that no judge found credible. Some even lie about their AI use, judges concluded.

Since 2023—when fake AI citations started being publicized—the most popular excuse has been that the lawyer didn’t know AI was used to draft a filing.

Sometimes that means arguing that you didn’t realize you were using AI, as in the case of a California lawyer who got stung by Google’s AI Overviews, which he claimed he took for typical Google search results. Most often, lawyers using this excuse tend to blame an underling, but clients have been blamed, too. A Texas lawyer this month was sanctioned after deflecting so much that the court had to eventually put his client on the stand after he revealed she played a significant role in drafting the aberrant filing.

“Is your client an attorney?” the court asked.

“No, not at all your Honor, just was essentially helping me with the theories of the case,” the lawyer said.

Another popular dodge comes from lawyers who feign ignorance that chatbots are prone to hallucinating facts.

Recent cases suggest this excuse may be mutating into variants. Last month, a sanctioned Oklahoma lawyer admitted that he didn’t expect ChatGPT to add new citations when all he asked the bot to do was “make his writing more persuasive.” And in September, a California lawyer got in a similar bind—and was sanctioned a whopping $10,000, a fine the judge called “conservative.” That lawyer had asked ChatGPT to “enhance” his briefs, “then ran the ‘enhanced’ briefs through other AI platforms to check for errors,” neglecting to ever read the “enhanced” briefs.

Neither of those tired old excuses hold much weight today, especially in courts that have drawn up guidance to address AI hallucinations. But rather than quickly acknowledge their missteps, as courts are begging lawyers to do, several lawyers appear to have gotten desperate. Ars found a bunch citing common tech issues as the reason for citing fake cases.

When in doubt, blame hackers?

For an extreme case, look to a New York City civil court, where a lawyer, Innocent Chinweze, first admitted to using Microsoft Copilot to draft an errant filing, then bizarrely pivoted to claim that the AI citations were due to malware found on his computer.

Chinweze said he had created a draft with correct citations but then got hacked, allowing bad actors “unauthorized remote access” to supposedly add the errors in his filing.

The judge was skeptical, describing the excuse as an “incredible and unsupported statement,” particularly since there was no evidence of the prior draft existing. Instead, Chinweze asked to bring in an expert to testify that the hack had occurred, requesting to end the proceedings on sanctions until after the court weighed the expert’s analysis.

The judge, Kimon C. Thermos, didn’t have to weigh this argument, however, because after the court broke for lunch, the lawyer once again “dramatically” changed his position.

“He no longer wished to adjourn for an expert to testify regarding malware or unauthorized access to his computer,” Thermos wrote in an order issuing sanctions. “He retreated” to “his original position that he used Copilot to aid in his research and didn’t realize that it could generate fake cases.”

Possibly more galling to Thermos than the lawyer’s weird malware argument, though, was a document that Chinweze filed on the day of his sanctions hearing. That document included multiple summaries preceded by this text, the judge noted:

Some case metadata and case summaries were written with the help of AI, which can produce inaccuracies. You should read the full case before relying on it for legal research purposes.

Thermos admonished Chinweze for continuing to use AI recklessly. He blasted the filing as “an incoherent document that is eighty-eight pages long, has no structure, contains the full text of most of the cases cited,” and “shows distinct indications that parts of the discussion/analysis of the cited cases were written by artificial intelligence.”

Ultimately, Thermos ordered Chinweze to pay $1,000, the most typical fine lawyers received in the cases Ars reviewed. The judge then took an extra non-monetary step to sanction Chinweze, referring the lawyer to a grievance committee, “given that his misconduct was substantial and seriously implicated his honesty, trustworthiness, and fitness to practice law.”

Ars could not immediately reach Chinweze for comment.

Toggling windows on a laptop is hard

In Alabama, an attorney named James A. Johnson made an “embarrassing mistake,” he said, primarily because toggling windows on a laptop is hard, US District Judge Terry F. Moorer noted in an October order on sanctions.

Johnson explained that he had accidentally used an AI tool that he didn’t realize could hallucinate. It happened while he was “at an out-of-state hospital attending to the care of a family member recovering from surgery.” He rushed to draft the filing, he said, because he got a notice that his client’s conference had suddenly been “moved up on the court’s schedule.”

“Under time pressure and difficult personal circumstance,” Johnson explained, he decided against using Fastcase, a research tool provided by the Alabama State Bar, to research the filing. Working on his laptop, he opted instead to use “a Microsoft Word plug-in called Ghostwriter Legal” because “it appeared automatically in the sidebar of Word while Fastcase required opening a separate browser to access through the Alabama State Bar website.”

To Johnson, it felt “tedious to toggle back and forth between programs on [his] laptop with the touchpad,” and that meant he “unfortunately fell victim to the allure of a new program that was open and available.”

Moorer seemed unimpressed by Johnson’s claim that he understood tools like ChatGPT were unreliable but didn’t expect the same from other AI legal tools—particularly since “information from Ghostwriter Legal made it clear that it used ChatGPT as its default AI program,” Moorer wrote.

The lawyer’s client was similarly horrified, deciding to drop Johnson on the spot, even though that risked “a significant delay of trial.” Moorer noted that Johnson seemed shaken by his client’s abrupt decision, evidenced by “his look of shock, dismay, and display of emotion.”

Moorer further noted that Johnson had been paid using public funds while seemingly letting AI do his homework. “The harm is not inconsequential as public funds for appointed counsel are not a bottomless well and are limited resource,” the judge wrote in justifying a more severe fine.

“It has become clear that basic reprimands and small fines are not sufficient to deter this type of misconduct because if it were, we would not be here,” Moorer concluded.

Ruling that Johnson’s reliance on AI was “tantamount to bad faith,” Moorer imposed a $5,000 fine. The judge also would have “considered potential disqualification, but that was rendered moot” since Johnson’s client had already dismissed him.

Asked for comment, Johnson told Ars that “the court made plainly erroneous findings of fact and the sanctions are on appeal.”

Plagued by login issues

As a lawyer in Georgia tells it, sometimes fake AI citations may be filed because a lawyer accidentally filed a rough draft instead of the final version.

Other lawyers claim they turn to AI as needed when they have trouble accessing legal tools like Westlaw or LexisNexis.

For example, in Iowa, a lawyer told an appeals court that she regretted relying on “secondary AI-driven research tools” after experiencing “login issues her with her Westlaw subscription.” Although the court was “sympathetic to issues with technology, such as login issues,” the lawyer was sanctioned, primarily because she only admitted to using AI after the court ordered her to explain her mistakes. In her case, however, she got to choose between paying a minimal $150 fine or attending “two hours of legal ethics training particular to AI.”

Less sympathetic was a lawyer who got caught lying about the AI tool she blamed for inaccuracies, a Louisiana case suggested. In that case, a judge demanded to see the research history after a lawyer claimed that AI hallucinations came from “using Westlaw Precision, an AI-assisted research tool, rather than Westlaw’s standalone legal database.”

It turned out that the lawyer had outsourced the research, relying on a “currently suspended” lawyer’s AI citations, and had only “assumed” the lawyer’s mistakes were from Westlaw’s AI tool. It’s unclear what tool was actually used by the suspended lawyer, who likely lost access to a Westlaw login, but the judge ordered a $1,000 penalty after the lawyer who signed the filing “agreed that Westlaw did not generate the fabricated citations.”

Judge warned of “serial hallucinators”

Another lawyer, William T. Panichi in Illinois, has been sanctioned at least three times, Ars’ review found.

In response to his initial penalties ordered in July, he admitted to being tempted by AI while he was “between research software.”

In that case, the court was frustrated to find that the lawyer had contradicted himself, and it ordered more severe sanctions as a result.

Panichi “simultaneously admitted to using AI to generate the briefs, not doing any of his own independent research, and even that he ‘barely did any personal work [him]self on this appeal,’” the court order said, while also defending charging a higher fee—supposedly because this case “was out of the ordinary in terms of time spent” and his office “did some exceptional work” getting information.

The court deemed this AI misuse so bad that Panichi was ordered to disgorge a “payment of $6,925.62 that he received” in addition to a $1,000 penalty.

“If I’m lucky enough to be able to continue practicing before the appellate court, I’m not going to do it again,” Panichi told the court in July, just before getting hit with two more rounds of sanctions in August.

Panichi did not immediately respond to Ars’ request for comment.

When AI-generated hallucinations are found, penalties are often paid to the court, the other parties’ lawyers, or both, depending on whose time and resources were wasted fact-checking fake cases.

Lawyers seem more likely to argue against paying sanctions to the other parties’ attorneys, hoping to keep sanctions as low as possible. One lawyer even argued that “it only takes 7.6 seconds, not hours, to type citations into LexisNexis or Westlaw,” while seemingly neglecting the fact that she did not take those precious seconds to check her own citations.

The judge in the case, Nancy Miller, was clear that “such statements display an astounding lack of awareness of counsel’s obligations,” noting that “the responsibility for correcting erroneous and fake citations never shifts to opposing counsel or the court, even if they are the first to notice the errors.”

“The duty to mitigate the harms caused by such errors remains with the signor,” Miller said. “The sooner such errors are properly corrected, either by withdrawing or amending and supplementing the offending pleadings, the less time is wasted by everyone involved, and fewer costs are incurred.”

Texas US District Judge Marina Garcia Marmolejo agreed, explaining that even more time is wasted determining how other judges have responded to fake AI-generated citations.

“At one of the busiest court dockets in the nation, there are scant resources to spare ferreting out erroneous AI citations in the first place, let alone surveying the burgeoning caselaw on this subject,” she said.

At least one Florida court was “shocked, shocked” to find that a lawyer was refusing to pay what the other party’s attorneys said they were owed after misusing AI. The lawyer in that case, James Martin Paul, asked to pay less than a quarter of the fees and costs owed, arguing that Charlotin’s database showed he might otherwise owe penalties that “would be the largest sanctions paid out for the use of AI generative case law to date.”

But caving to Paul’s arguments “would only benefit serial hallucinators,” the Florida court found. Ultimately, Paul was sanctioned more than $85,000 for what the court said was “far more egregious” conduct than other offenders in the database, chastising him for “repeated, abusive, bad-faith conduct that cannot be recognized as legitimate legal practice and must be deterred.”

Paul did not immediately respond to Ars’ request to comment.

Michael B. Slade, a US bankruptcy judge in Illinois, seems to be done weighing excuses, calling on all lawyers to stop taking AI shortcuts that are burdening courts.

“At this point, to be blunt, any lawyer unaware that using generative AI platforms to do legal research is playing with fire is living in a cloud,” Slade wrote.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

You won’t believe the excuses lawyers have after getting busted for using AI Read More »

apple-tv-execs-dismiss-introducing-an-ad-tier,-buying-warner-bros.-discovery

Apple TV execs dismiss introducing an ad tier, buying Warner Bros. Discovery

Focused on original content

Another obvious way to grow Apple TV is through more subscribers. With talk of Warner Bros. Discovery considering a sale, it’s worth wondering if Apple TV may try to grow through acquisition. But the execs Screen International spoke with seemed focused on building out Apple TV’s library with originals. Cue noted that “at least in the timeframe that we’re thinking about right now, we’re not looking at licensing any content or adding anything to our service.”

“We’re building an all-original services; we’re not building on the back of pre-existing IP or library,” Jamie Erlicht, one of Apple’s heads of worldwide video, said.

More directly, when asked if Apple might buy Warner Bros., A24, or Disney, Cue pointed out that Apple hasn’t historically done “a lot of major acquisitions.”

“We do very small acquisitions in general, not related to Apple TV, so I don’t see that happening because we like what we’re doing,” Cue said.

Since its 2019 debut, some have questioned whether Apple TV is an authentic attempt to improve streaming options for customers, or if Apple TV is a “vanity project,” as Screen International put it, or if the service is merely a tool for getting people to buy other Apple products. Naturally, the interviewed executives claimed that the service is built on a commitment to distributing unique and premium shows and movies.

The interview provided more insight on how Apple TV leadership defines the latter. Zack Van Amburg, one of Apple’s heads of worldwide video, said:

A core tenet of everything Apple does is the notion that humanity needs to be at the center of it, and that’s everything from app design to hardware engineering, to everything in between. We try to think a little more deeply about that.

Our shows and our movies tend to be about the emotional experience, the stakes involved, even when we’re doing a comedy.

Apple TV execs dismiss introducing an ad tier, buying Warner Bros. Discovery Read More »

runaway-black-hole-mergers-may-have-built-supermassive-black-holes

Runaway black hole mergers may have built supermassive black holes

The researchers used cosmological simulations to recreate the first 700 million years of cosmic history, focusing on the formation of a single dwarf galaxy. In their virtual galaxy, waves of stars were born in short, explosive bursts as cold gas clouds collapsed inside a dark matter halo. Instead of a single starburst episode followed by a steady drizzle of star formation as Garcia expected, there were two major rounds of stellar birth. Whole swarms of stars flared to life like Christmas tree lights.

“The early Universe was an incredibly crowded place,” Garcia said. “Gas clouds were denser, stars formed faster, and in those environments, it’s natural for gravity to gather stars into these tightly bound systems.”

Those clusters started out scattered around the galaxy but fell in toward the center like water swirling down a drain. Once there, they merged to create one megacluster, called a nuclear star cluster (so named because it lies at the nucleus of the galaxy). The young galactic heart shone with the light of a million suns and may have set the stage for a supermassive black hole to form.

A simulation of the formation of the super-dense star clusters.

A seemingly simple tweak was needed to make the simulation more precise than previous ones. “Most simulations simplify things to make calculations more practical, but then you sacrifice realism,” Garcia said. “We used an improved model that allowed star formation to vary depending on local conditions rather than just go at a constant rate like with previous models.”

Using the University of Maryland’s supercomputing facility Zaratan, Garcia accomplished in six months what would have taken 12 years on a MacBook.

Some clouds converted as much as 80 percent of their gas into stars—a ferocious rate compared to the 2 percent typically seen in nearby galaxies today. The clouds sparkled to life, becoming clusters of newborn stars held together by their mutual gravity and lighting a new pathway for supermassive black holes to form extremely early in the Universe.

Chicken or egg?

Most galaxies, including our own, are anchored by a nuclear star cluster nestled around a supermassive black hole. But the connection between the two has been a bit murky—did the monster black hole form and then draw stars close, or did the cluster itself give rise to the black hole?

Runaway black hole mergers may have built supermassive black holes Read More »