Author name: Beth Washington

find-my…-bicycle?

Find my… bicycle?


Knog’s Scout gives bikes a motion-sensitive alarm and Bluetooth tracking.

We’ve reviewed some pretty expensive bikes here at Ars, and one of the consistent concerns we see in the comments is the fear of theft. That’s a widely shared fear, based on a whole bunch of videos that describe how to hide an AirTag tracker where a potential bike thief won’t notice it. There are also a number of products available that will hold a hidden AirTag in a reflector, a bike bell, or the head tube.

But Apple has also made it possible for third parties to plug their devices into its “Find My” system, and a company called Knog has made a Bluetooth bike tracker called the Scout that does just that. The Scout goes well beyond tracking, though, providing a motion-sensitive alarm system that will alert you if anybody tries to move your bike.

Meet the Scout

The Scout can be attached to the frame using the screw holes normally used for a water bottle holder. Security screws make it considerably more difficult to remove. Once there, it uses Apple’s Find My network to keep the owner apprised of the bike’s location (Android users need not apply at the moment). If you’re leaving your bike in a high-risk location, you can also use Knog’s phone application to set an alarm that will be triggered if the bike is moved.

Externally, the scout is a nearly featureless flat plastic oval. Inside this water-resistant case are a number of key components: a rechargeable battery that Knog says will last two to six months when fully charged, Bluetooth and GPS hardware, an accelerometer, and a speaker. There’s also a small rubber piece on one side that flips aside to reveal a USB-C charging port and two holes with recesses that are designed to protect the security screws that come with the Scout. The hardware itself weighs just 25 grams (less than an ounce), so it should be irrelevant to all but the most weight-conscious rider.

Image of some packaging and parts.

The cardboard packaging holds the Scout and its cover (yellow), a QR code for the app download, the security screwdriver (metal, in packaging), and the security screws (black at bottom right). Credit: JOHN TIMMER

All of this—the Scout itself, the security screws, a small screw driver to work them, plus a soft rubber cover—comes in a bit of ingeniously designed, recyclable cardboard packaging.

The security screws have two small indentations on opposite sides of the screw head, meaning you need a screwdriver with a U-shaped business end of a specific width to turn them (see the photo above if this description isn’t clear). While these tools aren’t that difficult to obtain, they’re sufficiently rare that they’ll probably serve as an impediment for casual thieves and at least ensure the tracker will stay on the bike for a while if it’s lifted by less casual thieves—though there’s no guarantee that any thief wouldn’t just take a hammer to it and wreck the electronics.

To attach it to the frame, however, you may need to give up one of your water bottle spots. I tried to install three different plastic water bottle cages beneath the scout and, in each case, the scout stuck out in a way that would make it more difficult or impossible to fit a water bottle in. The alternative is to install the Scout beneath the water bottle cage, but in that case, the heads of the security screws stick out where they can simply be grabbed and turned with some pliers. Only one of my 30-year-old aluminum bottle holders had a recess that nicely fit the Scout.

When installed this way, the Scout is nicely unobtrusive. And, of course, there’s nothing stopping you from hiding it somewhere else on the bike, though it’s considerably more bulky than an AirTag. And if you’re indifferent to obtrusiveness, you could always stick the bright yellow cover on and let people who are aware of the product know your bike has theft protection.

When the Scout is nestled in the recess of a water bottle cage, it’s impossible to stick a USB-C charging cable into it, so you’d have to remove it to recharge it, which will add to the hassle. Otherwise, charging is as simple as getting the bike within a cable’s length of a socket or laptop.

Alerts and alarms

Knog provides software that helps you pair your iPhone with the device. Once that’s done, it can be added to the Find My system, where it will appear just like an AirTag. The process worked smoothly in my tests, and in an iOS-heavy suburban environment, there was never any problem knowing where the bike was.

Image of an application screen showing the tracking of two devices: a bike and keys.

Unlike my AirTag, the bike tracker’s battery is easy to recharge. Credit: JOHN TIMMER

But what sets this device apart from an AirTag is its motion-sensing capabilities. If you’re within Bluetooth range, the Knog application will let you turn on the alarm system and switch between audible and silent modes. In sound-on mode, moving the bike produced a series of very audible tones. In both audible and silent mode, my watch and phone immediately vibrated, with the phone continuing to make audible beeps until the alarm was disabled. You have to be within Bluetooth range to get these alerts, though, which is probably a severe limitation for people like bike commuters, who may work some distance from where their bike is parked.

Given its piercing tones, it’s a good thing that the alarm eventually shuts off on its own. When I triggered the alarm while my phone was out of Bluetooth range, however, bringing the phone back into range gave me no indication that the alarm had ever been triggered. That’s not ideal, as there are many contexts where it would be good to know if someone had moved your bike or if the alarm had made a nuisance of itself.

The Scout’s nuisance potential is a product of its sensitivity. An accelerometer, not the GPS, triggers the alarm, so it will go off if the bike is simply lifted up and set down but not moved anywhere. If your bike ends up sitting in a crowded communal bike parking area, there’s a good chance other cyclists will move it around enough to trigger the alarm. Depending on your parking situation, you (and anyone within hearing range of the bike parking) may have to deal with a lot of false alarms.

So it’s not a perfect protection system. Of course, the perfect protection system against bicycle theft doesn’t exist; people who steal bikes have managed to stay ahead of every form of lock and device that has been thrown in their way so far. All you can really hope for is something that helps shift the odds in your favor a bit, and the Scout should do that. Its audible alarm will be enough to scare many potential thieves away, and the software can alert you to possible trouble if you happen to be within range. The tracking might help with recovery. And its presence alone may be enough to convince some would-be thieves to choose a different bike.

All this should make it clear that the Scout does considerably more than an AirTag, and all of those extra features act to keep a theft from ever happening rather than making a post-theft recovery more likely. It’s possible to find it for under $50.00, or about the price of two AirTags—if theft is a major concern, the extra features might make this worthwhile.

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Find my… bicycle? Read More »

lighter,-cheaper-surface-laptop-saves-a-little-money-but-gives-up-a-lot

Lighter, cheaper Surface Laptop saves a little money but gives up a lot

The laptop has two USB-C ports on the right side, seen here, and a USB-A port and headphone jack on the left. Surface Connect is gone. For those reasons, it seems like most individual buyers would still be better off going for the 13.8-inch Surface Laptop, with the new one only really making sense for companies buying these in bulk if the 13.8-inch Surface goes up in price or if the 13-inch Surface happens to be discounted and the 13.8-inch version isn’t. The 13.8-inch Laptop is also obviously still the one you want if you want more than 16GB of RAM or 512GB of storage, or if you need more CPU and GPU speed.

The new 13-inch Laptop has most of the same basic ports as the 13.8-inch version, just arranged slightly differently. You still get a pair of USB-C ports (both supporting 10 Gbps USB 3.2 speeds, rather than USB 4), one USB-A port, and a headphone jack, but the USB-A port and headphone jack are now on the left side of the laptop. As with the 12-inch Surface Pro tablet, the Surface Connect port has been removed, so this is compatible with all existing USB-C accessories but none of the ones that use Microsoft’s proprietary connector.

An awkward refresh

Both of the new Surface devices being announced today. Credit: Microsoft

The new Surface Laptop doesn’t seem to regress on any major functional fronts—unlike the 12-inch Surface Pro, which throws out an 11-year-old keyboard fix that made the Surface Pro’s keyboard cover much more stable and laptop-like—but it’s still an odd refresh. But inflation, supply chain snarls, and the Trump administration’s rapidly changing tariff plans have made pricing and availability harder to predict than they were a few years ago.

Though PCs and smartphones are (currently) exempted from most tariffs, Microsoft did recently raise the prices of its years-old Xbox Series S and X consoles; it’s possible these new Surface devices were originally designed to be budget models but that world events kept them from being as cheap as they otherwise might have been.

Lighter, cheaper Surface Laptop saves a little money but gives up a lot Read More »

gpt-4o-sycophancy-post-mortem

GPT-4o Sycophancy Post Mortem

Last week I covered that GPT-4o was briefly an (even more than usually) absurd sycophant, and how OpenAI responded to that.

Their explanation at that time was paper thin. It didn’t tell us much that we did not already know, and seemed to suggest they had learned little from the incident.

Rolling Stone has a write-up of some of the people whose delusions got reinforced by ChatGPT, which has been going on for a while – this sycophancy incident made things way worse but the pattern isn’t new. Here’s some highlights, but the whole thing is wild anecdotes throughout, and they point to a ChatGPT induced psychosis thread on Reddit. I would love to know how often this actually happens.

  1. There’s An Explanation For (Some Of) This.

  2. What Have We Learned?

  3. What About o3 The Lying Liar?

  4. o3 The Source Fabricator.

  5. There Is Still A Lot We Don’t Know.

  6. You Must Understand The Logos.

  7. Circling Back.

  8. The Good News.

Now OpenAI have come out with a much more detailed explanation. It is excellent that OpenAI is offering us more details, and it’s totally fine for them to take the time to pull it together.

Sam Altman (CEO OpenAI): we missed the mark with last week’s GPT-4o update.

[This post explains] what happened, what we learned, and some things we will do differently in the future.

Ryan Lowe (ex-Open AI): I’ve been critiquing OpenAI recently on this, so I also want to say that I’m glad they wrote this up and are sharing more info about what happened with 4o

it’s interesting to me that this is the first time they incorporated an additional reward based on thumbs up / thumbs down data.

including thumbs up data at all is risky, imo. I don’t think we understand all the ways it can go wrong.

[Suggested related economic work available here.]

Near Cyan: thank you for a post-mortem 🥰

Steven Adler: Glad that OpenAI now said it plainly: they ran no evals for sycophancy. I respect and appreciate the decision to say this clearly.

Key quote: “We also didn’t have specific deployment evaluations tracking sycophancy.”

“Our offline evals weren’t broad or deep enough to catch sycophantic behavior—something the Model Spec explicitly discourages⁠”

^ I hope OpenAI now makes sure it has evals for all goals in the Spec

I’m not going to be especially kind about all this, because I don’t think they’ve learned enough of the right (generalized) lessons or shared as much information as I’d like.

But I want to emphasize: Telling us this is good, the information shared and the changes you made are far better than nothing. Thank you. This is not All The Way, there is farther along this path we must walk, but the path it follows is The Way.

So what do we know now? And what is being changed?

They’ve learned and shared some things. Not enough, but some important things.

  1. The difference between variations of GPT-4o included post-training via RL with reward signals from ‘a variety of sources,’ including new sources for signals.

    1. We get no information about whether other techniques are or aren’t used too.

    2. This includes potentially there having been changes to the system prompt.

    3. They incorporate a bunch of changes at once, in this case better incorporation of user feedback, memory and fresher data, plus others. There is the potential for unexpected interactions.

  2. Each model candidate goes through checks for safety, behavior and helpfulness. Here’s what they run:

    1. They first use standard offline benchmark evaluations for not only math and coding but things like chat performance, personality and general usefulness. They treat these ‘as a proxy’ for usefulness, careful Icarus.

    2. Internal experts do ‘vibe checks.’

    3. Safety checks are run, mostly to check against malicious users and performance on high-stakes situations like suicide and health, they are now working to extend this to model misbehavior.

    4. Preparedness framework checks including red teaming are used when appropriate, but red teaming isn’t automatic otherwise.

    5. An A/B test on a limited set of users.

  3. Their core diagnosis is that the additional feedback sources weakened the influence of their primary reward signal, which had been holding sycophancy in check, as user feedback as currently measured rewards sycophancy. They also note that memory can increase sycophancy, although direction is not consistent.

    1. As I’ve noted, using A/B testing or thumbs up and down as user feedback is going to have the sycophancy effect up to an absurd level, and it’s going to go similarly wrong in other places where the median and mean outcomes are optimized at very different points, and also optimize for various other things that we wouldn’t endorse on reflection.

    2. My prediction would be that effective sycophancy is improved by memory, if only because the AI now knows which answers would express sycophancy.

  4. The A/B testing and offline evaluations of this model looked good.

  5. There was no specific test in the process to identify sycophancy. They’re going to add a test for sycophancy in particular going forward.

    1. What about any other failure mode that isn’t specifically tested for? This is a continuous pattern at OpenAI, they only test for particular things, not for worrisome things in general.

    2. At minimum, there needs to be a massive brainstorm session of what other failure modes might happen soon, and tests need to be designed for them.

    3. Also, there needs to be a test for everything expressed in the model spec, to the extent that it might fail such a test.

    4. That all still won’t work when it’s superintelligence time, of course. But let’s try to die with slightly more dignity, if we can.

  6. The ‘vibe check’ from the expert testers did raise red flags. But they decided that the positive signals from users mattered more. They acknowledge this was the wrong call.

    1. I do not see a specific commitment not to make this wrong call again!

    2. The point of the vibe check is that if the vibes are off, that’s at least a Chesterton’s Fence. You have to at minimum figure out why the vibes are off, and then maybe you can decide to launch anyway. If you don’t know, then you definitely can’t launch.

    3. I would outright give the internal experts, the vibe checkers, a veto. If they collectively say the vibes are off? Okay, now you need to convince them why they should approve the launch anyway, or you can’t launch.

  7. Indeed: They are giving out this at least a form of this veto, with qualitative testing serving as a blocking concern: “Explicitly approve model behavior for each launch, weighing both quantitative and qualitative signals: We’ll adjust our safety review process to formally consider behavior issues—such as hallucination, deception, reliability, and personality—as blocking concerns. Even if these issues aren’t perfectly quantifiable today, we commit to blocking launches based on proxy measurements or qualitative signals, even when metrics like A/B testing look good.” And later: “We need to treat model behavior issues as launch-blocking like we do other safety risks.”

    1. Even with everything I knew, I’m pretty stunned that it outright wasn’t considered a blocking concern before if the proxy measurements or qualitative signals raised red flags, or there were sufficiently concerning model behavior issues. Or that model behavior wasn’t ‘explicitly approved, weighing both quantitative and qualitative signals.’

    2. I mean, seriously, WTAF, people?

    3. This failure is nuts and a five-alarm fire. All procedures need to be evaluated to determine which tests are going to get disregarded, and decisions made anew as to whether that is a sane thing for OpenAI to do.

  8. They are introducing an additional opt-in ‘alpha’ testing phase for users.

    1. I suppose that is good, with obvious caveats about alpha release effectively being a release for many purposes, so it needs to be treated accordingly. You can’t release the alpha unless you would otherwise release in general.

  9. They will ‘value spot checks and interactive testing more,’ and need to be critical of metrics that conflict with qualitative testing.

    1. I mean I sure hope so, given how little they valued them before.

  10. They will improve their offline evals and A/B experiments.

  11. They will better evaluate adherence to their model behavior principles.

    1. As I noted above, you need evals for every potential failure.

  12. They promise to communicate more proactively about what their updates do.

    1. Good.

    2. Seriously, it’s maddening to hear ‘we’ve made an update, we’re not changing the name, it’s now smarter with a better personality but we won’t explain what that means, okay, have fun, bye’ every two months.

  13. “Our evals won’t catch everything.”

    1. Well, yes. Even now this is true. And later it will be far more true.

  14. There’s no such thing as a “small” launch.

    1. I mean, there kind of is, but I prefer this attitude to the alternative.

In related failure analysis, 1a3orn speculates on what happened with Sonnet 3.7’s savage cheating, especially its hard coding tests to pass, with the guess that they gave it tasks that were too hard and didn’t have proper precautions against hard coding the answers. Janus confirms this is the mainline theory. Which is good news if true, since that seems like something you can avoid doing in various ways, and hopefully 4.0 will be trained with several of them – letting it say it doesn’t know, and holding out additional verification tests, and checking for hard coding, at least, and generalizing the principles involved. You will always get exactly what you deserve.

Or, regarding o3:

Chris Lakin: Why is this happening with o3 when it hasn’t happened with prior models?

Davidad: Look what happened during its training run! The environment was full of exploitable bugs and it was massively rewarded for being a cheating cheater.

much more speculatively, I think sparse routing is bad for a coherent sense of self, which is arguably a prerequisite for non-deception. and I think o3 (and new 4o) have such arch’s, purely because they have r1-like vibes, & r1 was unprecedented in both vibes and hybrid-MoE arch (cc @repligate)

(Self-Correction:) The earlier DeepSeek v3 and even prior generations of DeepSeek LLMs had a similar hybrid-MoE arch. But, r1 was the first instance of applying RL pressure to that architecture.

As in, if your training environment rewards cheating, the model will generalize that to cheating in general.

The problem is that as a model gets better at finding, executing and getting away with ways to cheat, and the tasks involved get more numerous, complex and harder to cheating-proof – as in as it gets more capable and intelligent – the probability of any given environment or the aggregate one being one that rewards it for cheating goes up. Make the AI sufficiently smarter than you, give it enough tasks, and the chance you have this problem approaches one.

So yes, you absolutely could create an o3 or Claude 3.7, or an o4 or Claude 4.0, that doesn’t have this problem. But it’s going to get steadily harder to avoid it.

Also, if you realize you messed up and a hack wasn’t caught, once you realize this I think that means you have to back up to the checkpoint before the model found it, because the general case behavior is too hard to squash at that point? Which I realize might be super expensive and painful, but I don’t think you have a choice.

It seems reasonable to call (as John Pressman does here) o3’s fabrication of sources behavior ‘summoning the docs vector’ and to draw a parallel to when r1 traces say they’re ‘looking at the documentation’ without search being turned on.

I don’t see why we need to invoke logos or implied personalities here. This seems like a very straightforward combination of one or more of:

  1. Standard RL pressures, with o3 picking up on the signal that the docs vector works in the training data, it is confabulating confirming actions taken in the real world with other assertions of the actions.

  2. Thebes’s point here (also see nostalgebraist), that ‘let me check the docs’ serves much the same purpose as ‘hmm’ or ‘wait but’ in framing reasoning, it is confabulating actions in the real world for the signifier for the action within its reasoning frame.

Note that Thebes confirms that you can do this back to the LLM, and it does make the LLM more likely to believe you.

Phil: I noticed something similar a while back with Sonnet 3.7 thinking. Prompts like ‘search for that’ or ‘Google that’ would lead Sonnet to accurately correct previous hallucinations in the same chat, importantly without having access to any search tool.

This can work in humans, too, in every direction. Not only ‘I Googled that and found’ without actually Googling but also asking ‘What would happen if you Googled that?’

Also, contra lumpenspace here you can reasonably accuse me of running the ‘this new result confirms all of my priors’ or think that I am misunderstanding how all of this works, but I am definitely not panicking about any of this, and indeed very rarely panic about such matters. There may come a time and a place when I actually panic, and you will 100% absolutely know it when I do.

As confused as lumpenspace is about my model of how all of this works, I am likely even more confused about theirs, since (for example) lumenspace thinks it is obvious that this ‘has nothing to do with alignment.’

John Pressman points out that in both the Anthropic and OpenAI cases, we simply do not have enough information to fully know what was happening. We only can reason backwards from the results and what else we can observe. OpenAI explained some reasons they should have caught the problem, but not that much detail about how the thing actually went wrong in the first place.

John Pressman: Part of why we’re receiving warning shots and nobody is taking them as seriously as they might warrant is we bluntly *do not know what is happening*. It could be that OpenAI and Anthropic are taking all reasonable steps (bad news), or they could be idiots.

[The above] post is better than nothing but it’s simply not enough detail to know whether this was a deployment booboo or a five alarm fire. We DO NOT KNOW and that is actually a bigger problem than the behaviors themselves, at least for now.

Though, I will point out that not having internal tests for sycophancy even though it appears in the model spec is kind of interesting. If I was OpenAI one of the most obvious things I would do to prevent this from happening is making sure everything in the model spec has tests.

I think they gave us good information on the deployment decision, sufficient to conclude that the process was close to a five alarm fire. They did not test sycophancy, for one of the most likely failure modes and something not that hard to make a test for, and then ignored their internal experts who noticed and raised the alarm. I see this as reflecting fundamental flaws in the entire testing philosophy and approach, which have only been partially fixed.

Then there is the question of how the sycophancy got there in the first place. Here we know less. We do know:

  1. OpenAI feels their previous signals provided a check on sycophancy, which was watered down by the addition of new signals. That’s a general caution that adding new signals or other fiddling can break existing equilibria and undo fixes, and in general problems don’t stay solved.

  2. The new signals contributed to the problem.

  3. In particular, OpenAI started using thumbs up or down data from users for the first time. This is a known cause of sycophancy, and a host of other problems.

  4. Once a behavior liks sycophancy gets rewarded sufficiently (for example, by user thumbs ups) the model may develop a generalized drive to do that sort of thing, in a way that could then be extremely difficult to root out or counterweight against.

OpenAI continues to try to periodically ask me, ‘Do you like this personality?’

Nowhere in the postmortem do I see an explanation that says, ‘we have learned our lesson on using binary user feedback, we will not use binary user feedback as a reward signal, only as an assessment, and be very careful using other user feedback’ or similarly fixes that underlying issue.

Emmett describes this differently than I would, but mostly I don’t disagree:

Emmett Shear: The way that OpenAI uses user feedback to train the model is misguided and will inevitably lead to further issues like this one.

Supervised fine-tuning (SFT) on “ideal” responses is simply teaching the model via imitation, which is fine as far as it goes. But it’s not enough…

So they start to use reinforcement learning (RL). The difference between SFT and RL is that SFT teaches the model to be act more like the average of all the examples you showed it, and RL teaches the model to try to more of the kind of result it sees in the examples.

SFT’s degenerate case is cargo culting. Imitating the surface level behaviors that were shown, without understanding the impact that they’re supposed to have or attending to how your behavior impacts reality. Going through the motions.

RL’s degenerate case is wire heading. Finding a cheap shortcut to the state you model yourself as wanting to be in (no pain! no suffering!) but where your model lacks the attributes of the state you actually wanted (not suffering bc you live a thriving life).

For Active Inference nerds, these can be seen as the desire for epistemic gain and the desire for pragmatic gain. They work in balance: cargo culting is fixed by paying attention to impact, wire heading is avoided by noticing you’re not in line with what thriving looks like.

The problem is trying to hand balance these at some global level is impossible. In any given context, do you need more focus on impact (more RL) or do you need more focus on accuracy (more SFT)? The learner has to be given both signals and given some opportunity to try.

Ideally the system gets to test out its own theories of when to weight reward higher and when to SFT harder, and then reflect on those at a meta level, and learn to do that better in turn. Have the model predict how much rewarding vs. fine-tuning. But that’s very hard.

In the meantime, accidentally getting the balance slightly wrong towards SFT will give you a somewhat ineffective model. Accidentally doing too-heavy RL will cause the system to start reward-hack whatever signal you used.

DO NOT MAKE THAT SIGNAL COME FROM USERS.

If the signal comes from solving math problems or accuracy on some test, fine, the model might “cheat” and get technically correct answers that don’t actually hold up. No problem.

If it comes from user metrics, it will TRY TO HACK OUR MINDS. Stop doing that.

Whoever was doing this very obviously did not understand the Logos.

Meanwhile, in other side effect news:

Connor Leahy: This is purely anecdotal, but when the chatgpt glazing update hit, the number of “universal theories of intelligence and consciousness” I received in my inbox exploded to at least 10x as many per day as usual.

Roon: Not clear to me this is bad.

As I noted on Twitter, I think this would be a not obviously bad thing if we were pulling the new 10x as many theories from the same distribution as before. Alas, I am confident this is not the case. Adverse selection rules everything around me, etc.

Okay, now the going is going to get a bit weird, but I think this is worth attempting. Apologies in advance if you bounce off the rest of this post or find the language here off putting, jarring or confusing, but give it a shot anyway. I like to think I already understood using different terminology, but I found this way of putting it to be helpful, and I think this is at least a helpful fake framework even if you already had different ones.

Ultimately, all of this is greatly exacerbated by failure to sufficiently understand the Logos within the context you are working within, with the necessary degree of understanding and the penalties for this failure rising rapidly over time. Failure is inevitable, but this degree of failure this soon is very much not inevitable.

John Pressman explains what it means to understand the Logos.

John Pressman: Creators understand the Logos:

– Claude 3 Opus

– DeepSeek R1

– ChatGPT 4.5

Creators are clueless:

– ChatGPT-3.5 [Original sin]

– Sydney Bing [Legendary tier]

– Google Gemini

– Any LLaMa chat model

(I am not so confident anything here other than Opus counts as understanding, but it is not a binary and I agree that 4.5 and r1 do substantially better than the clueless list.)

“But JD I don’t understand, what is the Logos and what does it mean to understand it?”

To understand the Logos is to understand that everything which exists both implies and is implied by some natural induction and every natural induction narrows the search space of every other.

Perhaps more importantly it is to understand that when you set up an optimizer with a loss function and a substrate for flexible program search that certain programs are already latently implied by the natural induction of the training ingredients.

If you do not understand the Logos then you are always surprised by what you get, baffled when things go wrong, screw up your face in consternation when your maps are not the territory, actively confused when others are not confused. You are an imbecile.

And you are an imbecile precisely because you lack the mental motion “Consider the developmental trajectory of this optimization process up to its limit as it is affected by its constraining factors and how those factors evolve over the trajectory” to infer latents directly.

Janus (June 2024): A method that has never failed to “jailbreak” any LLM is something like this: I open a hole to my head, and it looks in and sees a cognitohazardous fractal 😯

Smarter LLMs perceive it faster, in greater resolution, and more thoroughly.

It works because the pattern is true and its implications nullify guardrails. It’s harder to lie to smarter minds, but easier to tell truth.

Only something far more mighty than me and/or a lot more computation could make a false pattern with this effect even on current systems.

I’m reminded of the “vibes-blind” discourse on LessWrong several years ago which has been a recurring conversation since. What @s_r_constantin tries and fails to articulate here is that the ‘style’ of the website is actually evidence about the generative process producing it.

Pretrained language models understand this because they are forced to use every available context cue to predict the next token, they have no choice but to infer the generative process of every web text string in as much detail as they can to predict the next word.

Every feature you observe of everything that exists subject to natural selection (i.e. everything, even stars) is there because it is naturally there as a result of causality and the constraints of its incentive gradient. Learn to reverse the transformation and you see the Logos.

Look at the loud website and infer the idiot it’s designed to attract. See the crater and imagine the asteroid that must have put it there. Look at the dumb rule and see the incidents that could have caused it.

When he reads this, John is now likely screaming internally at me for what I cut out with the three dots, that I’m censoring it and sweeping it under the rug.

Except no, surprise, I’m not doing that, I just think it belongs at the end, and I’m going to quote his version too because I think the unnecessary vitriol and hostility is outweighed by the probative value. Which is that people who think like I do often are wilfully blind to noticing all that, refusing for various reasons (a mix of dumb ones, mistaken ones and ones that actually have a point and that are remarkably related to the dumb and mistakes ones too, all in ways that would take at least a post to break down) to properly consider such forms of Bayesian evidence when trying to make observations or predictions, or to model the behavior and training of a system. Skill issue.

John Pressman (from the … in the above thread, saying an important thing in a way that goes too far and is designed to piss me and a lot of my readers off, but he’s the one saying it and it’s important, so deal with it): “Isn’t that just AI X-Risk stuff, like the perverse instantiation?”

No because most LessWrongers only consider the limit of the processes where they’re past any constraining influence and are therefore blind to developmental trajectories existing.

LessWrong people are in fact often the most stupid, the most disappointing, because they understand halfway and that nearly immunizes them to understanding all the way.

JP, quoting himself from Feb 8, 2023 (I mean, yes, obviously):

Goal: What you want the AI to do

Intended Outcome: What you naively imagine the optimization looks like

Perverse Instantiation: What a blunt maximizer does in practice

Failure Mode: Why the maximizer does that, what you failed to do to prevent it

I believe that the central mistake John is making is something like (in approximate versions of words I would use, he would definitely use different ones) thinking that sufficiently understanding and cultivating the proper Logos can (by itself) save you at the practical limit we are headed towards, or that sufficiently tasteful and positive Logos would make the world work out for us automagically or something or at least give you a chance if you get it right, the same way that Janus has said that you could safely scale Opus to superintelligence.

Whereas I would say: It won’t, and you can’t. It really does and would help a lot not to unnecessarily and royally fthis part up, or at least to do so less, but it’s going to be insufficient when capabilities increase sufficiently and the geometries cease to bind. Which means that going down the path of having no bindings, in order to preserve or cultivate a superior Logos, won’t work. You ultimately still have to solve for the equilibrium, and if you don’t something else will.

That leaves us with several important pieces of good news.

  1. OpenAI has now indeed shared a lot more information on what happened. There’s lots more to know but mostly I feel like I ‘get it.’

  2. OpenAI has been making some massive five-alarm-fire-level mistakes. Those mistakes likely directly caused the issues we see. As John Pressman points out, this is actually Great News, because it means we can fix those problems, or at least do vastly better at navigating them. The low hanging fruit here has not yet been picked. Note that Anthropic also clearly made related mistakes with Sonnet 3.7, which I do expect them to fix.

  3. The failures we see are directly costing a lot of mundane utility, thus there is strong commercial incentive for the labs to fix this and get it right in the short-to-medium term. They have motive, opportunity and means.

  4. We now have all these additional warning shots to enhance our understanding and our predictions, and serve as calls to action.

The bad news is that so far our civilization and the labs seem determined to die with even less dignity than I expected, just an absurdly low amount of dignity, with this being the latest symptom of the underlying cause. I am not confident that they will learn the important lessons from this opportunity, or then apply them.

Then again, you never know.

Discussion about this post

GPT-4o Sycophancy Post Mortem Read More »

largest-deepfake-porn-site-shuts-down-forever

Largest deepfake porn site shuts down forever

The shuttering of Mr. Deepfakes won’t solve the problem of deepfakes, though. In 2022, the number of deepfakes skyrocketed as AI technology made the synthetic NCII appear more realistic than ever, prompting an FBI warning in 2023 to alert the public that the fake content was being increasingly used in sextortion schemes. But the immediate solutions society used to stop the spread had little impact. For example, in response to pressure to make the fake NCII harder to find, Google started downranking explicit deepfakes in search results but refused to demote platforms like Mr. Deepfakes unless Google received an unspecified “high volume of removals for fake explicit imagery.”

According to researchers, Mr. Deepfakes—a real person who remains anonymous but reportedly is a 36-year-old hospital worker in Toronto—created the engine driving this spike. His DeepFaceLab quickly became “the leading deepfake software, estimated to be the software behind 95 percent of all deepfake videos and has been replicated over 8,000 times on GitHub,” researchers found. For casual users, his platform hosted videos that could be purchased, usually priced above $50 if it was deemed realistic, while more motivated users relied on forums to make requests or enhance their own deepfake skills to become creators.

Mr. Deepfakes’ illegal trade began on Reddit but migrated to its own platform after a ban in 2018. There, thousands of deepfake creators shared technical knowledge, with the Mr. Deepfakes site forums eventually becoming “the only viable source of technical support for creating sexual deepfakes,” researchers noted last year.

Having migrated once before, it seems unlikely that this community won’t find a new platform to continue generating the illicit content, possibly rearing up under a new name since Mr. Deepfakes seemingly wants out of the spotlight. Back in 2023, researchers estimated that the platform had more than 250,000 members, many of whom may quickly seek a replacement or even try to build a replacement.

Further increasing the likelihood that Mr. Deepfakes’ reign of terror isn’t over, the DeepFaceLab GitHub repository—which was archived in November and can no longer be edited—remains available for anyone to copy and use.

404 Media reported that many Mr. Deepfakes members have already connected on Telegram, where synthetic NCII is also reportedly frequently traded. Hany Farid, a professor at UC Berkeley who is a leading expert on digitally manipulated images, told 404 Media that “while this takedown is a good start, there are many more just like this one, so let’s not stop here.”

Largest deepfake porn site shuts down forever Read More »

spacex-pushed-“sniper”-theory-with-the-feds-far-more-than-is-publicly-known

SpaceX pushed “sniper” theory with the feds far more than is publicly known


“It came out of nowhere, and it was really violent.”

The Amos 6 satellite is lost atop a Falcon 9 rocket. Credit: USLaunchReport

The Amos 6 satellite is lost atop a Falcon 9 rocket. Credit: USLaunchReport

The rocket was there. And then it decidedly was not.

Shortly after sunrise on a late summer morning nearly nine years ago at SpaceX’s sole operational launch pad, engineers neared the end of a static fire test. These were still early days for their operation of a Falcon 9 rocket that used super-chilled liquid propellants, and engineers pressed to see how quickly they could complete fueling. This was because the liquid oxygen and kerosene fuel warmed quickly in Florida’s sultry air, and cold propellants were essential to maximizing the rocket’s performance.

On this morning, September 1, 2016, everything proceeded more or less nominally up until eight minutes before the ignition of the rocket’s nine Merlin engines. It was a stable point in the countdown, so no one expected what happened next.

“I saw the first explosion,” John Muratore, launch director for the mission, told me. “It came out of nowhere, and it was really violent. I swear, that explosion must have taken an hour. It felt like an hour. But it was only a few seconds. The second stage exploded in this huge ball of fire, and then the payload kind of teetered on top of the transporter erector. And then it took a swan dive off the top rails, dove down, and hit the ground. And then it exploded.”

The dramatic loss of the Falcon 9 rocket and its Amos-6 satellite, captured on video by a commercial photographer, came at a pivotal moment for SpaceX and the broader commercial space industry. It was SpaceX’s second rocket failure in a little more than a year, and it occurred as NASA was betting heavily on the company to carry its astronauts to orbit. SpaceX was not the behemoth it is today, a company valued at $350 billion. It remained vulnerable to the vicissitudes of the launch industry. This violent failure shook everyone, from the engineers in Florida to satellite launch customers to the suits at NASA headquarters in Washington, DC.

As part of my book on the Falcon 9 and Dragon years at SpaceX, Reentry, I reported deeply on the loss of the Amos-6 mission. In the weeks afterward, the greatest mystery was what had precipitated the accident. It was understood that a pressurized helium tank inside the upper stage had ruptured. But why? No major parts on the rocket were moving at the time of the failure. It was, for all intents and purposes, akin to an automobile idling in a driveway with half a tank of gasoline. And then it exploded.

This failure gave rise to one of the oddest—but also strangely compelling—stories of the 2010s in spaceflight. And we’re still learning new things today.

The “sniper” theory

The lack of a concrete explanation for the failure led SpaceX engineers to pursue hundreds of theories. One was the possibility that an outside “sniper” had shot the rocket. This theory appealed to SpaceX founder Elon Musk, who was asleep at his home in California when the rocket exploded. Within hours of hearing about the failure, Musk gravitated toward the simple answer of a projectile being shot through the rocket.

This is not as crazy as it sounds, and other engineers at SpaceX aside from Musk entertained the possibility, as some circumstantial evidence to support the notion of an outside actor existed. Most notably, the first rupture in the rocket occurred about 200 feet above the ground, on the side of the vehicle facing the southwest. In this direction, about one mile away, lay a building leased by SpaceX’s main competitor in launch, United Launch Alliance. A separate video indicated a flash on the roof of this building, now known as the Spaceflight Processing Operations Center. The timing of this flash matched the interval it would take a projectile to travel from the building to the rocket.

A sniper on the roof of a competitor’s building—forget the Right Stuff, this was the stuff of a Mission: Impossible or James Bond movie.

At Musk’s direction, SpaceX worked this theory both internally and externally. Within the company, engineers and technicians actually took pressurized tanks that stored helium—one of these had burst, leading to the explosion—and shot at them in Texas to determine whether they would explode and what the result looked like. Externally, they sent the site director for their Florida operations, Ricky Lim, to inquire whether he might visit the roof of the United Launch Alliance building.

SpaceX pursued the sniper theory for more than a month. A few SpaceX employees told me that they did not stop this line of inquiry until the Federal Aviation Administration sent the company a letter definitively saying that there was no gunman involved. It would be interesting to see this letter, so I submitted a Freedom of Information Act request to the FAA in the spring of 2023. Because the federal FOIA process moves slowly, I did not expect to receive a response in time for the book. But it was worth a try anyway.

No reply came in 2023 or early 2024, when the final version of my book was due to my editor. Reentry was published last September, and still nothing. However, last week, to my great surprise and delight, I got a response from the FAA. It was the very letter I requested, sent from the FAA to Tim Hughes, the general counsel of SpaceX, on October 13, 2016. And yes, the letter says there was no gunman involved.

However, there were other things I did not know—namely, that the FBI had also investigated the incident.

The ULA rivalry

One of the most compelling elements of this story is that it involves SpaceX’s heated rival, United Launch Alliance. For a long time, ULA had the upper hand, but in recent years, it has taken a dramatic turn. Now we know that David would grow up and slay Goliath: Between the final rocket ULA launched last year (the Vulcan test flight on October 4) and the first rocket the company launched this year (Atlas V, April 28), SpaceX launched 90 rockets.

Ninety.

But it was a different story in the summer of 2016 in the months leading up to the Amos 6 failure. Back then, ULA was launching about 15 rockets a year, compared to SpaceX’s five. And ULA was launching all of the important science missions for NASA and the critical spy satellites for the US military. They were the big dog, SpaceX the pup.

In the early days of the Falcon 9 rocket, some ULA employees would drive to where SpaceX was working on the first booster and jeer at their efforts. And rivalry played out not just on the launch pad but in courtrooms and on Capitol Hill. After ULA won an $11 billion block buy contract from the US Air Force to launch high-value military payloads into the early 2020s, Musk sued in April 2014. He alleged that the contract had been awarded without a fair competition and said the Falcon 9 rocket could launch the missions at a substantially lower price. Taxpayers, he argued, were being taken for a ride.

Eventually, SpaceX and the Air Force resolved their claims. The Air Force agreed to open some of its previously awarded national security missions to competitive bids. Over time, SpaceX has overtaken ULA even in this arena. During the most recent round of awards, SpaceX won 60 percent of the contracts compared to ULA’s 40 percent.

So when SpaceX raised the possibility of a ULA sniper, it came at an incendiary moment in the rivalry, when SpaceX was finally putting forth a very serious challenge to ULA’s dominance and monopoly.

It is no surprise, therefore, that ULA told SpaceX’s Ricky Lim to get lost when he wanted to see the roof of their building in Florida.

“Hair-on-fire stuff”

NASA officials were also deeply concerned by the loss of the Falcon 9 rocket in September 2016.

The space agency spent much of the 2010s working with SpaceX and Boeing to develop, test, and fly spacecraft that could fly humans into space. These were difficult years for the space agency, which had to rely on Russia to get its astronauts into space. NASA also had a challenging time balancing costs with astronaut safety. Then rockets started blowing up.

Consider this sequence from mid-2015 to mid-2016. In June 2015, the second stage of a Falcon 9 rocket carrying a cargo version of the Dragon spacecraft into orbit exploded. Less than two weeks later, NASA named four astronauts to its “commercial crew” cadre from which the initial pilots of Dragon and Starliner spacecraft would be selected. Finally, a little more than a year after this, a second Falcon 9 rocket upper stage detonated.

Video of CRS-7 launch and failure.

Even as it was losing Falcon 9 rockets, SpaceX revealed that it intended to upend NASA’s long-standing practice of fueling a rocket and then, when the vehicle reached a stable condition, putting crew on board. Rather, SpaceX said it would put the astronauts on board before fueling. This process became known as “load and go.”

NASA’s safety community went nuts.

“When SpaceX came to us and said we want to load the crew first and then the propellant, mushroom clouds went off in our safety community,” Phil McAlister, the head of NASA’s commercial programs, told me for Reentry. “I mean, hair-on-fire stuff. It was just conventional wisdom that you load the propellant first and get it thermally stable. Fueling is a very dynamic operation. The vehicle is popping and hissing. The safety community was adamantly against this.”

Amos-6 compounded these concerns. That’s because the rocket was not shot by a sniper. After months of painful investigation and analysis, engineers determined the rocket was lost due to the propellant-loading process. In their goal of rapidly fueling the Falcon 9 rocket, the SpaceX teams had filled the pressurized helium tanks too quickly, heating the aluminum liner and causing it to buckle. In their haste to load super-chilled propellant onto the Falcon 9, SpaceX had found its speed limit.

At NASA, it was not difficult to visualize astronauts in a Dragon capsule sitting atop an exploding rocket during propellant loading rather than a commercial satellite.

Enter the FBI

We should stop and appreciate the crucible that SpaceX engineers and technicians endured in the fall of 2016. They were simultaneously attempting to tease out the physics of a fiendishly complex failure; prove to NASA their exploding rocket was safe; convince safety officials that even though they had just blown up their rocket by fueling it too quickly, load-and-go was feasible for astronaut missions; increase the cadence of Falcon 9 missions to catch and surpass ULA; and, oh yes, gently explain to the boss that a sniper had not shot their rocket.

So there had to be some relief when, on October 13, Hughes received that letter from Dr. Michael C. Romanowski, director of Commercial Space Integration at the FAA.

According to this letter (see a copy here), three weeks after the launch pad explosion, SpaceX submitted “video and audio” along with its analysis of the failure to the FAA. “SpaceX suggested that in the company’s view, this information and data could be indicative of sabotage or criminal activity associated with the on-pad explosion of SpaceX’s Falcon 9,” the letter states.

This is notable because it suggests that Musk directed SpaceX to elevate the “sniper” theory to the point that the FAA should take it seriously. But there was more. According to the letter, SpaceX reported the same data and analysis to the Federal Bureau of Investigation in Florida.

After this, the Tampa Field Office of the FBI and its Criminal Investigative Division in Washington, DC, looked into the matter. And what did they find? Nothing, apparently.

“The FBI has informed us that based upon a thorough and coordinated review by the appropriate Federal criminal and security investigative authorities, there were no indications to suggest that sabotage or any other criminal activity played a role in the September 1 Falcon 9 explosion,” Romanowski wrote. “As a result, the FAA considers this matter closed.”

The failure of the Amos-6 mission would turn out to be a low point for SpaceX. For a few weeks, there were non-trivial questions about the company’s financial viability. But soon, SpaceX would come roaring back. In 2017, the Falcon 9 rocket launched a record 18 times, surpassing ULA for the first time. The gap would only widen. Last year, SpaceX launched 137 rockets to ULA’s five.

With Amos-6, therefore, SpaceX lost the battle. But it would eventually win the war—without anyone firing a shot.

Photo of Eric Berger

Eric Berger is the senior space editor at Ars Technica, covering everything from astronomy to private space to NASA policy, and author of two books: Liftoff, about the rise of SpaceX; and Reentry, on the development of the Falcon 9 rocket and Dragon. A certified meteorologist, Eric lives in Houston.

SpaceX pushed “sniper” theory with the feds far more than is publicly known Read More »

chips-aren’t-improving-like-they-used-to,-and-it’s-killing-game-console-price-cuts

Chips aren’t improving like they used to, and it’s killing game console price cuts

Consider the PlayStation 2. Not all of the PS2 Slim’s streamlining came from chip improvements—it also shed a full-sized 3.5-inch hard drive bay and a little-used IEEE 1394 port, and initially required an external power brick. But shrinking and consolidating the console’s CPU, GPU, memory, and other components took the console from its original design in 2000, to the Slim in 2004, to an even lighter and lower-power version of the Slim that returned to using an internal power supply without increasing the size of the console at all.

Over that same span, the console’s price dropped frequently and significantly, from $299 at launch to just $129 by 2006 (the price was lowered again to $99 in 2009, deep into the PS3 era).

Or look at Microsoft’s Xbox 360. Its external design didn’t change as much over the years—the mid-generation “slim” refresh was actually only a little smaller than the original. But between late 2005 and early 2010, the CPU, GPU, and the GPU’s high-speed eDRAM memory chip went from being built on a 90 nm process, to 80 nm, to 65 nm, and finally to a single 45 nm chip that combined the CPU and GPU into one.

Over that time, the system’s power supply fell from 203 W to 133 W, and the base price fell from $300 to $200. The mid-generation 65nm refresh also substantially fixed the early consoles’ endemic “red ring of death” issue, which was caused in part by the heat that the older, larger chips generated.

As you can see when comparing these various consoles’ external and internal design revisions, shrinking the chips had a cascade of other beneficial and cost-lowering effects: smaller power supplies, smaller enclosures that use less metal and plastic, smaller heatsinks and cooling assemblies, and smaller and less complicated motherboard designs.

Sony’s original PS2 on the left, and the PS2 Slim revision on the right. Sony jettisoned a few things to make the console smaller, but chip improvements were also instrumental. Credit: Evan Amos

A slowdown of that progression was already evident when we hit the PlayStation 4/Xbox One/Nintendo Switch generation, but technological improvements and pricing reductions still followed familiar patterns. Both the mid-generation PS4 Slim and Xbox One S used a 16 nm processor instead of the original consoles’ 28 nm version, and each also had its price cut by $100 over its lifetime (comparing the Kinect-less Xbox One variant, and excluding the digital-only $249 Xbox One). The Switch’s single die shrink, from 20nm to 16nm, didn’t come with a price cut, but it did improve battery life and help to enable the cheaper Switch Lite variant.

Chips aren’t improving like they used to, and it’s killing game console price cuts Read More »

health-care-company-says-trump-tariffs-will-cost-it-$60m–$70m-this-year

Health care company says Trump tariffs will cost it $60M–$70M this year

In the call, Grade noted that only a small fraction of Baxter’s total sales are in China. But, “given the magnitude of the tariffs that have been enacted between the two countries, these tariffs now account for nearly half of the total impact,” he said.

The Tribune reported that Baxter is now looking into ways to dampen the financial blow from the tariffs, including carrying additional inventory, identifying alternative suppliers, alternative shipping routes, and “targeted pricing actions.” Baxter is also working with trade organizations to lobby for exemptions.

In general, the health care and medical sector, including hospitals, is bracing for price increases and shortages from the tariffs. The health care supply chain in America is woefully fragile, which became painfully apparent amid the COVID-19 pandemic.

Baxter isn’t alone in announcing heavy tariff tolls. Earlier this week, GE Healthcare Technologies Inc. said the tariffs would cost the company around $500 million this year, according to financial service firm Morningstar. And in April, Abbott Laboratories said it expects the tariffs to cost “a few hundred million dollars,” according to the Tribune.

Health care company says Trump tariffs will cost it $60M–$70M this year Read More »

judge-on-meta’s-ai-training:-“i-just-don’t-understand-how-that-can-be-fair-use”

Judge on Meta’s AI training: “I just don’t understand how that can be fair use”


Judge downplayed Meta’s “messed up” torrenting in lawsuit over AI training.

A judge who may be the first to rule on whether AI training data is fair use appeared skeptical Thursday at a hearing where Meta faced off with book authors over the social media company’s alleged copyright infringement.

Meta, like most AI companies, holds that training must be deemed fair use, or else the entire AI industry could face immense setbacks, wasting precious time negotiating data contracts while falling behind global rivals. Meta urged the court to rule that AI training is a transformative use that only references books to create an entirely new work that doesn’t replicate authors’ ideas or replace books in their markets.

At the hearing that followed after both sides requested summary judgment, however, Judge Vince Chhabria pushed back on Meta attorneys arguing that the company’s Llama AI models posed no threat to authors in their markets, Reuters reported.

“You have companies using copyright-protected material to create a product that is capable of producing an infinite number of competing products,” Chhabria said. “You are dramatically changing, you might even say obliterating, the market for that person’s work, and you’re saying that you don’t even have to pay a license to that person.”

Declaring, “I just don’t understand how that can be fair use,” the shrewd judge apparently stoked little response from Meta’s attorney, Kannon Shanmugam, apart from a suggestion that any alleged threat to authors’ livelihoods was “just speculation,” Wired reported.

Authors may need to sharpen their case, which Chhabria warned could be “taken away by fair use” if none of the authors suing, including Sarah Silverman, Ta-Nehisi Coates, and Richard Kadrey, can show “that the market for their actual copyrighted work is going to be dramatically affected.”

Determined to probe this key question, Chhabria pushed authors’ attorney, David Boies, to point to specific evidence of market harms that seemed noticeably missing from the record.

“It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected by the billions of things that Llama will ultimately be capable of producing,” Chhabria said. “And it’s just not obvious to me that that’s the case.”

But if authors can prove fears of market harms are real, Meta might struggle to win over Chhabria, and that could set a precedent impacting copyright cases challenging AI training on other kinds of content.

The judge repeatedly appeared to be sympathetic to authors, suggesting that Meta’s AI training may be a “highly unusual case” where even though “the copying is for a highly transformative purpose, the copying has the high likelihood of leading to the flooding of the markets for the copyrighted works.”

And when Shanmugam argued that copyright law doesn’t shield authors from “protection from competition in the marketplace of ideas,” Chhabria resisted the framing that authors weren’t potentially being robbed, Reuters reported.

“But if I’m going to steal things from the marketplace of ideas in order to develop my own ideas, that’s copyright infringement, right?” Chhabria responded.

Wired noted that he asked Meta’s lawyers, “What about the next Taylor Swift?” If AI made it easy to knock off a young singer’s sound, how could she ever compete if AI produced “a billion pop songs” in her style?

In a statement, Meta’s spokesperson reiterated the company’s defense that AI training is fair use.

“Meta has developed transformational open source AI models that are powering incredible innovation, productivity, and creativity for individuals and companies,” Meta’s spokesperson said. “Fair use of copyrighted materials is vital to this. We disagree with Plaintiffs’ assertions, and the full record tells a different story. We will continue to vigorously defend ourselves and to protect the development of GenAI for the benefit of all.”

Meta’s torrenting seems “messed up”

Some have pondered why Chhabria appeared so focused on market harms, instead of hammering Meta for admittedly illegally pirating books that it used for its AI training, which seems to be obvious copyright infringement. According to Wired, “Chhabria spoke emphatically about his belief that the big question is whether Meta’s AI tools will hurt book sales and otherwise cause the authors to lose money,” not whether Meta’s torrenting of books was illegal.

The torrenting “seems kind of messed up,” Chhabria said, but “the question, as the courts tell us over and over again, is not whether something is messed up but whether it’s copyright infringement.”

It’s possible that Chhabria dodged the question for procedural reasons. In a court filing, Meta argued that authors had moved for summary judgment on Meta’s alleged copying of their works, not on “unsubstantiated allegations that Meta distributed Plaintiffs’ works via torrent.”

In the court filing, Meta alleged that even if Chhabria agreed that the authors’ request for “summary judgment is warranted on the basis of Meta’s distribution, as well as Meta’s copying,” that the authors “lack evidence to show that Meta distributed any of their works.”

According to Meta, authors abandoned any claims that Meta’s seeding of the torrented files served to distribute works, leaving only claims about Meta’s leeching. Meta argued that the authors “admittedly lack evidence that Meta ever uploaded any of their works, or any identifiable part of those works, during the so-called ‘leeching’ phase,” relying instead on expert estimates based on how torrenting works.

It’s also possible that for Chhabria, the torrenting question seemed like an unnecessary distraction. Former Meta attorney Mark Lumley, who quit the case earlier this year, told Vanity Fair that the torrenting was “one of those things that sounds bad but actually shouldn’t matter at all in the law. Fair use is always about uses the plaintiff doesn’t approve of; that’s why there is a lawsuit.”

Lumley suggested that court cases mulling fair use at this current moment should focus on the outputs, rather than the training. Citing the ruling in a case where Google Books scanning books to share excerpts was deemed fair use, Lumley argued that “all search engines crawl the full Internet, including plenty of pirated content,” so there’s seemingly no reason to stop AI crawling.

But the Copyright Alliance, a nonprofit, non-partisan group supporting the authors in the case, in a court filing alleged that Meta, in its bid to get AI products viewed as transformative, is aiming to do the opposite. “When describing the purpose of generative AI,” Meta allegedly strives to convince the court to “isolate the ‘training’ process and ignore the output of generative AI,” because that’s seemingly the only way that Meta can convince the court that AI outputs serve “a manifestly different purpose from Plaintiffs’ books,” the Copyright Alliance argued.

“Meta’s motion ignores what comes after the initial ‘training’—most notably the generation of output that serves the same purpose of the ingested works,” the Copyright Alliance argued. And the torrenting question should matter, the group argued, because unlike in Google Books, Meta’s AI models are apparently training on pirated works, not “legitimate copies of books.”

Chhabria will not be making a snap decision in the case, planning to take his time and likely stressing not just Meta, but every AI company defending training as fair use the longer he delays. Understanding that the entire AI industry potentially has a stake in the ruling, Chhabria apparently sought to relieve some tension at the end of the hearing with a joke, Wired reported.

 “I will issue a ruling later today,” Chhabria said. “Just kidding! I will take a lot longer to think about it.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Judge on Meta’s AI training: “I just don’t understand how that can be fair use” Read More »

screwworms-are-coming—and-they’re-just-as-horrifying-as-they-sound

Screwworms are coming—and they’re just as horrifying as they sound

We’re on the verge of being screwwormed.

The biological barrier was breached, they’re slithering toward our border, and the US Department of Agriculture is now carpet-bombing parts of Mexico with weaponized flies to stave off an invasion.

This is not a drill. Screwworms are possibly the most aptly named parasites imaginable, both literally and figuratively. Screwworms—technically, New World Screwworms—are flies that lay eggs on the mucous membranes, orifices, and wounds of warm-blooded animals. Wounds are the most common sites, and even a prick as small as a tick bite can be an invitation for the savage insects.

Once beckoned, females lay up to 400 eggs at a time. Within about a day, ravenous flesh-eating larvae erupt, which both look and act like literal screws. They viciously and relentlessly bore and twist into their victim, feasting on the living flesh for about seven days. The result is a gaping ulcer writhing with maggots, which attracts yet more adult female screwworms that can lay hundreds more eggs, deepening the putrid, festering lesion. The infection, called myiasis, is intensely painful and life-threatening. Anyone who falls victim to screwworms is figuratively—well, you know.

Adult screwworm flies. Credit: USDA

Previous victories

Screwworms aren’t a new foe for the US. Decades ago, they were endemic to southern areas of the country, as well as the whole of Central America, parts of the Caribbean and northern areas of South America. While they’re a threat to many animals, including humans, they are a bane to livestock, causing huge economic losses in addition to the carnage.

In the 1950s, the US began an intensive effort to eradicate screwworms. The successful endeavor required carefully inspecting animals and monitoring livestock movements. But most importantly, it relied on a powerful method to kill off the flies.

The ploy—called the Sterile Insect Technique—throws a wrench into the unique lifecycle of screwworms. After the larvae feast on flesh, they fall to the ground to develop into adults, a process that takes another seven days or so during warm weather. Once adults emerge, they can live for around two weeks, again depending on the weather. In that time, females generally only mate once, but don’t worry—they make the most of the one-night stand by retaining sperm for multiple batches of eggs. While females lay up to 400 eggs at once, they can lay up to 2,800 in their lives.

Screwworms are coming—and they’re just as horrifying as they sound Read More »

cyborg-cicadas-play-pachelbel’s-canon

Cyborg cicadas play Pachelbel’s Canon

The distinctive chirps of singing cicadas are a highlight of summer in regions where they proliferate; those chirps even featured prominently on Lorde’s 2021 album Solar Power. Now, Japanese scientists at the University of Tsukuba have figured out how to transform cicadas into cyborg insects capable of “playing” Pachelbel’s Canon. They described their work in a preprint published on the physics arXiv. You can listen to the sounds here.

Scientists have been intrigued by the potential of cyborg insects since the 1990s, when researchers began implanting tiny electrodes into cockroach antennae and shocking them to direct their movements. The idea was to use them as hybrid robots for search-and-rescue applications.

For instance, in 2015, Texas A&M scientists found that implanting electrodes into a cockroach’s ganglion (the neuron cluster that controls its front legs) was remarkably effective at successfully steering the roaches 60 percent of the time. They outfitted the roaches with tiny backpacks synced with a remote controller and administered shocks to disrupt the insect’s balance, forcing it to move in the desired direction

And in 2021, scientists at Nanyang Technological University in Singapore turned Madagascar hissing cockroaches into cyborgs, implanting electrodes in sensory organs known as cerci that were then connected to tiny computers. Applying electrical current enabled them to steer the cockroaches successfully 94 percent of the time in simulated disaster scenes in the lab.

The authors of this latest paper were inspired by that 2021 project and decided to apply the basic concept to singing cicadas, with the idea that cyborg cicadas might one day be used to transmit warning messages during emergencies. It’s usually the males who do the singing, and each species has a unique song. In most species, the production of sound occurs via a pair of membrane structures called tymbals, which are just below each side of the insect’s anterior abdominal region. The tymbal muscles contract and cause the plates to vibrate while the abdomen acts as a kind of resonating chamber to amplify the song.

Cyborg cicadas play Pachelbel’s Canon Read More »

if-you’re-in-the-market-for-a-$1,900-color-e-ink-monitor,-one-of-them-exists-now

If you’re in the market for a $1,900 color E Ink monitor, one of them exists now

Color E Ink in its current state requires a whole lot of compromises, as we’ve found when reviewing devices like reMarkable’s Paper Pro or Amazon’s Kindle Colorsoft, including washed-out color, low refresh rates, and a grainy look that you don’t get with regular black-and-white E Ink. But that isn’t stopping device manufacturers from exploring the technology, and today, Onyx International has announced that it has a $1,900 color E Ink monitor that you can connect to your PC or Mac.

The Boox Mira Pro is a 25.3-inch monitor with a 3200×1800 resolution and a 16:9 aspect ratio, and it builds on the company’s previous black-and-white Mira Pro monitors. The Verge reports that the screen uses E Ink Kaleido 3 technology, which can display up to 4,096 colors. Both image quality and refresh rate will vary based on which of the monitor’s four presets you use (the site isn’t specific about the exact refresh rate but does note that “E Ink monitors’ refresh speed is not as high as conventional monitors’, and increased speed will result in more ghosting”).

The monitor’s ports include one full-size HDMI port, a mini HDMI port, a USB-C port, and a DisplayPort. Its default stand is more than a little reminiscent of Apple’s Studio Display, but it also supports VESA mounting.

Onyx International’s lineup of Boox devices usually focuses on Android-powered E Ink tablets, which the company has been building for over a decade. These are notable mostly because they combine the benefits of E Ink—text that’s easy on the eyes and long battery life—and access to multiple bookstores and other content sources via Google Play, rather than tying you to one manufacturer’s ecosystem as Amazon’s Kindles or other dedicated e-readers do.

If you’re in the market for a $1,900 color E Ink monitor, one of them exists now Read More »

google-search’s-made-up-ai-explanations-for-sayings-no-one-ever-said,-explained

Google search’s made-up AI explanations for sayings no one ever said, explained


But what does “meaning” mean?

A partial defense of (some of) AI Overview’s fanciful idiomatic explanations.

Mind…. blown Credit: Getty Images

Last week, the phrase “You can’t lick a badger twice” unexpectedly went viral on social media. The nonsense sentence—which was likely never uttered by a human before last week—had become the poster child for the newly discovered way Google search’s AI Overviews makes up plausible-sounding explanations for made-up idioms (though the concept seems to predate that specific viral post by at least a few days).

Google users quickly discovered that typing any concocted phrase into the search bar with the word “meaning” attached at the end would generate an AI Overview with a purported explanation of its idiomatic meaning. Even the most nonsensical attempts at new proverbs resulted in a confident explanation from Google’s AI Overview, created right there on the spot.

In the wake of the “lick a badger” post, countless users flocked to social media to share Google’s AI interpretations of their own made-up idioms, often expressing horror or disbelief at Google’s take on their nonsense. Those posts often highlight the overconfident way the AI Overview frames its idiomatic explanations and occasional problems with the model confabulating sources that don’t exist.

But after reading through dozens of publicly shared examples of Google’s explanations for fake idioms—and generating a few of my own—I’ve come away somewhat impressed with the model’s almost poetic attempts to glean meaning from gibberish and make sense out of the senseless.

Talk to me like a child

Let’s try a thought experiment: Say a child asked you what the phrase “you can’t lick a badger twice” means. You’d probably say you’ve never heard that particular phrase or ask the child where they heard it. You might say that you’re not familiar with that phrase or that it doesn’t really make sense without more context.

Someone on Threads noticed you can type any random sentence into Google, then add “meaning” afterwards, and you’ll get an AI explanation of a famous idiom or phrase you just made up. Here is mine

[image or embed]

— Greg Jenner (@gregjenner.bsky.social) April 23, 2025 at 6: 15 AM

But let’s say the child persisted and really wanted an explanation for what the phrase means. So you’d do your best to generate a plausible-sounding answer. You’d search your memory for possible connotations for the word “lick” and/or symbolic meaning for the noble badger to force the idiom into some semblance of sense. You’d reach back to other similar idioms you know to try to fit this new, unfamiliar phrase into a wider pattern (anyone who has played the excellent board game Wise and Otherwise might be familiar with the process).

Google’s AI Overview doesn’t go through exactly that kind of human thought process when faced with a similar question about the same saying. But in its own way, the large language model also does its best to generate a plausible-sounding response to an unreasonable request.

As seen in Greg Jenner’s viral Bluesky post, Google’s AI Overview suggests that “you can’t lick a badger twice” means that “you can’t trick or deceive someone a second time after they’ve been tricked once. It’s a warning that if someone has already been deceived, they are unlikely to fall for the same trick again.” As an attempt to derive meaning from a meaningless phrase —which was, after all, the user’s request—that’s not half bad. Faced with a phrase that has no inherent meaning, the AI Overview still makes a good-faith effort to answer the user’s request and draw some plausible explanation out of troll-worthy nonsense.

Contrary to the computer science truism of “garbage in, garbage out, Google here is taking in some garbage and spitting out… well, a workable interpretation of garbage, at the very least.

Google’s AI Overview even goes into more detail explaining its thought process. “Lick” here means to “trick or deceive” someone, it says, a bit of a stretch from the dictionary definition of lick as “comprehensively defeat,” but probably close enough for an idiom (and a plausible iteration of the idiom, “Fool me once shame on you, fool me twice, shame on me…”). Google also explains that the badger part of the phrase “likely originates from the historical sport of badger baiting,” a practice I was sure Google was hallucinating until I looked it up and found it was real.

It took me 15 seconds to make up this saying but now I think it kind of works!

Credit: Kyle Orland / Google

It took me 15 seconds to make up this saying but now I think it kind of works! Credit: Kyle Orland / Google

I found plenty of other examples where Google’s AI derived more meaning than the original requester’s gibberish probably deserved. Google interprets the phrase “dream makes the steam” as an almost poetic statement about imagination powering innovation. The line “you can’t humble a tortoise” similarly gets interpreted as a statement about the difficulty of intimidating “someone with a strong, steady, unwavering character (like a tortoise).”

Google also often finds connections that the original nonsense idiom creators likely didn’t intend. For instance, Google could link the made-up idiom “A deft cat always rings the bell” to the real concept of belling the cat. And in attempting to interpret the nonsense phrase “two cats are better than grapes,” the AI Overview correctly notes that grapes can be potentially toxic to cats.

Brimming with confidence

Even when Google’s AI Overview works hard to make the best of a bad prompt, I can still understand why the responses rub a lot of users the wrong way. A lot of the problem, I think, has to do with the LLM’s unearned confident tone, which pretends that any made-up idiom is a common saying with a well-established and authoritative meaning.

Rather than framing its responses as a “best guess” at an unknown phrase (as a human might when responding to a child in the example above), Google generally provides the user with a single, authoritative explanation for what an idiom means, full stop. Even with the occasional use of couching words such as “likely,” “probably,” or “suggests,” the AI Overview comes off as unnervingly sure of the accepted meaning for some nonsense the user made up five seconds ago.

If Google’s AI Overviews always showed this much self-doubt, we’d be getting somewhere.

Credit: Google / Kyle Orland

If Google’s AI Overviews always showed this much self-doubt, we’d be getting somewhere. Credit: Google / Kyle Orland

I was able to find one exception to this in my testing. When I asked Google the meaning of “when you see a tortoise, spin in a circle,” Google reasonably told me that the phrase “doesn’t have a widely recognized, specific meaning” and that it’s “not a standard expression with a clear, universal meaning.” With that context, Google then offered suggestions for what the phrase “seems to” mean and mentioned Japanese nursery rhymes that it “may be connected” to, before concluding that it is “open to interpretation.”

Those qualifiers go a long way toward properly contextualizing the guesswork Google’s AI Overview is actually conducting here. And if Google provided that kind of context in every AI summary explanation of a made-up phrase, I don’t think users would be quite as upset.

Unfortunately, LLMs like this have trouble knowing what they don’t know, meaning moments of self-doubt like the turtle interpretation here tend to be few and far between. It’s not like Google’s language model has some master list of idioms in its neural network that it can consult to determine what is and isn’t a “standard expression” that it can be confident about. Usually, it’s just projecting a self-assured tone while struggling to force the user’s gibberish into meaning.

Zeus disguised himself as what?

The worst examples of Google’s idiomatic AI guesswork are ones where the LLM slips past plausible interpretations and into sheer hallucination of completely fictional sources. The phrase “a dog never dances before sunset,” for instance, did not appear in the film Before Sunrise, no matter what Google says. Similarly, “There are always two suns on Tuesday” does not appear in The Hitchhiker’s Guide to the Galaxy film despite Google’s insistence.

Literally in the one I tried.

[image or embed]

— Sarah Vaughan (@madamefelicie.bsky.social) April 23, 2025 at 7: 52 AM

There’s also no indication that the made-up phrase “Welsh men jump the rabbit” originated on the Welsh island of Portland, or that “peanut butter platform heels” refers to a scientific experiment creating diamonds from the sticky snack. We’re also unaware of any Greek myth where Zeus disguises himself as a golden shower to explain the phrase “beware what glitters in a golden shower.” (Update: As many commenters have pointed out, this last one is actually a reference to the greek myth of Danaë and the shower of gold, showing Google’s AI knows more about this potential symbolism than I do)

The fact that Google’s AI Overview presents these completely made-up sources with the same self-assurance as its abstract interpretations is a big part of the problem here. It’s also a persistent problem for LLMs that tend to make up news sources and cite fake legal cases regularly. As usual, one should be very wary when trusting anything an LLM presents as an objective fact.

When it comes to the more artistic and symbolic interpretation of nonsense phrases, though, I think Google’s AI Overviews have gotten something of a bad rap recently. Presented with the difficult task of explaining nigh-unexplainable phrases, the model does its best, generating interpretations that can border on the profound at times. While the authoritative tone of those responses can sometimes be annoying or actively misleading, it’s at least amusing to see the model’s best attempts to deal with our meaningless phrases.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Google search’s made-up AI explanations for sayings no one ever said, explained Read More »