Author name: Beth Washington

honda’s-hopper-suddenly-makes-the-japanese-carmaker-a-serious-player-in-rocketry

Honda’s hopper suddenly makes the Japanese carmaker a serious player in rocketry

The company has not disclosed its spending on rocket development. Honda’s hopper is smaller than similar prototype boosters SpaceX has used for vertical landing demos, so engineers will have to scale up the design to create a viable launch vehicle.

But Tuesday’s test catapulted Honda into an exclusive club of companies that have flown reusable rocket hoppers with an eye toward orbital flight, including SpaceX, Blue Origin, and a handful of Chinese startups. Meanwhile, European and Japanese space agencies have funded a pair of reusable rocket hoppers named Themis and Callisto. Neither rocket has ever flown, after delays of several years.

Honda’s experimental rocket lifts off from a test site in Taiki, a community in northern Japan.

Before Honda’s leadership green-lit the rocket project in 2019, a group of the company’s younger engineers proposed applying the company’s expertise in combustion and control technologies toward a launch vehicle. Honda officials believe the carmaker “has the potential to contribute more to people’s daily lives by launching satellites with its own rockets.”

The company suggested in its press release Tuesday that a Honda-built rocket might launch Earth observation satellites to monitor global warming and extreme weather, and satellite constellations for wide-area communications. Specifically, the company noted the importance of satellite communications to enabling connected features in cars, airplanes, and other Honda products.

“In this market environment, Honda has chosen to take on the technological challenge of developing reusable rockets by utilizing Honda technologies amassed in the development of various products and automated driving systems, based on a belief that reusable rockets will contribute to achieving sustainable transportation,” Honda said.

Toyota, Japan’s largest car company, also has a stake in the launch business. Interstellar Technologies, a Japanese space startup, announced a $44 million investment from Toyota in January. The two firms said they were establishing an alliance to draw on Toyota’s formula for automobile manufacturing to set up a factory for mass-producing orbital-class rockets. Interstellar has launched a handful of sounding rockets but hasn’t yet built an orbital launcher.

Japan’s primary rocket builder, Mitsubishi Heavy Industries, is another titan of Japanese industry, but it has never launched more than six space missions in a single year. MHI’s newest rocket, the H3, debuted in 2023 but is fully expendable.

The second-biggest Japanese automaker, Honda, is now making its own play. Car companies aren’t accustomed to making vehicles that can only be used once.

Honda’s hopper suddenly makes the Japanese carmaker a serious player in rocketry Read More »

here’s-kia’s-new-small,-affordable-electric-car:-the-2026-ev4-sedan

Here’s Kia’s new small, affordable electric car: The 2026 EV4 sedan

The mesh headrests are a clever touch, as they’re both comfortable and lightweight. The controls built into the side of the passenger seat that let the driver change its position are a specialty of the automaker. There are also plenty of other conveniences, including wireless device charging, 100 W USB-C ports, and wireless Android Auto and Apple CarPlay. We relied on the native navigation app, which is not as visually pretty as the one you cast from your phone to the 12.3-inch infotainment screen, but it kept me on course on unfamiliar roads in a foreign country while suffering from jet lag. That seems worthy of a mention.

Public transport

Traffic in and around Seoul makes a wonderful case for public transport; it provided less of an opportunity for the EV4 to show its stuff beyond relatively low-speed stop-and-go, mostly topping out at 50 mph (80 km/h) on the roads, which are heavily studded with traffic cameras. Determining a true impression of the car’s range will require spending more time with it on US roads, as a result.

It was, however, an easy car to drive in traffic and to drive slowly. It’s no speed demon anyway; 0–62 mph (100 km/h) takes 7.4 seconds if you floor it in the standard range car, or 7.7 seconds in the big battery one. The ride is good over broken tarmac, although it is quite firm when dealing with short-duration bumps. Meanwhile, the steering is light but not particularly informative when it comes to providing a picture of what the front tires are doing.

Good driving dynamics help sell a car once someone has had a test drive, but most will only get that far if the pricing is right. That’s yet to be announced, and who knows what will happen with tariffs and the clean vehicle tax credit between now and when the cars arrive in dealerships toward the end of the year. However, we expect the standard-range car to start between $37,000 and $39,000, undercutting the Tesla Model 3 in the process. That sounds rather compelling to me.

Here’s Kia’s new small, affordable electric car: The 2026 EV4 sedan Read More »

companies-may-soon-pay-a-fee-for-their-rockets-to-share-the-skies-with-airplanes

Companies may soon pay a fee for their rockets to share the skies with airplanes


Some space companies aren’t necessarily against this idea, but SpaceX hasn’t spoken.

Starship soars through the stratosphere. Credit: Stephen Clark/Ars Technica

The Federal Aviation Administration may soon levy fees on companies seeking launch and reentry licenses, a new tack in the push to give the agency the resources it needs to keep up with the rapidly growing commercial space industry.

The text of a budget reconciliation bill released by Sen. Ted Cruz (R-Texas) last week calls for the FAA’s Office of Commercial Space Transportation, known as AST, to begin charging licensing fees to space companies next year. The fees would phase in over eight years, after which the FAA would adjust them to keep pace with inflation. The money would go into a trust fund to help pay for the operating costs of the FAA’s commercial space office.

The bill released by Cruz’s office last week covers federal agencies under the oversight of the Senate Commerce Committee, which he chairs. These agencies include the FAA and NASA. Ars recently covered Cruz’s proposals for NASA to keep the Space Launch System rocket, Orion spacecraft, and Gateway lunar space station alive, while the Trump administration aims to cancel Gateway and end the SLS and Orion programs after two crew missions to the Moon.

The Trump administration’s fiscal year 2026 budget request, released last month, proposes $42 million for the FAA’s Office of Commercial Space Transportation, a fraction of the agency’s overall budget request of $22 billion. The FAA’s commercial space office received an almost identical funding level in 2024 and 2025. Accounting for inflation, this is effectively a budget cut for AST. The office’s budget increased from $27.6 million to more than $42 million between 2021 and 2024, when companies like SpaceX began complaining the FAA was not equipped to keep up with the fast-moving commercial launch industry.

The FAA licensed 11 commercial launch and reentry operations in 2015, when AST’s budget was $16.6 million. Last year, the number of space operations increased to 164, and the US industry is on track to conduct more than 200 commercial launches and reentries in 2025. SpaceX’s Falcon 9 rocket is doing most of these launches.

While the FAA’s commercial space office receives more federal funding today, the budget hasn’t grown to keep up with the cadence of commercial spaceflight. SpaceX officials urged the FAA to double its licensing staff in 2023 after the company experienced delays in securing launch licenses.

In the background, a Falcon 9 rocket climbs away from Space Launch Complex 40 at Cape Canaveral Space Force Station, Florida. Another Falcon 9 stands on its launch pad at neighboring Kennedy Space Center awaiting its opportunity to fly.

Adding it up

Cruz’s section of the Senate reconciliation bill calls for the FAA to charge commercial space companies per pound of payload mass, beginning with 25 cents per pound in 2026 and increasing to $1.50 per pound in 2033. Subsequent fee rates would change based on inflation. The overall fee per launch or entry would be capped at $30,000 in 2026, increasing to $200,000 in 2033, and then adjusted to keep pace with inflation.

The Trump administration has not weighed in on Cruz’s proposed fee schedule, but Trump’s nominee for the next FAA administrator, Bryan Bedford, agreed with the need for launch and reentry licensing fees in a Senate confirmation hearing Wednesday. Most of the hearing’s question-and-answer session focused on the safety of commercial air travel, but there was a notable exchange on the topic of commercial spaceflight.

Cruz said the rising number of space launches will “add considerable strain to the airspace system” in the United States. Airlines and their passengers pay FAA-mandated fees for each flight segment, and private owners pay the FAA a fee to register their aircraft. The FAA also charges overflight fees to aircraft traveling through US airspace, even if they don’t take off or land in the United States.

“Nearly every user of the National Airspace System pays something back into the system to help cover their operational costs, yet under current law, space launch companies do not, and there is no mechanism for them to pay even if they wish to,” Cruz said. “As commercial spaceflight expands rapidly, so does its impact on the FAA’s ability to operate the National Airspace System. This proposal accounts for that.”

When asked if he agreed, Trump’s FAA nominee suggested he did. Bedford, president and CEO of Republic Airways, is poised to take the helm of the federal aviation regulator if he passes Senate confirmation.

Bryan Bedford is seen prior to his nomination hearing before the Senate Commerce Committee to lead the Federal Aviation Administration on June 11, 2025. Credit: Craig Hudson For The Washington Post via Getty Images

The FAA clears airspace of commercial and private air traffic along the flight corridors of rockets as they launch into space, and around the paths of spacecraft as they return to Earth. The agency is primarily charged with ensuring commercial rockets don’t endanger the public. The National Airspace System (NAS) consists of 29 million square miles of airspace over land and oceans. The FAA says more than 45,000 flights and 2.9 million airline passengers travel through the airspace every day.

Bedford said he didn’t want to speak on specific policy proposals before the Trump administration announces an official position on the matter.

“But I’ll confirm you’re exactly right,” Bedford told Cruz. “Passengers and airlines themselves pay significant taxes. … Those taxes are designed to modernize our NAS. One of the things that is absolutely critical in modernization is making sure we design the NAS so it can accommodate an increased cadence in space launch, so I certainly support where you’re going with that.”

SpaceX would be the company most affected by the proposed licensing fees. The majority of SpaceX’s missions launch the company’s own Starlink broadband satellites aboard Falcon 9 rockets. Most of those launches carry around 17 metric tons (about 37,500 pounds) of usable payload mass.

A quick calculation shows that SpaceX would pay a fee of roughly $9,400 for an average Starlink launch on a Falcon 9 rocket next year if Cruz’s legislation is signed into law. SpaceX launched 89 dedicated Starlink missions last year. That would add up to more than $800,000 in annual fees going into the FAA’s coffers under Cruz’s licensing scheme. Once you account for all of SpaceX’s other commercial launches, this number would likely exceed $1 million.

Assuming Falcon 9s continue to launch Starlink satellites in 2033, the fees would rise to approximately $56,000 per launch. SpaceX may have switched over all Starlink missions to its giant new Starship rocket by then, in which case the company will likely reach the FAA’s proposed fee cap of $200,000 per launch. SpaceX hopes to launch Starships at lower cost than it currently launches the Falcon 9 rocket, so this proposal would see SpaceX pay a significantly larger fraction of its per-mission costs in the form of FAA fees.

Industry reaction

A senior transportation official in the Biden administration voiced tentative support in 2023 for a fee scheme similar to the one under consideration by the Senate. Michael Huerta, a former FAA administrator during the Obama administration and the first Trump administration, told NPR last year that he supports the idea.

“You have this group of new users that are paying nothing into the system that are an increasing share of the operations,” Huerta said. “I truly believe the current structure isn’t sustainable.”

The Commercial Spaceflight Federation, an industry advocacy group that includes SpaceX and Blue Origin among its membership, signaled last year it was against the idea of creating launch and reentry fees, or taxes, as some industry officials call them. Commercial launch and reentry companies have been excluded from FAA fees to remove regulatory burdens and help the industry grow. The federation told NPR last year that because the commercial space industry requires access to US airspace much less often than the aviation industry, it would not yet be appropriate to have space companies pay into an FAA trust fund.

SpaceX did not respond to questions from Ars on the matter. United Launch Alliance would likely be on the hook to become the second-largest payer of FAA fees, at least over the next couple of years, with numerous missions in its backlog to launch massive stacks of Internet satellites for Amazon’s Project Kuiper network from Cape Canaveral Space Force Station in Florida.

A ULA spokesperson told Ars the company is still reviewing and assessing the Senate Commerce Committee’s proposal. “In general, we are supportive of fees that are affordable, do not disadvantage US companies against their foreign counterparts, are fair, equitable, and are used to directly improve the shared infrastructure at the Cape and other spaceports,” the spokesperson said.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Companies may soon pay a fee for their rockets to share the skies with airplanes Read More »

“two-years-of-work-in-two-months”:-states-cope-with-trump-broadband-overhaul

“Two years of work in two months”: States cope with Trump broadband overhaul


Trump overhaul of $42B broadband fund upends states’ plans to expand access.

Spools of fiber conduits for broadband network construction. Credit: Getty Images | Akchamczuk

The Trump administration has upended plans that state governments made to distribute $42 billion in federal broadband funding, forcing state officials to scrap much of the preparation work they did over the previous couple of years.

Secretary of Commerce Howard Lutnick essentially put the Broadband Equity, Access, and Deployment (BEAD) program on hold earlier this year and last week announced details of a rules overhaul that requires states to change how they distribute money to Internet service providers. To find out how this affects states, we spoke with Andrew Butcher, president of the Maine Connectivity Authority (MCA).

“We had been in position to be making awards this month, but for [the Trump administration’s] deliberations and program changes, so it’s pretty unfortunate,” Butcher told Ars. Established by a 2021 state law, the MCA is a quasi-governmental agency that oversees Maine’s BEAD planning and other programs that increase broadband access.

“This is the construction season,” Butcher said. “We planned it so that projects would be able to get ready with their pre-construction activities and their construction activities beginning in the summer, so they would have all summer and through the fall and early winter to get in motion.” The National Telecommunications and Information Administration (NTIA), a division of the Commerce Department, “has now essentially relegated the process to not even begin pre-construction until late fall, early winter at the earliest,” he said.

The Biden administration spent about three years developing rules and procedures for BEAD and then evaluating plans submitted by each US state and territory. Maine has been working on its plans for about two years, Butcher said. The process included analyzing which addresses in Maine are unserved and eligible for funding to subsidize network construction, and inviting ISPs to bid on projects. Maine and other states will have to go through the bidding process with ISPs again due to the overhaul.

Two years of work in two months

The change “undoubtedly creates additional work and effort for Maine and every other state and territory,” Butcher said. “So we will execute it as quickly and efficiently as possible, but it kind of jams two years of work into two months.” The new timeline is difficult, but “Secretary Lutnick has committed that funds will be awarded and projects started this year. We’re going to hold them to that,” he said.

Butcher said he was relieved that the BEAD program wasn’t canceled entirely. He pointed to President Trump’s recent move to kill the separate $2.7 billion grant program created by the Digital Equity Act of 2021.

Maine was supposed to receive $35 million from the Digital Equity Act for several programs that would provide devices, digital skills training, STEM education, telehealth access, and other services. Trump claimed the Digital Equity Act is “racist and illegal.”

Butcher said that “for all anyone knows, it was canceled simply because the word ‘equity’ is in it.” He pointed out that the same word appears in the title of the Broadband Equity, Access, and Deployment program. Given that, “the updated policy guidance for the BEAD program could have been worse,” Butcher said.

US eliminates fiber preference

Lutnick and other Republicans didn’t like the Biden administration’s decision to prioritize the building of fiber networks in BEAD, arguing that fixed wireless and satellite services like Starlink should have an equal shot at obtaining grants. The NTIA said on June 6 that states and territories must conduct “an additional ‘Benefit of the Bargain Round’ of subgrantee selection that permits all applicants to compete on a level playing field.” That will give non-fiber ISPs a better chance to obtain grants.

Senate Democrats accused the Trump administration of forcing states to subsidize Starlink instead of more robust fiber networks.

“States must maintain the flexibility to choose the highest quality broadband options, rather than be forced by bureaucrats in Washington to funnel funds to Elon Musk’s Starlink, which lacks the scalability, reliability, and speed of fiber or other terrestrial broadband solutions,” Senate Democrats wrote in a May 30 letter to Trump and Lutnick. The letter said that forcing states to scrap their previous work could cause them to “not only miss this year’s construction season but next year’s as well, delaying broadband deployment by years.”

Commerce Secretary slammed cost

Lutnick has pushed for lower per-location costs and made a social media post criticizing Nevada’s plans. “The Biden Administration approved their BEAD application with 24 project areas in the state with a PER LOCATION cost of over $100,000 each, incredible,” Lutnick wrote. “One location cost over $228,000!! We will stop this absurd spending while delivering the benefit of the bargain by connecting unserved communities with satellite, fixed wireless, and/or fiber: whichever makes the most economic sense.”

Lutnick also complained that “Congress set aside $42.45 billion for rural broadband in November 2021. More than three years later, not a single person has been connected to the Internet under the BEAD program.”

Sen. Jacky Rosen (D-Nev.) called Lutnick’s complaint disingenuous. “You’ve been holding up BEAD funding that was already APPROVED for my state since January, and you’re complaining no one has been connected yet?” she wrote.

Butcher said he trusts the expertise of Nevada’s broadband office to “make the most of the available funding,” even if Lutnick thinks the state is spending too much in some areas. “We are talking about facilitating a once-in-a-lifetime level of critical infrastructure investment,” Butcher said. “Every place is going to be different.”

Butcher said Lutnick is exercising “authority as a central government over the rights and expertise of a state body, which I guess I don’t understand how the party’s values work anymore, but that to me feels like a pretty strange Republican imposition.”

Butcher still expects significant fiber deployment

Overall, Nevada’s plan was to use $416 million to connect 43,715 households and businesses. Maine was to receive about $272 million, which Butcher said would “provide deployment to about 25,000 unserved households and businesses” and about 3,500 community anchor institutions. Anchor institutions under the BEAD program can include places like schools, libraries, hospitals and other health facilities, public safety facilities, public housing, and community centers.

“With our available funding, we really don’t have the ability to consider a cost per passing anywhere near” the $228,000 example cited by Lutnick, Butcher said. “We have to be resourceful and efficient in the decision-making… to squeeze the value out of that as much as possible.”

Fiber is Butcher’s first choice, and he said he is not convinced that the Trump administration’s new guidelines will significantly reduce the amount of fiber deployment that ultimately happens once BEAD funds are finally spent.

“The introduction of more of a preference or bias towards the cheapest deployment option… actually may very well drive competition and further incentivize fiber providers to be more aggressive” in their bids for projects, he said.

Still, he said the cost of laying fiber lines in certain locations means that wireless and satellite networks have their place. “There are some places where fiber is a prohibitive cost. Maine is a big place without a lot of people,” Butcher said.

Starlink not the first choice

When the government gives money to a fiber ISP to subsidize deployment, it’s easy to see the results: The provider is required under the terms of the grant to install fiber at homes and businesses that weren’t previously served. The benefits aren’t as immediately clear with Starlink, which is already deploying satellites that can serve most of the country.

But residents can benefit from deals between Starlink and local governments by gaining access to equipment and higher levels of service. Maine already partnered with Starlink last year to coordinate bulk purchases of equipment for Internet users and guarantee service availability.

Starlink availability and speed varies by region. But with last year’s deal between Maine and Starlink, “we’ve been able to establish a network reservation to ensure a higher standard of service performance,” Butcher said. He called Starlink a great option for remote areas but said that satellite is “far from the policy standard that we should be looking to” for every location in Maine.

Despite the BEAD holdup and Digital Equity Act cancellation, the MCA has been distributing other funds. “Over the last three years, MCA has facilitated over $250 million in public and private investments to address about 86,000 unserved locations,” Butcher said.

With the BEAD changes, Butcher said the MCA is ready to do the work needed to obtain the funding. “I think in the context of our DOGE environment, it’s important to note that teams like the MCA team are ready to rise to the moment and to do really hard work. But this is the kind of thing that absolutely grinds people down,” Butcher said. “It’s not just MCA, it’s this entire network of Internet service providers, their subcontractors, workforce training providers, community volunteer broadband committees. These investments are reflective of an entire ecosystem which doesn’t just entail pole-in-the-ground and attaching wires to the pole and equipment to that. It is a robust set of public-private partnerships.”

Photo of Jon Brodkin

Jon is a Senior IT Reporter for Ars Technica. He covers the telecom industry, Federal Communications Commission rulemakings, broadband consumer affairs, court cases, and government regulation of the tech industry.

“Two years of work in two months”: States cope with Trump broadband overhaul Read More »

engineer-creates-first-custom-motherboard-for-1990s-playstation-console

Engineer creates first custom motherboard for 1990s PlayStation console

The nsOne project joins a growing community of homebrew PlayStation 1 hardware developments. Other recent projects include Picostation, a Raspberry Pi Pico-based optical disc emulator (ODE) that allows PlayStation 1 consoles to load games from SD cards instead of physical discs. Other ODEs like MODE and PSIO have also become popular solutions for retrogaming collectors who play games on original hardware as optical drives age and fail.

From repair job to reverse-engineering project

To understand the classic console’s physical architecture, Brodesco physically sanded down an original motherboard to expose its internal layers, then cross-referenced the exposed traces with component datasheets and service manuals.

“I realized that detailed documentation on the original motherboard was either incomplete or entirely unavailable,” Brodesco explained in his Kickstarter campaign. This discovery launched what would become a comprehensive documentation effort, including tracing every connection on the board and creating multi-layer graphic representations of the circuitry.

A photo of the nsOne PlayStation motherboard.

A photo of the nsOne PlayStation motherboard. Credit: Lorentio Brodesco

Using optical scanning and manual net-by-net reverse-engineering, Brodesco recreated the PlayStation 1’s schematic in modern PCB design software. This process involved creating component symbols with accurate pin mappings and identifying—or in some cases creating—the correct footprints for each proprietary component that Sony had never publicly documented.

Brodesco also identified what he calls the “minimum architecture” required to boot the console without BIOS modifications, streamlining the design process while maintaining full compatibility.

The mock-up board shown in photos validates the footprints of chips and connectors, all redrawn from scratch. According to Brodesco, a fully routed version with complete multilayer routing and final layout is already in development.

A photo of the nsOne PlayStation motherboard.

A photo of the nsOne PlayStation motherboard. Credit: Lorentio Brodesco

As Brodesco noted on Kickstarter, his project’s goal is to “create comprehensive documentation, design files, and production-ready blueprints for manufacturing fully functional motherboards.”

Beyond repairs, the documentation and design files Brodesco is creating would preserve the PlayStation 1’s hardware architecture for future generations: “It’s a tribute to the PS1, to retro hardware, and to the belief that one person really can build the impossible.”

Engineer creates first custom motherboard for 1990s PlayStation console Read More »

apple’s-craig-federighi-on-the-long-road-to-the-ipad’s-mac-like-multitasking

Apple’s Craig Federighi on the long road to the iPad’s Mac-like multitasking


Federighi talks to Ars about why the iPad’s Mac-style multitasking took so long.

Apple press photograph of iPads running iPadOS 26

iPads! Running iOS 26! Credit: Apple

iPads! Running iOS 26! Credit: Apple

CUPERTINO, Calif.—When Apple Senior Vice President of Software Engineering Craig Federighi introduced the new multitasking UI in iPadOS 26 at the company’s Worldwide Developers Conference this week, he did it the same way he introduced the Calculator app for the iPad last year or timers in the iPad’s Clock app the year before—with a hint of sarcasm.

“Wow,” Federighi enthuses in a lightly exaggerated tone about an hour and 19 minutes into a 90-minute presentation. “More windows, a pointier pointer, and a menu bar? Who would’ve thought? We’ve truly pulled off a mind-blowing release!”

This elicits a sensible chuckle from the gathered audience of developers, media, and Apple employees watching the keynote on the Apple Park campus, where I have grabbed myself a good-but-not-great seat to watch the largely pre-recorded keynote on a gigantic outdoor screen.

Federighi is acknowledging—and lightly poking fun at—the audience of developers, pro users, and media personalities who have been asking for years that Apple’s iPad behave more like a traditional computer. And after many incremental steps, including a big swing and partial miss with the buggy, limited Stage Manager interface a couple of years ago, Apple has finally responded to requests for Mac-like multitasking with a distinctly Mac-like interface, an improved file manager, and better support for running tasks in the background.

But if this move was so forehead-slappingly obvious, why did it take so long to get here? This is one of the questions we dug into when we sat down with Federighi and Senior Vice President of Worldwide Marketing Greg Joswiak for a post-keynote chat earlier this week.

It used to be about hardware restrictions

People have been trying to use iPads (and make a philosophical case for them) as quote-unquote real computers practically from the moment they were introduced 15 years ago.

But those early iPads lacked so much of what we expect from modern PCs and Macs, most notably robust multi-window multitasking and the ability for third-party apps to exchange data. The first iPads were almost literally just iPhone internals connected to big screens, with just a fraction of the RAM and storage available in the Macs of the day; that necessitated the use of a blown-up version of the iPhone’s operating system and the iPhone’s one-full-screen-app-at-a-time interface.

“If you want to rewind all the way to the time we introduced Split View and Slide Over [in iOS 9], you have to start with the grounding that the iPad is a direct manipulation touch-first device,” Federighi told Ars. “It is a foundational requirement that if you touch the screen and start to move something, that it responds. Otherwise, the entire interaction model is broken—it’s a psychic break with your contract with the device.”

Mac users, Federighi said, were more tolerant of small latency on their devices because they were already manipulating apps on the screen indirectly, but the iPads of a decade or so ago “didn’t have the capacity to run an unlimited number of windowed apps with perfect responsiveness.”

It’s also worth noting the technical limitations of iPhone and iPad apps at the time, which up until then had mostly been designed and coded to match the specific screen sizes and resolutions of the (then-manageable) number of iDevices that existed. It simply wasn’t possible for the apps of the day to be dynamically resized as desktop windows are, because no one was coding their apps that way.

Apple’s iPad Pros—and, later, the iPad Airs—have gradually adopted hardware and software features that make them more Mac-like. Credit: Andrew Cunningham

Of course, those hardware limitations no longer exist. Apple’s iPad Pros started boosting the tablets’ processing power, RAM, and storage in earnest in the late 2010s, and Apple introduced a Microsoft Surface-like keyboard and stylus accessories that moved the iPad away from its role as a content consumption device. For years now, Apple’s faster tablets have been based on the same hardware as its slower Macs—we know the hardware can do more because Apple is already doing more with it elsewhere.

“Over time the iPad’s gotten more powerful, the screens have gotten larger, the user base has shifted into a mode where there is a little bit more trackpad and keyboard use in how many people use the device,” Federighi told Ars. “And so the stars kind of aligned to where many of the things that you traditionally do with a Mac were possible to do on an iPad for the first time and still meet iPad’s basic contract.”

On correcting some of Stage Manager’s problems

More multitasking in iPadOS 26. Credit: Apple

Apple has already tried a windowed multitasking system on modern iPads once this decade, of course, with iPadOS 16’s Stage Manager interface.

Any first crack at windowed multitasking on the iPad was going to have a steep climb. This was the first time Apple or its developers had needed to contend with truly dynamically resizable app windows in iOS or iPadOS, the first time Apple had implemented a virtual memory system on the iPad, and the first time Apple had tried true multi-monitor support. Stage Manager was in such rough shape that Apple delayed that year’s iPadOS release to keep working on it.

But the biggest problem with Stage Manager was actually that it just didn’t work on a whole bunch of iPads. You could only use it on new expensive models—if you had a new cheap model or even an older expensive model, your iPad was stuck with the older Slide Over and Split View modes that had been designed around the hardware limitations of mid-2010s iPads.

“We wanted to offer a new baseline of a totally consistent experience of what it meant to have Stage Manager,” Federighi told Ars. “And for us, that meant four simultaneous apps on the internal display and an external display with four simultaneous apps. So, eight apps running at once. And we said that’s the baseline, and that’s what it means to be Stage Manager; we didn’t want to say ‘you get Stage Manager, but you get Stage Manager-lite here or something like that. And so immediately that established a floor for how low we could go.”

Fixing that was one of the primary goals of the new windowing system.

“We decided this time: make everything we can make available,” said Federighi, “even if it has some nuances on older hardware, because we saw so much demand [for Stage Manager].”

That slight change in approach, combined with other behind-the-scenes optimizations, makes the new multitasking model more widely compatible than Stage Manager is. There are still limits on those devices—not to the number of windows you can open, but to how many of those windows can be active and up-to-date at once. And true multi-monitor support would remain the purview of the faster, more-expensive models.

“We have discovered many, many optimizations,” Federighi said. “We re-architected our windowing system and we re-architected the way that we manage background tasks, background processing, that enabled us to squeeze more out of other devices than we were able to do at the time we introduced Stage Manager.”

Stage Manager still exists in iPadOS 26, but as an optional extra multitasking mode that you have to choose to enable instead of the new windowed multitasking system. You can also choose to turn both multitasking systems off entirely, preserving the iPad’s traditional big-iPhone-for-watching-Netflix interface for the people who prefer it.

“iPad’s gonna be iPad”

The $349 base-model iPad is one that stands to gain the most from iPadOS 26. Credit: Andrew Cunningham

However, while the new iPadOS 26 UI takes big steps toward the Mac’s interface, the company still tries to treat them as different products with different priorities. To date, that has meant no touch screens on the Mac (despite years of rumors), and it will continue to mean that there are some Mac things that the iPad will remain unable to do.

“But we’ve looked and said, as [the iPad and Mac] come together, where on the iPad the Mac idiom for doing something, like where we put the window close controls and maximize controls, what color are they—we’ve said why not, where it makes sense, use a converged design for those things so it’s familiar and comfortable,” Federighi told Ars. “But where it doesn’t make sense, iPad’s gonna be iPad.”

There will still be limitations and frustrations when trying to fit an iPad into a Mac-shaped hole in your computing setup. While tasks can run in the background, for example, Apple only allows apps to run workloads with a definitive endpoint, things like a video export or a file transfer. System agents or other apps that perform some routine on-and-off tasks continuously in the background aren’t supported. All the demos we’ve seen so far are also on new, high-end iPad hardware, and it remains to be seen how well the new features behave on low-end tablets like the 11th-generation A16 iPad, or old 2019-era hardware like the iPad Air 3.

But it does feel like Apple has finally settled on a design that might stick and that adds capability to the iPad without wrecking its simplicity for the people who still just want a big screen for reading and streaming.

Photo of Andrew Cunningham

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

Apple’s Craig Federighi on the long road to the iPad’s Mac-like multitasking Read More »

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

fair-or-fixed?-why-le-mans-is-all-about-“balance-of-performance”-now.

Fair or fixed? Why Le Mans is all about “balance of performance” now.


Last year’s data plus plenty of simulation are meant to create a level playing field.

Dozen and dozens of racing cars lined up on the start line at Le Mans

LE MANS, FRANCE – JUNE 10: The #35 Alpine Endurance Team Alpine A424 of Paul-Loup Chatin, Ferdinand Habsburg-Lothringen, and Charles Milesi sits among the 2025 Le Mans entry for a group picture on the main straight at the Circuit de la Sarthe on June 10, 2025 in Le Mans, France. Credit: Ker Robertson/Getty Images

LE MANS, FRANCE – JUNE 10: The #35 Alpine Endurance Team Alpine A424 of Paul-Loup Chatin, Ferdinand Habsburg-Lothringen, and Charles Milesi sits among the 2025 Le Mans entry for a group picture on the main straight at the Circuit de la Sarthe on June 10, 2025 in Le Mans, France. Credit: Ker Robertson/Getty Images

This coming weekend will see the annual 24 Hours of Le Mans take place in France. In total, 62 cars will compete, split into three different classes. At the front of the field are the very fastest hypercars—wickedly fast prototypes that are also all hybrids, with the exception of the V12 Aston Martin Valkyries. In the middle are the pro-am LMP2s, followed by 24 GT3 cars—modified versions of performance cars that include everything from Ford Mustangs to McLaren 720s. It is racing nirvana. But with so many different makes and models of cars in the Hypercar class, some two-wheel drive, others with all-wheel drive, how do they ensure it’s a fair race?

Get ready for some acronyms

Sports car racing can be (needlessly) complicated at times. Take the Hypercar class at Le Mans. The 21 cars that will contest it are actually built to two separate rulebooks.

One, called LMH (for Le Mans Hypercar), was written by the organizers of Le Mans and the World Endurance Championship. These prototypes can be hybrids, with the electric motor on the front axle: Ferrari, Peugeot, and Toyota have all taken this route. But they don’t have to be; the Aston Martin Valkyrie already had to lose a lot of power to meet the rules, so it just relies on its big V12 to do all the work. Most of the cars are purpose-built for the race, but Aston Martin went the other route and converted a road car for racing.

The other is called LMDh (Le Mans Daytona hybrid) and hails from the US, in the rulebook written for the International Motor Sports Association’s GTP category. As the name suggests, these cars must be hybrids, and all must use the same specified motor, battery, and gearbox. LMDh cars also all need to start off using one of four approved carbon-fiber chassis (or spines), onto which automakers can style their own bodies and add their own engines. Alpine, BMW, Cadillac, and Porsche all have LMDh cars entered in this year’s Le Mans.

Convergence

In a parallel universe, the result would be two competing series, neither with many cars on the grid. But the people at IMSA get on pretty well with the organizers of Le Mans (the Automobile Club de l’Ouest or ACO) and the World Endurance Championship (the Fédération Internationale de l’Automobile, or FIA), and they decided to create a way to allow everyone to play together in the same sandbox.

“2021 [was] the first year with LMH, and at that time, the only big manufacturer involved was Toyota; Glickenhaus was there at the time, but there were not many manufacturers, let’s say, interested in that kind of category,” said Thierry Bouvet, competition director at the ACO.

“So together with IMSA, while the world was [isolating] during the pandemic, we basically wrote a set of technical regulations, LMDh which was, on paper, a little bit of a different car [with] more focus on avoiding cost escalation. After a couple of years of writing those regulations, we had an interesting process of convergence, we call it, to be able to have the LMH and LMDh racing together,” he said.

It’s not the first time that different cars have competed against each other at Le Mans. Before Hypercar, the top category was called LMP1h (Le Mans Prototype 1 hybrids), which burned brightly for a few short years but collapsed under the weight of F1-level budgets that proved too much for both Audi and Porsche, leaving just Toyota and some privateers. LMP1h used a complicated “Equivalence of Technology,” but now the approach, first perfected with the slower GT3 cars, is called Balance of Performance, or BoP.

LE MANS, FRANCE - JUNE 10: The Penske Porsche, Ferrari AF-Corse, Toyota Gazoo Racing and Jota Cadillac sit on the front row as the 2025 Le Mans entry sits for a group picture on the main straight at the Circuit de la Sarthe on June 10, 2025 in Le Mans, France.

The race starts at 10 am ET on Saturday, June 14. Credit: Ker Robertson/Getty Images

Obviously, none of the automakers behind the LMDh teams would have entered the race if they thought only LMH cars had a chance of winning overall.

“So it went through a couple of long and very interesting—in terms of technique, technically speaking—simulation working groups, where we involved all the manufacturers from both categories, and we believe we achieved… a nice working point in the middle, which allows both cars to be competitive, through the different restrictions, through BoP and so on. Now we feel that we’ve got a really fair and equitable working point,” Bouvet said. As evidence, he pointed to the fact that last year Toyota took the World Endurance Championship for constructors, but Porsche’s drivers cemented the WEC driver’s title, with Ferrari winning Le Mans.

Imma hit you with the BoP gun

The rules limit both the amount of downforce and the amount of drag that the cars can generate from their bodywork, which have to be in the ratio of 4:1; this prevents any one manufacturer from having a massive advantage in terms of cornering grip or fuel efficiency. From there, the BoP gets more granular, setting maximum weight and power outputs (above and below 250 km/h), the maximum amount of energy allowed to be sent to the wheels between pit stops, as well as any extra time added to pit stops.

Weighing cars is easy, and timing them in pit stops is old hat, too. But the advance here is the torque sensors at each axle that feed back data to the race officials, letting them know exactly how much power each car is deploying to its wheels.

“We had to think of something which will work independently, whether it’s hybrid power or internal combustion engine power. Should we think about fuel only? That will only be concerning, obviously, the internal combustion engine and not do the job for the hybrid system. So, power at the wheel is a nice and elegant solution,” he said.

LE MANS, FRANCE - JUNE 8: The #007 Aston Martin Thor Team, Aston Martin Valkyrie of Harry Tincknell, Tom Gamble, and Ross Gunn in action during Test Day on June 8, 2025 in Le Mans, France.

The Aston Martin Valkyrie is the only road-going hypercar to be entered into the Hypercar category at Le Mans. Credit: ames Moy Photography/Getty Images

For the World Endurance Championship, BoP is calculated on a rolling average of the last three races, with some OEMs getting a little more weight or a little less power if necessary. While the 24 Hours of Le Mans counts as a round of the WEC, it’s open to other entrants as well, and BoP works a bit differently. Instead, Bouvet and his team based this year’s BoP on data from last year’s 24-hour race, plus the simulations he mentioned. This is done to prevent teams from sandbagging in the races that lead up to their most important race of the year

As the newest and least competitive car, the Valkyrie gets the biggest break, with a minimum weight of just 2,271 lbs (1,030 kg) and a maximum power of 697 hp (520 kW). The Toyota GR010—which won the race in 2021 and 2022—can also deploy 697 hp but at a minimum weight of 2,321 lbs (1,052 kg), more than any other car in the class.

No process is perfect, and there is little that racing fans like to complain about more than BoP, which some feel makes racing too artificial, or even fixed. You’re unlikely to hear complaints about it from competitors at Le Mans, though—criticizing BoP is not allowed in WEC, although both Porsche and Toyota have recently expressed their feelings about BoP within those strictures.

The first qualifying session for this weekend’s race took place earlier today, sorting out the 15 fastest Hypercars that will compete later this week to see who leads the pack to the start line on Saturday.

Photo of Jonathan M. Gitlin

Jonathan is the Automotive Editor at Ars Technica. He has a BSc and PhD in Pharmacology. In 2014 he decided to indulge his lifelong passion for the car by leaving the National Human Genome Research Institute and launching Ars Technica’s automotive coverage. He lives in Washington, DC.

Fair or fixed? Why Le Mans is all about “balance of performance” now. Read More »

with-the-launch-of-o3-pro,-let’s-talk-about-what-ai-“reasoning”-actually-does

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does


inquiring artificial minds want to know

New studies reveal pattern-matching reality behind the AI industry’s reasoning claims.

On Tuesday, OpenAI announced that o3-pro, a new version of its most capable simulated reasoning model, is now available to ChatGPT Pro and Team users, replacing o1-pro in the model picker. The company also reduced API pricing for o3-pro by 87 percent compared to o1-pro while cutting o3 prices by 80 percent. While “reasoning” is useful for some analytical tasks, new studies have posed fundamental questions about what the word actually means when applied to these AI systems.

We’ll take a deeper look at “reasoning” in a minute, but first, let’s examine what’s new. While OpenAI originally launched o3 (non-pro) in April, the o3-pro model focuses on mathematics, science, and coding while adding new capabilities like web search, file analysis, image analysis, and Python execution. Since these tool integrations slow response times (longer than the already slow o1-pro), OpenAI recommends using the model for complex problems where accuracy matters more than speed. However, they do not necessarily confabulate less than “non-reasoning” AI models (they still introduce factual errors), which is a significant caveat when seeking accurate results.

Beyond the reported performance improvements, OpenAI announced a substantial price reduction for developers. O3-pro costs $20 per million input tokens and $80 per million output tokens in the API, making it 87 percent cheaper than o1-pro. The company also reduced the price of the standard o3 model by 80 percent.

These reductions address one of the main concerns with reasoning models—their high cost compared to standard models. The original o1 cost $15 per million input tokens and $60 per million output tokens, while o3-mini cost $1.10 per million input tokens and $4.40 per million output tokens.

Why use o3-pro?

Unlike general-purpose models like GPT-4o that prioritize speed, broad knowledge, and making users feel good about themselves, o3-pro uses a chain-of-thought simulated reasoning process to devote more output tokens toward working through complex problems, making it generally better for technical challenges that require deeper analysis. But it’s still not perfect.

An OpenAI's o3-pro benchmark chart.

An OpenAI’s o3-pro benchmark chart. Credit: OpenAI

Measuring so-called “reasoning” capability is tricky since benchmarks can be easy to game by cherry-picking or training data contamination, but OpenAI reports that o3-pro is popular among testers, at least. “In expert evaluations, reviewers consistently prefer o3-pro over o3 in every tested category and especially in key domains like science, education, programming, business, and writing help,” writes OpenAI in its release notes. “Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy.”

An OpenAI's o3-pro benchmark chart.

An OpenAI’s o3-pro benchmark chart. Credit: OpenAI

OpenAI shared benchmark results showing o3-pro’s reported performance improvements. On the AIME 2024 mathematics competition, o3-pro achieved 93 percent pass@1 accuracy, compared to 90 percent for o3 (medium) and 86 percent for o1-pro. The model reached 84 percent on PhD-level science questions from GPQA Diamond, up from 81 percent for o3 (medium) and 79 percent for o1-pro. For programming tasks measured by Codeforces, o3-pro achieved an Elo rating of 2748, surpassing o3 (medium) at 2517 and o1-pro at 1707.

When reasoning is simulated

Structure made of cubes in the shape of a thinking or contemplating person that evolves from simple to complex, 3D render.


It’s easy for laypeople to be thrown off by the anthropomorphic claims of “reasoning” in AI models. In this case, as with the borrowed anthropomorphic term “hallucinations,” “reasoning” has become a term of art in the AI industry that basically means “devoting more compute time to solving a problem.” It does not necessarily mean the AI models systematically apply logic or possess the ability to construct solutions to truly novel problems. This is why we at Ars Technica continue to use the term “simulated reasoning” (SR) to describe these models. They are simulating a human-style reasoning process that does not necessarily produce the same results as human reasoning when faced with novel challenges.

While simulated reasoning models like o3-pro often show measurable improvements over general-purpose models on analytical tasks, research suggests these gains come from allocating more computational resources to traverse their neural networks in smaller, more directed steps. The answer lies in what researchers call “inference-time compute” scaling. When these models use what are called “chain-of-thought” techniques, they dedicate more computational resources to exploring connections between concepts in their neural network data. Each intermediate “reasoning” output step (produced in tokens) serves as context for the next token prediction, effectively constraining the model’s outputs in ways that tend to improve accuracy and reduce mathematical errors (though not necessarily factual ones).

But fundamentally, all Transformer-based AI models are pattern-matching marvels. They borrow reasoning patterns from examples in the training data that researchers use to create them. Recent studies on Math Olympiad problems reveal that SR models still function as sophisticated pattern-matching machines—they cannot catch their own mistakes or adjust failing approaches, often producing confidently incorrect solutions without any “awareness” of errors.

Apple researchers found similar limitations when testing SR models on controlled puzzle environments. Even when provided explicit algorithms for solving puzzles like Tower of Hanoi, the models failed to execute them correctly—suggesting their process relies on pattern matching from training data rather than logical reasoning. As problem complexity increased, these models showed a “counterintuitive scaling limit,” reducing their reasoning effort despite having adequate computational resources. This aligns with the USAMO findings showing that models made basic logical errors and continued with flawed approaches even when generating contradictory results.

However, there’s some serious nuance here that you may miss if you’re reaching quickly for a pro-AI or anti-AI take. Pattern-matching and reasoning aren’t necessarily mutually exclusive. Since it’s difficult to mechanically define human reasoning at a fundamental level, we can’t definitively say whether sophisticated pattern-matching is categorically different from “genuine” reasoning or just a different implementation of similar underlying processes. The Tower of Hanoi failures are compelling evidence of current limitations, but they don’t resolve the deeper philosophical question of what reasoning actually is.

Illustration of a robot standing on a latter in front of a large chalkboard solving mathematical problems. A red question mark hovers over its head.

And understanding these limitations doesn’t diminish the genuine utility of SR models. For many real-world applications—debugging code, solving math problems, or analyzing structured data—pattern matching from vast training sets is enough to be useful. But as we consider the industry’s stated trajectory toward artificial general intelligence and even superintelligence, the evidence so far suggests that simply scaling up current approaches or adding more “thinking” tokens may not bridge the gap between statistical pattern recognition and what might be called generalist algorithmic reasoning.

But the technology is evolving rapidly, and new approaches are already being developed to address those shortcomings. For example, self-consistency sampling allows models to generate multiple solution paths and check for agreement, while self-critique prompts attempt to make models evaluate their own outputs for errors. Tool augmentation represents another useful direction already used by o3-pro and other ChatGPT models—by connecting LLMs to calculators, symbolic math engines, or formal verification systems, researchers can compensate for some of the models’ computational weaknesses. These methods show promise, though they don’t yet fully address the fundamental pattern-matching nature of current systems.

For now, o3-pro is a better, cheaper version of what OpenAI previously provided. It’s good at solving familiar problems, struggles with truly new ones, and still makes confident mistakes. If you understand its limitations, it can be a powerful tool, but always double-check the results.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does Read More »

scientists-built-a-badminton-playing-robot-with-ai-powered-skills

Scientists built a badminton-playing robot with AI-powered skills

It also learned fall avoidance and determined how much risk was reasonable to take given its limited speed. The robot did not attempt impossible plays that would create the potential for serious damage—it was committed, but not suicidal.

But when it finally played humans, it turned out ANYmal, as a badminton player, was amateur at best.

The major leagues

The first problem was its reaction time. An average human reacts to visual stimuli in around 0.2–0.25 seconds. Elite badminton players with trained reflexes, anticipation, and muscle memory can cut this time down to 0.12–0.15 seconds. ANYmal needed roughly 0.35 seconds after the opponent hit the shuttlecock to register trajectories and figure out what to do.

Part of the problem was poor eyesight. “I think perception is still a big issue,” Ma said. “The robot localized the shuttlecock with the stereo camera and there could be a positioning error introduced at each timeframe.” The camera also had a limited field of view, which meant the robot could see the shuttlecock for only a limited time before it had to act. “Overall, it was suited for more friendly matches—when the human player starts to smash, the success rate goes way down for the robot,” Ma acknowledged.

But his team already has some ideas on how to make ANYmal better. Reaction time can be improved by predicting the shuttlecock trajectory based on the opponent’s body position rather than waiting to see the shuttlecock itself—a technique commonly used by elite badminton or tennis players. To improve ANYmal’s perception, the team wants to fit it with more advanced hardware, like event cameras—vision sensors that register movement with ultra-low latencies in the microseconds range. Other improvements might include faster, more capable actuators.

“I think the training framework we propose would be useful in any application where you need to balance perception and control—picking objects up, even catching and throwing stuff,” Ma suggested. Sadly, one thing that’s almost certainly off the table is taking ANYmal to major leagues in badminton or tennis. “Would I set up a company selling badminton-playing robots? Well, maybe not,” Ma said.

Science Robotics, 2025. DOI: 10.1126/scirobotics.adu3922

Scientists built a badminton-playing robot with AI-powered skills Read More »

ocean-acidification-crosses-“planetary-boundaries”

Ocean acidification crosses “planetary boundaries”

A critical measure of the ocean’s health suggests that the world’s marine systems are in greater peril than scientists had previously realized and that parts of the ocean have already reached dangerous tipping points.

A study, published Monday in the journal Global Change Biology, found that ocean acidification—the process in which the world’s oceans absorb excess carbon dioxide from the atmosphere, becoming more acidic—crossed a “planetary boundary” five years ago.

“A lot of people think it’s not so bad,” said Nina Bednaršek, one of the study’s authors and a senior researcher at Oregon State University. “But what we’re showing is that all of the changes that were projected, and even more so, are already happening—in all corners of the world, from the most pristine to the little corner you care about. We have not changed just one bay, we have changed the whole ocean on a global level.”

The new study, also authored by researchers at the UK’s Plymouth Marine Laboratory and the National Oceanic and Atmospheric Administration (NOAA), finds that by 2020 the world’s oceans were already very close to the “danger zone” for ocean acidity, and in some regions had already crossed into it.

Scientists had determined that ocean acidification enters this danger zone or crosses this planetary boundary when the amount of calcium carbonate—which allows marine organisms to develop shells—is less than 20 percent compared to pre-industrial levels. The new report puts the figure at about 17 percent.

“Ocean acidification isn’t just an environmental crisis, it’s a ticking time bomb for marine ecosystems and coastal economies,” said Steve Widdicombe, director of science at the Plymouth lab, in a press release. “As our seas increase in acidity, we’re witnessing the loss of critical habitats that countless marine species depend on and this, in turn, has major societal and economic implications.”

Scientists have determined that there are nine planetary boundaries that, once breached, risk humans’ abilities to live and thrive. One of these is climate change itself, which scientists have said is already beyond humanity’s “safe operating space” because of the continued emissions of heat-trapping gases. Another is ocean acidification, also caused by burning fossil fuels.

Ocean acidification crosses “planetary boundaries” Read More »

ibm-now-describing-its-first-error-resistant-quantum-compute-system

IBM now describing its first error-resistant quantum compute system


Company is moving past focus on qubits, shifting to functional compute units.

A rendering of what IBM expects will be needed to house a Starling quantum computer. Credit: IBM

On Tuesday, IBM released its plans for building a system that should push quantum computing into entirely new territory: a system that can both perform useful calculations while catching and fixing errors and be utterly impossible to model using classical computing methods. The hardware, which will be called Starling, is expected to be able to perform 100 million operations without error on a collection of 200 logical qubits. And the company expects to have it available for use in 2029.

Perhaps just as significant, IBM is also committing to a detailed description of the intermediate steps to Starling. These include a number of processors that will be configured to host a collection of error-corrected qubits, essentially forming a functional compute unit. This marks a major transition for the company, as it involves moving away from talking about collections of individual hardware qubits and focusing instead on units of functional computational hardware. If all goes well, it should be possible to build Starling by chaining a sufficient number of these compute units together.

“We’re updating [our roadmap] now with a series of deliverables that are very precise,” IBM VP Jay Gambetta told Ars, “because we feel that we’ve now answered basically all the science questions associated with error correction and it’s becoming more of a path towards an engineering problem.”

New architectures

Error correction on quantum hardware involves entangling a group of qubits in a way that distributes one or more quantum bit values among them and includes additional qubits that can be used to check the state of the system. It can be helpful to think of these as data and measurement qubits. Performing weak quantum measurements on the measurement qubits produces what’s called “syndrome data,” which can be interpreted to determine whether anything about the data qubits has changed (indicating an error) and how to correct it.

There are lots of potential ways to arrange different combinations of data and measurement qubits for this to work, each referred to as a code. But, as a general rule, the more hardware qubits committed to the code, the more robust it will be to errors, and the more logical qubits that can be distributed among its hardware qubits.

Some quantum hardware, like that based on trapped ions or neutral atoms, is relatively flexible when it comes to hosting error-correction codes. The hardware qubits can be moved around so that any two can be entangled, so it’s possible to adopt a huge range of configurations, albeit at the cost of the time spent moving atoms around. IBM’s technology is quite different. It relies on qubits made of superconducting electronics laid out on a chip, with entanglement mediated by wiring that runs between qubits. The layout of this wiring is set during the chip’s manufacture, and so the chip’s design commits it to a limited number of potential error-correction codes.

Unfortunately, this wiring can also enable crosstalk between neighboring qubits, causing them to lose their state. To avoid this, existing IBM processors have their qubits wired in what they term a “heavy hex” configuration, named for its hexagonal arrangements of connections among its qubits. This has worked well to keep the error rate of its hardware down, but it also poses a challenge, since IBM has decided to go with an error-correction code that’s incompatible with the heavy hex geometry.

A couple of years back, an IBM team described a compact error correction code called a low-density parity check (LDPC). This requires a square grid of nearest-neighbor connections among its qubits, as well as wiring to connect qubits that are relatively distant on the chip. To get its chips and error-correction scheme in sync, IBM has made two key advances. The first is in its chip packaging, which now uses several layers of wiring sitting above the hardware qubits to enable all of the connections needed for the LDPC code.

We’ll see that first in a processor called Loon that’s on the company’s developmental roadmap. “We’ve already demonstrated these three things: high connectivity, long-range couplers, and couplers that break the plane [of the chip] and connect to other qubits,” Gambetta said. “We have to combine them all as a single demonstration showing that all these parts of packaging can be done, and that’s what I want to achieve with Loon.” Loon will be made public later this year.

Two diagrams of blue objects linked by red lines. The one on the left is sparse and simple, while the one on the right is a complicated mesh of red lines.

On the left, the simple layout of the connections in a current-generation Heron processor. At right, the complicated web of connections that will be present in Loon. Credit: IBM

The second advance IBM has made is to eliminate the crosstalk that the heavy hex geometry was used to minimize, so heavy hex will be going away. “We are releasing this year a bird for near-term experiments that is a square array that has almost zero crosstalk,” Gambetta said, “and that is Nighthawk.” The more densely connected qubits cut the overhead needed to perform calculations by a factor of 15, Gambetta told Ars.

Nighthawk is a 2025 release on a parallel roadmap that you can think of as user-facing. Iterations on its basic design will be released annually through 2028, each enabling more operations without error (going from 5,000 gate operations this year to 15,000 in 2028). Each individual Nighthawk processor will host 120 hardware qubits, but 2026 will see three of them chained together and operating as a unit, providing 360 hardware qubits. That will be followed in 2027 by a machine with nine linked Nighthawk processors, boosting the hardware qubit number over 1,000.

Riding the bicycle

The real future of IBM’s hardware, however, will be happening over on the developmental line of processors, where talk about hardware qubit counts will become increasingly irrelevant. In a technical document released today, IBM is describing the specific LDPC code it will be using, termed a bivariate bicycle code due to some cylindrical symmetries in its details that vaguely resemble bicycle wheels. The details of the connections matter less than the overall picture of what it takes to use this error code in practice.

IBM describes two implementations of this form of LDPC code. In the first, 144 hardware qubits are arranged so that they play host to 12 logical qubits and all of the measurement qubits needed to perform error checks. The standard measure of a code’s ability to catch and correct errors is called its distance, and in this case, the distance is 12. As an alternative, they also describe a code that uses 288 hardware qubits to host the same 12 logical qubits but boost the distance to 18, meaning it’s more resistant to errors. IBM will make one of these collections of logical qubits available as a Kookaburra processor in 2026, which will use them to enable stable quantum memory.

The follow-on will bundle these with a handful of additional qubits that can produce quantum states that are needed for some operations. Those, plus hardware needed for the quantum memory, form a single, functional computation unit, built on a single chip, that is capable of performing all the operations needed to implement any quantum algorithm.

That will appear with the Cockatoo chip, which will also enable multiple processing units to be linked on a single bus, allowing the logical qubit count to grow beyond 12. (The company says that one of the dozen logical qubits in each unit will be used to mediate entanglement with other units and so won’t be available for computation.) That will be followed by the first test versions of Starling, which will allow universal computations on a limited number of logical qubits spread across multiple chips.

Separately, IBM is releasing a document that describes a key component of the system that will run on classical computing hardware. Full error correction requires evaluating the syndrome data derived from the state of all the measurement qubits in order to determine the state of the logical qubits and whether any corrections need to be made. As the complexity of the logical qubits grows, the computational burden of evaluating grows with it. If this evaluation can’t be executed in real time, then it becomes impossible to perform error-corrected calculations.

To address this, IBM has developed a message-passing decoder that can perform parallel evaluations of the syndrome data. The system explores more of the solution space by a combination of randomizing the weight given to the memory of past solutions and by handing any seemingly non-optimal solutions on to new instances for additional evaluation. The key thing is that IBM estimates that this can be run in real time using FPGAs, ensuring that the system works.

A quantum architecture

There are a lot more details beyond those, as well. Gambetta described the linkage between each computational unit—IBM is calling it a Universal Bridge—which requires one microwave cable for each code distance of the logical qubits being linked. (In other words, a distance 12 code would need 12 microwave-carrying cables to connect each chip.) He also said that IBM is developing control hardware that can operate inside the refrigeration hardware, based on what they’re calling “cold CMOS,” which is capable of functioning at 4 Kelvin.

The company is also releasing renderings of what it expects Starling to look like: a series of dilution refrigerators, all connected by a single pipe that contains the Universal Bridge. “It’s an architecture now,” Gambetta said. “I have never put details in the roadmap that I didn’t feel we could hit, and now we’re putting a lot more details.”

The striking thing to me about this is that it marks a shift away from a focus on individual qubits, their connectivity, and their error rates. The error hardware rates are now good enough (4 x 10-4) for this to work, although Gambetta felt that a few more improvements should be expected. And connectivity will now be directed exclusively toward creating a functional computational unit.

That said, there’s still a lot of space beyond Starling on IBM’s roadmap. The 200 logical qubits it promises will be enough to handle some problems, but not enough to perform the complex algorithms needed to do things like break encryption. That will need to wait for something closer to Blue Jay, a 2033 system that IBM expects will have 2,000 logical qubits. And, as of right now, it’s the only thing listed beyond Starling.

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

IBM now describing its first error-resistant quantum compute system Read More »