CUDA

nvidia’s-50-series-cards-drop-support-for-physx,-impacting-older-games

Nvidia’s 50-series cards drop support for PhysX, impacting older games

Nvidia’s PhysX offerings to developers didn’t always generate warm feelings. As part of its broader GamesWorks package, PhysX was cited as one of the reasons The Witcher 3 ran at notably sub-optimal levels at launch. Protagonist Geralt’s hair, rendered in PhysX-powered HairWorks, was a burden on some chipsets.

PhysX started appearing in general game engines, like Unity 5, and was eventually open-sourced, first in limited computer and mobile form, then more broadly. As an application wrapped up in Nvidia’s 32-bit CUDA API and platform, the PhysX engine had a built-in shelf life. Now the expiration date is known, and it is conditional on buying into Nvidia’s 50-series video cards—whenever they approach reasonable human prices.

Dune buggy in Borderlands 3, dodging rockets shot by a hovering attack craft just over a sand dune, in Borderlands 3.

See that smoke? It’s from Sweden, originally.

Credit: Gearbox/Take 2

See that smoke? It’s from Sweden, originally. Credit: Gearbox/Take 2

The real dynamic particles were the friends we made…

Nvidia noted in mid-January that 32-bit applications cannot be developed or debugged on the latest versions of its CUDA toolkit. They will still run on cards before the 50 series. Technically, you could also keep an older card installed on your system for compatibility, which is real dedication to early-2010’s-era particle physics.

Technically, a 64-bit game could still support PhysX on Nvidia’s newest GPUs, but the heyday of PhysX, as a stand-alone technology switched on in game settings, tended to coincide with the 32-bit computing era.

If you load up a 32-bit game now with PhysX enabled (or forced in a config file) and a 50-series Nvidia GPU installed, there’s a good chance the physics work will be passed to the CPU instead of the GPU, likely bottlenecking the game and steeply lowering frame rates. Of course, turning off PhysX entirely raised frame rates above even native GPU support levels.

Demanding Borderlands 2 keep using PhysX made it so it “runs terrible,” noted one Redditor, even if the dust clouds and flapping cloth strips looked interesting. Other games with PhysX baked in, as listed by ResetEra completists, include Metro 2033, Assassin’s Creed IV: Black Flag, and the 2013 Star Trek game.

Commenters on Reddit and ResetEra note that many of the games listed had performance issues with PhysX long before Nvidia forced them to either turn off or be loaded onto a CPU. For some games, however, PhysX enabled destructible environments, “dynamic bank notes” and “posters” (in the Arkham games), fluid simulations, and base gameplay physics.

Anyone who works in, or cares about, game preservation has always had their work cut out for them. But it’s a particularly tough challenge to see certain aspects of a game’s operation lost to the forward march of the CUDA platform, something that’s harder to explain than a scratched CD or Windows compatibility.

Nvidia’s 50-series cards drop support for PhysX, impacting older games Read More »

the-next-nvidia-driver-makes-even-more-gpus-“open,”-in-a-specific,-quirky-way

The next Nvidia driver makes even more GPUs “open,” in a specific, quirky way

You know open when you see it —

You can’t see inside the firmware, but more open code can translate it for you.

GeForce RTX 4060 cards on display in a case

Getty Images

You have to read the headline on Nvidia’s latest GPU announcement slowly, parsing each clause as it arrives.

“Nvidia transitions fully” sounds like real commitment, a burn-the-boats call. “Towards open-source GPU,” yes, evoking the company’s “first step” announcement a little over two years ago, so this must be progress, right? But, back up a word here, then finish: “GPU kernel modules.”

So, Nvidia has “achieved equivalent or better application performance with our open-source GPU kernel modules,” and added some new capabilities to them. And now most of Nvidia’s modern GPUs will default to using open source GPU kernel modules, starting with driver release R560, with dual GPL and MIT licensing. But Nvidia has moved most of its proprietary functions into a proprietary, closed-source firmware blob. The parts of Nvidia’s GPUs that interact with the broader Linux system are open, but the user-space drivers and firmware are none of your or the OSS community’s business.

Is it better than what existed before? Certainly. AMD and Intel have maintained open source GPU drivers, in both the kernel and user space, for years, though also with proprietary firmware. This brings Nvidia a bit closer to the Linux community and allows for community debugging and contribution. There’s no indication that Nvidia aims to go further with its open source moves, however, and its modules remain outside the main kernel, packaged up for users to install themselves.

Not all GPUs will be able to use the open source drivers: a number of chips from the Maxwell, Pascal, and Volta lines; GPUs from the Turing, Ampere, Ada Lovelace, and Hopper architectures are recommended to switch to the open bits; and Grace Hopper and Blackwell units must do so.

As noted by Hector Martin, a developer on the Asahi Linux distribution, at the time of the first announcement, this shift makes it easier to sandbox closed-source code while using Nvidia hardware. But the net amount of closed-off code is about the same as before.

Nvidia’s blog post has details on how to integrate its open kernel modules onto various systems, including CUDA setups.

The next Nvidia driver makes even more GPUs “open,” in a specific, quirky way Read More »

what-kind-of-bug-would-make-machine-learning-suddenly-40%-worse-at-nethack?

What kind of bug would make machine learning suddenly 40% worse at NetHack?

Large Moon Models (LMMs) —

One day, a roguelike-playing system just kept biffing it, for celestial reasons.

Moon rendered in ASCII text, with

Aurich Lawson

Members of the Legendary Computer Bugs Tribunal, honored guests, if I may have your attention? I would, humbly, submit a new contender for your esteemed judgment. You may or may not find it novel, you may even deign to call it a “bug,” but I assure you, you will find it entertaining.

Consider NetHack. It is one of the all-time roguelike games, and I mean that in the more strict sense of that term. The content is procedurally generated, deaths are permanent, and the only thing you keep from game to game is your skill and knowledge. I do understand that the only thing two roguelike fans can agree on is how wrong the third roguelike fan is in their definition of roguelike, but, please, let us move on.

NetHack is great for machine learning…

Being a difficult game full of consequential choices and random challenges, as well as a “single-agent” game that can be generated and played at lightning speed on modern computers, NetHack is great for those working in machine learning—or imitation learning, actually, as detailed in Jens Tuyls’ paper on how compute scaling affects single-agent game learning. Using Tuyls’ model of expert NetHack behavior, Bartłomiej Cupiał and Maciej Wołczyk trained a neural network to play and improve itself using reinforcement learning.

By mid-May of this year, the two had their model consistently scoring 5,000 points by their own metrics. Then, on one run, the model suddenly got worse, on the order of 40 percent. It scored 3,000 points. Machine learning generally, gradually, goes in one direction with these types of problems. It didn’t make sense.

Cupiał and Wołczyk tried quite a few things: reverting their code, restoring their entire software stack from a Singularity backup, and rolling back their CUDA libraries. The result? 3,000 points. They rebuild everything from scratch, and it’s still 3,000 points.

<em>NetHack</em>, played by a regular human.” height=”506″ src=”https://cdn.arstechnica.net/wp-content/uploads/2024/06/13863751533_64654db44e_o.png” width=”821″></img><figcaption>
<p><em>NetHack</em>, played by a regular human.</p>
</figcaption></figure>
<h2>… except on certain nights</h2>
<p>As <a href=detailed in Cupiał’s X (formerly Twitter) thread, this was several hours of confused trial and error by him and Wołczyk. “I am starting to feel like a madman. I can’t even watch a TV show constantly thinking about the bug,” Cupiał wrote. In desperation, he asks model author Tuyls if he knows what could be wrong. He wakes up in Kraków to an answer:

“Oh yes, it’s probably a full moon today.”

In NetHack, the game in which the DevTeam has thought of everything, if the game detects from your system clock that it should be a full moon, it will generate a message: “You are lucky! Full moon tonight.” A full moon imparts a few player benefits: a single point added to Luck, and werecreatures mostly kept to their animal forms.

It’s an easier game, all things considered, so why would the learning agent’s score be lower? It simply doesn’t have data about full moon variables in its training data, so a branching series of decisions likely leads to lesser outcomes, or just confusion. It was indeed a full moon in Kraków when the 3,000-ish scores started showing up. What a terrible night to have a learning model.

Of course, “score” is not a real metric for success in NetHack, as Cupiał himself noted. Ask a model to get the best score, and it will farm the heck out of low-level monsters because it never gets bored. “Finding items required for [ascension] or even [just] doing a quest is too much for pure RL agent,” Cupiał wrote. Another neural network, AutoAscend, does a better job of progressing through the game, but “even it can only solve sokoban and reach mines end,” Cupiał notes.

Is it a bug?

I submit to you that, although NetHack responded to the full moon in its intended way, this quirky, very hard-to-fathom stop on a machine-learning journey was indeed a bug and a worthy one in the pantheon. It’s not a Harvard moth, nor a 500-mile email, but what is?

Because the team used Singularity to back up and restore their stack, they inadvertently carried forward the machine time and resulting bug each time they tried to solve it. The machine’s resulting behavior was so bizarre, and seemingly based on unseen forces, that it drove a coder into fits. And the story has a beginning, a climactic middle, and a denouement that teaches us something, however obscure.

The NetHack Lunar Learning Bug is, I submit, quite worth memorializing. Thank you for your time.

What kind of bug would make machine learning suddenly 40% worse at NetHack? Read More »