DJ Henderson – Page 39

Twirling body horror in gymnastics video exposes AI’s flaws

AI, AI confabulation, AI hallucination, AI jabberwocky, Ai video generator, Biz & IT, confabulation, gymnastics, hallucination, jabberwocky, machine learning, Olympics, openai, Sora, video synthesis / DJ Henderson / December 13, 2024

The slithy toves did gyre and gimble in the wabe

Nonsensical jabberwocky movements created by OpenAI’s Sora are typical for current AI-generated video, and here’s why.

A still image from an AI-generated video of an ever-morphing synthetic gymnast. Credit: OpenAI / Deedy

On Wednesday, a video from OpenAI’s newly launched Sora AI video generator went viral on social media, featuring a gymnast who sprouts extra limbs and briefly loses her head during what appears to be an Olympic-style floor routine.

As it turns out, the nonsensical synthesis errors in the video—what we like to call “jabberwockies”—hint at technical details about how AI video generators work and how they might get better in the future.

But before we dig into the details, let’s take a look at the video.

An AI-generated video of an impossible gymnast, created with OpenAI Sora.

In the video, we see a view of what looks like a floor gymnastics routine. The subject of the video flips and flails as new legs and arms rapidly and fluidly emerge and morph out of her twirling and transforming body. At one point, about 9 seconds in, she loses her head, and it reattaches to her body spontaneously.

“As cool as the new Sora is, gymnastics is still very much the Turing test for AI video,” wrote venture capitalist Deedy Das when he originally shared the video on X. The video inspired plenty of reaction jokes, such as this reply to a similar post on Bluesky: “hi, gymnastics expert here! this is not funny, gymnasts only do this when they’re in extreme distress.”

We reached out to Das, and he confirmed that he generated the video using Sora. He also provided the prompt, which was very long and split into four parts, generated by Anthropic’s Claude, using complex instructions like “The gymnast initiates from the back right corner, taking position with her right foot pointed behind in B-plus stance.”

“I’ve known for the last 6 months having played with text to video models that they struggle with complex physics movements like gymnastics,” Das told us in a conversation. “I had to try it [in Sora] because the character consistency seemed improved. Overall, it was an improvement because previously… the gymnast would just teleport away or change their outfit mid flip, but overall it still looks downright horrifying. We hoped AI video would learn physics by default, but that hasn’t happened yet!”

So what went wrong?

When examining how the video fails, you must first consider how Sora “knows” how to create anything that resembles a gymnastics routine. During the training phase, when the Sora model was created, OpenAI fed example videos of gymnastics routines (among many other types of videos) into a specialized neural network that associates the progression of images with text-based descriptions of them.

That type of training is a distinct phase that happens once before the model’s release. Later, when the finished model is running and you give a video-synthesis model like Sora a written prompt, it draws upon statistical associations between words and images to produce a predictive output. It’s continuously making next-frame predictions based on the last frame of the video. But Sora has another trick for attempting to preserve coherency over time. “By giving the model foresight of many frames at a time,” reads OpenAI’s Sora System Card, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.”

A still image from a moment where the AI-generated gymnast loses her head. It soon re-attaches to her body. — A still image from a moment where the AI-generated gymnast loses her head. It soon reattaches to her body. Credit: OpenAI / Deedy

Maybe not quite solved yet. In this case, rapidly moving limbs prove a particular challenge when attempting to predict the next frame properly. The result is an incoherent amalgam of gymnastics footage that shows the same gymnast performing running flips and spins, but Sora doesn’t know the correct order in which to assemble them because it’s pulling on statistical averages of wildly different body movements in its relatively limited training data of gymnastics videos, which also likely did not include limb-level precision in its descriptive metadata.

Sora doesn’t know anything about physics or how the human body should work, either. It’s drawing upon statistical associations between pixels in the videos in its training dataset to predict the next frame, with a little bit of look-ahead to keep things more consistent.

This problem is not unique to Sora. All AI video generators can produce wildly nonsensical results when your prompts reach too far past their training data, as we saw earlier this year when testing Runway’s Gen-3. In fact, we ran some gymnast prompts through the latest open source AI video model that may rival Sora in some ways, Hunyuan Video, and it produced similar twirling, morphing results, seen below. And we used a much simpler prompt than Das did with Sora.

An example from open source Chinese AI model Hunyuan Video with the prompt, “A young woman doing a complex floor gymnastics routine at the olympics, featuring running and flips.”

AI models based on transformer technology are fundamentally imitative in nature. They’re great at transforming one type of data into another type or morphing one style into another. What they’re not great at (yet) is producing coherent generations that are truly original. So if you happen to provide a prompt that closely matches a training video, you might get a good result. Otherwise, you may get madness.

As we wrote about image-synthesis model Stable Diffusion 3’s body horror generations earlier this year, “Basically, any time a user prompt homes in on a concept that isn’t represented well in the AI model’s training dataset, the image-synthesis model will confabulate its best interpretation of what the user is asking for. And sometimes that can be completely terrifying.”

For the engineers who make these models, success in AI video generation quickly becomes a question of how many examples (and how much training) you need before the model can generalize enough to produce convincing and coherent results. It’s also a question of metadata quality—how accurately the videos are labeled. In this case, OpenAI used an AI vision model to describe its training videos, which helped improve quality, but apparently not enough—yet.

We’re looking at an AI jabberwocky in action

In a way, the type of generation failure in the gymnast video is a form of confabulation (or hallucination, as some call it), but it’s even worse because it’s not coherent. So instead of calling it a confabulation, which is a plausible-sounding fabrication, we’re going to lean on a new term, “jabberwocky,” which Dictionary.com defines as “a playful imitation of language consisting of invented, meaningless words; nonsense; gibberish,” taken from Lewis Carroll’s nonsense poem of the same name. Imitation and nonsense, you say? Check and check.

We’ve covered jabberwockies in AI video before with people mocking Chinese video-synthesis models, a monstrously weird AI beer commercial, and even Will Smith eating spaghetti. They’re a form of misconfabulation where an AI model completely fails to produce a plausible output. This will not be the last time we see them, either.

How could AI video models get better and avoid jabberwockies?

In our coverage of Gen-3 Alpha, we called the threshold where you get a level of useful generalization in an AI model the “illusion of understanding,” where training data and training time reach a critical mass that produces good enough results to generalize across enough novel prompts.

One of the key reasons language models like OpenAI’s GPT-4 impressed users was that they finally reached a size where they had absorbed enough information to give the appearance of genuinely understanding the world. With video synthesis, achieving this same apparent level of “understanding” will require not just massive amounts of well-labeled training data but also the computational power to process it effectively.

AI boosters hope that these current models represent one of the key steps on the way to something like truly general intelligence (often called AGI) in text, or in AI video, what OpenAI and Runway researchers call “world simulators” or “world models” that somehow encode enough physics rules about the world to produce any realistic result.

Judging by the morphing alien shoggoth gymnast, that may still be a ways off. Still, it’s early days in AI video generation, and judging by how quickly AI image-synthesis models like Midjourney progressed from crude abstract shapes into coherent imagery, it’s likely video synthesis will have a similar trajectory over time. Until then, enjoy the AI-generated jabberwocky madness.

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Twirling body horror in gymnastics video exposes AI’s flaws Read More »

Generating power with a thin, flexible thermoelectric film

cooling, materials science, power, Science, thermoelectric material / DJ Henderson / December 12, 2024

The No. 1 nuisance with smartphones and smartwatches is that we need to charge them every day. As warm-blooded creatures, however, we generate heat all the time, and that heat can be converted into electricity for some of the electronic gadgetry we carry.

Flexible thermoelectric devices, or F-TEDs, can convert thermal energy into electric power. The problem is that F-TEDs weren’t actually flexible enough to comfortably wear or efficient enough to power even a smartwatch. They were also very expensive to make.

But now, a team of Australian researchers thinks they finally achieved a breakthrough that might take F-TEDs off the ground.

“The power generated by the flexible thermoelectric film we have created would not be enough to charge a smartphone but should be enough to keep a smartwatch going,” said Zhi-Gang Chen, a professor at Queensland University of Technology in Brisbane, Australia. Does that mean we have reached a point where it would be possible to make a thermoelectric Apple Watch band that could keep the watch charged all the time? “It would take some industrial engineering and optimization, but we can definitely achieve a smartwatch band like that,” Chen said.

Manufacturing heaven

Thermoelectric generators producing enough power to run something like an Apple Watch were, so far, made with rigid bulk materials. The obvious problem with them was that nobody would want to wear a metal slab on their wrist or run a power cable from anywhere else to their watch. Flexible thermoelectric devices, on the other hand, were perfectly wearable but offered efficiencies that made them good for low-power health-monitoring electronics rather than more power-hungry hardware like smartwatches.

Back in 2021, generating 35 microwatts per square centimeter in a wristband worn during a typical walk outside was impressive enough to land your research paper in Nature. Today, Chen and his colleagues made a flexible thermoelectric device that performed over 34 times better at room temperature. “To the best of our knowledge, we hold a current record in this field,” Chen says.

Generating power with a thin, flexible thermoelectric film Read More »

Russia takes unusual route to hack Starlink-connected devices in Ukraine

backdoors, Biz & IT, nation state hacking, phishing, Security, turla / DJ Henderson / December 11, 2024

“Microsoft assesses that Secret Blizzard either used the Amadey malware as a service (MaaS) or accessed the Amadey command-and-control (C2) panels surreptitiously to download a PowerShell dropper on target devices,” Microsoft said. “The PowerShell dropper contained a Base64-encoded Amadey payload appended by code that invoked a request to Secret Blizzard C2 infrastructure.”

The ultimate objective was to install Tavdig, a backdoor Secret Blizzard used to conduct reconnaissance on targets of interest. The Amdey sample Microsoft uncovered collected information from device clipboards and harvested passwords from browsers. It would then go on to install a custom reconnaissance tool that was “selectively deployed to devices of further interest by the threat actor—for example, devices egressing from STARLINK IP addresses, a common signature of Ukrainian front-line military devices.”

When Secret Blizzard assessed a target was of high value, it would then install Tavdig to collect information, including “user info, netstat, and installed patches and to import registry settings into the compromised device.”

Earlier in the year, Microsoft said, company investigators observed Secret Blizzard using tools belonging to Storm-1887 to also target Ukrainian military personnel. Microsoft researchers wrote:

In January 2024, Microsoft observed a military-related device in Ukraine compromised by a Storm-1837 backdoor configured to use the Telegram API to launch a cmdlet with credentials (supplied as parameters) for an account on the file-sharing platform Mega. The cmdlet appeared to have facilitated remote connections to the account at Mega and likely invoked the download of commands or files for launch on the target device. When the Storm-1837 PowerShell backdoor launched, Microsoft noted a PowerShell dropper deployed to the device. The dropper was very similar to the one observed during the use of Amadey bots and contained two base64 encoded files containing the previously referenced Tavdig backdoor payload (rastls.dll) and the Symantec binary (kavp.exe).

As with the Amadey bot attack chain, Secret Blizzard used the Tavdig backdoor loaded into kavp.exe to conduct initial reconnaissance on the device. Secret Blizzard then used Tavdig to import a registry file, which was used to install and provide persistence for the KazuarV2 backdoor, which was subsequently observed launching on the affected device.

Although Microsoft did not directly observe the Storm-1837 PowerShell backdoor downloading the Tavdig loader, based on the temporal proximity between the execution of the Storm-1837 backdoor and the observation of the PowerShell dropper, Microsoft assesses that it is likely that the Storm-1837 backdoor was used by Secret Blizzard to deploy the Tavdig loader.

Wednesday’s post comes a week after both Microsoft and Lumen’s Black Lotus Labs reported that Secret Blizzard co-opted the tools of a Pakistan-based threat group tracked as Storm-0156 to install backdoors and collect intel on targets in South Asia. Microsoft first observed the activity in late 2022. In all, Microsoft said, Secret Blizzard has used the tools and infrastructure of at least six other threat groups in the past seven years.

Russia takes unusual route to hack Starlink-connected devices in Ukraine Read More »

iOS 18.2, macOS 15.2 updates arrive today with image and emoji generation

Apple, ios 18, iPadOS 18, macOS 15, Tech / DJ Henderson / December 11, 2024

Apple has announced that it will be releasing the iOS 18.2, iPadOS 18.2, and macOS Sequoia 15.2 updates to the public later this afternoon, following weeks of beta testing for developers and users. As with iOS 18.1, the headlining features are new additions to Apple Intelligence, mainly the image-generation capabilities: Image Playground for general images, and “Genmoji” for making custom images in the style of Apple’s built-in Unicode-based emoji characters.

Other AI features include “Image Wand,” which will take sketched images from the Notes app and turn them into a “polished image” using context clues from other notes; and ChatGPT integration for the Writing Tools feature.

The updates also include a long list of bug fixes and security updates, for those who don’t care about Apple Intelligence. Safari gets better data importing and exporting support, an HTTPS Priority feature that “upgrades URLs to HTTPS whenever possible,” and a download status indicator for iPhones with a Dynamic Island. Mail in iOS offers to automatically sort messages to bring important ones to the top of your inbox. There are also various tweaks and improvements for the Photos, Podcasts, Voice Memos, and Stocks apps, while the Weather app in macOS can optionally display the weather in your menu bar.

iOS 18.2, macOS 15.2 updates arrive today with image and emoji generation Read More »

NASA believes it understands why Ingenuity crashed on Mars

ingenuity, Mars, Science, Space / DJ Henderson / December 11, 2024

Eleven months after the Ingenuity helicopter made its final flight on Mars, engineers and scientists at NASA and a private company that helped build the flying vehicle said they have identified what probably caused it to crash on the surface of Mars.

In short, the helicopter’s on-board navigation sensors were unable to discern enough features in the relatively smooth surface of Mars to determine its position, so when it touched down, it did so moving horizontally. This caused the vehicle to tumble, snapping off all four of the helicopter’s blades.

Delving into the root cause

It is not easy to conduct a forensic analysis like this on Mars, which is typically about 100 million miles from Earth. Ingenuity carried no black box on board, so investigators have had to piece together their findings from limited data and imagery.

“While multiple scenarios are viable with the available data, we have one we believe is most likely: Lack of surface texture gave the navigation system too little information to work with,” said Ingenuity’s first pilot, Håvard Grip of NASA’s Jet Propulsion Laboratory, in a news release.

A team from NASA and a company that specializes in unmanned aerial vehicles, AeroVironment, started by looking at the terrain where Ingenuity was operating over during its 72nd flight, on January 18 of this year. The helicopter’s navigation system tracked visual features on the surface using a downward-looking camera. During its initial flights, Ingenuity was able to discern pebbles and other features to determine its position. But nearly three years later, Ingenuity was flying in a region of Jezero Crater filled with steep, relatively featureless sand ripples.

NASA believes it understands why Ingenuity crashed on Mars Read More »

Startup will brick $800 emotional support robot for kids without refunds

Robots, Smart Home, Tech / DJ Henderson / December 10, 2024

In addition to the robot being bricked, Embodied noted that warranties, repair services, the corresponding parent app and guides, and support staff will no longer be accessible.

“Unable to offer refunds”

Embodied said it is “unable” to offer most Moxie owners refunds due to its “financial situation and impending dissolution.” The potential exception is for people who bought a Moxie within 30 days. For those customers, Embodied said that “if the company or its assets are sold, we will do our best to prioritize refunds for purchases,” but it emphasized that this is not a guarantee.

Embodied also acknowledged complications for those who acquired the expensive robot through a third-party lender. Embodied advised such customers to contact their lender, but it’s possible that some will end up paying interest on a toy that no longer works.

Embodied said it’s looking for another company to buy Moxie. Should that happen, the new company will receive Embodied customer data and determine how it may use it, according to Embodied’s Terms of Service. Otherwise, Embodied said it “securely” erases user data “in accordance with our privacy policy and applicable law,” which includes deleting personally identifiable information from Embodied systems.

Another smart gadget bites the dust

Currently, there’s some hope that Moxies can be resurrected. Things look grim for Moxie owners, but we’ve seen failed smart device companies, like Insteon, be resurrected before. It’s also possible that someone will release of an open-source version of the product, like the one made for Spotify Car Thing, which Spotify officially bricked today.

But the short-lived, expensive nature of Moxie is exactly why some groups, like right-to-repair activists, are pushing the FTC to more strongly regulate smart devices, particularly when it comes to disclosure and commitments around software support. With smart gadget makers trying to determine how to navigate challenging economic landscapes, the owners of various types of smart devices—from AeroGarden indoor gardening systems to Snoo bassinets —have had to deal with the consequences, including broken devices and paywalled features. Last month, the FTC noted that smart device manufacturers that don’t commit to software support may be breaking the law.

For Moxie owners, disappointment doesn’t just come from wasted money and e-waste creation but also from the pain of giving a child a tech “companion” to grow with and then have it suddenly taken away.

Startup will brick $800 emotional support robot for kids without refunds Read More »

Location data firm helps police find out when suspects visited their doctor

fog data science, Location Data, Policy / DJ Henderson / December 10, 2024

The intake form “was included in an email thread between a Fog representative and Bryan Kimbell, chief human trafficking investigator at the Office of the Attorney General in Georgia,” according to 404 Media.

Services like Fog Data Science have triggered concerns about how police might use location tracking to prosecute abortions. “For several thousand dollars annually, the software lets police trace unique borders around large, customized regions to generate a list of devices in the area. Police can use Fog Reveal to geofence entire buildings or street blocks—like the area surrounding an abortion clinic—and get information on devices used within and surrounding those buildings to identify suspects,” Ars wrote in November 2022.

The EFF’s 2022 investigation found that Fog obtained data from the firm Venntel, which is the subject of a Federal Trade Commission action. The FTC last week announced a proposed settlement with Venntel and its owner, Gravy Analytics. The FTC alleged that “Gravy Analytics and Venntel violated the FTC Act by unfairly selling sensitive consumer location data, and by collecting and using consumers’ location data without obtaining verifiable user consent for commercial and government uses.”

“Surreptitious surveillance by data brokers undermines our civil liberties and puts servicemembers, union workers, religious minorities, and others at risk,” Samuel Levine, director of the FTC Bureau of Consumer Protection, said in the announcement. According to the FTC, Gravy Analytics used geofencing “to identify and sell lists of consumers who attended certain events related to medical conditions and places of worship and sold additional lists that associate individual consumers to other sensitive characteristics.”

If the proposed order takes effect, “Gravy Analytics and Venntel will be prohibited from selling, disclosing, or using sensitive location data in any product or service, and must establish a sensitive data location program,” the FTC said. One of the FTC’s two Republicans partially dissented from the decision, which may not be finalized until after President-elect Trump takes office.

Location data firm helps police find out when suspects visited their doctor Read More »

o1 Turns Pro

Turns / DJ Henderson / December 10, 2024

So, how about OpenAI’s o1 and o1 Pro?

Sam Altman: o1 is powerful but it’s not so powerful that the universe needs to send us a tsunami.

As a result, the universe realized its mistake, and cancelled the tsunami.

We now have o1, and for those paying $200/month we have o1 pro.

It is early days, but we can say with confidence: They are good models, sir. Large improvements over o1-preview, especially in difficult or extensive coding questions, math, science, logic and fact recall. The benchmark jumps are big.

If you’re in the market for the use cases where it excels, this is a big deal, and also you should probably be paying the $200/month.

If you’re not into those use cases, maybe don’t pay the $200, but others are very much into those tasks and will use this to accelerate those tasks, so this is a big deal.

This post will be about o1’s capabilities only. Aside from this short summary, it skips covering the model card, the safety issues and questions about whether o1 ‘tried to escape’ or anything like that.

For now, I’ll note that:

o1 scored Medium on CBRN and Persuasion.
o1 scored Low on Cybersecurity and Model Autonomy.
I found the Apollo report, the one that involves the supposed ‘escape attempts’ and what not, centrally unsurprising, given what we already knew.
Generally, what I’ve seen so far is about what one would expect.

Here is the system card if you want to look at that in the meantime.

For practical use purposes, evals are negative selection, you need to try it out.

Janus: You should definitely try it out and ignore evaluations and such.

Roon: The o1 model is quite good at programming. In my use, it’s been remarkably better than the o1 preview. You should just try it and mostly ignore evaluations and such.

OpenAI introduces ChatGPT Pro, a $200/month service offering unlimited access to all of their models, including a special o1 pro mode where it uses additional compute.

Gallabytes: haha yes finally someone offers premium pricing. if o1 is as good as they say I’ll be a very happy user.

Yep, premium pricing options are awesome. More like this, please.

$20/month makes your decision easy. If you don’t subscribe to at least one paid service, you’re a fool. If you’re reading this, and you’re not paying for both Claude and ChatGPT at a minimum, you’re still probably making a mistake.

At $200/month for ChatGPT Pro, or $2,400/year, we are plausibly talking real money. That decision is a lot less obvious.

The extra compute helps. The question is how much?

You can mostly ignore all the evals and scores. It’s not about that. It’s about what kind of practical boost you get from unlimited o1 pro and o1 (and voice mode).

When o1 pro is hooked up to an IDE, a web browser or both, that will make a huge practical difference. Right now, it offers neither. It’s a big jump by all reports in deep reasoning and complex PhD-level or higher science and math problems. It solves especially tricky coding questions exceptionally well. But how often are these the modalities you want, and how much value is on the table?

Early poll results (where a full 17% of you said you’d already tried it!) had a majority say it mostly isn’t worth the price, with only a small fraction saying it provides enough value for the common folk who aren’t mainlining.

Sam Altman agrees: almost everyone will be best served by our free tier or the $20-per-month plus tier.

A small percentage of users want to use ChatGPT a lot and hit rate limits, and want to pay more for more intelligence on truly difficult problems. The $200-per-month tier is good for them!

I think Altman is wrong? Or alternatively, he’s actually saying ‘we don’t expect you to pay $200/month, it would be a bad look if I told you to pay that, and the $20/month product is excellent either way,’ which is reasonable.

I would be very surprised if pro coders weren’t getting great value here. Even if you only solve a few tricky spots each month, that’s already huge.

For short term practical personal purposes, those are the key questions.

Miles Brundage: o1 pro mode is now (just barely) off this chart at 79%.

Lest we forget, GPQA = “Google-proof question answering” in physics, bio, and chemistry – not easy stuff. 📈

Vellum verifies MMLU, Human Eval and MATH, with very good scores: 92.3% MMLU, 92.4% HumanEval, 94.8% MATH. And that’s all for o1, not o1 pro.

These are big jumps. We also have 83% on AIME 2024.

It’s cheating, in a sense, to compare o1 outputs to Sonnet or GPT-4o outputs, since it uses more compute. But in a more important sense, progress is progress.

Jason Li wrote the 2024 Putnam and fed the questions into o1 (not pro), thinking it got at least half (60/120) and would place in the top ~2%. Dan Hendrycks offered to put them into o1 pro, responses were less impressed, so there’s some mismatch somewhere, Dan suspects he used a worse prompt.

A middle-level-silly benchmark is to open the floor and see what people ask?

Garrett Scott: I just subscribed to OpenAI’s $200/month subscription. Reply with questions to ask it and I will repost them in this thread.

Tym Switzer: Budget response:

Groan, fine, I guess, I mean I don’t really know what I was expecting.

Twitter, the floor is yours. What have we got?

Here is o1 pro speculating about potential explanations for unexplained things.

Here is o1 pro searching for the alpha in public markets, sure, but easy question.

Here is o1 pro’s flat tax plan, good instruction following, except I have to dock it tons of points for proactively suggesting an asset tax, and for not analyzing how to avoid reducing net revenue even though that wasn’t requested.

Here is o1 pro explaining Thermodynamic Dissipative adaptation at a post-doc level.

And Claude, commenting on that explanation, which it overall found strong:

Claude: The model appears to be prioritizing:

Accuracy over creativity

Comprehensiveness over depth

Safety over novelty

Structure over style

There’s a lot more, I recommend browing the thread.

As usual, it seems like you want to play to its strengths, rather than asking generic questions. The good news is that o1’s strengths include fact recall, coding and math and science and logic.

I always find them fun, but do not forget that they are deeply silly.

Colin Fraser: I made this for you guys that you can print out as a reference that you can consult before sending me a screenshot

This seems importantly incomplete, even when adjusting so ‘easy’ and ‘hard’ refer to what you would expect to be easy or hard for a computer of a given type, rather than what would be easy or hard for a human. That’s because a lot of what matters is how the computer gets the answer right or wrong. We are far too results oriented, here as everywhere, rather than looking at the steps and breaking down the methods.

Nat McAleese: on the first day of shipmas, o1 said to me: there are three r’s in strawberry.

Fun with self-referential math questions.

Riley Goodside: Remarkable answer from o1 here — the reply author below tried to replicate my answer for this prompt (60,466,176) and got a different one they assumed was an error, but it isn’t.

In words, 205,891,132,094,649 has 41 vowels

And (41 – 14)^10 = 205,891,132,094,649.

Still failing to notice 9.8 is more than 9.11, I see? Although here o1 pro passes.

Ask it to solve physics?

Siraj Raval: Spent $200 on o1 Pro and casually asked it to solve physics. ‘Unify general relativity and quantum mechanics,’ I said. No string theory, no loops. It gave me wild new math and testable predictions like it was nothing. Full link.

Justin Kruger: Ask o1 Pro to devise a plan for directly observing continents on a terrestrial planet within 10 light-years of Earth and getting a 4k image back in our lifetime for less than $100B. Then ask it for a plan to get a human on that planet for less than $1T.

Siraj Raval: Done. It’s devised plan is genius.

The answer to physics is of course completely Obvious Nonsense but the question essentially asked for completely Obvious Nonsense, so… not bad?

Failing to remember that the Earth is a sphere, which is relevant when a plane flies far enough.

Gallabytes goes super deep on the all-important tic-tac-toe benchmark, for a while was impressed that he couldn’t beat it, then did anyway.

Actually not a bad benchmark. Diminishing returns, so act now.

Here is an especially silly question to focus on:

Nick St. Pierre: AGI 2025.

Reactions to o1 were almost universally positive. It’s a good model, sir.

The basics: It’s relatively fast, and seems to hallucinate less.

Danielle Fong: o1 is really fast; I like this.

Nick Dobos: o1 is fully rolled out to me. First impression:

Wow! o1 is cracked. What the

It’s so fast, and half the reason I was using Sonnet 3.6 was simply that o1 preview was slow. Half the time, I just needed something simple and quick, and it’s knocking my initial, admittedly simple coding questions out of the park so far.

Tyler Cowen: With the full o1 model, the rate of hallucination is much lower.

OpenAI: OpenAI o1 is more concise in its thinking, resulting in faster response times than o1-preview.

Our testing shows that o1 outperforms o1-preview, reducing major errors on difficult real-world questions by 34%.

Note that on the ‘hallucination’ tests per se, o1 did not outperform o1-preview.

The ‘vibe shift’ here is presumably as compared to o1-preview, which I like many others concluded wasn’t worth using in most cases.

Sam Altman (December 6, 11: 57 p.m.): Fun watching the vibes shift so quickly on o1 🙂

Glad you like it!

Tyler Cowen finds o1 to be an excellent economist, and hard to stump.

Amjad Masad complains the model is not ‘reasoning from first principles’ on controversial questions but rather defaulting to consensus and calling everything else a conspiracy theory. I am confused why he expected it to suddenly start Just Asking Questions, given how it is being trained, and given how reliable consensus is in such situations versus Just Asking Questions, by default?

I bet you could still get it to think with better prompting. I think a certain type of person (which definitely includes Masad) is very inclined to find this type of fault, but as John Schulman explains, you couldn’t do it directly any other way even if you wanted to:

John Shulman: Nope, we don’t know how to train models to reason about controversial topics from first principles; we can only train them to reason on tasks like math calculations and puzzles where there’s an objective ground truth answer. On general tasks, we only know how to train them to imitate humans or maximize human approval. Nowadays post-training / alignment boosts benchmark scores, e.g. see here.

Amjad Masad: Ah makes sense. I tried to coax it to do Bayesian reasoning which was a bit more interesting.

Andrej Karpathy: The pitch is that reasoning capabilities learned in reward-rich settings transfer to other domains, the extent to which this turns out to be true is a large weight on timelines.

So far, the answer seems to be that it transfers some, and o1 and o1-pro still seem highly useful in ways beyond reasoning, but o1-style models mostly don’t ‘do their core thing’ in areas where they couldn’t be trained on definitive answers.

Hailey Collet: I’ve had several coding tasks where o1 has succeed that o1-preview had failed (can’t share details). Today, I successfully used o1 to perform some challenging pytorch optimization. It was a fight, whereas in this instance QwQ succeeded 1st try. O1-pro, by another use, also nailed.

Girl Lich: It’s better at rpg character optimization, which is not a domain would have expected then to train it on.

Lord Byron’s Iron: I’m using o1 (not pro) to debug my deckbuilder roguelike game

It’s much much better at debugging than Claude, generally correctly diagnosing bugs on first try assuming adequate context. And its code isn’t flawless but requires very little massaging to make work.

Reactions to o1 Pro by professionals seem very, very positive, although it does not strictly dominate Claude Sonnet.

William: Tech-debt deflation is here.

O1 Pro just solved an incredibly complicated and painful rewrite of a file that no other model has ever come close to.

I have been using this in an evaluation for different frontier models, and this marks a huge shift for me.

We’ve entered the “why fix your code today when a better model will do it tomorrow” regime.

[Here is an example.]

Holy shit! Holy shit! Holy shit!

Dean Ball: I am thinking about writing a review of o1 Pro, but for now, the above sums up my thoughts quite well.

TPIronside notes that while Claude Sonnet produces cleaner code, o1 is better at avoiding subtle errors or working with more obscure libraries and code bases. So you’d use Sonnet for most queries, but when something is driving you crazy you would pull out o1 Pro.

The key is what William realizes. The part where something is driving you crazy, or you have to pay down tech debt, is exactly where you end up spending most of your time (in my model and experience). That’s the hard part. So this is huge.

Sully: O1-Pro is probably the best model I’ve used for coding, hands down.

I gave it a pretty complicated codebase and asked it to refactor, referencing the documentation.

The difference between Claude, Gemini, O1, and O1-Pro is night and day.

The first time in a while I’ve been this impressed.

Full comparison in the video plus code.

Yeah, haha, I’m pretty sure I got $200 worth of coding done just this weekend using O1-Pro.

I’m really liking this model the more I use it.

Sully: With how smart O1-Pro is, the challenge isn’t really “Can the model do it?” anymore.

It’s bringing all the right data for the model to use.

Once it has the right context, it basically can do just about anything you ask it to.

And no copy-pasting, ragging, or “projects” don’t solve this 100%.

There has to be some workflow layer.

Not quite sure what it is yet.

Sully also notes offhand he thinks Gemini-1206 is quite good.

Kakachia777 does a comparison of o1 Pro to Claude 3.5 Sonnet, prefers Sonnet for coding because its code is easier to maintain. They have o1 pro somewhat better at deeper reasoning and complex tasks but not as much as others are saying, and recommends o1 Pro only for those who do specialized PhD-level tasks.

That post also claims new Chinese o1-style models are coming that will be much improved. As always, we shall wait and see.

For that wheelhouse, many report o1 Pro is scary good. Here’s one comment on Kakachia’s post.

T-Rex MD: Just finished my own testing. The science part, I can tell you, no AI, and No human has ever even come close to this.

I ran 4 separate windows at the same time, previously known research ended in roadblocks and met premature ending, all done and sorted. The o1-preview managed to break down years to months, then through many refinement, to 5 days. I have now redone all of that and finished it in 5-6 hours.

Other AIs fail to reason like I do or even close to what I do. My reasoning is extremely specific and medicine – science driven and refined.

I can safely say “o1-pro”, is the king, and unlikely to be de-throned at least until February.

Danielle Fong is feeling the headpats, and generally seems positive.

Danielle Fong: Just hired a new intern at $200/month.

they’re cracked, no doubt, but i’m suspicious they might be working many jobs.

And you can always count on him, but this one does hit a bit different:

McKay Wrigley (always the most impressed person): OpenAI o1 pro is *significantlybetter than I anticipated.

This is the 1st time a model’s come out and been so good that it kind of shocked me.

I screenshotted Coinbase and had 4 popular models write code to clone it in 1 shot.

Guess which was o1 pro.

I will be devastated if o1 pro isn’t available via API.

I’d pay a stupid sum of money per 1M tokens for whatever this steroid version of o1 is.

Also, if you’re a “bUt tHe bEnChMarKs” person try actually working on normal stuff with it.

It’s way better, and it’s not close.

…

The deeper I go down the rabbit hole the more impressed I get.

This thing is different.

Derya Unutmaz reports o1 Pro unlocked great new ideas for his cancer therapy project, and he’s super excited.

A relatively skeptical take on o1-pro that still seems pretty sweet even so?

Riley Goodside: Problems o1 pro can solve that o1 can’t at all are rare; mostly it feels like things that work half the time on o1 work 90% of the time on o1 pro.

Here’s one I didn’t expect.

Wolf of Walgreens: Incredibly useful for fact recall. Disappointing for math (o1pro)

Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time? But perhaps not. Seems like a clue?

Potentially related is that Steve Sokolowski reports it blows away other models at legal research, to the point of enabling pro se cases.

The problem with using o1 for coding, in a nutshell.

Gallabytes: how are people using o1 for coding? I’ve gotten so used to using cursor for this instead of a chat app. are you actually pasting in all the relevant context for the thing you want to do, then pasting the solution back into your editor?

McKay Wrigley (professional impressed person who is indeed also impressed with Gemini 1206, and is also a coder) is super impressed with o1, but will continue using Sonnet as well, because you often don’t want to have to step out of context.

McKay Wrigley: Finally has a reliable workflow.

Significantly better than his current workflow.

He’s basically replaced the Cursor Composer step with o1 Pro requests.

o1 Pro can complete a shocking number of tasks in a single step.

If he needs to do a few extra things, he uses Cursor Tab/Chat with Sonnet.

Video coming soon.

This basic idea makes sense. If you don’t need to rely on lots of context and want to essentially one-shot the problem, you want to Go Big with o1-pro.

If you want to make small adjustments, or write cleaner code, you go with Sonnet.

However, if Sonnet is failing at something and you’re going crazy, you can ‘pull out the bazooka’ and use o1-pro again, despite the context shifting. And indeed, that’s where the bulk of the actual pain comes, in my experience.

Still, putting o1 straight into the IDE would be ten times better, and likely not only get me to definitely pay but also to code a lot more?

I buy that this probably works.

Tyler Cowen: Addending “Take all the time you need” I find to be a useful general prompt with the o1 model.

Ethan Mollick: Haven’t seen the thinking time change when prompted to think longer. But will keep trying.

Tyler Cowen: Maybe it doesn’t need more time, it just needs to relax a bit!?

A prompt that predicts a superior result is likely a very good prompt. So if this works without causing o1 to think for longer, my presumption is then that it works because people who take all the time they need, or are told they can do so, produce better answers, so this steers it into a space with better answers.

He also advises using o1 to ask lots of questions while reading books.

Tyler Cowen: With the new o1 model, often the best strategy is to get a good history book, to help you generate questions, and then to ask away. How long will it take the world to realize a revolution in reading has arrived?

Dwarkesh Patel: Reading while constantly asking Claude questions is 2x harder and 4x more valuable.

Bloom 2 Sigma on demand.

To answer Tyler Cowen’s question, I mean, never, obviously. The revolution will not be televised, so almost everyone will miss it. People aren’t going to read books and stop to ask questions. That sounds like work and being curious and paying attention, and people don’t even read books when not doing any of those things.

People definitely aren’t going to start cracking open history books. I mean, ‘cmon.

The ‘ask LLMs lots of questions while reading’ tactic is of course correct. It was correct before using Claude Sonnet, and it’s highly plausible o1 makes it more correct now that you have a second option – I’m guessing you’ll want to mix up which one you use based on the question type. And no, you don’t have to jam the book in the context window – but you could, and in many cases you probably should. What, like it’s hard? If the book is too long, use Gemini-1206.

That said, I’ve spent all day reading and writing and used almost no queries. I ask questions most often when reading papers, then when reading some types of books, but I rarely read books and I’ve been triaging away the papers for now.

One should of course also be asking questions while writing, or reading blogs, or even reading Twitter, but mostly I end up not doing it.

It is early, but it seems clear that o1 and especially o1 pro are big jumps in capability for things in their wheelhouse. If you want what this kind of extended thinking can get you, including fact recall and relative lack of hallucinations, and especially large or tricky code, math or science problems, and likely most academic style questions, we took a big step up.

When this gets incorporated into IDEs, we should see a big step up in coding. It makes me excited to code again, the way Claude Sonnet 3.5 did (and does, although right now I don’t have the time).

Another key weakness is lack of web browsing. The combination of this plus browsing seems like it will be scary powerful. You’ll still want some combination of GPT-4o and Perplexity in your toolbox.

For other uses, it is too early to tell when you would want to use this over Sonnet 3.5.1. My instinct is that you’d still probably want to default to Sonnet for questions where it should be ‘smart enough’ to give you what you’re looking for, or of course just ask both of them all the time. Also there’s Gemini-1206, which I’m hearing a bunch of positive vibes about, so it might also be worth a look.

Discussion about this post

o1 Turns Pro Read More »

Apple hit with $1.2B lawsuit after killing controversial CSAM-detecting tool

Apple, child porn, child pornography, child sex abuse materials, csam, end to end encryption, iCloud, IPhone, Policy / DJ Henderson / December 9, 2024

When Apple devices are used to spread CSAM, it’s a huge problem for survivors, who allegedly face a range of harms, including “exposure to predators, sexual exploitation, dissociative behavior, withdrawal symptoms, social isolation, damage to body image and self-worth, increased risky behavior, and profound mental health issues, including but not limited to depression, anxiety, suicidal ideation, self-harm, insomnia, eating disorders, death, and other harmful effects.” One survivor told The Times she “lives in constant fear that someone might track her down and recognize her.”

Survivors suing have also incurred medical and other expenses due to Apple’s inaction, the lawsuit alleged. And those expenses will keep piling up if the court battle drags on for years and Apple’s practices remain unchanged.

Apple could win, a lawyer and policy fellow at the Stanford Institute for Human-Centered Artificial Intelligence, Riana Pfefferkorn, told The Times, as survivors face “significant hurdles” seeking liability for mishandling content that Apple says Section 230 shields. And a win for survivors could “backfire,” Pfefferkorn suggested, if Apple proves that forced scanning of devices and services violates the Fourth Amendment.

Survivors, some of whom own iPhones, think that Apple has a responsibility to protect them. In a press release, Margaret E. Mabie, a lawyer representing survivors, praised survivors for raising “a call for justice and a demand for Apple to finally take responsibility and protect these victims.”

“Thousands of brave survivors are coming forward to demand accountability from one of the most successful technology companies on the planet,” Mabie said. “Apple has not only rejected helping these victims, it has advertised the fact that it does not detect child sex abuse material on its platform or devices thereby exponentially increasing the ongoing harm caused to these victims.”

Apple hit with $1.2B lawsuit after killing controversial CSAM-detecting tool Read More »

Rocket Report: NASA delays Artemis again; SpinLaunch spins a little cash

rocket report, Science, Space / DJ Henderson / December 7, 2024

All the news that’s fit to lift

A report in which we read some tea leaves.

Look a the rocket which has now launched 400 times. Credit: SpaceX

Welcome to Edition 7.22 of the Rocket Report! The big news is the Trump administration’s announcement that commercial astronaut Jared Isaacman would be put forward as the nominee to serve as the next NASA Administrator. Isaacman has flown to space twice, and demonstrated that he takes spaceflight seriously. More background on Isaacman, and possible changes, can be found here.

As always, we welcome reader submissions, and if you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets as well as a quick look ahead at the next three launches on the calendar.

Orbex pauses launch site work in Sutherland, Scotland. Small-launch vehicle developer Orbex will halt work on its own launch site in northern Scotland and instead use a rival facility in the Shetland Islands, Space News reports. Orbex announced December 4 that it would “pause” construction of Sutherland Spaceport in Scotland and instead use the SaxaVord Spaceport on the island of Unst in the Shetlands for its Prime launch vehicle. Orbex had been linked to Spaceport Sutherland since the UK Space Agency announced in 2018 it selected the site for a vertical launch complex.

Pivoting to medium lift? … The move, Orbex said, will free up resources to allow the company to focus on launch vehicle development, including both Prime and a new medium-class vehicle called Proxima. “This decision will help us to reach first launch in 2025 and provides SaxaVord with another customer to further strengthen its commercial proposition. It’s a win-win for UK and Scottish space,” Phil Chambers, chief executive of Orbex, said. If you’re reading the tea leaves here, one might guess that the smaller Prime rocket will never launch, and the medium-lift design is a hail mary. We’ll see. (submitted by Ken the Bin)

SpinLaunch raises a little cash. Space startup SpinLaunch is fundraising again, though TechCrunch reports that it was exploring raising a significantly more ambitious sum earlier this year. The company has closed an $11.5 million round out of a planned $25 million, according to a filing with the US Securities and Exchange Commission. SpinLaunch confirmed funding to TechCrunch but did not comment on the amount raised. It last raised $71 million in a Series B funding round in 2022. SpinLaunch, as the name implies, plans to build a kinetic launch system as a low-cost, high-cadence alternative to rockets.

Putting the spin into SpinLaunch … A person familiar with the company’s plans told TechCrunch that the startup had talked to investors around nine months ago, hoping they would pile into a $350 million round at a $2 billion valuation. In response to a question about this fundraising target, SpinLaunch CEO David Wrenn said the figures were “highly inaccurate and misleading” and that he was “pleased with our recently closed financing.” Someone is spinning something, clearly. (submitted by Ken the Bin)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

Vega C successfully returns to flight. After originally targeting November 29 for the return-to-flight mission of the Vega C rocket, Arianespace successfully launched the vehicle on Thursday, December 5, Space News reports. The three solid-fuel lower stages of the Vega C performed as expected, followed by three burns by the liquid-propellant AVUM+ upper stage. That upper stage deployed its payload, the Sentinel-1C satellite, about one hour and 45 minutes after liftoff. The launch was the first for the Vega C since a December 2022 launch failure on the rocket’s second flight that destroyed two Pléiades Neo imaging satellites.

Eyes in the sky … The payload, Sentinel-1C, is a radar imaging satellite built by Thales Alenia Space for the Copernicus program of Earth observation missions run by ESA and the European Commission. It replaces the Sentinel-1B spacecraft that malfunctioned in orbit nearly three years ago. It joins the existing, but aging, Sentinel-1A satellite and includes new capabilities to monitor maritime traffic with an Automatic Identification System receiver.

PLD Space secures loan for Miura 5 rocket. The Spanish launch company said this week that it had secured an 11 million euro loan ($11.6 million) from COFIDES, a state-owned development fund, to support the development of the launch site for its Miura 5 rocket in Kourou, French Guiana. The company said the funding bolsters its mission to ensure autonomous and competitive European access to space while strengthening Europe’s space infrastructure.

A public-private partnership … “This initiative exemplifies the critical role of public-private collaboration in supporting strategic and innovative projects, which rely on institutional backing as an anchor investor during the early stages of technological development,” added Spanish Secretary of State for Trade Amparo López Senovilla. The Miura 5 rocket will have an estimated payload capacity of 1 metric ton to low-Earth orbit and may make its debut in 2026. (submitted by Ken the Bin)

SpaceX value may soar higher. SpaceX is in talks to sell insider shares in a transaction valuing the rocket and satellite maker at about $350 billion, according to people familiar with the matter, Bloomberg reports. That would be a significant premium to a previously mulled valuation of $255 billion as reported by Bloomberg News and other media outlets just last month. SpaceX was last valued at about $210 billion in a tender offer earlier this year.

A big post-election bump … The current conversations are ongoing, and the details could change depending on interest from insider sellers and buyers, sources told the publication. The potential transaction would cement SpaceX’s status as the most valuable private startup in the world and rival the market capitalizations of some of the largest public companies. SpaceX has established itself as the industry’s preeminent rocket launch provider, lofting satellites, cargo, and people to space for NASA, the Pentagon, and commercial partners, and is building out a large network of Starlink satellites providing Internet service. (submitted by Ken the Bin)

China debuts a new medium-lift rocket. China’s new Long March 12 rocket made a successful inaugural flight Saturday, placing two experimental satellites into orbit and testing uprated, higher-thrust engines that will allow a larger Chinese launcher in development to send astronauts to the Moon. The Long March 12 is the newest member of China’s Long March rocket family, which has been flying since China launched its first satellite into orbit in 1970, Ars reports.

Rocket likely to be used for megaconstellation deployment … Like all of China’s other existing rockets, the Long March 12 configuration that flew Saturday is fully disposable. At the Zhuhai Airshow earlier this month, China’s largest rocket company displayed another version of the Long March 12 with a reusable first stage but with scant design details. The Long March 12 is powered by four kerosene-fueled YF-100K engines on its first stage, generating more than 1.1 million pounds, or 5,000 kilonewtons, of thrust at full throttle. These engines are upgraded, higher-thrust versions of the YF-100 engines used on several other types of Long March rockets. (submitted by EllPeaTea and Ken the Bin)

Falcon 9 rocket reaches some remarkable milestones. About 10 days ago, SpaceX launched a batch of Starlink v2-mini satellites from Kennedy Space Center in Florida on a Falcon 9 rocket, marking the 400th successful mission by the Falcon 9 rocket. Additionally, it was the Falcon program’s 375th booster recovery, according to SpaceX. Finally, with this mission, the company shattered its record for turnaround time from the landing of a booster to its launch to 13 days and 12 hours, down from 21 days, Ars reports.

A rapidly reusable shuttle … All told, in November, SpaceX launched 16 Falcon 9 rockets. The previous record for monthly launches by the Falcon 9 rocket was 14. SpaceX is on pace to launch 135 or more Falcon 9 and Falcon Heavy missions this year. That is a meaningful number, because over the course of the three decades it flew into orbit, NASA’s space shuttle flew 135 missions. The space shuttle was a significantly more complex vehicle, and unlike the Falcon 9 rocket, humans flew aboard it during every mission. However, there is some historical significance in the fact that the Falcon rocket may fly as many missions in a single year as the space shuttle did during its lifetime.

Long March 3B hits a milestone. China launched a new communication engineering test satellite early Tuesday with its workhorse Long March 3B rocket. This added to a series of satellites potentially for undisclosed military purposes, Space News reports. The launch was, notably, the 100th of the workhorse Long March 3B.

First time to the century marker … The rocket has performed 96 successful launches with two failures and two partial failures. The first launch, in February 1996 carrying Intelsat 708, infamously saw the rocket veer off course shortly after clearing the tower and impacting a nearby village. Developed by the state-run China Academy of Launch Vehicle Technology, the three-stage and four-liquid-booster rocket is the only Chinese launcher to reach 100 launches. (submitted by Ken the Bin)

NASA delays Artemis launches again. In a news conference Thursday, NASA officials discussed changes to the timeline for future Artemis missions due to problems with Orion’s heat shield. The agency announced it is now targeting April 2026 for Artemis II (from September 2025) and mid-2027 for Artemis III (from September 2026). NASA said it now understands the technical cause of the heat shield issues observed during the Artemis I flight in late 2022 and will fly the heat shield as-is on Artemis II, with some changes to the reentry profile.

This may not be the final plan … The timing of this news conference was interesting, as there will be a changing of administrations at NASA in the coming weeks. The Trump administration has put forward commercial astronaut Jared Isaacman to lead NASA, and as Ars reported Thursday, there are likely some significant shakeups coming in the Artemis program. One possibility is that the Space Launch System rocket could be scrapped, with commercial rockets used to fly the Artemis missions.

Next three launches

Dec. 8: Falcon 9 | Starlink 12-5 | Cape Canaveral Space Force Station, Florida | 05: 10 UTC

Dec. 12: Falcon 9 | Starlink 11-2 | Vandenberg Space Force Base, California | 19: 33 UTC

Dec. 12: Falcon 9 | O3b mPOWER 7 & 8 | Kennedy Space Center, Fla. | 20: 58 UTC

Eric Berger is the senior space editor at Ars Technica, covering everything from astronomy to private space to NASA policy, and author of two books: Liftoff, about the rise of SpaceX; and Reentry, on the development of the Falcon 9 rocket and Dragon. A certified meteorologist, Eric lives in Houston.

Rocket Report: NASA delays Artemis again; SpinLaunch spins a little cash Read More »

Indiana Jones and the Great Circle is pitch-perfect archaeological adventuring

Features, gaming, Indiana Jones, MachineGames / DJ Henderson / December 7, 2024

Review: Amazing open-world environs round out a tight, fun-filled adventure story.

No need to put Harrison Ford through the de-aging filter here! Credit: Bethesda / MachineGames

Historically, games based on popular film or TV franchises have generally been seen as cheap cash-ins, slapping familiar characters and settings on a shovelware clone of a popular genre and counting on the license to sell enough copies to devoted fans. Indiana Jones and the Great Circle clearly has grander ambitions than that, putting a AAA budget behind a unique open-world exploration game built around stealth, melee combat, and puzzle solving.

Building such a game on top of such well-loved source material comes with plenty of challenges. The developers at MachineGames need to pay homage to the source material without resorting to the kind of slavish devotion that amounts to a mere retread of a familiar story. At the same time, any new Indy adventure carries with it the weight not just of the character’s many film and TV appearances but also well-remembered games like Indiana Jones and the Fate of Atlantis. Then there are game franchises like Tomb Raider and Uncharted, which have already put their own significant stamps on the Indiana Jones formula of action-packed, devil-may-care treasure-hunting.

No, this is not a scene from a new Uncharted game. Credit: Bethesda / MachineGames

Surprisingly, Indiana Jones and the Great Circle bears all this pressure pretty well. While the stealth-exploration gameplay and simplistic puzzles can feel a bit trite at points, the game’s excellent presentation, top-notch world-building, and fun-filled, campy storyline drive one of Indy’s most memorable adventures since the original movie trilogy.

A fun-filled adventure

The year is 1937, and Indiana Jones has already Raided a Lost Ark but has yet to investigate the Last Crusade. After a short introductory flashback that retells an interactive version of Raiders of the Lost Ark‘s famous golden idol extraction, Professor Jones gets unexpectedly drawn away from preparations for midterms when a giant of a man breaks into Marshall College’s antiquities wing and steals a lone mummified cat.

Investigating that theft takes Jones on a globetrotting tour of locations along “The Great Circle,” a ring of archaeologically significant sites around the world that house ancient artifacts rumored to hold great and mysterious power. Those rumors have attracted the attention of the Nazis (who else would you expect?), dragging Indy into a race to secure the artifacts before they threaten to alter the course of an impending world war.

You see a whip, I see a grappling hook. Credit: Bethesda / MachineGames

The game’s overarching narrative—told mainly through lengthy cut scenes that serve as the most captivating reward for in-game achievements—does a pitch-perfect job of replicating the campy, madcap, fun-filled, adventurous tone Indy is known for. The writing is full of all the pithy one-liners and cheesy puns you could hope for, as well as countless overt and subtle references to Indy movie moments that will be familiar to even casual fans.

Indy here is his usual mix of archaeological superhero and bumbling everyman. One moment, he’s using his whip and some hard-to-believe upper body strength to jump around some quickly crumbling ruins. The next, he’s avoiding death during a madcap fight scene through a combination of sheer dumb luck and overconfident opposition. The next, he’s solving ancient riddles with reams of historical techno-babble and showing a downright supernatural ability to decipher long-dead languages in an instant when the plot demands it.

You have to admit it, this circle is pretty great! Credit: Bethesda / MachineGames

It all works in large part thanks to Troy Baker’s excellent vocal performance as Jones, which he somehow pulls off as a compelling cross between Harrison Ford and Jeff Goldblum. The music does some heavy lifting in setting the tone, too; it’s full of downright cinematic stirring horns and tension-packed strings that fade in and out perfectly in sync with the on-screen action. The game even shows some great restraint in its sparing use of the famous Indiana Jones theme, which I ended up humming to myself as I played more often than I actually heard it referenced in the game’s score.

Indy quips well off of Gina, a roving reporter searching for her missing sister who serves as the obligatory love interest/globetrotting exploration partner. But the game’s best scenes all involve Emmerich Voss, the Nazi archaeologist antagonist who makes an absolute meal out of his scenery chewing. From his obsession with cranial shapes to his preening diatribes about the inferiority of American culture, Voss makes the perfect foil for Indy’s no-nonsense, homespun apple pie forthrightness.

Voss steals literally every scene he’s in. Credit: Bethesda / MachineGames

By the time the plot descends into an inevitable mess of pseudo-religious magical mysticism, it’s clear that this is a story that doesn’t take itself too seriously. You may cringe a bit at how over the top it all gets, but you’ll probably be having too much fun to care.

Take a look around

In between the cut scenes—which together could form the basis for a strong Indiana Jones-themed episodic streaming miniseries—there’s an actual interactive game to play here as well. That game primarily plays out across three decently sized maps—one urban, one desert, and one water-logged marsh—that you can explore relatively freely, broken up by shorter, more linear interludes in between.

Following the main story quests in each of these locales generally has you zigzagging across the map through a series of glorified fetch quests. Go to location A to collect some mystical doodad, then return it to unlock some fun exposition and a reason to go to location B. Repeat as necessary.

I say “point A” there, but it’s usually more accurate to say the game points you toward “circle A” on the map. Once you get there, you often have to do a bit of unguided exploring to find the hidden trinket or secret entry point you need.

Am I going in the right direction? Credit: Bethesda / MachineGames

At their best, these exploration bits made me feel more like an archaeological detective than the usual in-game tourist blindly following a waypoint from location to location. At its worst, I spent 15 minutes searching through one of these map circles before finding my in-game partner Gina standing right next to the target I was probably intended to find immediately. So it goes.

Traipsing across the map in this way slowly reveals the sizable scale of the game’s environments, which often extend beyond what’s first apparent on the map to multi-floor buildings and gigantic subterranean caverns. Unlocking and/or figuring out all of the best paths through these labyrinthine locales—which can involve climbing across rooftops or crawling through enemy barracks—is often half the fun.

As you crisscross the map, you also invariably stumble on a seemingly endless array of optional sidequests, mysteries, and “fieldwork,” which you keep track of in a dynamically updated journal. While there’s an attempt at a plot justification for each of these optional fetch quests, the ones I tried ended up being much less compelling than the main plot, which seems to have taken most of the writers’ attention.

Indiana Jones, famous Vatican tourist. Credit: Bethesda / MachineGames

As you explore, a tiny icon in the corner of the screen will also alert you to photo opportunities, which can unlock important bits of lore or context for puzzles. I thoroughly enjoyed these quick excuses to appreciate the game’s well-designed architecture and environments, even as it made Indy feel a bit more like a random tourist than a badass archaeologist hero.

Quick, hide!

Unfortunately, your ability to freely explore The Great Circle‘s environments is often hampered by large groups of roaming Nazi and/or fascist soldiers. Sometimes, you can put on a disguise to walk among them unseen, but even then, certain enemies can pick you out of the crowd, something that was not clear to me until I had already been plucked out of obscurity more than a few times.

When undisguised, you’ll spend a lot of time kneeling and sneaking silently just outside the soldiers’ vision cones or patiently waiting for them to move so you can sneak through a newly safe path. Remaining unseen also lets you silently take out enemies from behind, which includes pushing unsuspected enemy sentries off of ledges in a hilarious move that never, ever gets old.

They’ll never find me up here. Credit: Bethesda / MachineGames

When your sneaking skills fail you amid a large group of enemies, the best and easiest thing to do is immediately run and hide. For the most part, the enemies are incredibly inept in their inevitable pursuit; dodge around a couple of corners and hide in a dark alley and they’ll usually quickly lose track of you. While I appreciated that being spotted wasn’t an instant death sentence, the ease with which I could outsmart these soldiers made the sneaking a lot less tense.

If you get spotted by a group of just one or two enemy soldiers, though, it’s time for some first-person melee combat, which draws heavy inspiration from the developers’ previous work on the early ’00s Chronicles of Riddick games. These fights usually play out like the world’s most overdesigned game of Punch-Out!!—you stand there waiting for a heavily telegraphed punch to come in, at which point you throw up a quick block or dodge and then counter with a series of rapid, crunchy punches of your own. Repeat until the enemy goes down.

You can spice things up a bit here by disarming and/or unbalancing your foes with your whip or by grabbing a wide variety of nearby objects to use as improvised melee weapons. After a while, though, all the fistfights start to feel pretty rote and unmemorable. The first time you hit a Nazi upside the head with a plunger is hilarious. The fifth time is a bit tiresome.

It’s always a good time to punch a Nazi. Credit: Bethesda / MachineGames

While you can also pull out a trusty revolver to simply shoot your foes, the racket the shots make usually leads to so much unwelcome enemy attention that it’s rarely worth the trouble. Aside from a handful of obligatory sections where the game practically forces you into a shooting gallery situation, I found little need to engage in the serviceable but unexciting gun combat.

And while The Great Circle is far from a horror game, there are a few combat moments of genuine terror with foes more formidable than the average grunt. I don’t want to give away too much, but those with fear of underwater creatures, the dark, or confined spaces will find some parts of the game incredibly tense.

Not so puzzling

My favorite gameplay moments in The Great Circle were the extended sections where I didn’t have to worry about stealth or combat and could just focus on exploring massive underground ruins. These feature some of the game’s most interesting traversal challenges, where looking around and figuring out just how to make it to the next objective is engaging on its own terms. There’s little of the Uncharted-style gameplay of practically highlighting every handhold and jump with a flashing red sign.

When giant mechanical gears need placing, you know who to call! Credit: Bethesda / MachineGames

These exploratory bits are broken up by some obligatory puzzles, usually involving Indiana Jones’ trademark of unbelievably intricate ancient stone machinery. Arrange the giant stone gears so the door opens, put the right relic in the right spot, shine a light on some emblems with a few mirrors, and so on. You know the drill if you’ve played any number of similar action-adventure games, and you probably won’t be all that engaged if you know how to perform some basic logic and exploration (though snapping pictures with the in-game camera offers hints for those who get unexpectedly stuck).

But even during the least engaging puzzles or humdrum fights in The Great Circle, I was compelled forward by the promise of some intricate ruin or pithy cut scene quip to come. Like the best Indiana Jones movies, there’s a propulsive force to the game’s most exciting scenes that helps you push past any brief feelings of tedium in between. Here’s hoping we see a lot more of this version of Indiana Jones in the future.

A note on performance

Indiana Jones and the Great Circle has received some recent negative attention for having relatively beefy system requirements, including calling for GPUs that have some form of real-time ray-tracing acceleration. We tested the game on a system with an Nvidia RTX 2080 Ti and an Intel i7-8700K CPU with 32 GB of RAM, which puts it roughly between the “minimum” and “recommended” specs suggested by the publisher.

Trace those rays. Credit: Bethesda / MachineGames

Despite this, we were able to run the game at 1440p resolution and “High” graphical settings at a steady 60 fps throughout. The game did occassionally suffer some heavy frame stuttering when loading new scenes, and far-off background elements had a tendency to noticeably “pop in” when running, but otherwise, we had few complaints about the graphical performance.

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Indiana Jones and the Great Circle is pitch-perfect archaeological adventuring Read More »

US to start nationwide testing for H5N1 flu virus in milk supply

Biology, bird flu, Flu virus, H5N1, health, Infectious disease, medicine, Science / DJ Henderson / December 6, 2024

So, the ultimate goal of the USDA is to eliminate cattle as a reservoir. When the Agency announced it was planning for this program, it noted that there were two candidate vaccines in trials. Until those are validated, it plans to use the standard playbook for handling emerging infections: contact tracing and isolation. And it has the ability to compel cattle and their owners to be more cooperative than the human population turned out to be.

The five-step plan

The USDA refers to isolation and contact tracing as Stage 3 of a five-stage plan for controlling H5N1 in cattle, with the two earlier stages being the mandatory sampling and testing, meant to be handled on a state-by-state basis. Following the successful containment of the virus in a state, the USDA will move on to batch sampling to ensure each state remains virus-free. This is essential, given that we don’t have a clear picture of how many times the virus has jumped from its normal reservoir in birds into the cattle population.

That makes it possible that reaching Stage 5, which the USDA terms “Demonstrating Freedom from H5 in US Dairy Cattle,” will turn out to be impossible. Dairy cattle are likely to have daily contact with birds, and it may be that the virus will be regularly re-introduced into the population, leaving containment as the only option until the vaccines are ready.

Testing will initially focus primarily on states where cattle-to-human transmission is known to have occurred or the virus is known to be present: California, Colorado, Michigan, Mississippi, Oregon, and Pennsylvania. If you wish to track the progress of the USDA’s efforts, it will be posting weekly updates.

US to start nationwide testing for H5N1 flu virus in milk supply Read More »

Author name: DJ Henderson