AI

the-first-gpt-4-class-ai-model-anyone-can-download-has-arrived:-llama-405b

The first GPT-4-class AI model anyone can download has arrived: Llama 405B

A new llama emerges —

“Open source AI is the path forward,” says Mark Zuckerberg, misusing the term.

A red llama in a blue desert illustration based on a photo.

In the AI world, there’s a buzz in the air about a new AI language model released Tuesday by Meta: Llama 3.1 405B. The reason? It’s potentially the first time anyone can download a GPT-4-class large language model (LLM) for free and run it on their own hardware. You’ll still need some beefy hardware: Meta says it can run on a “single server node,” which isn’t desktop PC-grade equipment. But it’s a provocative shot across the bow of “closed” AI model vendors such as OpenAI and Anthropic.

“Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation,” says Meta. Company CEO Mark Zuckerberg calls 405B “the first frontier-level open source AI model.”

In the AI industry, “frontier model” is a term for an AI system designed to push the boundaries of current capabilities. In this case, Meta is positioning 405B among the likes of the industry’s top AI models, such as OpenAI’s GPT-4o, Claude’s 3.5 Sonnet, and Google Gemini 1.5 Pro.

A chart published by Meta suggests that 405B gets very close to matching the performance of GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

But as we’ve noted many times since March, these benchmarks aren’t necessarily scientifically sound or translate to the subjective experience of interacting with AI language models. In fact, this traditional slate of AI benchmarks is so generally useless to laypeople that even Meta’s PR department now just posts a few images of charts and doesn’t even try to explain them in any detail.

A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

Enlarge / A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

We’ve instead found that measuring the subjective experience of using a conversational AI model (through what might be called “vibemarking”) on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs. In the absence of Chatbot Arena data, Meta has provided the results of its own human evaluations of 405B’s outputs that seem to show Meta’s new model holding its own against GPT-4 Turbo and Claude 3.5 Sonnet.

A Meta-provided chart that shows how humans rated Llama 3.1 405B's outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Enlarge / A Meta-provided chart that shows how humans rated Llama 3.1 405B’s outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Whatever the benchmarks, early word on the street (after the model leaked on 4chan yesterday) seems to match the claim that 405B is roughly equivalent to GPT-4. It took a lot of expensive computer training time to get there—and money, of which the social media giant has plenty to burn. Meta trained the 405B model on over 15 trillion tokens of training data scraped from the web (then parsed, filtered, and annotated by Llama 2), using more than 16,000 H100 GPUs.

So what’s with the 405B name? In this case, “405B” means 405 billion parameters, and parameters are numerical values that store trained information in a neural network. More parameters translate to a larger neural network powering the AI model, which generally (but not always) means more capability, such as better ability to make contextual connections between concepts. But larger-parameter models have a tradeoff in needing more computing power (AKA “compute”) to run.

We’ve been expecting the release of a 400 billion-plus parameter model of the Llama 3 family since Meta gave word that it was training one in April, and today’s announcement isn’t just about the biggest member of the Llama 3 family: There’s an entirely new iteration of improved Llama models with the designation “Llama 3.1.” That includes upgraded versions of its smaller 8B and 70B models, which now feature multilingual support and an extended context length of 128,000 tokens (the “context length” is roughly the working memory capacity of the model, and “tokens” are chunks of data used by LLMs to process information).

Meta says that 405B is useful for long-form text summarization, multilingual conversational agents, and coding assistants and for creating synthetic data used to train future AI language models. Notably, that last use-case—allowing developers to use outputs from Llama models to improve other AI models—is now officially supported by Meta’s Llama 3.1 license for the first time.

Abusing the term “open source”

Llama 3.1 405B is an open-weights model, which means anyone can download the trained neural network files and run them or fine-tune them. That directly challenges a business model where companies like OpenAI keep the weights to themselves and instead monetize the model through subscription wrappers like ChatGPT or charge for access by the token through an API.

Fighting the “closed” AI model is a big deal to Mark Zuckerberg, who simultaneously released a 2,300-word manifesto today on why the company believes in open releases of AI models, titled, “Open Source AI Is the Path Forward.” More on the terminology in a minute. But briefly, he writes about the need for customizable AI models that offer user control and encourage better data security, higher cost-efficiency, and better future-proofing, as opposed to vendor-locked solutions.

All that sounds reasonable, but undermining your competitors using a model subsidized by a social media war chest is also an efficient way to play spoiler in a market where you might not always win with the most cutting-edge tech. That benefits Meta, Zuckerberg says, because he doesn’t want to get locked into a system where companies like his have to pay a toll to access AI capabilities, drawing comparisons to “taxes” Apple levies on developers through its App Store.

A screenshot of Mark Zuckerberg's essay,

Enlarge / A screenshot of Mark Zuckerberg’s essay, “Open Source AI Is the Path Forward,” published on July 23, 2024.

So, about that “open source” term. As we first wrote in an update to our Llama 2 launch article a year ago, “open source” has a very particular meaning that has traditionally been defined by the Open Source Initiative. The AI industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (such as Llama 3.1) or that ship without providing training data. We’ve been calling these releases “open weights” instead.

Unfortunately for terminology sticklers, Zuckerberg has now baked the erroneous “open source” label into the title of his potentially historic aforementioned essay on open AI releases, so fighting for the correct term in AI may be a losing battle. Still, his usage annoys people like independent AI researcher Simon Willison, who likes Zuckerberg’s essay otherwise.

“I see Zuck’s prominent misuse of ‘open source’ as a small-scale act of cultural vandalism,” Willison told Ars Technica. “Open source should have an agreed meaning. Abusing the term weakens that meaning which makes the term less generally useful, because if someone says ‘it’s open source,’ that no longer tells me anything useful. I have to then dig in and figure out what they’re actually talking about.”

The Llama 3.1 models are available for download through Meta’s own website and on Hugging Face. They both require providing contact information and agreeing to a license and an acceptable use policy, which means that Meta can technically legally pull the rug out from under your use of Llama 3.1 or its outputs at any time.

The first GPT-4-class AI model anyone can download has arrived: Llama 405B Read More »

astronomers-discover-technique-to-spot-ai-fakes-using-galaxy-measurement-tools

Astronomers discover technique to spot AI fakes using galaxy-measurement tools

stars in their eyes —

Researchers use technique to quantify eyeball reflections that often reveal deepfake images.

Researchers write,

Enlarge / Researchers write, “In this image, the person on the left (Scarlett Johansson) is real, while the person on the right is AI-generated. Their eyeballs are depicted underneath their faces. The reflections in the eyeballs are consistent for the real person, but incorrect (from a physics point of view) for the fake person.”

In 2024, it’s almost trivial to create realistic AI-generated images of people, which has led to fears about how these deceptive images might be detected. Researchers at the University of Hull recently unveiled a novel method for detecting AI-generated deepfake images by analyzing reflections in human eyes. The technique, presented at the Royal Astronomical Society’s National Astronomy Meeting last week, adapts tools used by astronomers to study galaxies for scrutinizing the consistency of light reflections in eyeballs.

Adejumoke Owolabi, an MSc student at the University of Hull, headed the research under the guidance of Dr. Kevin Pimbblet, professor of astrophysics.

Their detection technique is based on a simple principle: A pair of eyes being illuminated by the same set of light sources will typically have a similarly shaped set of light reflections in each eyeball. Many AI-generated images created to date don’t take eyeball reflections into account, so the simulated light reflections are often inconsistent between each eye.

A series of real eyes showing largely consistent reflections in both eyes.

Enlarge / A series of real eyes showing largely consistent reflections in both eyes.

In some ways, the astronomy angle isn’t always necessary for this kind of deepfake detection because a quick glance at a pair of eyes in a photo can reveal reflection inconsistencies, which is something artists who paint portraits have to keep in mind. But the application of astronomy tools to automatically measure and quantify eye reflections in deepfakes is a novel development.

Automated detection

In a Royal Astronomical Society blog post, Pimbblet explained that Owolabi developed a technique to detect eyeball reflections automatically and ran the reflections’ morphological features through indices to compare similarity between left and right eyeballs. Their findings revealed that deepfakes often exhibit differences between the pair of eyes.

The team applied methods from astronomy to quantify and compare eyeball reflections. They used the Gini coefficient, typically employed to measure light distribution in galaxy images, to assess the uniformity of reflections across eye pixels. A Gini value closer to 0 indicates evenly distributed light, while a value approaching 1 suggests concentrated light in a single pixel.

A series of deepfake eyes showing inconsistent reflections in each eye.

Enlarge / A series of deepfake eyes showing inconsistent reflections in each eye.

In the Royal Astronomical Society post, Pimbblet drew comparisons between how they measured eyeball reflection shape and how they typically measure galaxy shape in telescope imagery: “To measure the shapes of galaxies, we analyze whether they’re centrally compact, whether they’re symmetric, and how smooth they are. We analyze the light distribution.”

The researchers also explored the use of CAS parameters (concentration, asymmetry, smoothness), another tool from astronomy for measuring galactic light distribution. However, this method proved less effective in identifying fake eyes.

A detection arms race

While the eye-reflection technique offers a potential path for detecting AI-generated images, the method might not work if AI models evolve to incorporate physically accurate eye reflections, perhaps applied as a subsequent step after image generation. The technique also requires a clear, up-close view of eyeballs to work.

The approach also risks producing false positives, as even authentic photos can sometimes exhibit inconsistent eye reflections due to varied lighting conditions or post-processing techniques. But analyzing eye reflections may still be a useful tool in a larger deepfake detection toolset that also considers other factors such as hair texture, anatomy, skin details, and background consistency.

While the technique shows promise in the short term, Dr. Pimbblet cautioned that it’s not perfect. “There are false positives and false negatives; it’s not going to get everything,” he told the Royal Astronomical Society. “But this method provides us with a basis, a plan of attack, in the arms race to detect deepfakes.”

Astronomers discover technique to spot AI fakes using galaxy-measurement tools Read More »

elon-musk’s-x-tests-letting-users-request-community-notes-on-bad-posts

Elon Musk’s X tests letting users request Community Notes on bad posts

Elon Musk’s X tests letting users request Community Notes on bad posts

Continuing to evolve the fact-checking service that launched as Twitter’s Birdwatch, X has announced that Community Notes can now be requested to clarify problematic posts spreading on Elon Musk’s platform.

X’s Community Notes account confirmed late Thursday that, due to “popular demand,” X had launched a pilot test on the web-based version of the platform. The test is active now and the same functionality will be “coming soon” to Android and iOS, the Community Notes account said.

Through the current web-based pilot, if you’re an eligible user, you can click on the “•••” menu on any X post on the web and request fact-checking from one of Community Notes’ top contributors, X explained. If X receives five or more requests within 24 hours of the post going live, a Community Note will be added.

Only X users with verified phone numbers will be eligible to request Community Notes, X said, and to start, users will be limited to five requests a day.

“The limit may increase if requests successfully result in helpful notes, or may decrease if requests are on posts that people don’t agree need a note,” X’s website said. “This helps prevent spam and keep note writers focused on posts that could use helpful notes.”

Once X receives five or more requests for a Community Note within a single day, top contributors with diverse views will be alerted to respond. On X, top contributors are constantly changing, as their notes are voted as either helpful or not. If at least 4 percent of their notes are rated “helpful,” X explained on its site, and the impact of their notes meets X standards, they can be eligible to receive alerts.

“A contributor’s Top Writer status can always change as their notes are rated by others,” X’s website said.

Ultimately, X considers notes helpful if they “contain accurate, high-quality information” and “help inform people’s understanding of the subject matter in posts,” X said on another part of its site. To gauge the former, X said that the platform partners with “professional reviewers” from the Associated Press and Reuters. X also continually monitors whether notes marked helpful by top writers match what general X users marked as helpful.

“We don’t expect all notes to be perceived as helpful by all people all the time,” X’s website said. “Instead, the goal is to ensure that on average notes that earn the status of Helpful are likely to be seen as helpful by a wide range of people from different points of view, and not only be seen as helpful by people from one viewpoint.”

X will also be allowing half of the top contributors to request notes during the pilot phase, which X said will help the platform evaluate “whether it is beneficial for Community Notes contributors to have both the ability to write notes and request notes.”

According to X, the criteria for requesting a note have intentionally been designed to be simple during the pilot stage, but X expects “these criteria to evolve, with the goal that requests are frequently found valuable to contributors, and not noisy.”

It’s hard to tell from the outside looking in how helpful Community Notes are to X users. The most recent Community Notes survey data that X points to is from 2022 when the platform was still called Twitter and the fact-checking service was still called Birdwatch.

That data showed that “on average,” users were “20–40 percent less likely to agree with the substance of a potentially misleading Tweet than someone who sees the Tweet alone.” And based on Twitter’s “internal data” at that time, the platform also estimated that “people on Twitter who see notes are, on average, 15–35 percent less likely to Like or Retweet a Tweet than someone who sees the Tweet alone.”

Elon Musk’s X tests letting users request Community Notes on bad posts Read More »

microsoft-cto-kevin-scott-thinks-llm-“scaling-laws”-will-hold-despite-criticism

Microsoft CTO Kevin Scott thinks LLM “scaling laws” will hold despite criticism

As the word turns —

Will LLMs keep improving if we throw more compute at them? OpenAI dealmaker thinks so.

Kevin Scott, CTO and EVP of AI at Microsoft speaks onstage during Vox Media's 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023 in Dana Point, California.

Enlarge / Kevin Scott, CTO and EVP of AI at Microsoft speaks onstage during Vox Media’s 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023 in Dana Point, California.

During an interview with Sequoia Capital’s Training Data podcast published last Tuesday, Microsoft CTO Kevin Scott doubled down on his belief that so-called large language model (LLM) “scaling laws” will continue to drive AI progress, despite some skepticism in the field that progress has leveled out. Scott played a key role in forging a $13 billion technology-sharing deal between Microsoft and OpenAI.

“Despite what other people think, we’re not at diminishing marginal returns on scale-up,” Scott said. “And I try to help people understand there is an exponential here, and the unfortunate thing is you only get to sample it every couple of years because it just takes a while to build supercomputers and then train models on top of them.”

LLM scaling laws refer to patterns explored by OpenAI researchers in 2020 showing that the performance of language models tends to improve predictably as the models get larger (more parameters), are trained on more data, and have access to more computational power (compute). The laws suggest that simply scaling up model size and training data can lead to significant improvements in AI capabilities without necessarily requiring fundamental algorithmic breakthroughs.

Since then, other researchers have challenged the idea of persisting scaling laws over time, but the concept is still a cornerstone of OpenAI’s AI development philosophy.

You can see Scott’s comments in the video below beginning around 46: 05:

Microsoft CTO Kevin Scott on how far scaling laws will extend

Scott’s optimism contrasts with a narrative among some critics in the AI community that progress in LLMs has plateaued around GPT-4 class models. The perception has been fueled by largely informal observations—and some benchmark results—about recent models like Google’s Gemini 1.5 Pro, Anthropic’s Claude Opus, and even OpenAI’s GPT-4o, which some argue haven’t shown the dramatic leaps in capability seen in earlier generations, and that LLM development may be approaching diminishing returns.

“We all know that GPT-3 was vastly better than GPT-2. And we all know that GPT-4 (released thirteen months ago) was vastly better than GPT-3,” wrote AI critic Gary Marcus in April. “But what has happened since?”

The perception of plateau

Scott’s stance suggests that tech giants like Microsoft still feel justified in investing heavily in larger AI models, betting on continued breakthroughs rather than hitting a capability plateau. Given Microsoft’s investment in OpenAI and strong marketing of its own Microsoft Copilot AI features, the company has a strong interest in maintaining the perception of continued progress, even if the tech stalls.

Frequent AI critic Ed Zitron recently wrote in a post on his blog that one defense of continued investment into generative AI is that “OpenAI has something we don’t know about. A big, sexy, secret technology that will eternally break the bones of every hater,” he wrote. “Yet, I have a counterpoint: no it doesn’t.”

Some perceptions of slowing progress in LLM capabilities and benchmarking may be due to the rapid onset of AI in the public eye when, in fact, LLMs have been developing for years prior. OpenAI continued to develop LLMs during a roughly three-year gap between the release of GPT-3 in 2020 and GPT-4 in 2023. Many people likely perceived a rapid jump in capability with GPT-4’s launch in 2023 because they had only become recently aware of GPT-3-class models with the launch of ChatGPT in late November 2022, which used GPT-3.5.

In the podcast interview, the Microsoft CTO pushed back against the idea that AI progress has stalled, but he acknowledged the challenge of infrequent data points in this field, as new models often take years to develop. Despite this, Scott expressed confidence that future iterations will show improvements, particularly in areas where current models struggle.

“The next sample is coming, and I can’t tell you when, and I can’t predict exactly how good it’s going to be, but it will almost certainly be better at the things that are brittle right now, where you’re like, oh my god, this is a little too expensive, or a little too fragile, for me to use,” Scott said in the interview. “All of that gets better. It’ll get cheaper, and things will become less fragile. And then more complicated things will become possible. That is the story of each generation of these models as we’ve scaled up.”

Microsoft CTO Kevin Scott thinks LLM “scaling laws” will hold despite criticism Read More »

openai-reportedly-nears-breakthrough-with-“reasoning”-ai,-reveals-progress-framework

OpenAI reportedly nears breakthrough with “reasoning” AI, reveals progress framework

studies in hype-otheticals —

Five-level AI classification system probably best seen as a marketing exercise.

Illustration of a robot with many arms.

OpenAI recently unveiled a five-tier system to gauge its advancement toward developing artificial general intelligence (AGI), according to an OpenAI spokesperson who spoke with Bloomberg. The company shared this new classification system on Tuesday with employees during an all-hands meeting, aiming to provide a clear framework for understanding AI advancement. However, the system describes hypothetical technology that does not yet exist and is possibly best interpreted as a marketing move to garner investment dollars.

OpenAI has previously stated that AGI—a nebulous term for a hypothetical concept that means an AI system that can perform novel tasks like a human without specialized training—is currently the primary goal of the company. The pursuit of technology that can replace humans at most intellectual work drives most of the enduring hype over the firm, even though such a technology would likely be wildly disruptive to society.

OpenAI CEO Sam Altman has previously stated his belief that AGI could be achieved within this decade, and a large part of the CEO’s public messaging has been related to how the company (and society in general) might handle the disruption that AGI may bring. Along those lines, a ranking system to communicate AI milestones achieved internally on the path to AGI makes sense.

OpenAI’s five levels—which it plans to share with investors—range from current AI capabilities to systems that could potentially manage entire organizations. The company believes its technology (such as GPT-4o that powers ChatGPT) currently sits at Level 1, which encompasses AI that can engage in conversational interactions. However, OpenAI executives reportedly told staff they’re on the verge of reaching Level 2, dubbed “Reasoners.”

Bloomberg lists OpenAI’s five “Stages of Artificial Intelligence” as follows:

  • Level 1: Chatbots, AI with conversational language
  • Level 2: Reasoners, human-level problem solving
  • Level 3: Agents, systems that can take actions
  • Level 4: Innovators, AI that can aid in invention
  • Level 5: Organizations, AI that can do the work of an organization

A Level 2 AI system would reportedly be capable of basic problem-solving on par with a human who holds a doctorate degree but lacks access to external tools. During the all-hands meeting, OpenAI leadership reportedly demonstrated a research project using their GPT-4 model that the researchers believe shows signs of approaching this human-like reasoning ability, according to someone familiar with the discussion who spoke with Bloomberg.

The upper levels of OpenAI’s classification describe increasingly potent hypothetical AI capabilities. Level 3 “Agents” could work autonomously on tasks for days. Level 4 systems would generate novel innovations. The pinnacle, Level 5, envisions AI managing entire organizations.

This classification system is still a work in progress. OpenAI plans to gather feedback from employees, investors, and board members, potentially refining the levels over time.

Ars Technica asked OpenAI about the ranking system and the accuracy of the Bloomberg report, and a company spokesperson said they had “nothing to add.”

The problem with ranking AI capabilities

OpenAI isn’t alone in attempting to quantify levels of AI capabilities. As Bloomberg notes, OpenAI’s system feels similar to levels of autonomous driving mapped out by automakers. And in November 2023, researchers at Google DeepMind proposed their own five-level framework for assessing AI advancement, showing that other AI labs have also been trying to figure out how to rank things that don’t yet exist.

OpenAI’s classification system also somewhat resembles Anthropic’s “AI Safety Levels” (ASLs) first published by the maker of the Claude AI assistant in September 2023. Both systems aim to categorize AI capabilities, though they focus on different aspects. Anthropic’s ASLs are more explicitly focused on safety and catastrophic risks (such as ASL-2, which refers to “systems that show early signs of dangerous capabilities”), while OpenAI’s levels track general capabilities.

However, any AI classification system raises questions about whether it’s possible to meaningfully quantify AI progress and what constitutes an advancement (or even what constitutes a “dangerous” AI system, as in the case of Anthropic). The tech industry so far has a history of overpromising AI capabilities, and linear progression models like OpenAI’s potentially risk fueling unrealistic expectations.

There is currently no consensus in the AI research community on how to measure progress toward AGI or even if AGI is a well-defined or achievable goal. As such, OpenAI’s five-tier system should likely be viewed as a communications tool to entice investors that shows the company’s aspirational goals rather than a scientific or even technical measurement of progress.

OpenAI reportedly nears breakthrough with “reasoning” AI, reveals progress framework Read More »

“superhuman”-go-ais-still-have-trouble-defending-against-these-simple-exploits

“Superhuman” Go AIs still have trouble defending against these simple exploits

Man vs. machine —

Plugging up “worst-case” algorithmic holes is proving more difficult than expected.

Man vs. machine in a sea of stones.

Enlarge / Man vs. machine in a sea of stones.

Getty Images

In the ancient Chinese game of Go, state-of-the-art artificial intelligence has generally been able to defeat the best human players since at least 2016. But in the last few years, researchers have discovered flaws in these top-level AI Go algorithms that give humans a fighting chance. By using unorthodox “cyclic” strategies—ones that even a beginning human player could detect and defeat—a crafty human can often exploit gaps in a top-level AI’s strategy and fool the algorithm into a loss.

Researchers at MIT and FAR AI wanted to see if they could improve this “worst case” performance in otherwise “superhuman” AI Go algorithms, testing a trio of methods to harden the top-level KataGo algorithm‘s defenses against adversarial attacks. The results show that creating truly robust, unexploitable AIs may be difficult, even in areas as tightly controlled as board games.

Three failed strategies

In the pre-print paper “Can Go AIs be adversarially robust?”, the researchers aim to create a Go AI that is truly “robust” against any and all attacks. That means an algorithm that can’t be fooled into “game-losing blunders that a human would not commit” but also one that would require any competing AI algorithm to spend significant computing resources to defeat it. Ideally, a robust algorithm should also be able to overcome potential exploits by using additional computing resources when confronted with unfamiliar situations.

An example of the original cyclic attack in action.

Enlarge / An example of the original cyclic attack in action.

The researchers tried three methods to generate such a robust Go algorithm. In the first, they simply fine-tuned the KataGo model using more examples of the unorthodox cyclic strategies that previously defeated it, hoping that KataGo could learn to detect and defeat these patterns after seeing more of them.

This strategy initially seemed promising, letting KataGo win 100 percent of games against a cyclic “attacker.” But after the attacker itself was fine-tuned (a process that used much less computing power than KataGo’s fine-tuning), that win rate fell back down to 9 percent against a slight variation on the original attack.

For its second defense attempt, the researchers iterated a multi-round “arms race” where new adversarial models discover novel exploits and new defensive models seek to plug up those newly discovered holes. After 10 rounds of such iterative training, the final defending algorithm still only won 19 percent of games against a final attacking algorithm that had discovered previously unseen variation on the exploit. This was true even as the updated algorithm maintained an edge against earlier attackers that it had been trained against in the past.

Go AI if they know the right algorithm-exploiting strategy.” height=”427″ src=”https://cdn.arstechnica.net/wp-content/uploads/2024/07/GettyImages-109417607-640×427.jpg” width=”640″>

Enlarge / Even a child can beat a world-class Go AI if they know the right algorithm-exploiting strategy.

Getty Images

In their final attempt, researchers tried a completely new type of training using vision transformers, in an attempt to avoid what might be “bad inductive biases” found in the convolutional neural networks that initially trained KataGo. This method also failed, winning only 22 percent of the time against a variation on the cyclic attack that “can be replicated by a human expert,” the researchers wrote.

Will anything work?

In all three defense attempts, the KataGo-beating adversaries didn’t represent some new, previously unseen height in general Go-playing ability. Instead, these attacking algorithms were laser-focused on discovering exploitable weaknesses in an otherwise performant AI algorithm, even if those simple attack strategies would lose to most human players.

Those exploitable holes highlight the importance of evaluating “worst-case” performance in AI systems, even when the “average-case” performance can seem downright superhuman. On average, KataGo can dominate even high-level human players using traditional strategies. But in the worst case, otherwise “weak” adversaries can find holes in the system that make it fall apart.

It’s easy to extend this kind of thinking to other types of generative AI systems. LLMs that can succeed at some complex creative and reference tasks might still utterly fail when confronted with trivial math problems (or even get “poisoned” by malicious prompts). Visual AI models that can describe and analyze complex photos may nonetheless fail horribly when presented with basic geometric shapes.

If you can solve these kinds of puzzles, you may have better visual reasoning than state-of-the-art AIs.

Enlarge / If you can solve these kinds of puzzles, you may have better visual reasoning than state-of-the-art AIs.

Improving these kinds of “worst case” scenarios is key to avoiding embarrassing mistakes when rolling an AI system out to the public. But this new research shows that determined “adversaries” can often discover new holes in an AI algorithm’s performance much more quickly and easily than that algorithm can evolve to fix those problems.

And if that’s true in Go—a monstrously complex game that nonetheless has tightly defined rules—it might be even more true in less controlled environments. “The key takeaway for AI is that these vulnerabilities will be difficult to eliminate,” FAR CEO Adam Gleave told Nature. “If we can’t solve the issue in a simple domain like Go, then in the near-term there seems little prospect of patching similar issues like jailbreaks in ChatGPT.”

Still, the researchers aren’t despairing. While none of their methods were able to “make [new] attacks impossible” in Go, their strategies were able to plug up unchanging “fixed” exploits that had been previously identified. That suggests “it may be possible to fully defend a Go AI by training against a large enough corpus of attacks,” they write, with proposals for future research that could make this happen.

Regardless, this new research shows that making AI systems more robust against worst-case scenarios might be at least as valuable as chasing new, more human/superhuman capabilities.

“Superhuman” Go AIs still have trouble defending against these simple exploits Read More »

can-you-do-better-than-top-level-ai-models-on-these-basic-vision-tests?

Can you do better than top-level AI models on these basic vision tests?

A bit myopic —

Abstract analysis that is trivial for humans often stymies GPT-4o, Gemini, and Sonnet.

Whatever you do, don't ask the AI how many horizontal lines are in this image.

Enlarge / Whatever you do, don’t ask the AI how many horizontal lines are in this image.

Getty Images

In the last couple of years, we’ve seen amazing advancements in AI systems when it comes to recognizing and analyzing the contents of complicated images. But a new paper highlights how many state-of-the-art “vision learning Models” (VLMs) often fail at simple, low-level visual analysis tasks that are trivially easy for a human.

In the provocatively titled pre-print paper “Vision language models are blind (which has a PDF version that includes a dark sunglasses emoji in the title), researchers from Auburn University and the University of Alberta create eight simple visual acuity tests with objectively correct answers. These range from identifying how often two colored lines intersect to identifying which letter in a long word has been circled to counting how many nested shapes exist in an image (representative examples and results can be viewed on the research team’s webpage).

  • If you can solve these kinds of puzzles, you may have better visual reasoning than state-of-the-art AIs.

  • The puzzles on the right are like something out of Highlights magazine.

  • A representative sample shows AI models failing at a task that most human children would find trivial.

Crucially, these tests are generated by custom code and don’t rely on pre-existing images or tests that could be found on the public Internet, thereby “minimiz[ing] the chance that VLMs can solve by memorization,” according to the researchers. The tests also “require minimal to zero world knowledge” beyond basic 2D shapes, making it difficult for the answer to be inferred from “textual question and choices alone” (which has been identified as an issue for some other visual AI benchmarks).

Are you smarter than a fifth grader?

After running multiple tests across four different visual models—GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found all four fell well short of the 100 percent accuracy you might expect for such simple visual analysis tasks (and which most sighted humans would have little trouble achieving). But the size of the AI underperformance varied greatly depending on the specific task. When asked to count the number of rows and columns in a blank grid, for instance, the best-performing model only gave an accurate answer less than 60 percent of the time. On the other hand, Gemini-1.5 Pro hit nearly 93 percent accuracy in identifying circled letters, approaching human-level performance.

  • For some reason, the models tend to incorrectly guess the “o” is circled a lot more often than all the other letters in this test.

  • The models performed perfectly in counting five interlocking circles, a pattern they might be familiar with from common images of the Olympic rings.

  • Do you have an easier time counting columns than rows in a grid? If so, you probably aren’t an AI.

Even small changes to the tasks could also lead to huge changes in results. While all four tested models were able to correctly identify five overlapping hollow circles, the accuracy across all models dropped to well below 50 percent when six to nine circles were involved. The researchers hypothesize that this “suggests that VLMs are biased towards the well-known Olympic logo, which has 5 circles.” In other cases, models occasionally hallucinated nonsensical answers, such as guessing “9,” “n”, or “©” as the circled letter in the word “Subdermatoglyphic.”

Overall, the results highlight how AI models that can perform well at high-level visual reasoning have some significant “blind spots” (sorry) when it comes to low-level abstract images. It’s all somewhat reminiscent of similar capability gaps that we often see in state-of-the-art large language models, which can create extremely cogent summaries of lengthy texts while at the same time failing extremely basic math and spelling questions.

These gaps in VLM capabilities could come down to the inability of these systems to generalize beyond the kinds of content they are explicitly trained on. Yet when the researchers tried fine-tuning a model using specific images drawn from one of their tasks (the “are two circles touching?” test), that model showed only modest improvement, from 17 percent accuracy up to around 37 percent. “The loss values for all these experiments were very close to zero, indicating that the model overfits the training set but fails to generalize,” the researchers write.

The researchers propose that the VLM capability gap may be related to the so-called “late fusion” of vision encoders onto pre-trained large language models. An “early fusion” training approach that integrates visual encoding alongside language training could lead to better results on these low-level tasks, the researchers suggest (without providing any sort of analysis of this question).

Can you do better than top-level AI models on these basic vision tests? Read More »

intuit’s-ai-gamble:-mass-layoff-of-1,800-paired-with-hiring-spree

Intuit’s AI gamble: Mass layoff of 1,800 paired with hiring spree

In the name of AI —

Intuit CEO: “Companies that aren’t prepared to take advantage of [AI] will fall behind.”

Signage for financial software company Intuit at the company's headquarters in the Silicon Valley town of Mountain View, California, August 24, 2016.

On Wednesday, Intuit CEO Sasan Goodarzi announced in a letter to the company that it would be laying off 1,800 employees—about 10 percent of its workforce of around 18,000—while simultaneously planning to hire the same number of new workers as part of a major restructuring effort purportedly focused on AI.

“As I’ve shared many times, the era of AI is one of the most significant technology shifts of our lifetime,” wrote Goodarzi in a blog post on Intuit’s website. “This is truly an extraordinary time—AI is igniting global innovation at an incredible pace, transforming every industry and company in ways that were unimaginable just a few years ago. Companies that aren’t prepared to take advantage of this AI revolution will fall behind and, over time, will no longer exist.”

The CEO says Intuit is in a position of strength and that the layoffs are not cost-cutting related, but they allow the company to “allocate additional investments to our most critical areas to support our customers and drive growth.” With new hires, the company expects its overall headcount to grow in its 2025 fiscal year.

Intuit’s layoffs (which collectively qualify as a “mass layoff” under the WARN act) hit various departments within the company, including closing Intuit’s offices in Edmonton, Canada, and Boise, Idaho, affecting over 250 employees. Approximately 1,050 employees will receive layoffs because they’re “not meeting expectations,” according to Goodarzi’s letter. Intuit has also eliminated more than 300 roles across the company to “streamline” operations and shift resources toward AI, and the company plans to consolidate 80 tech roles to “sites where we are strategically growing our technology teams and capabilities,” such as Atlanta, Bangalore, New York, Tel Aviv, and Toronto.

In turn, the company plans to accelerate investments in its AI-powered financial assistant, Intuit Assist, which provides AI-generated financial recommendations. The company also plans to hire new talent in engineering, product development, data science, and customer-facing roles, with a particular emphasis on AI expertise.

Not just about AI

Despite Goodarzi’s heavily AI-focused message, the restructuring at Intuit reveals a more complex picture. A closer look at the layoffs shows that many of the 1,800 job cuts stem from performance-based departures (such as the aforementioned 1,050). The restructuring also includes a 10 percent reduction in executive positions at the director level and above (“To continue increasing our velocity of decision making,” Goodarzi says).

These numbers suggest that the reorganization may also serve as an opportunity for Intuit to trim its workforce of underperforming staff, using the AI hype cycle as a compelling backdrop for a broader house-cleaning effort.

But as far as CEOs are concerned, it’s always a good time to talk about how they’re embracing the latest, hottest thing in technology: “With the introduction of GenAI,” Goodarzi wrote, “we are now delivering even more compelling customer experiences, increasing monetization potential, and driving efficiencies in how the work gets done within Intuit. But it’s just the beginning of the AI revolution.”

Intuit’s AI gamble: Mass layoff of 1,800 paired with hiring spree Read More »

court-ordered-penalties-for-15-teens-who-created-naked-ai-images-of-classmates

Court ordered penalties for 15 teens who created naked AI images of classmates

Real consequences —

Teens ordered to attend classes on sex education and responsible use of AI.

Court ordered penalties for 15 teens who created naked AI images of classmates

A Spanish youth court has sentenced 15 minors to one year of probation after spreading AI-generated nude images of female classmates in two WhatsApp groups.

The minors were charged with 20 counts of creating child sex abuse images and 20 counts of offenses against their victims’ moral integrity. In addition to probation, the teens will also be required to attend classes on gender and equality, as well as on the “responsible use of information and communication technologies,” a press release from the Juvenile Court of Badajoz said.

Many of the victims were too ashamed to speak up when the inappropriate fake images began spreading last year. Prior to the sentencing, a mother of one of the victims told The Guardian that girls like her daughter “were completely terrified and had tremendous anxiety attacks because they were suffering this in silence.”

The court confirmed that the teens used artificial intelligence to create images where female classmates “appear naked” by swiping photos from their social media profiles and superimposing their faces on “other naked female bodies.”

Teens using AI to sexualize and harass classmates has become an alarming global trend. Police have probed disturbing cases in both high schools and middle schools in the US, and earlier this year, the European Union proposed expanding its definition of child sex abuse to more effectively “prosecute the production and dissemination of deepfakes and AI-generated material.” Last year, US President Joe Biden issued an executive order urging lawmakers to pass more protections.

In addition to mental health impacts, victims have reported losing trust in classmates who targeted them and wanting to switch schools to avoid further contact with harassers. Others stopped posting photos online and remained fearful that the harmful AI images will resurface.

Minors targeting classmates may not realize exactly how far images can potentially spread when generating fake child sex abuse materials (CSAM); they could even end up on the dark web. An investigation by the United Kingdom-based Internet Watch Foundation (IWF) last year reported that “20,254 AI-generated images were found to have been posted to one dark web CSAM forum in a one-month period,” with more than half determined most likely to be criminal.

IWF warned that it has identified a growing market for AI-generated CSAM and concluded that “most AI CSAM found is now realistic enough to be treated as ‘real’ CSAM.” One “shocked” mother of a female classmate victimized in Spain agreed. She told The Guardian that “if I didn’t know my daughter’s body, I would have thought that image was real.”

More drastic steps to stop deepfakes

While lawmakers struggle to apply existing protections against CSAM to AI-generated images or to update laws to explicitly prosecute the offense, other more drastic solutions to prevent the harmful spread of deepfakes have been proposed.

In an op-ed for The Guardian today, journalist Lucia Osborne-Crowley advocated for laws restricting sites used to both generate and surface deepfake pornography, including regulating this harmful content when it appears on social media sites and search engines. And IWF suggested that, like jurisdictions that restrict sharing bomb-making information, lawmakers could also restrict guides instructing bad actors on how to use AI to generate CSAM.

The Malvaluna Association, which represented families of victims in Spain and broadly advocates for better sex education, told El Diario that beyond more regulations, more education is needed to stop teens motivated to use AI to attack classmates. Because the teens were ordered to attend classes, the association agreed to the sentencing measures.

“Beyond this particular trial, these facts should make us reflect on the need to educate people about equality between men and women,” the Malvaluna Association said. The group urged that today’s kids should not be learning about sex through pornography that “generates more sexism and violence.”

Teens sentenced in Spain were between the ages of 13 and 15. According to the Guardian, Spanish law prevented sentencing of minors under 14, but the youth court “can force them to take part in rehabilitation courses.”

Tech companies could also make it easier to report and remove harmful deepfakes. Ars could not immediately reach Meta for comment on efforts to combat the proliferation of AI-generated CSAM on WhatsApp, the private messaging app that was used to share fake images in Spain.

An FAQ said that “WhatsApp has zero tolerance for child sexual exploitation and abuse, and we ban users when we become aware they are sharing content that exploits or endangers children,” but it does not mention AI.

Court ordered penalties for 15 teens who created naked AI images of classmates Read More »

three-betas-in,-ios-18-testers-still-can’t-try-out-apple-intelligence-features

Three betas in, iOS 18 testers still can’t try out Apple Intelligence features

intel inside? —

Apple has said some features will be available to test “this summer.”

Three betas in, iOS 18 testers still can’t try out Apple Intelligence features

Apple

The beta-testing cycle for Apple’s latest operating system updates is in full swing—earlier this week, the third developer betas rolled out for iOS 18, iPadOS 18, macOS 15 Sequoia, and the rest of this fall’s updates. The fourth developer beta ought to be out in a couple of weeks, and it’s reasonably likely to coincide with the first betas that Apple offers to the full public (though the less-stable developer-only betas got significantly more public last year when Apple stopped making people pay for a developer account to access them).

Many of the new updates’ features are present and available to test, including cosmetic updates and under-the-hood improvements. But none of Apple’s much-hyped Apple Intelligence features are available to test in any form. MacRumors reports that Settings menus for the Apple Intelligence features have appeared in the Xcode Simulator for current versions of iOS 18 but, as of now, those settings still appear to be non-functional placeholders that don’t actually do anything.

That may change soon; Apple did say that the first wave of Apple Intelligence features would be available “this summer,” and I would wager a small amount of money on the first ones being available in the public beta builds later this month. But the current state of the betas does reinforce reporting from Bloomberg’s Mark Gurman that suggested Apple was “caught flat-footed” by the tech world’s intense interest in generative AI.

Even when they do arrive, the Apple Intelligence features will be rolled out gradually. Some will be available earlier than others—Gurman recently reported that the new Siri, specifically, might not be available for testing until January and might not actually be ready to launch until sometime in early 2025. The first wave of features will only work in US English, and only relatively recent Apple hardware will be capable of using most of them. For now, that means iPads and Macs with an M-series chip, or the iPhone 15 Pro, though presumably this year’s new crop of Pro and non-Pro iPhones will all be Apple Intelligence-compatible.

Apple’s relatively slow rollout of generative AI features isn’t necessarily a bad thing. Look at Microsoft, which has been repeatedly burned by its desire to rush AI-powered features into its Bing search engine, Edge browser, and Windows operating system. Windows 11’s Recall feature, a comprehensive database of screenshots and text tracking everything that users do on their PCs, was announced and then delayed multiple times after security researchers and other testers demonstrated how it could put users’ personal data at risk.

Three betas in, iOS 18 testers still can’t try out Apple Intelligence features Read More »

in-bid-to-loosen-nvidia’s-grip-on-ai,-amd-to-buy-finnish-startup-for-$665m

In bid to loosen Nvidia’s grip on AI, AMD to buy Finnish startup for $665M

AI tech stack —

The acquisition is the largest of its kind in Europe in a decade.

In bid to loosen Nvidia’s grip on AI, AMD to buy Finnish startup for $665M

AMD is to buy Finnish artificial intelligence startup Silo AI for $665 million in one of the largest such takeovers in Europe as the US chipmaker seeks to expand its AI services to compete with market leader Nvidia.

California-based AMD said Silo’s 300-member team would use its software tools to build custom large language models (LLMs), the kind of AI technology that underpins chatbots such as OpenAI’s ChatGPT and Google’s Gemini. The all-cash acquisition is expected to close in the second half of this year, subject to regulatory approval.

“This agreement helps us both accelerate our customer engagements and deployments while also helping us accelerate our own AI tech stack,” Vamsi Boppana, senior vice president of AMD’s artificial intelligence group, told the Financial Times.

The acquisition is the largest of a privately held AI startup in Europe since Google acquired UK-based DeepMind for around 400 million pounds in 2014, according to data from Dealroom.

The deal comes at a time when buyouts by Silicon Valley companies have come under tougher scrutiny from regulators in Brussels and the UK. Europe-based AI startups, including Mistral, DeepL, and Helsing, have raised hundreds of millions of dollars this year as investors seek out a local champion to rival US-based OpenAI and Anthropic.

Helsinki-based Silo AI, which is among the largest private AI labs in Europe, offers tailored AI models and platforms to enterprise customers. The Finnish company launched an initiative last year to build LLMs in European languages, including Swedish, Icelandic, and Danish.

AMD’s AI technology competes with that of Nvidia, which has taken the lion’s share of the high-performance chip market. Nvidia’s success has propelled its valuation past $3 trillion this year as tech companies push to build the computing infrastructure needed to power the biggest AI models. AMD started to roll out its MI300 chips late last year in a direct challenge to Nvidia’s “Hopper” line of chips.

Peter Sarlin, Silo AI co-founder and chief executive, called the acquisition the “logical next step” as the Finnish group seeks to become a “flagship” AI company.

Silo AI is committed to “open source” AI models, which are available for free and can be customized by anyone. This distinguishes it from the likes of OpenAI and Google, which favor their own proprietary or “closed” models.

The startup previously described its family of open models, called “Poro,” as an important step toward “strengthening European digital sovereignty” and democratizing access to LLMs.

The concentration of the most powerful LLMs into the hands of a few US-based Big Tech companies is meanwhile attracting attention from antitrust regulators in Washington and Brussels.

The Silo deal shows AMD seeking to scale its business quickly and drive customer engagement with its own offering. AMD views Silo, which builds custom models for clients, as a link between its “foundational” AI software and the real-world applications of the technology.

Software has become a new battleground for semiconductor companies as they try to lock in customers to their hardware and generate more predictable revenues, outside the boom-and-bust chip sales cycle.

Nvidia’s success in the AI market stems from its multibillion-dollar investment in Cuda, its proprietary software that allows chips originally designed for processing computer graphics and video games to run a wider range of applications.

Since starting to develop Cuda in 2006, Nvidia has expanded its software platform to include a range of apps and services, largely aimed at corporate customers that lack the in-house resources and skills that Big Tech companies have to build on its technology.

Nvidia now offers more than 600 “pre-trained” models, meaning they are simpler for customers to deploy. The Santa Clara, California-based group last month started rolling out a “microservices” platform, called NIM, which promises to let developers build chatbots and AI “co-pilot” services quickly.

Historically, Nvidia has offered its software free of charge to buyers of its chips, but said this year that it planned to charge for products such as NIM.

AMD is among several companies contributing to the development of an OpenAI-led rival to Cuda, called Triton, which would let AI developers switch more easily between chip providers. Meta, Microsoft, and Intel have also worked on Triton.

© 2024 The Financial Times Ltd. All rights reserved. Please do not copy and paste FT articles and redistribute by email or post to the web.

In bid to loosen Nvidia’s grip on AI, AMD to buy Finnish startup for $665M Read More »

chatgpt’s-much-heralded-mac-app-was-storing-conversations-as-plain-text

ChatGPT’s much-heralded Mac app was storing conversations as plain text

Seriously? —

The app was updated to address the issue after it gained public attention.

A message field for ChatGPT pops up over a Mac desktop

Enlarge / The app lets you invoke ChatGPT from anywhere in the system with a keyboard shortcut, Spotlight-style.

Samuel Axon

OpenAI announced its Mac desktop app for ChatGPT with a lot of fanfare a few weeks ago, but it turns out it had a rather serious security issue: user chats were stored in plain text, where any bad actor could find them if they gained access to your machine.

As Threads user Pedro José Pereira Vieito noted earlier this week, “the OpenAI ChatGPT app on macOS is not sandboxed and stores all the conversations in plain-text in a non-protected location,” meaning “any other running app / process / malware can read all your ChatGPT conversations without any permission prompt.”

He added:

macOS has blocked access to any user private data since macOS Mojave 10.14 (6 years ago!). Any app accessing private user data (Calendar, Contacts, Mail, Photos, any third-party app sandbox, etc.) now requires explicit user access.

OpenAI chose to opt-out of the sandbox and store the conversations in plain text in a non-protected location, disabling all of these built-in defenses.

OpenAI has now updated the app, and the local chats are now encrypted, though they are still not sandboxed. (The app is only available as a direct download from OpenAI’s website and is not available through Apple’s App Store where more stringent security is required.)

Many people now use ChatGPT like they might use Google: to ask important questions, sort through issues, and so on. Often, sensitive personal data could be shared in those conversations.

It’s not a great look for OpenAI, which recently entered into a partnership with Apple to offer chat bot services built into Siri queries in Apple operating systems. Apple detailed some of the security around those queries at WWDC last month, though, and they’re more stringent than what OpenAI did (or to be more precise, didn’t do) with its Mac app, which is a separate initiative from the partnership.

If you’ve been using the app recently, be sure to update it as soon as possible.

ChatGPT’s much-heralded Mac app was storing conversations as plain text Read More »