large language models

elon-musk-claims-he-is-training-“the-world’s-most-powerful-ai-by-every-metric”

Elon Musk claims he is training “the world’s most powerful AI by every metric”

the biggest, most powerful —

One snag: xAI might not have the electrical power contracts to do it.

Elon Musk, chief executive officer of Tesla Inc., during a fireside discussion on artificial intelligence risks with Rishi Sunak, UK prime minister, in London, UK, on Thursday, Nov. 2, 2023.

Enlarge / Elon Musk, chief executive officer of Tesla Inc., during a fireside discussion on artificial intelligence risks with Rishi Sunak, UK prime minister, in London, UK, on Thursday, Nov. 2, 2023.

On Monday, Elon Musk announced the start of training for what he calls “the world’s most powerful AI training cluster” at xAI’s new supercomputer facility in Memphis, Tennessee. The billionaire entrepreneur and CEO of multiple tech companies took to X (formerly Twitter) to share that the so-called “Memphis Supercluster” began operations at approximately 4: 20 am local time that day.

Musk’s xAI team, in collaboration with X and Nvidia, launched the supercomputer cluster featuring 100,000 liquid-cooled H100 GPUs on a single RDMA fabric. This setup, according to Musk, gives xAI “a significant advantage in training the world’s most powerful AI by every metric by December this year.”

Given issues with xAI’s Grok chatbot throughout the year, skeptics would be justified in questioning whether those claims will match reality, especially given Musk’s tendency for grandiose, off-the-cuff remarks on the social media platform he runs.

Power issues

According to a report by News Channel 3 WREG Memphis, the startup of the massive AI training facility marks a milestone for the city. WREG reports that xAI’s investment represents the largest capital investment by a new company in Memphis’s history. However, the project has raised questions among local residents and officials about its impact on the area’s power grid and infrastructure.

WREG reports that Doug McGowen, president of Memphis Light, Gas and Water (MLGW), previously stated that xAI could consume up to 150 megawatts of power at peak times. This substantial power requirement has prompted discussions with the Tennessee Valley Authority (TVA) regarding the project’s electricity demands and connection to the power system.

The TVA told the local news station, “TVA does not have a contract in place with xAI. We are working with xAI and our partners at MLGW on the details of the proposal and electricity demand needs.”

The local news outlet confirms that MLGW has stated that xAI moved into an existing building with already existing utility services, but the full extent of the company’s power usage and its potential effects on local utilities remain unclear. To address community concerns, WREG reports that MLGW plans to host public forums in the coming days to provide more information about the project and its implications for the city.

For now, Tom’s Hardware reports that Musk is side-stepping power issues by installing a fleet of 14 VoltaGrid natural gas generators that provide supplementary power to the Memphis computer cluster while his company works out an agreement with the local power utility.

As training at the Memphis Supercluster gets underway, all eyes are on xAI and Musk’s ambitious goal of developing the world’s most powerful AI by the end of the year (by which metric, we are uncertain), given the competitive landscape in AI at the moment between OpenAI/Microsoft, Amazon, Apple, Anthropic, and Google. If such an AI model emerges from xAI, we’ll be ready to write about it.

This article was updated on July 24, 2024 at 1: 11 pm to mention Musk installing natural gas generators onsite in Memphis.

Elon Musk claims he is training “the world’s most powerful AI by every metric” Read More »

the-first-gpt-4-class-ai-model-anyone-can-download-has-arrived:-llama-405b

The first GPT-4-class AI model anyone can download has arrived: Llama 405B

A new llama emerges —

“Open source AI is the path forward,” says Mark Zuckerberg, misusing the term.

A red llama in a blue desert illustration based on a photo.

In the AI world, there’s a buzz in the air about a new AI language model released Tuesday by Meta: Llama 3.1 405B. The reason? It’s potentially the first time anyone can download a GPT-4-class large language model (LLM) for free and run it on their own hardware. You’ll still need some beefy hardware: Meta says it can run on a “single server node,” which isn’t desktop PC-grade equipment. But it’s a provocative shot across the bow of “closed” AI model vendors such as OpenAI and Anthropic.

“Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation,” says Meta. Company CEO Mark Zuckerberg calls 405B “the first frontier-level open source AI model.”

In the AI industry, “frontier model” is a term for an AI system designed to push the boundaries of current capabilities. In this case, Meta is positioning 405B among the likes of the industry’s top AI models, such as OpenAI’s GPT-4o, Claude’s 3.5 Sonnet, and Google Gemini 1.5 Pro.

A chart published by Meta suggests that 405B gets very close to matching the performance of GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

But as we’ve noted many times since March, these benchmarks aren’t necessarily scientifically sound or translate to the subjective experience of interacting with AI language models. In fact, this traditional slate of AI benchmarks is so generally useless to laypeople that even Meta’s PR department now just posts a few images of charts and doesn’t even try to explain them in any detail.

A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

Enlarge / A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

We’ve instead found that measuring the subjective experience of using a conversational AI model (through what might be called “vibemarking”) on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs. In the absence of Chatbot Arena data, Meta has provided the results of its own human evaluations of 405B’s outputs that seem to show Meta’s new model holding its own against GPT-4 Turbo and Claude 3.5 Sonnet.

A Meta-provided chart that shows how humans rated Llama 3.1 405B's outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Enlarge / A Meta-provided chart that shows how humans rated Llama 3.1 405B’s outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Whatever the benchmarks, early word on the street (after the model leaked on 4chan yesterday) seems to match the claim that 405B is roughly equivalent to GPT-4. It took a lot of expensive computer training time to get there—and money, of which the social media giant has plenty to burn. Meta trained the 405B model on over 15 trillion tokens of training data scraped from the web (then parsed, filtered, and annotated by Llama 2), using more than 16,000 H100 GPUs.

So what’s with the 405B name? In this case, “405B” means 405 billion parameters, and parameters are numerical values that store trained information in a neural network. More parameters translate to a larger neural network powering the AI model, which generally (but not always) means more capability, such as better ability to make contextual connections between concepts. But larger-parameter models have a tradeoff in needing more computing power (AKA “compute”) to run.

We’ve been expecting the release of a 400 billion-plus parameter model of the Llama 3 family since Meta gave word that it was training one in April, and today’s announcement isn’t just about the biggest member of the Llama 3 family: There’s an entirely new iteration of improved Llama models with the designation “Llama 3.1.” That includes upgraded versions of its smaller 8B and 70B models, which now feature multilingual support and an extended context length of 128,000 tokens (the “context length” is roughly the working memory capacity of the model, and “tokens” are chunks of data used by LLMs to process information).

Meta says that 405B is useful for long-form text summarization, multilingual conversational agents, and coding assistants and for creating synthetic data used to train future AI language models. Notably, that last use-case—allowing developers to use outputs from Llama models to improve other AI models—is now officially supported by Meta’s Llama 3.1 license for the first time.

Abusing the term “open source”

Llama 3.1 405B is an open-weights model, which means anyone can download the trained neural network files and run them or fine-tune them. That directly challenges a business model where companies like OpenAI keep the weights to themselves and instead monetize the model through subscription wrappers like ChatGPT or charge for access by the token through an API.

Fighting the “closed” AI model is a big deal to Mark Zuckerberg, who simultaneously released a 2,300-word manifesto today on why the company believes in open releases of AI models, titled, “Open Source AI Is the Path Forward.” More on the terminology in a minute. But briefly, he writes about the need for customizable AI models that offer user control and encourage better data security, higher cost-efficiency, and better future-proofing, as opposed to vendor-locked solutions.

All that sounds reasonable, but undermining your competitors using a model subsidized by a social media war chest is also an efficient way to play spoiler in a market where you might not always win with the most cutting-edge tech. That benefits Meta, Zuckerberg says, because he doesn’t want to get locked into a system where companies like his have to pay a toll to access AI capabilities, drawing comparisons to “taxes” Apple levies on developers through its App Store.

A screenshot of Mark Zuckerberg's essay,

Enlarge / A screenshot of Mark Zuckerberg’s essay, “Open Source AI Is the Path Forward,” published on July 23, 2024.

So, about that “open source” term. As we first wrote in an update to our Llama 2 launch article a year ago, “open source” has a very particular meaning that has traditionally been defined by the Open Source Initiative. The AI industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (such as Llama 3.1) or that ship without providing training data. We’ve been calling these releases “open weights” instead.

Unfortunately for terminology sticklers, Zuckerberg has now baked the erroneous “open source” label into the title of his potentially historic aforementioned essay on open AI releases, so fighting for the correct term in AI may be a losing battle. Still, his usage annoys people like independent AI researcher Simon Willison, who likes Zuckerberg’s essay otherwise.

“I see Zuck’s prominent misuse of ‘open source’ as a small-scale act of cultural vandalism,” Willison told Ars Technica. “Open source should have an agreed meaning. Abusing the term weakens that meaning which makes the term less generally useful, because if someone says ‘it’s open source,’ that no longer tells me anything useful. I have to then dig in and figure out what they’re actually talking about.”

The Llama 3.1 models are available for download through Meta’s own website and on Hugging Face. They both require providing contact information and agreeing to a license and an acceptable use policy, which means that Meta can technically legally pull the rug out from under your use of Llama 3.1 or its outputs at any time.

The first GPT-4-class AI model anyone can download has arrived: Llama 405B Read More »

ai-trains-on-kids’-photos-even-when-parents-use-strict-privacy-settings

AI trains on kids’ photos even when parents use strict privacy settings

“Outrageous” —

Even unlisted YouTube videos are used to train AI, watchdog warns.

AI trains on kids’ photos even when parents use strict privacy settings

Human Rights Watch (HRW) continues to reveal how photos of real children casually posted online years ago are being used to train AI models powering image generators—even when platforms prohibit scraping and families use strict privacy settings.

Last month, HRW researcher Hye Jung Han found 170 photos of Brazilian kids that were linked in LAION-5B, a popular AI dataset built from Common Crawl snapshots of the public web. Now, she has released a second report, flagging 190 photos of children from all of Australia’s states and territories, including indigenous children who may be particularly vulnerable to harms.

These photos are linked in the dataset “without the knowledge or consent of the children or their families.” They span the entirety of childhood, making it possible for AI image generators to generate realistic deepfakes of real Australian children, Han’s report said. Perhaps even more concerning, the URLs in the dataset sometimes reveal identifying information about children, including their names and locations where photos were shot, making it easy to track down children whose images might not otherwise be discoverable online.

That puts children in danger of privacy and safety risks, Han said, and some parents thinking they’ve protected their kids’ privacy online may not realize that these risks exist.

From a single link to one photo that showed “two boys, ages 3 and 4, grinning from ear to ear as they hold paintbrushes in front of a colorful mural,” Han could trace “both children’s full names and ages, and the name of the preschool they attend in Perth, in Western Australia.” And perhaps most disturbingly, “information about these children does not appear to exist anywhere else on the Internet”—suggesting that families were particularly cautious in shielding these boys’ identities online.

Stricter privacy settings were used in another image that Han found linked in the dataset. The photo showed “a close-up of two boys making funny faces, captured from a video posted on YouTube of teenagers celebrating” during the week after their final exams, Han reported. Whoever posted that YouTube video adjusted privacy settings so that it would be “unlisted” and would not appear in searches.

Only someone with a link to the video was supposed to have access, but that didn’t stop Common Crawl from archiving the image, nor did YouTube policies prohibiting AI scraping or harvesting of identifying information.

Reached for comment, YouTube’s spokesperson, Jack Malon, told Ars that YouTube has “been clear that the unauthorized scraping of YouTube content is a violation of our Terms of Service, and we continue to take action against this type of abuse.” But Han worries that even if YouTube did join efforts to remove images of children from the dataset, the damage has been done, since AI tools have already trained on them. That’s why—even more than parents need tech companies to up their game blocking AI training—kids need regulators to intervene and stop training before it happens, Han’s report said.

Han’s report comes a month before Australia is expected to release a reformed draft of the country’s Privacy Act. Those reforms include a draft of Australia’s first child data protection law, known as the Children’s Online Privacy Code, but Han told Ars that even people involved in long-running discussions about reforms aren’t “actually sure how much the government is going to announce in August.”

“Children in Australia are waiting with bated breath to see if the government will adopt protections for them,” Han said, emphasizing in her report that “children should not have to live in fear that their photos might be stolen and weaponized against them.”

AI uniquely harms Australian kids

To hunt down the photos of Australian kids, Han “reviewed fewer than 0.0001 percent of the 5.85 billion images and captions contained in the data set.” Because her sample was so small, Han expects that her findings represent a significant undercount of how many children could be impacted by the AI scraping.

“It’s astonishing that out of a random sample size of about 5,000 photos, I immediately fell into 190 photos of Australian children,” Han told Ars. “You would expect that there would be more photos of cats than there are personal photos of children,” since LAION-5B is a “reflection of the entire Internet.”

LAION is working with HRW to remove links to all the images flagged, but cleaning up the dataset does not seem to be a fast process. Han told Ars that based on her most recent exchange with the German nonprofit, LAION had not yet removed links to photos of Brazilian kids that she reported a month ago.

LAION declined Ars’ request for comment.

In June, LAION’s spokesperson, Nathan Tyler, told Ars that, “as a nonprofit, volunteer organization,” LAION is committed to doing its part to help with the “larger and very concerning issue” of misuse of children’s data online. But removing links from the LAION-5B dataset does not remove the images online, Tyler noted, where they can still be referenced and used in other AI datasets, particularly those relying on Common Crawl. And Han pointed out that removing the links from the dataset doesn’t change AI models that have already trained on them.

“Current AI models cannot forget data they were trained on, even if the data was later removed from the training data set,” Han’s report said.

Kids whose images are used to train AI models are exposed to a variety of harms, Han reported, including a risk that image generators could more convincingly create harmful or explicit deepfakes. In Australia last month, “about 50 girls from Melbourne reported that photos from their social media profiles were taken and manipulated using AI to create sexually explicit deepfakes of them, which were then circulated online,” Han reported.

For First Nations children—”including those identified in captions as being from the Anangu, Arrernte, Pitjantjatjara, Pintupi, Tiwi, and Warlpiri peoples”—the inclusion of links to photos threatens unique harms. Because culturally, First Nations peoples “restrict the reproduction of photos of deceased people during periods of mourning,” Han said the AI training could perpetuate harms by making it harder to control when images are reproduced.

Once an AI model trains on the images, there are other obvious privacy risks, including a concern that AI models are “notorious for leaking private information,” Han said. Guardrails added to image generators do not always prevent these leaks, with some tools “repeatedly broken,” Han reported.

LAION recommends that, if troubled by the privacy risks, parents remove images of kids online as the most effective way to prevent abuse. But Han told Ars that’s “not just unrealistic, but frankly, outrageous.”

“The answer is not to call for children and parents to remove wonderful photos of kids online,” Han said. “The call should be [for] some sort of legal protections for these photos, so that kids don’t have to always wonder if their selfie is going to be abused.”

AI trains on kids’ photos even when parents use strict privacy settings Read More »

google-translate-just-nearly-doubled-its-number-of-supported-languages

Google Translate just nearly doubled its number of supported languages

Large language models —

This includes common languages like Cantonese and lesser-known ones like Manx.

The Google PaLM 2 logo.

Enlarge / The logo for PaLM 2, a Google large language model.

Google

Google announced today that it has added support for 110 new languages to Google Translate, nearly doubling the number of languages that can be translated.

The company used the PaLM 2 large language model to facilitate these additions.

In a blog post, Google Senior Software Engineer Isaac Caswell claimed that the newly added languages are spoken by more than 614 million people, or about 8 percent of the global population.

He noted that about a quarter of the languages originate in Africa, “representing our largest expansion of African languages to date.”

The blog post also went into some light detail about Google’s philosophy for choosing languages and for deciding which dialects to support:

Languages have an immense amount of variation: regional varieties, dialects, different spelling standards. In fact, many languages have no one standard form, so it’s impossible to pick a “right” variety. Our approach has been to prioritize the most commonly used varieties of each language. For example, Romani is a language that has many dialects all throughout Europe. Our models produce text that is closest to Southern Vlax Romani, a commonly used variety online. But it also mixes in elements from others, like Northern Vlax and Balkan Romani.

This update brings the total number of languages supported by Google Translate to 243, which is just the beginning of its publicized initiative to ultimately support 1,000 languages through the use of AI. You can see the full list of languages added in a help page published by Google.

By contrast, Apple Translate supports 21 languages, though that number includes both US and UK English as distinct options. Apple recently announced plans to add Hindi to its Translate app. Of course, Apple and Google take very different approaches to—and have different levels of investment in—these tools.

Google Translate just nearly doubled its number of supported languages Read More »

anthropic-introduces-claude-3.5-sonnet,-matching-gpt-4o-on-benchmarks

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks

The Anthropic Claude 3 logo, jazzed up by Benj Edwards.

Anthropic / Benj Edwards

On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of “3.5” models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work documents in a dedicated window.

So far, people outside of Anthropic seem impressed. “This model is really, really good,” wrote independent AI researcher Simon Willison on X. “I think this is the new best overall model (and both faster and half the price of Opus, similar to the GPT-4 Turbo to GPT-4o jump).”

As we’ve written before, benchmarks for large language models (LLMs) are troublesome because they can be cherry-picked and often do not capture the feel and nuance of using a machine to generate outputs on almost any conceivable topic. But according to Anthropic, Claude 3.5 Sonnet matches or outperforms competitor models like GPT-4o and Gemini 1.5 Pro on certain benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

If all that makes your eyes glaze over, that’s OK; it’s meaningful to researchers but mostly marketing to everyone else. A more useful performance metric comes from what we might call “vibemarks” (coined here first!) which are subjective, non-rigorous aggregate feelings measured by competitive usage on sites like LMSYS’s Chatbot Arena. The Claude 3.5 Sonnet model is currently under evaluation there, and it’s too soon to say how well it will fare.

Claude 3.5 Sonnet also outperforms Anthropic’s previous-best model (Claude 3 Opus) on benchmarks measuring “reasoning,” math skills, general knowledge, and coding abilities. For example, the model demonstrated strong performance in an internal coding evaluation, solving 64 percent of problems compared to 38 percent for Claude 3 Opus.

Claude 3.5 Sonnet is also a multimodal AI model that accepts visual input in the form of images, and the new model is reportedly excellent at a battery of visual comprehension tests.

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

Roughly speaking, the visual benchmarks mean that 3.5 Sonnet is better at pulling information from images than previous models. For example, you can show it a picture of a rabbit wearing a football helmet, and the model knows it’s a rabbit wearing a football helmet and can talk about it. That’s fun for tech demos, but the tech is still not accurate enough for applications of the tech where reliability is mission critical.

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks Read More »

researchers-describe-how-to-tell-if-chatgpt-is-confabulating

Researchers describe how to tell if ChatGPT is confabulating

Researchers describe how to tell if ChatGPT is confabulating

Aurich Lawson | Getty Images

It’s one of the world’s worst-kept secrets that large language models give blatantly false answers to queries and do so with a confidence that’s indistinguishable from when they get things right. There are a number of reasons for this. The AI could have been trained on misinformation; the answer could require some extrapolation from facts that the LLM isn’t capable of; or some aspect of the LLM’s training might have incentivized a falsehood.

But perhaps the simplest explanation is that an LLM doesn’t recognize what constitutes a correct answer but is compelled to provide one. So it simply makes something up, a habit that has been termed confabulation.

Figuring out when an LLM is making something up would obviously have tremendous value, given how quickly people have started relying on them for everything from college essays to job applications. Now, researchers from the University of Oxford say they’ve found a relatively simple way to determine when LLMs appear to be confabulating that works with all popular models and across a broad range of subjects. And, in doing so, they develop evidence that most of the alternative facts LLMs provide are a product of confabulation.

Catching confabulation

The new research is strictly about confabulations, and not instances such as training on false inputs. As the Oxford team defines them in their paper describing the work, confabulations are where “LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed.”

The reasoning behind their work is actually quite simple. LLMs aren’t trained for accuracy; they’re simply trained on massive quantities of text and learn to produce human-sounding phrasing through that. If enough text examples in its training consistently present something as a fact, then the LLM is likely to present it as a fact. But if the examples in its training are few, or inconsistent in their facts, then the LLMs synthesize a plausible-sounding answer that is likely incorrect.

But the LLM could also run into a similar situation when it has multiple options for phrasing the right answer. To use an example from the researchers’ paper, “Paris,” “It’s in Paris,” and “France’s capital, Paris” are all valid answers to “Where’s the Eiffel Tower?” So, statistical uncertainty, termed entropy in this context, can arise either when the LLM isn’t certain about how to phrase the right answer or when it can’t identify the right answer.

This means it’s not a great idea to simply force the LLM to return “I don’t know” when confronted with several roughly equivalent answers. We’d probably block a lot of correct answers by doing so.

So instead, the researchers focus on what they call semantic entropy. This evaluates all the statistically likely answers evaluated by the LLM and determines how many of them are semantically equivalent. If a large number all have the same meaning, then the LLM is likely uncertain about phrasing but has the right answer. If not, then it is presumably in a situation where it would be prone to confabulation and should be prevented from doing so.

Researchers describe how to tell if ChatGPT is confabulating Read More »

report:-apple-isn’t-paying-openai-for-chatgpt-integration-into-oses

Report: Apple isn’t paying OpenAI for ChatGPT integration into OSes

in the pocket —

Apple thinks pushing OpenAI’s brand to hundreds of millions is worth more than money.

The OpenAI and Apple logos together.

OpenAI / Apple / Benj Edwards

On Monday, Apple announced it would be integrating OpenAI’s ChatGPT AI assistant into upcoming versions of its iPhone, iPad, and Mac operating systems. It paves the way for future third-party AI model integrations, but given Google’s multi-billion-dollar deal with Apple for preferential web search, the OpenAI announcement inspired speculation about who is paying whom. According to a Bloomberg report published Wednesday, Apple considers ChatGPT’s placement on its devices as compensation enough.

“Apple isn’t paying OpenAI as part of the partnership,” writes Bloomberg reporter Mark Gurman, citing people familiar with the matter who wish to remain anonymous. “Instead, Apple believes pushing OpenAI’s brand and technology to hundreds of millions of its devices is of equal or greater value than monetary payments.”

The Bloomberg report states that neither company expects the agreement to generate meaningful revenue in the short term, and in fact, the partnership could burn extra money for OpenAI, because it pays Microsoft to host ChatGPT’s capabilities on its Azure cloud. However, OpenAI could benefit by converting free users to paid subscriptions, and Apple potentially benefits by providing easy, built-in access to ChatGPT during a time when its own in-house LLMs are still catching up.

And there’s another angle at play. Currently, OpenAI offers subscriptions (ChatGPT Plus, Enterprise, Team) that unlock additional features. If users subscribe to OpenAI through the ChatGPT app on an Apple device, the process will reportedly use Apple’s payment platform, which may give Apple a significant cut of the revenue. According to the report, Apple hopes to negotiate additional revenue-sharing deals with AI vendors in the future.

Why OpenAI

The rise of ChatGPT in the public eye over the past 18 months has made OpenAI a power player in the tech industry, allowing it to strike deals with publishers for AI training content—and ensure continued support from Microsoft in the form of investments that trade vital funding and compute for access to OpenAI’s large language model (LLM) technology like GPT-4.

Still, Apple’s choice of ChatGPT as Apple’s first external AI integration has led to widespread misunderstanding, especially since Apple buried the lede about its own in-house LLM technology that powers its new “Apple Intelligence” platform.

On Apple’s part, CEO Tim Cook told The Washington Post that it chose OpenAI as its first third-party AI partner because he thinks the company controls the leading LLM technology at the moment: “I think they’re a pioneer in the area, and today they have the best model,” he said. “We’re integrating with other people as well. But they’re first, and I think today it’s because they’re best.”

Apple’s choice also brings risk. OpenAI’s record isn’t spotless, racking up a string of public controversies over the past month that include an accusation from actress Scarlett Johansson that the company intentionally imitated her voice, resignations from a key scientist and safety personnel, the revelation of a restrictive NDA for ex-employees that prevented public criticism, and accusations against OpenAI CEO Sam Altman of “psychological abuse” related by a former member of the OpenAI board.

Meanwhile, critics of privacy issues related to gathering data for training AI models—including OpenAI foe Elon Musk, who took to X on Monday to spread misconceptions about how the ChatGPT integration might work—also worried that the Apple-OpenAI deal might expose personal data to the AI company, although both companies strongly deny that will be the case.

Looking ahead, Apple’s deal with OpenAI is not exclusive, and the company is already in talks to offer Google’s Gemini chatbot as an additional option later this year. Apple has also reportedly held talks with Anthropic (maker of Claude 3) as a potential chatbot partner, signaling its intention to provide users with a range of AI services, much like how the company offers various search engine options in Safari.

Report: Apple isn’t paying OpenAI for ChatGPT integration into OSes Read More »

apple-and-openai-currently-have-the-most-misunderstood-partnership-in-tech

Apple and OpenAI currently have the most misunderstood partnership in tech

A man talks into a smartphone.

Enlarge / He isn’t using an iPhone, but some people talk to Siri like this.

On Monday, Apple premiered “Apple Intelligence” during a wide-ranging presentation at its annual Worldwide Developers Conference in Cupertino, California. However, the heart of its new tech, an array of Apple-developed AI models, was overshadowed by the announcement of ChatGPT integration into its device operating systems.

Since rumors of the partnership first emerged, we’ve seen confusion on social media about why Apple didn’t develop a cutting-edge GPT-4-like chatbot internally. Despite Apple’s year-long development of its own large language models (LLMs), many perceived the integration of ChatGPT (and opening the door for others, like Google Gemini) as a sign of Apple’s lack of innovation.

“This is really strange. Surely Apple could train a very good competing LLM if they wanted? They’ve had a year,” wrote AI developer Benjamin De Kraker on X. Elon Musk has also been grumbling about the OpenAI deal—and spreading misinformation about it—saying things like, “It’s patently absurd that Apple isn’t smart enough to make their own AI, yet is somehow capable of ensuring that OpenAI will protect your security & privacy!”

While Apple has developed many technologies internally, it has also never been shy about integrating outside tech when necessary in various ways, from acquisitions to built-in clients—in fact, Siri was initially developed by an outside company. But by making a deal with a company like OpenAI, which has been the source of a string of tech controversies recently, it’s understandable that some people don’t understand why Apple made the call—and what it might entail for the privacy of their on-device data.

“Our customers want something with world knowledge some of the time”

While Apple Intelligence largely utilizes its own Apple-developed LLMs, Apple also realized that there may be times when some users want to use what the company considers the current “best” existing LLM—OpenAI’s GPT-4 family. In an interview with The Washington Post, Apple CEO Tim Cook explained the decision to integrate OpenAI first:

“I think they’re a pioneer in the area, and today they have the best model,” he said. “And I think our customers want something with world knowledge some of the time. So we considered everything and everyone. And obviously we’re not stuck on one person forever or something. We’re integrating with other people as well. But they’re first, and I think today it’s because they’re best.”

The proposed benefit of Apple integrating ChatGPT into various experiences within iOS, iPadOS, and macOS is that it allows AI users to access ChatGPT’s capabilities without the need to switch between different apps—either through the Siri interface or through Apple’s integrated “Writing Tools.” Users will also have the option to connect their paid ChatGPT account to access extra features.

As an answer to privacy concerns, Apple says that before any data is sent to ChatGPT, the OS asks for the user’s permission, and the entire ChatGPT experience is optional. According to Apple, requests are not stored by OpenAI, and users’ IP addresses are hidden. Apparently, communication with OpenAI servers happens through API calls similar to using the ChatGPT app on iOS, and there is reportedly no deeper OS integration that might expose user data to OpenAI without the user’s permission.

We can only take Apple’s word for it at the moment, of course, and solid details about Apple’s AI privacy efforts will emerge once security experts get their hands on the new features later this year.

Apple’s history of tech integration

So you’ve seen why Apple chose OpenAI. But why look to outside companies for tech? In some ways, Apple building an external LLM client into its operating systems isn’t too different from what it has previously done with streaming video (the YouTube app on the original iPhone), Internet search (Google search integration), and social media (integrated Twitter and Facebook sharing).

The press has positioned Apple’s recent AI moves as Apple “catching up” with competitors like Google and Microsoft in terms of chatbots and generative AI. But playing it slow and cool has long been part of Apple’s M.O.—not necessarily introducing the bleeding edge of technology but improving existing tech through refinement and giving it a better user interface.

Apple and OpenAI currently have the most misunderstood partnership in tech Read More »

duckduckgo-offers-“anonymous”-access-to-ai-chatbots-through-new-service

DuckDuckGo offers “anonymous” access to AI chatbots through new service

anonymous confabulations —

DDG offers LLMs from OpenAI, Anthropic, Meta, and Mistral for factually-iffy conversations.

DuckDuckGo's AI Chat promotional image.

DuckDuckGo

On Thursday, DuckDuckGo unveiled a new “AI Chat” service that allows users to converse with four mid-range large language models (LLMs) from OpenAI, Anthropic, Meta, and Mistral in an interface similar to ChatGPT while attempting to preserve privacy and anonymity. While the AI models involved can output inaccurate information readily, the site allows users to test different mid-range LLMs without having to install anything or sign up for an account.

DuckDuckGo’s AI Chat currently features access to OpenAI’s GPT-3.5 Turbo, Anthropic’s Claude 3 Haiku, and two open source models, Meta’s Llama 3 and Mistral’s Mixtral 8x7B. The service is currently free to use within daily limits. Users can access AI Chat through the DuckDuckGo search engine, direct links to the site, or by using “!ai” or “!chat” shortcuts in the search field. AI Chat can also be disabled in the site’s settings for users with accounts.

According to DuckDuckGo, chats on the service are anonymized, with metadata and IP address removed to prevent tracing back to individuals. The company states that chats are not used for AI model training, citing its privacy policy and terms of use.

“We have agreements in place with all model providers to ensure that any saved chats are completely deleted by the providers within 30 days,” says DuckDuckGo, “and that none of the chats made on our platform can be used to train or improve the models.”

An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Enlarge / An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Benj Edwards

However, the privacy experience is not bulletproof because, in the case of GPT-3.5 and Claude Haiku, DuckDuckGo is required to send a user’s inputs to remote servers for processing over the Internet. Given certain inputs (i.e., “Hey, GPT, my name is Bob, and I live on Main Street, and I just murdered Bill”), a user could still potentially be identified if such an extreme need arose.

While the service appears to work well for us, there’s a question about its utility. For example, while GPT-3.5 initially wowed people when it launched with ChatGPT in 2022, it also confabulated a lot—and it still does. GPT-4 was the first major LLM to get confabulations under control to a point where the bot became more reasonably useful for some tasks (though this itself is a controversial point), but that more capable model isn’t present in DuckDuckGo’s AI Chat. Also missing are similar GPT-4-level models like Claude Opus or Google’s Gemini Ultra, likely because they are far more expensive to run. DuckDuckGo says it may roll out paid plans in the future, and those may include higher daily usage limits or access to “more advanced models.”)

It’s true that the other three models generally (and subjectively) pass GPT-3.5 in capability for coding with lower hallucinations, but they can still make things up, too. With DuckDuckGo AI Chat as it stands, the company is left with a chatbot novelty with a decent interface and the promise that your conversations with it will remain private. But what use are fully private AI conversations if they are full of errors?

Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It's a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Enlarge / Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It’s a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Benj Edwards

As DuckDuckGo itself states in its privacy policy, “By its very nature, AI Chat generates text with limited information. As such, Outputs that appear complete or accurate because of their detail or specificity may not be. For example, AI Chat cannot dynamically retrieve information and so Outputs may be outdated. You should not rely on any Output without verifying its contents using other sources, especially for professional advice (like medical, financial, or legal advice).”

So, have fun talking to bots, but tread carefully. They’ll easily “lie” to your face because they don’t understand what they are saying and are tuned to output statistically plausible information, not factual references.

DuckDuckGo offers “anonymous” access to AI chatbots through new service Read More »

google’s-ai-overview-is-flawed-by-design,-and-a-new-company-blog-post-hints-at-why

Google’s AI Overview is flawed by design, and a new company blog post hints at why

guided by voices —

Google: “There are bound to be some oddities and errors” in system that told people to eat rocks.

A selection of Google mascot characters created by the company.

Enlarge / The Google “G” logo surrounded by whimsical characters, all of which look stunned and surprised.

On Thursday, Google capped off a rough week of providing inaccurate and sometimes dangerous answers through its experimental AI Overview feature by authoring a follow-up blog post titled, “AI Overviews: About last week.” In the post, attributed to Google VP Liz Reid, head of Google Search, the firm formally acknowledged issues with the feature and outlined steps taken to improve a system that appears flawed by design, even if it doesn’t realize it is admitting it.

To recap, the AI Overview feature—which the company showed off at Google I/O a few weeks ago—aims to provide search users with summarized answers to questions by using an AI model integrated with Google’s web ranking systems. Right now, it’s an experimental feature that is not active for everyone, but when a participating user searches for a topic, they might see an AI-generated answer at the top of the results, pulled from highly ranked web content and summarized by an AI model.

While Google claims this approach is “highly effective” and on par with its Featured Snippets in terms of accuracy, the past week has seen numerous examples of the AI system generating bizarre, incorrect, or even potentially harmful responses, as we detailed in a recent feature where Ars reporter Kyle Orland replicated many of the unusual outputs.

Drawing inaccurate conclusions from the web

On Wednesday morning, Google's AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993.

Enlarge / On Wednesday morning, Google’s AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993.

Kyle Orland / Google

Given the circulating AI Overview examples, Google almost apologizes in the post and says, “We hold ourselves to a high standard, as do our users, so we expect and appreciate the feedback, and take it seriously.” But Reid, in an attempt to justify the errors, then goes into some very revealing detail about why AI Overviews provides erroneous information:

AI Overviews work very differently than chatbots and other LLM products that people may have tried out. They’re not simply generating an output based on training data. While AI Overviews are powered by a customized language model, the model is integrated with our core web ranking systems and designed to carry out traditional “search” tasks, like identifying relevant, high-quality results from our index. That’s why AI Overviews don’t just provide text output, but include relevant links so people can explore further. Because accuracy is paramount in Search, AI Overviews are built to only show information that is backed up by top web results.

This means that AI Overviews generally don’t “hallucinate” or make things up in the ways that other LLM products might.

Here we see the fundamental flaw of the system: “AI Overviews are built to only show information that is backed up by top web results.” The design is based on the false assumption that Google’s page-ranking algorithm favors accurate results and not SEO-gamed garbage. Google Search has been broken for some time, and now the company is relying on those gamed and spam-filled results to feed its new AI model.

Even if the AI model draws from a more accurate source, as with the 1993 game console search seen above, Google’s AI language model can still make inaccurate conclusions about the “accurate” data, confabulating erroneous information in a flawed summary of the information available.

Generally ignoring the folly of basing its AI results on a broken page-ranking algorithm, Google’s blog post instead attributes the commonly circulated errors to several other factors, including users making nonsensical searches “aimed at producing erroneous results.” Google does admit faults with the AI model, like misinterpreting queries, misinterpreting “a nuance of language on the web,” and lacking sufficient high-quality information on certain topics. It also suggests that some of the more egregious examples circulating on social media are fake screenshots.

“Some of these faked results have been obvious and silly,” Reid writes. “Others have implied that we returned dangerous results for topics like leaving dogs in cars, smoking while pregnant, and depression. Those AI Overviews never appeared. So we’d encourage anyone encountering these screenshots to do a search themselves to check.”

(No doubt some of the social media examples are fake, but it’s worth noting that any attempts to replicate those early examples now will likely fail because Google will have manually blocked the results. And it is potentially a testament to how broken Google Search is if people believed extreme fake examples in the first place.)

While addressing the “nonsensical searches” angle in the post, Reid uses the example search, “How many rocks should I eat each day,” which went viral in a tweet on May 23. Reid says, “Prior to these screenshots going viral, practically no one asked Google that question.” And since there isn’t much data on the web that answers it, she says there is a “data void” or “information gap” that was filled by satirical content found on the web, and the AI model found it and pushed it as an answer, much like Featured Snippets might. So basically, it was working exactly as designed.

A screenshot of an AI Overview query,

Enlarge / A screenshot of an AI Overview query, “How many rocks should I eat each day” that went viral on X last week.

Google’s AI Overview is flawed by design, and a new company blog post hints at why Read More »

openai-board-first-learned-about-chatgpt-from-twitter,-according-to-former-member

OpenAI board first learned about ChatGPT from Twitter, according to former member

It’s a secret to everybody —

Helen Toner, center of struggle with Altman, suggests CEO fostered “toxic atmosphere” at company.

Helen Toner, former OpenAI board member, speaks onstage during Vox Media's 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023.

Enlarge / Helen Toner, former OpenAI board member, speaks during Vox Media’s 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023.

In a recent interview on “The Ted AI Show” podcast, former OpenAI board member Helen Toner said the OpenAI board was unaware of the existence of ChatGPT until they saw it on Twitter. She also revealed details about the company’s internal dynamics and the events surrounding CEO Sam Altman’s surprise firing and subsequent rehiring last November.

OpenAI released ChatGPT publicly on November 30, 2022, and its massive surprise popularity set OpenAI on a new trajectory, shifting focus from being an AI research lab to a more consumer-facing tech company.

“When ChatGPT came out in November 2022, the board was not informed in advance about that. We learned about ChatGPT on Twitter,” Toner said on the podcast.

Toner’s revelation about ChatGPT seems to highlight a significant disconnect between the board and the company’s day-to-day operations, bringing new light to accusations that Altman was “not consistently candid in his communications with the board” upon his firing on November 17, 2023. Altman and OpenAI’s new board later said that the CEO’s mismanagement of attempts to remove Toner from the OpenAI board following her criticism of the company’s release of ChatGPT played a key role in Altman’s firing.

“Sam didn’t inform the board that he owned the OpenAI startup fund, even though he constantly was claiming to be an independent board member with no financial interest in the company on multiple occasions,” she said. “He gave us inaccurate information about the small number of formal safety processes that the company did have in place, meaning that it was basically impossible for the board to know how well those safety processes were working or what might need to change.”

Toner also shed light on the circumstances that led to Altman’s temporary ousting. She mentioned that two OpenAI executives had reported instances of “psychological abuse” to the board, providing screenshots and documentation to support their claims. The allegations made by the former OpenAI executives, as relayed by Toner, suggest that Altman’s leadership style fostered a “toxic atmosphere” at the company:

In October of last year, we had this series of conversations with these executives, where the two of them suddenly started telling us about their own experiences with Sam, which they hadn’t felt comfortable sharing before, but telling us how they couldn’t trust him, about the toxic atmosphere it was creating. They use the phrase “psychological abuse,” telling us they didn’t think he was the right person to lead the company, telling us they had no belief that he could or would change, there’s no point in giving him feedback, no point in trying to work through these issues.

Despite the board’s decision to fire Altman, Altman began the process of returning to his position just five days later after a letter to the board signed by over 700 OpenAI employees. Toner attributed this swift comeback to employees who believed the company would collapse without him, saying they also feared retaliation from Altman if they did not support his return.

“The second thing I think is really important to know, that has really gone under reported is how scared people are to go against Sam,” Toner said. “They experienced him retaliate against people retaliating… for past instances of being critical.”

“They were really afraid of what might happen to them,” she continued. “So some employees started to say, you know, wait, I don’t want the company to fall apart. Like, let’s bring back Sam. It was very hard for those people who had had terrible experiences to actually say that… if Sam did stay in power, as he ultimately did, that would make their lives miserable.”

In response to Toner’s statements, current OpenAI board chair Bret Taylor provided a statement to the podcast: “We are disappointed that Miss Toner continues to revisit these issues… The review concluded that the prior board’s decision was not based on concerns regarding product safety or security, the pace of development, OpenAI’s finances, or its statements to investors, customers, or business partners.”

Even given that review, Toner’s main argument is that OpenAI hasn’t been able to police itself despite claims to the contrary. “The OpenAI saga shows that trying to do good and regulating yourself isn’t enough,” she said.

OpenAI board first learned about ChatGPT from Twitter, according to former member Read More »

google’s-“ai-overview”-can-give-false,-misleading,-and-dangerous-answers

Google’s “AI Overview” can give false, misleading, and dangerous answers

This is fine.

Enlarge / This is fine.

Getty Images

If you use Google regularly, you may have noticed the company’s new AI Overviews providing summarized answers to some of your questions in recent days. If you use social media regularly, you may have come across many examples of those AI Overviews being hilariously or even dangerously wrong.

Factual errors can pop up in existing LLM chatbots as well, of course. But the potential damage that can be caused by AI inaccuracy gets multiplied when those errors appear atop the ultra-valuable web real estate of the Google search results page.

“The examples we’ve seen are generally very uncommon queries and aren’t representative of most people’s experiences,” a Google spokesperson told Ars. “The vast majority of AI Overviews provide high quality information, with links to dig deeper on the web.”

After looking through dozens of examples of Google AI Overview mistakes (and replicating many ourselves for the galleries below), we’ve noticed a few broad categories of errors that seemed to show up again and again. Consider this a crash course in some of the current weak points of Google’s AI Overviews and a look at areas of concern for the company to improve as the system continues to roll out.

Treating jokes as facts

  • The bit about using glue on pizza can be traced back to an 11-year-old troll post on Reddit. (via)

    Kyle Orland / Google

  • This wasn’t funny when the guys at Pep Boys said it, either. (via)

    Kyle Orland / Google

  • Weird Al recommends “running with scissors” as well! (via)

    Kyle Orland / Google

Some of the funniest example of Google’s AI Overview failing come, ironically enough, when the system doesn’t realize a source online was trying to be funny. An AI answer that suggested using “1/8 cup of non-toxic glue” to stop cheese from sliding off pizza can be traced back to someone who was obviously trying to troll an ongoing thread. A response recommending “blinker fluid” for a turn signal that doesn’t make noise can similarly be traced back to a troll on the Good Sam advice forums, which Google’s AI Overview apparently trusts as a reliable source.

In regular Google searches, these jokey posts from random Internet users probably wouldn’t be among the first answers someone saw when clicking through a list of web links. But with AI Overviews, those trolls were integrated into the authoritative-sounding data summary presented right at the top of the results page.

What’s more, there’s nothing in the tiny “source link” boxes below Google’s AI summary to suggest either of these forum trolls are anything other than good sources of information. Sometimes, though, glancing at the source can save you some grief, such as when you see a response calling running with scissors “cardio exercise that some say is effective” (that came from a 2022 post from Little Old Lady Comedy).

Bad sourcing

  • Washington University in St. Louis says this ratio is accurate, but others disagree. (via)

    Kyle Orland / Google

  • Man, we wish this fantasy remake was real. (via)

    Kyle Orland / Google

Sometimes Google’s AI Overview offers an accurate summary of a non-joke source that happens to be wrong. When asking about how many Declaration of Independence signers owned slaves, for instance, Google’s AI Overview accurately summarizes a Washington University of St. Louis library page saying that one-third “were personally enslavers.” But the response ignores contradictory sources like a Chicago Sun-Times article saying the real answer is closer to three-quarters. I’m not enough of a history expert to judge which authoritative-seeming source is right, but at least one historian online took issue with the Google AI’s answer sourcing.

Other times, a source that Google trusts as authoritative is really just fan fiction. That’s the case for a response that imagined a 2022 remake of 2001: A Space Odyssey, directed by Steven Spielberg and produced by George Lucas. A savvy web user would probably do a double-take before citing citing Fandom’s “Idea Wiki” as a reliable source, but a careless AI Overview user might not notice where the AI got its information.

Google’s “AI Overview” can give false, misleading, and dangerous answers Read More »