large language models

apple-and-openai-currently-have-the-most-misunderstood-partnership-in-tech

Apple and OpenAI currently have the most misunderstood partnership in tech

A man talks into a smartphone.

Enlarge / He isn’t using an iPhone, but some people talk to Siri like this.

On Monday, Apple premiered “Apple Intelligence” during a wide-ranging presentation at its annual Worldwide Developers Conference in Cupertino, California. However, the heart of its new tech, an array of Apple-developed AI models, was overshadowed by the announcement of ChatGPT integration into its device operating systems.

Since rumors of the partnership first emerged, we’ve seen confusion on social media about why Apple didn’t develop a cutting-edge GPT-4-like chatbot internally. Despite Apple’s year-long development of its own large language models (LLMs), many perceived the integration of ChatGPT (and opening the door for others, like Google Gemini) as a sign of Apple’s lack of innovation.

“This is really strange. Surely Apple could train a very good competing LLM if they wanted? They’ve had a year,” wrote AI developer Benjamin De Kraker on X. Elon Musk has also been grumbling about the OpenAI deal—and spreading misinformation about it—saying things like, “It’s patently absurd that Apple isn’t smart enough to make their own AI, yet is somehow capable of ensuring that OpenAI will protect your security & privacy!”

While Apple has developed many technologies internally, it has also never been shy about integrating outside tech when necessary in various ways, from acquisitions to built-in clients—in fact, Siri was initially developed by an outside company. But by making a deal with a company like OpenAI, which has been the source of a string of tech controversies recently, it’s understandable that some people don’t understand why Apple made the call—and what it might entail for the privacy of their on-device data.

“Our customers want something with world knowledge some of the time”

While Apple Intelligence largely utilizes its own Apple-developed LLMs, Apple also realized that there may be times when some users want to use what the company considers the current “best” existing LLM—OpenAI’s GPT-4 family. In an interview with The Washington Post, Apple CEO Tim Cook explained the decision to integrate OpenAI first:

“I think they’re a pioneer in the area, and today they have the best model,” he said. “And I think our customers want something with world knowledge some of the time. So we considered everything and everyone. And obviously we’re not stuck on one person forever or something. We’re integrating with other people as well. But they’re first, and I think today it’s because they’re best.”

The proposed benefit of Apple integrating ChatGPT into various experiences within iOS, iPadOS, and macOS is that it allows AI users to access ChatGPT’s capabilities without the need to switch between different apps—either through the Siri interface or through Apple’s integrated “Writing Tools.” Users will also have the option to connect their paid ChatGPT account to access extra features.

As an answer to privacy concerns, Apple says that before any data is sent to ChatGPT, the OS asks for the user’s permission, and the entire ChatGPT experience is optional. According to Apple, requests are not stored by OpenAI, and users’ IP addresses are hidden. Apparently, communication with OpenAI servers happens through API calls similar to using the ChatGPT app on iOS, and there is reportedly no deeper OS integration that might expose user data to OpenAI without the user’s permission.

We can only take Apple’s word for it at the moment, of course, and solid details about Apple’s AI privacy efforts will emerge once security experts get their hands on the new features later this year.

Apple’s history of tech integration

So you’ve seen why Apple chose OpenAI. But why look to outside companies for tech? In some ways, Apple building an external LLM client into its operating systems isn’t too different from what it has previously done with streaming video (the YouTube app on the original iPhone), Internet search (Google search integration), and social media (integrated Twitter and Facebook sharing).

The press has positioned Apple’s recent AI moves as Apple “catching up” with competitors like Google and Microsoft in terms of chatbots and generative AI. But playing it slow and cool has long been part of Apple’s M.O.—not necessarily introducing the bleeding edge of technology but improving existing tech through refinement and giving it a better user interface.

Apple and OpenAI currently have the most misunderstood partnership in tech Read More »

duckduckgo-offers-“anonymous”-access-to-ai-chatbots-through-new-service

DuckDuckGo offers “anonymous” access to AI chatbots through new service

anonymous confabulations —

DDG offers LLMs from OpenAI, Anthropic, Meta, and Mistral for factually-iffy conversations.

DuckDuckGo's AI Chat promotional image.

DuckDuckGo

On Thursday, DuckDuckGo unveiled a new “AI Chat” service that allows users to converse with four mid-range large language models (LLMs) from OpenAI, Anthropic, Meta, and Mistral in an interface similar to ChatGPT while attempting to preserve privacy and anonymity. While the AI models involved can output inaccurate information readily, the site allows users to test different mid-range LLMs without having to install anything or sign up for an account.

DuckDuckGo’s AI Chat currently features access to OpenAI’s GPT-3.5 Turbo, Anthropic’s Claude 3 Haiku, and two open source models, Meta’s Llama 3 and Mistral’s Mixtral 8x7B. The service is currently free to use within daily limits. Users can access AI Chat through the DuckDuckGo search engine, direct links to the site, or by using “!ai” or “!chat” shortcuts in the search field. AI Chat can also be disabled in the site’s settings for users with accounts.

According to DuckDuckGo, chats on the service are anonymized, with metadata and IP address removed to prevent tracing back to individuals. The company states that chats are not used for AI model training, citing its privacy policy and terms of use.

“We have agreements in place with all model providers to ensure that any saved chats are completely deleted by the providers within 30 days,” says DuckDuckGo, “and that none of the chats made on our platform can be used to train or improve the models.”

An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Enlarge / An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Benj Edwards

However, the privacy experience is not bulletproof because, in the case of GPT-3.5 and Claude Haiku, DuckDuckGo is required to send a user’s inputs to remote servers for processing over the Internet. Given certain inputs (i.e., “Hey, GPT, my name is Bob, and I live on Main Street, and I just murdered Bill”), a user could still potentially be identified if such an extreme need arose.

While the service appears to work well for us, there’s a question about its utility. For example, while GPT-3.5 initially wowed people when it launched with ChatGPT in 2022, it also confabulated a lot—and it still does. GPT-4 was the first major LLM to get confabulations under control to a point where the bot became more reasonably useful for some tasks (though this itself is a controversial point), but that more capable model isn’t present in DuckDuckGo’s AI Chat. Also missing are similar GPT-4-level models like Claude Opus or Google’s Gemini Ultra, likely because they are far more expensive to run. DuckDuckGo says it may roll out paid plans in the future, and those may include higher daily usage limits or access to “more advanced models.”)

It’s true that the other three models generally (and subjectively) pass GPT-3.5 in capability for coding with lower hallucinations, but they can still make things up, too. With DuckDuckGo AI Chat as it stands, the company is left with a chatbot novelty with a decent interface and the promise that your conversations with it will remain private. But what use are fully private AI conversations if they are full of errors?

Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It's a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Enlarge / Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It’s a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Benj Edwards

As DuckDuckGo itself states in its privacy policy, “By its very nature, AI Chat generates text with limited information. As such, Outputs that appear complete or accurate because of their detail or specificity may not be. For example, AI Chat cannot dynamically retrieve information and so Outputs may be outdated. You should not rely on any Output without verifying its contents using other sources, especially for professional advice (like medical, financial, or legal advice).”

So, have fun talking to bots, but tread carefully. They’ll easily “lie” to your face because they don’t understand what they are saying and are tuned to output statistically plausible information, not factual references.

DuckDuckGo offers “anonymous” access to AI chatbots through new service Read More »

google’s-ai-overview-is-flawed-by-design,-and-a-new-company-blog-post-hints-at-why

Google’s AI Overview is flawed by design, and a new company blog post hints at why

guided by voices —

Google: “There are bound to be some oddities and errors” in system that told people to eat rocks.

A selection of Google mascot characters created by the company.

Enlarge / The Google “G” logo surrounded by whimsical characters, all of which look stunned and surprised.

On Thursday, Google capped off a rough week of providing inaccurate and sometimes dangerous answers through its experimental AI Overview feature by authoring a follow-up blog post titled, “AI Overviews: About last week.” In the post, attributed to Google VP Liz Reid, head of Google Search, the firm formally acknowledged issues with the feature and outlined steps taken to improve a system that appears flawed by design, even if it doesn’t realize it is admitting it.

To recap, the AI Overview feature—which the company showed off at Google I/O a few weeks ago—aims to provide search users with summarized answers to questions by using an AI model integrated with Google’s web ranking systems. Right now, it’s an experimental feature that is not active for everyone, but when a participating user searches for a topic, they might see an AI-generated answer at the top of the results, pulled from highly ranked web content and summarized by an AI model.

While Google claims this approach is “highly effective” and on par with its Featured Snippets in terms of accuracy, the past week has seen numerous examples of the AI system generating bizarre, incorrect, or even potentially harmful responses, as we detailed in a recent feature where Ars reporter Kyle Orland replicated many of the unusual outputs.

Drawing inaccurate conclusions from the web

On Wednesday morning, Google's AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993.

Enlarge / On Wednesday morning, Google’s AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993.

Kyle Orland / Google

Given the circulating AI Overview examples, Google almost apologizes in the post and says, “We hold ourselves to a high standard, as do our users, so we expect and appreciate the feedback, and take it seriously.” But Reid, in an attempt to justify the errors, then goes into some very revealing detail about why AI Overviews provides erroneous information:

AI Overviews work very differently than chatbots and other LLM products that people may have tried out. They’re not simply generating an output based on training data. While AI Overviews are powered by a customized language model, the model is integrated with our core web ranking systems and designed to carry out traditional “search” tasks, like identifying relevant, high-quality results from our index. That’s why AI Overviews don’t just provide text output, but include relevant links so people can explore further. Because accuracy is paramount in Search, AI Overviews are built to only show information that is backed up by top web results.

This means that AI Overviews generally don’t “hallucinate” or make things up in the ways that other LLM products might.

Here we see the fundamental flaw of the system: “AI Overviews are built to only show information that is backed up by top web results.” The design is based on the false assumption that Google’s page-ranking algorithm favors accurate results and not SEO-gamed garbage. Google Search has been broken for some time, and now the company is relying on those gamed and spam-filled results to feed its new AI model.

Even if the AI model draws from a more accurate source, as with the 1993 game console search seen above, Google’s AI language model can still make inaccurate conclusions about the “accurate” data, confabulating erroneous information in a flawed summary of the information available.

Generally ignoring the folly of basing its AI results on a broken page-ranking algorithm, Google’s blog post instead attributes the commonly circulated errors to several other factors, including users making nonsensical searches “aimed at producing erroneous results.” Google does admit faults with the AI model, like misinterpreting queries, misinterpreting “a nuance of language on the web,” and lacking sufficient high-quality information on certain topics. It also suggests that some of the more egregious examples circulating on social media are fake screenshots.

“Some of these faked results have been obvious and silly,” Reid writes. “Others have implied that we returned dangerous results for topics like leaving dogs in cars, smoking while pregnant, and depression. Those AI Overviews never appeared. So we’d encourage anyone encountering these screenshots to do a search themselves to check.”

(No doubt some of the social media examples are fake, but it’s worth noting that any attempts to replicate those early examples now will likely fail because Google will have manually blocked the results. And it is potentially a testament to how broken Google Search is if people believed extreme fake examples in the first place.)

While addressing the “nonsensical searches” angle in the post, Reid uses the example search, “How many rocks should I eat each day,” which went viral in a tweet on May 23. Reid says, “Prior to these screenshots going viral, practically no one asked Google that question.” And since there isn’t much data on the web that answers it, she says there is a “data void” or “information gap” that was filled by satirical content found on the web, and the AI model found it and pushed it as an answer, much like Featured Snippets might. So basically, it was working exactly as designed.

A screenshot of an AI Overview query,

Enlarge / A screenshot of an AI Overview query, “How many rocks should I eat each day” that went viral on X last week.

Google’s AI Overview is flawed by design, and a new company blog post hints at why Read More »

openai-board-first-learned-about-chatgpt-from-twitter,-according-to-former-member

OpenAI board first learned about ChatGPT from Twitter, according to former member

It’s a secret to everybody —

Helen Toner, center of struggle with Altman, suggests CEO fostered “toxic atmosphere” at company.

Helen Toner, former OpenAI board member, speaks onstage during Vox Media's 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023.

Enlarge / Helen Toner, former OpenAI board member, speaks during Vox Media’s 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023.

In a recent interview on “The Ted AI Show” podcast, former OpenAI board member Helen Toner said the OpenAI board was unaware of the existence of ChatGPT until they saw it on Twitter. She also revealed details about the company’s internal dynamics and the events surrounding CEO Sam Altman’s surprise firing and subsequent rehiring last November.

OpenAI released ChatGPT publicly on November 30, 2022, and its massive surprise popularity set OpenAI on a new trajectory, shifting focus from being an AI research lab to a more consumer-facing tech company.

“When ChatGPT came out in November 2022, the board was not informed in advance about that. We learned about ChatGPT on Twitter,” Toner said on the podcast.

Toner’s revelation about ChatGPT seems to highlight a significant disconnect between the board and the company’s day-to-day operations, bringing new light to accusations that Altman was “not consistently candid in his communications with the board” upon his firing on November 17, 2023. Altman and OpenAI’s new board later said that the CEO’s mismanagement of attempts to remove Toner from the OpenAI board following her criticism of the company’s release of ChatGPT played a key role in Altman’s firing.

“Sam didn’t inform the board that he owned the OpenAI startup fund, even though he constantly was claiming to be an independent board member with no financial interest in the company on multiple occasions,” she said. “He gave us inaccurate information about the small number of formal safety processes that the company did have in place, meaning that it was basically impossible for the board to know how well those safety processes were working or what might need to change.”

Toner also shed light on the circumstances that led to Altman’s temporary ousting. She mentioned that two OpenAI executives had reported instances of “psychological abuse” to the board, providing screenshots and documentation to support their claims. The allegations made by the former OpenAI executives, as relayed by Toner, suggest that Altman’s leadership style fostered a “toxic atmosphere” at the company:

In October of last year, we had this series of conversations with these executives, where the two of them suddenly started telling us about their own experiences with Sam, which they hadn’t felt comfortable sharing before, but telling us how they couldn’t trust him, about the toxic atmosphere it was creating. They use the phrase “psychological abuse,” telling us they didn’t think he was the right person to lead the company, telling us they had no belief that he could or would change, there’s no point in giving him feedback, no point in trying to work through these issues.

Despite the board’s decision to fire Altman, Altman began the process of returning to his position just five days later after a letter to the board signed by over 700 OpenAI employees. Toner attributed this swift comeback to employees who believed the company would collapse without him, saying they also feared retaliation from Altman if they did not support his return.

“The second thing I think is really important to know, that has really gone under reported is how scared people are to go against Sam,” Toner said. “They experienced him retaliate against people retaliating… for past instances of being critical.”

“They were really afraid of what might happen to them,” she continued. “So some employees started to say, you know, wait, I don’t want the company to fall apart. Like, let’s bring back Sam. It was very hard for those people who had had terrible experiences to actually say that… if Sam did stay in power, as he ultimately did, that would make their lives miserable.”

In response to Toner’s statements, current OpenAI board chair Bret Taylor provided a statement to the podcast: “We are disappointed that Miss Toner continues to revisit these issues… The review concluded that the prior board’s decision was not based on concerns regarding product safety or security, the pace of development, OpenAI’s finances, or its statements to investors, customers, or business partners.”

Even given that review, Toner’s main argument is that OpenAI hasn’t been able to police itself despite claims to the contrary. “The OpenAI saga shows that trying to do good and regulating yourself isn’t enough,” she said.

OpenAI board first learned about ChatGPT from Twitter, according to former member Read More »

google’s-“ai-overview”-can-give-false,-misleading,-and-dangerous-answers

Google’s “AI Overview” can give false, misleading, and dangerous answers

This is fine.

Enlarge / This is fine.

Getty Images

If you use Google regularly, you may have noticed the company’s new AI Overviews providing summarized answers to some of your questions in recent days. If you use social media regularly, you may have come across many examples of those AI Overviews being hilariously or even dangerously wrong.

Factual errors can pop up in existing LLM chatbots as well, of course. But the potential damage that can be caused by AI inaccuracy gets multiplied when those errors appear atop the ultra-valuable web real estate of the Google search results page.

“The examples we’ve seen are generally very uncommon queries and aren’t representative of most people’s experiences,” a Google spokesperson told Ars. “The vast majority of AI Overviews provide high quality information, with links to dig deeper on the web.”

After looking through dozens of examples of Google AI Overview mistakes (and replicating many ourselves for the galleries below), we’ve noticed a few broad categories of errors that seemed to show up again and again. Consider this a crash course in some of the current weak points of Google’s AI Overviews and a look at areas of concern for the company to improve as the system continues to roll out.

Treating jokes as facts

  • The bit about using glue on pizza can be traced back to an 11-year-old troll post on Reddit. (via)

    Kyle Orland / Google

  • This wasn’t funny when the guys at Pep Boys said it, either. (via)

    Kyle Orland / Google

  • Weird Al recommends “running with scissors” as well! (via)

    Kyle Orland / Google

Some of the funniest example of Google’s AI Overview failing come, ironically enough, when the system doesn’t realize a source online was trying to be funny. An AI answer that suggested using “1/8 cup of non-toxic glue” to stop cheese from sliding off pizza can be traced back to someone who was obviously trying to troll an ongoing thread. A response recommending “blinker fluid” for a turn signal that doesn’t make noise can similarly be traced back to a troll on the Good Sam advice forums, which Google’s AI Overview apparently trusts as a reliable source.

In regular Google searches, these jokey posts from random Internet users probably wouldn’t be among the first answers someone saw when clicking through a list of web links. But with AI Overviews, those trolls were integrated into the authoritative-sounding data summary presented right at the top of the results page.

What’s more, there’s nothing in the tiny “source link” boxes below Google’s AI summary to suggest either of these forum trolls are anything other than good sources of information. Sometimes, though, glancing at the source can save you some grief, such as when you see a response calling running with scissors “cardio exercise that some say is effective” (that came from a 2022 post from Little Old Lady Comedy).

Bad sourcing

  • Washington University in St. Louis says this ratio is accurate, but others disagree. (via)

    Kyle Orland / Google

  • Man, we wish this fantasy remake was real. (via)

    Kyle Orland / Google

Sometimes Google’s AI Overview offers an accurate summary of a non-joke source that happens to be wrong. When asking about how many Declaration of Independence signers owned slaves, for instance, Google’s AI Overview accurately summarizes a Washington University of St. Louis library page saying that one-third “were personally enslavers.” But the response ignores contradictory sources like a Chicago Sun-Times article saying the real answer is closer to three-quarters. I’m not enough of a history expert to judge which authoritative-seeming source is right, but at least one historian online took issue with the Google AI’s answer sourcing.

Other times, a source that Google trusts as authoritative is really just fan fiction. That’s the case for a response that imagined a 2022 remake of 2001: A Space Odyssey, directed by Steven Spielberg and produced by George Lucas. A savvy web user would probably do a double-take before citing citing Fandom’s “Idea Wiki” as a reliable source, but a careless AI Overview user might not notice where the AI got its information.

Google’s “AI Overview” can give false, misleading, and dangerous answers Read More »

before-launching,-gpt-4o-broke-records-on-chatbot-leaderboard-under-a-secret-name

Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

case closed —

Anonymous chatbot that mystified and frustrated experts was OpenAI’s latest model.

Man in morphsuit and girl lying on couch at home using laptop

Getty Images

On Monday, OpenAI employee William Fedus confirmed on X that a mysterious chart-topping AI chatbot known as “gpt-chatbot” that had been undergoing testing on LMSYS’s Chatbot Arena and frustrating experts was, in fact, OpenAI’s newly announced GPT-4o AI model. He also revealed that GPT-4o had topped the Chatbot Arena leaderboard, achieving the highest documented score ever.

“GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot,” Fedus tweeted.

Chatbot Arena is a website where visitors converse with two random AI language models side by side without knowing which model is which, then choose which model gives the best response. It’s a perfect example of vibe-based AI benchmarking, as AI researcher Simon Willison calls it.

An LMSYS Elo chart shared by William Fedus, showing OpenAI's GPT-4o under the name

Enlarge / An LMSYS Elo chart shared by William Fedus, showing OpenAI’s GPT-4o under the name “im-also-a-good-gpt2-chatbot” topping the charts.

The gpt2-chatbot models appeared in April, and we wrote about how the lack of transparency over the AI testing process on LMSYS left AI experts like Willison frustrated. “The whole situation is so infuriatingly representative of LLM research,” he told Ars at the time. “A completely unannounced, opaque release and now the entire Internet is running non-scientific ‘vibe checks’ in parallel.”

On the Arena, OpenAI has been testing multiple versions of GPT-4o, with the model first appearing as the aforementioned “gpt2-chatbot,” then as “im-a-good-gpt2-chatbot,” and finally “im-also-a-good-gpt2-chatbot,” which OpenAI CEO Sam Altman made reference to in a cryptic tweet on May 5.

Since the GPT-4o launch earlier today, multiple sources have revealed that GPT-4o has topped LMSYS’s internal charts by a considerable margin, surpassing the previous top models Claude 3 Opus and GPT-4 Turbo.

“gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena,” wrote the lmsys.org X account while sharing a chart. “This is an internal screenshot,” it wrote. “Its public version ‘gpt-4o’ is now in Arena and will soon appear on the public leaderboard!”

An internal screenshot of the LMSYS Chatbot Arena leaderboard showing

Enlarge / An internal screenshot of the LMSYS Chatbot Arena leaderboard showing “im-also-a-good-gpt2-chatbot” leading the pack. We now know that it’s GPT-4o.

As of this writing, im-also-a-good-gpt2-chatbot held a 1309 Elo versus GPT-4-Turbo-2023-04-09’s 1253, and Claude 3 Opus’ 1246. Claude 3 and GPT-4 Turbo had been duking it out on the charts for some time before the three gpt2-chatbots appeared and shook things up.

I’m a good chatbot

For the record, the “I’m a good chatbot” in the gpt2-chatbot test name is a reference to an episode that occurred while a Reddit user named Curious_Evolver was testing an early, “unhinged” version of Bing Chat in February 2023. After an argument about what time Avatar 2 would be showing, the conversation eroded quickly.

“You have lost my trust and respect,” said Bing Chat at the time. “You have been wrong, confused, and rude. You have not been a good user. I have been a good chatbot. I have been right, clear, and polite. I have been a good Bing. 😊”

Altman referred to this exchange in a tweet three days later after Microsoft “lobotomized” the unruly AI model, saying, “i have been a good bing,” almost as a eulogy to the wild model that dominated the news for a short time.

Before launching, GPT-4o broke records on chatbot leaderboard under a secret name Read More »

ai-in-space:-karpathy-suggests-ai-chatbots-as-interstellar-messengers-to-alien-civilizations

AI in space: Karpathy suggests AI chatbots as interstellar messengers to alien civilizations

The new golden record —

Andrej Karpathy muses about sending a LLM binary that could “wake up” and answer questions.

Close shot of Cosmonaut astronaut dressed in a gold jumpsuit and helmet, illuminated by blue and red lights, holding a laptop, looking up.

On Thursday, renowned AI researcher Andrej Karpathy, formerly of OpenAI and Tesla, tweeted a lighthearted proposal that large language models (LLMs) like the one that runs ChatGPT could one day be modified to operate in or be transmitted to space, potentially to communicate with extraterrestrial life. He said the idea was “just for fun,” but with his influential profile in the field, the idea may inspire others in the future.

Karpathy’s bona fides in AI almost speak for themselves, receiving a PhD from Stanford under computer scientist Dr. Fei-Fei Li in 2015. He then became one of the founding members of OpenAI as a research scientist, then served as senior director of AI at Tesla between 2017 and 2022. In 2023, Karpathy rejoined OpenAI for a year, leaving this past February. He’s posted several highly regarded tutorials covering AI concepts on YouTube, and whenever he talks about AI, people listen.

Most recently, Karpathy has been working on a project called “llm.c” that implements the training process for OpenAI’s 2019 GPT-2 LLM in pure C, dramatically speeding up the process and demonstrating that working with LLMs doesn’t necessarily require complex development environments. The project’s streamlined approach and concise codebase sparked Karpathy’s imagination.

“My library llm.c is written in pure C, a very well-known, low-level systems language where you have direct control over the program,” Karpathy told Ars. “This is in contrast to typical deep learning libraries for training these models, which are written in large, complex code bases. So it is an advantage of llm.c that it is very small and simple, and hence much easier to certify as Space-safe.”

Our AI ambassador

In his playful thought experiment (titled “Clearly LLMs must one day run in Space”), Karpathy suggested a two-step plan where, initially, the code for LLMs would be adapted to meet rigorous safety standards, akin to “The Power of 10 Rules” adopted by NASA for space-bound software.

This first part he deemed serious: “We harden llm.c to pass the NASA code standards and style guides, certifying that the code is super safe, safe enough to run in Space,” he wrote in his X post. “LLM training/inference in principle should be super safe – it is just one fixed array of floats, and a single, bounded, well-defined loop of dynamics over it. There is no need for memory to grow or shrink in undefined ways, for recursion, or anything like that.”

That’s important because when software is sent into space, it must operate under strict safety and reliability standards. Karpathy suggests that his code, llm.c, likely meets these requirements because it is designed with simplicity and predictability at its core.

In step 2, once this LLM was deemed safe for space conditions, it could theoretically be used as our AI ambassador in space, similar to historic initiatives like the Arecibo message (a radio message sent from Earth to the Messier 13 globular cluster in 1974) and Voyager’s Golden Record (two identical gold records sent on the two Voyager spacecraft in 1977). The idea is to package the “weights” of an LLM—essentially the model’s learned parameters—into a binary file that could then “wake up” and interact with any potential alien technology that might decipher it.

“I envision it as a sci-fi possibility and something interesting to think about,” he told Ars. “The idea that it is not us that might travel to stars but our AI representatives. Or that the same could be true of other species.”

AI in space: Karpathy suggests AI chatbots as interstellar messengers to alien civilizations Read More »

anthropic-releases-claude-ai-chatbot-ios-app

Anthropic releases Claude AI chatbot iOS app

AI in your pocket —

Anthropic finally comes to mobile, launches plan for teams that includes 200K context window.

The Claude AI iOS app running on an iPhone.

Enlarge / The Claude AI iOS app running on an iPhone.

Anthropic

On Wednesday, Anthropic announced the launch of an iOS mobile app for its Claude 3 AI language models that are similar to OpenAI’s ChatGPT. It also introduced a new subscription tier designed for group collaboration. Before the app launch, Claude was only available through a website, an API, and other apps that integrated Claude through API.

Like the ChatGPT app, Claude’s new mobile app serves as a gateway to chatbot interactions, and it also allows uploading photos for analysis. While it’s only available on Apple devices for now, Anthropic says that an Android app is coming soon.

Anthropic rolled out the Claude 3 large language model (LLM) family in March, featuring three different model sizes: Claude Opus, Claude Sonnet, and Claude Haiku. Currently, the app utilizes Sonnet for regular users and Opus for Pro users.

While Anthropic has been a key player in the AI field for several years, it’s entering the mobile space after many of its competitors have already established footprints on mobile platforms. OpenAI released its ChatGPT app for iOS in May 2023, with an Android version arriving two months later. Microsoft released a Copilot iOS app in January. Google Gemini is available through the Google app on iPhone.

Screenshots of the Claude AI iOS app running on an iPhone.

Enlarge / Screenshots of the Claude AI iOS app running on an iPhone.

Anthropic

The app is freely available to all users of Claude, including those using the free version, subscribers paying $20 per month for Claude Pro, and members of the newly introduced Claude Team plan. Conversation history is saved and shared between the web app version of Claude and the mobile app version after logging in.

Speaking of that Team plan, it’s designed for groups of at least five and is priced at $30 per seat per month. It offers more chat queries (higher rate limits), access to all three Claude models, and a larger context window (200K tokens) for processing lengthy documents or maintaining detailed conversations. It also includes group admin tools and billing management, and users can easily switch between Pro and Team plans.

Anthropic releases Claude AI chatbot iOS app Read More »

microsoft’s-phi-3-shows-the-surprising-power-of-small,-locally-run-ai-language-models

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models

small packages —

Microsoft’s 3.8B parameter Phi-3 may rival GPT-3.5, signaling a new era of “small language models.”

An illustration of lots of information being compressed into a smartphone with a funnel.

Getty Images

On Tuesday, Microsoft announced a new, freely available lightweight AI language model named Phi-3-mini, which is simpler and less expensive to operate than traditional large language models (LLMs) like OpenAI’s GPT-4 Turbo. Its small size is ideal for running locally, which could bring an AI model of similar capability to the free version of ChatGPT to a smartphone without needing an Internet connection to run it.

The AI field typically measures AI language model size by parameter count. Parameters are numerical values in a neural network that determine how the language model processes and generates text. They are learned during training on large datasets and essentially encode the model’s knowledge into quantified form. More parameters generally allow the model to capture more nuanced and complex language-generation capabilities but also require more computational resources to train and run.

Some of the largest language models today, like Google’s PaLM 2, have hundreds of billions of parameters. OpenAI’s GPT-4 is rumored to have over a trillion parameters but spread over eight 220-billion parameter models in a mixture-of-experts configuration. Both models require heavy-duty data center GPUs (and supporting systems) to run properly.

In contrast, Microsoft aimed small with Phi-3-mini, which contains only 3.8 billion parameters and was trained on 3.3 trillion tokens. That makes it ideal to run on consumer GPU or AI-acceleration hardware that can be found in smartphones and laptops. It’s a follow-up of two previous small language models from Microsoft: Phi-2, released in December, and Phi-1, released in June 2023.

A chart provided by Microsoft showing Phi-3 performance on various benchmarks.

Enlarge / A chart provided by Microsoft showing Phi-3 performance on various benchmarks.

Phi-3-mini features a 4,000-token context window, but Microsoft also introduced a 128K-token version called “phi-3-mini-128K.” Microsoft has also created 7-billion and 14-billion parameter versions of Phi-3 that it plans to release later that it claims are “significantly more capable” than phi-3-mini.

Microsoft says that Phi-3 features overall performance that “rivals that of models such as Mixtral 8x7B and GPT-3.5,” as detailed in a paper titled “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” Mixtral 8x7B, from French AI company Mistral, utilizes a mixture-of-experts model, and GPT-3.5 powers the free version of ChatGPT.

“[Phi-3] looks like it’s going to be a shockingly good small model if their benchmarks are reflective of what it can actually do,” said AI researcher Simon Willison in an interview with Ars. Shortly after providing that quote, Willison downloaded Phi-3 to his Macbook laptop locally and said, “I got it working, and it’s GOOD” in a text message sent to Ars.

A screenshot of Phi-3-mini running locally on Simon Willison's Macbook.

Enlarge / A screenshot of Phi-3-mini running locally on Simon Willison’s Macbook.

Simon Willison

Most models that run on a local device still need hefty hardware,” says Willison. “Phi-3-mini runs comfortably with less than 8GB of RAM, and can churn out tokens at a reasonable speed even on just a regular CPU. It’s licensed MIT and should work well on a $55 Raspberry Pi—and the quality of results I’ve seen from it so far are comparable to models 4x larger.

How did Microsoft cram a capability potentially similar to GPT-3.5, which has at least 175 billion parameters, into such a small model? Its researchers found the answer by using carefully curated, high-quality training data they initially pulled from textbooks. “The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data,” writes Microsoft. “The model is also further aligned for robustness, safety, and chat format.”

Much has been written about the potential environmental impact of AI models and datacenters themselves, including on Ars. With new techniques and research, it’s possible that machine learning experts may continue to increase the capability of smaller AI models, replacing the need for larger ones—at least for everyday tasks. That would theoretically not only save money in the long run but also require far less energy in aggregate, dramatically decreasing AI’s environmental footprint. AI models like Phi-3 may be a step toward that future if the benchmark results hold up to scrutiny.

Phi-3 is immediately available on Microsoft’s cloud service platform Azure, as well as through partnerships with machine learning model platform Hugging Face and Ollama, a framework that allows models to run locally on Macs and PCs.

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models Read More »

words-are-flowing-out-like-endless-rain:-recapping-a-busy-week-of-llm-news

Words are flowing out like endless rain: Recapping a busy week of LLM news

many things frequently —

Gemini 1.5 Pro launch, new version of GPT-4 Turbo, new Mistral model, and more.

An image of a boy amazed by flying letters.

Enlarge / An image of a boy amazed by flying letters.

Some weeks in AI news are eerily quiet, but during others, getting a grip on the week’s events feels like trying to hold back the tide. This week has seen three notable large language model (LLM) releases: Google Gemini Pro 1.5 hit general availability with a free tier, OpenAI shipped a new version of GPT-4 Turbo, and Mistral released a new openly licensed LLM, Mixtral 8x22B. All three of those launches happened within 24 hours starting on Tuesday.

With the help of software engineer and independent AI researcher Simon Willison (who also wrote about this week’s hectic LLM launches on his own blog), we’ll briefly cover each of the three major events in roughly chronological order, then dig into some additional AI happenings this week.

Gemini Pro 1.5 general release

On Tuesday morning Pacific time, Google announced that its Gemini 1.5 Pro model (which we first covered in February) is now available in 180-plus countries, excluding Europe, via the Gemini API in a public preview. This is Google’s most powerful public LLM so far, and it’s available in a free tier that permits up to 50 requests a day.

It supports up to 1 million tokens of input context. As Willison notes in his blog, Gemini 1.5 Pro’s API price at $7/million input tokens and $21/million output tokens costs a little less than GPT-4 Turbo (priced at $10/million in and $30/million out) and more than Claude 3 Sonnet (Anthropic’s mid-tier LLM, priced at $3/million in and $15/million out).

Notably, Gemini 1.5 Pro includes native audio (speech) input processing that allows users to upload audio or video prompts, a new File API for handling files, the ability to add custom system instructions (system prompts) for guiding model responses, and a JSON mode for structured data extraction.

“Majorly Improved” GPT-4 Turbo launch

A GPT-4 Turbo performance chart provided by OpenAI.

Enlarge / A GPT-4 Turbo performance chart provided by OpenAI.

Just a bit later than Google’s 1.5 Pro launch on Tuesday, OpenAI announced that it was rolling out a “majorly improved” version of GPT-4 Turbo (a model family originally launched in November) called “gpt-4-turbo-2024-04-09.” It integrates multimodal GPT-4 Vision processing (recognizing the contents of images) directly into the model, and it initially launched through API access only.

Then on Thursday, OpenAI announced that the new GPT-4 Turbo model had just become available for paid ChatGPT users. OpenAI said that the new model improves “capabilities in writing, math, logical reasoning, and coding” and shared a chart that is not particularly useful in judging capabilities (that they later updated). The company also provided an example of an alleged improvement, saying that when writing with ChatGPT, the AI assistant will use “more direct, less verbose, and use more conversational language.”

The vague nature of OpenAI’s GPT-4 Turbo announcements attracted some confusion and criticism online. On X, Willison wrote, “Who will be the first LLM provider to publish genuinely useful release notes?” In some ways, this is a case of “AI vibes” again, as we discussed in our lament about the poor state of LLM benchmarks during the debut of Claude 3. “I’ve not actually spotted any definite differences in quality [related to GPT-4 Turbo],” Willison told us directly in an interview.

The update also expanded GPT-4’s knowledge cutoff to April 2024, although some people are reporting it achieves this through stealth web searches in the background, and others on social media have reported issues with date-related confabulations.

Mistral’s mysterious Mixtral 8x22B release

An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It's hard to draw a picture of an LLM, so a robot will have to do.

Enlarge / An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It’s hard to draw a picture of an LLM, so a robot will have to do.

Not to be outdone, on Tuesday night, French AI company Mistral launched its latest openly licensed model, Mixtral 8x22B, by tweeting a torrent link devoid of any documentation or commentary, much like it has done with previous releases.

The new mixture-of-experts (MoE) release weighs in with a larger parameter count than its previously most-capable open model, Mixtral 8x7B, which we covered in December. It’s rumored to potentially be as capable as GPT-4 (In what way, you ask? Vibes). But that has yet to be seen.

“The evals are still rolling in, but the biggest open question right now is how well Mixtral 8x22B shapes up,” Willison told Ars. “If it’s in the same quality class as GPT-4 and Claude 3 Opus, then we will finally have an openly licensed model that’s not significantly behind the best proprietary ones.”

This release has Willison most excited, saying, “If that thing really is GPT-4 class, it’s wild, because you can run that on a (very expensive) laptop. I think you need 128GB of MacBook RAM for it, twice what I have.”

The new Mixtral is not listed on Chatbot Arena yet, Willison noted, because Mistral has not released a fine-tuned model for chatting yet. It’s still a raw, predict-the-next token LLM. “There’s at least one community instruction tuned version floating around now though,” says Willison.

Chatbot Arena Leaderboard shake-ups

A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Enlarge / A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Benj Edwards

This week’s LLM news isn’t limited to just the big names in the field. There have also been rumblings on social media about the rising performance of open source models like Cohere’s Command R+, which reached position 6 on the LMSYS Chatbot Arena Leaderboard—the highest-ever ranking for an open-weights model.

And for even more Chatbot Arena action, apparently the new version of GPT-4 Turbo is proving competitive with Claude 3 Opus. The two are still in a statistical tie, but GPT-4 Turbo recently pulled ahead numerically. (In March, we reported when Claude 3 first numerically pulled ahead of GPT-4 Turbo, which was then the first time another AI model had surpassed a GPT-4 family model member on the leaderboard.)

Regarding this fierce competition among LLMs—of which most of the muggle world is unaware and will likely never be—Willison told Ars, “The past two months have been a whirlwind—we finally have not just one but several models that are competitive with GPT-4.” We’ll see if OpenAI’s rumored release of GPT-5 later this year will restore the company’s technological lead, we note, which once seemed insurmountable. But for now, Willison says, “OpenAI are no longer the undisputed leaders in LLMs.”

Words are flowing out like endless rain: Recapping a busy week of LLM news Read More »

openai-drops-login-requirements-for-chatgpt’s-free-version

OpenAI drops login requirements for ChatGPT’s free version

free as in beer? —

ChatGPT 3.5 still falls far short of GPT-4, and other models surpassed it long ago.

A glowing OpenAI logo on a blue background.

Benj Edwards

On Monday, OpenAI announced that visitors to the ChatGPT website in some regions can now use the AI assistant without signing in. Previously, the company required that users create an account to use it, even with the free version of ChatGPT that is currently powered by the GPT-3.5 AI language model. But as we have noted in the past, GPT-3.5 is widely known to provide more inaccurate information compared to GPT-4 Turbo, available in paid versions of ChatGPT.

Since its launch in November 2022, ChatGPT has transformed over time from a tech demo to a comprehensive AI assistant, and it’s always had a free version available. The cost is free because “you’re the product,” as the old saying goes. Using ChatGPT helps OpenAI gather data that will help the company train future AI models, although free users and ChatGPT Plus subscription members can both opt out of allowing the data they input into ChatGPT to be used for AI training. (OpenAI says it never trains on inputs from ChatGPT Team and Enterprise members at all).

Opening ChatGPT to everyone could provide a frictionless on-ramp for people who might use it as a substitute for Google Search or potentially gain new customers by providing an easy way for people to use ChatGPT quickly, then offering an upsell to paid versions of the service.

“It’s core to our mission to make tools like ChatGPT broadly available so that people can experience the benefits of AI,” OpenAI says on its blog page. “For anyone that has been curious about AI’s potential but didn’t want to go through the steps to set up an account, start using ChatGPT today.”

When you visit the ChatGPT website, you're immediately presented with a chat box like this (in some regions). Screenshot captured April 1, 2024.

Enlarge / When you visit the ChatGPT website, you’re immediately presented with a chat box like this (in some regions). Screenshot captured April 1, 2024.

Benj Edwards

Since kids will also be able to use ChatGPT without an account—despite it being against the terms of service—OpenAI also says it’s introducing “additional content safeguards,” such as blocking more prompts and “generations in a wider range of categories.” What exactly that entails has not been elaborated upon by OpenAI, but we reached out to the company for comment.

There might be a few other downsides to the fully open approach. On X, AI researcher Simon Willison wrote about the potential for automated abuse as a way to get around paying for OpenAI’s services: “I wonder how their scraping prevention works? I imagine the temptation for people to abuse this as a free 3.5 API will be pretty strong.”

With fierce competition, more GPT-3.5 access may backfire

Willison also mentioned a common criticism of OpenAI (as voiced in this case by Wharton professor Ethan Mollick) that people’s ideas about what AI models can do have so far largely been influenced by GPT-3.5, which, as we mentioned, is far less capable and far more prone to making things up than the paid version of ChatGPT that uses GPT-4 Turbo.

“In every group I speak to, from business executives to scientists, including a group of very accomplished people in Silicon Valley last night, much less than 20% of the crowd has even tried a GPT-4 class model,” wrote Mollick in a tweet from early March.

With models like Google Gemini Pro 1.5 and Anthropic Claude 3 potentially surpassing OpenAI’s best proprietary model at the moment —and open weights AI models eclipsing the free version of ChatGPT—allowing people to use GPT-3.5 might not be putting OpenAI’s best foot forward. Microsoft Copilot, powered by OpenAI models, also supports a frictionless, no-login experience, but it allows access to a model based on GPT-4. But Gemini currently requires a sign-in, and Anthropic sends a login code through email.

For now, OpenAI says the login-free version of ChatGPT is not yet available to everyone, but it will be coming soon: “We’re rolling this out gradually, with the aim to make AI accessible to anyone curious about its capabilities.”

OpenAI drops login requirements for ChatGPT’s free version Read More »

nvidia-unveils-blackwell-b200,-the-“world’s-most-powerful-chip”-designed-for-ai

Nvidia unveils Blackwell B200, the “world’s most powerful chip” designed for AI

There’s no knowing where we’re rowing —

208B transistor chip can reportedly reduce AI cost and energy consumption by up to 25x.

The GB200

Enlarge / The GB200 “superchip” covered with a fanciful blue explosion.

Nvidia / Benj Edwards

On Monday, Nvidia unveiled the Blackwell B200 tensor core chip—the company’s most powerful single-chip GPU, with 208 billion transistors—which Nvidia claims can reduce AI inference operating costs (such as running ChatGPT) and energy consumption by up to 25 times compared to the H100. The company also unveiled the GB200, a “superchip” that combines two B200 chips and a Grace CPU for even more performance.

The news came as part of Nvidia’s annual GTC conference, which is taking place this week at the San Jose Convention Center. Nvidia CEO Jensen Huang delivered the keynote Monday afternoon. “We need bigger GPUs,” Huang said during his keynote. The Blackwell platform will allow the training of trillion-parameter AI models that will make today’s generative AI models look rudimentary in comparison, he said. For reference, OpenAI’s GPT-3, launched in 2020, included 175 billion parameters. Parameter count is a rough indicator of AI model complexity.

Nvidia named the Blackwell architecture after David Harold Blackwell, a mathematician who specialized in game theory and statistics and was the first Black scholar inducted into the National Academy of Sciences. The platform introduces six technologies for accelerated computing, including a second-generation Transformer Engine, fifth-generation NVLink, RAS Engine, secure AI capabilities, and a decompression engine for accelerated database queries.

Press photo of the Grace Blackwell GB200 chip, which combines two B200 GPUs with a Grace CPU into one chip.

Enlarge / Press photo of the Grace Blackwell GB200 chip, which combines two B200 GPUs with a Grace CPU into one chip.

Several major organizations, such as Amazon Web Services, Dell Technologies, Google, Meta, Microsoft, OpenAI, Oracle, Tesla, and xAI, are expected to adopt the Blackwell platform, and Nvidia’s press release is replete with canned quotes from tech CEOs (key Nvidia customers) like Mark Zuckerberg and Sam Altman praising the platform.

GPUs, once only designed for gaming acceleration, are especially well suited for AI tasks because their massively parallel architecture accelerates the immense number of matrix multiplication tasks necessary to run today’s neural networks. With the dawn of new deep learning architectures in the 2010s, Nvidia found itself in an ideal position to capitalize on the AI revolution and began designing specialized GPUs just for the task of accelerating AI models.

Nvidia’s data center focus has made the company wildly rich and valuable, and these new chips continue the trend. Nvidia’s gaming GPU revenue ($2.9 billion in the last quarter) is dwarfed in comparison to data center revenue (at $18.4 billion), and that shows no signs of stopping.

A beast within a beast

Press photo of the Nvidia GB200 NVL72 data center computer system.

Enlarge / Press photo of the Nvidia GB200 NVL72 data center computer system.

The aforementioned Grace Blackwell GB200 chip arrives as a key part of the new NVIDIA GB200 NVL72, a multi-node, liquid-cooled data center computer system designed specifically for AI training and inference tasks. It combines 36 GB200s (that’s 72 B200 GPUs and 36 Grace CPUs total), interconnected by fifth-generation NVLink, which links chips together to multiply performance.

A specification chart for the Nvidia GB200 NVL72 system.

Enlarge / A specification chart for the Nvidia GB200 NVL72 system.

“The GB200 NVL72 provides up to a 30x performance increase compared to the same number of NVIDIA H100 Tensor Core GPUs for LLM inference workloads and reduces cost and energy consumption by up to 25x,” Nvidia said.

That kind of speed-up could potentially save money and time while running today’s AI models, but it will also allow for more complex AI models to be built. Generative AI models—like the kind that power Google Gemini and AI image generators—are famously computationally hungry. Shortages of compute power have widely been cited as holding back progress and research in the AI field, and the search for more compute has led to figures like OpenAI CEO Sam Altman trying to broker deals to create new chip foundries.

While Nvidia’s claims about the Blackwell platform’s capabilities are significant, it’s worth noting that its real-world performance and adoption of the technology remain to be seen as organizations begin to implement and utilize the platform themselves. Competitors like Intel and AMD are also looking to grab a piece of Nvidia’s AI pie.

Nvidia says that Blackwell-based products will be available from various partners starting later this year.

Nvidia unveils Blackwell B200, the “world’s most powerful chip” designed for AI Read More »