chatgpt

microsoft-launches-ai-chatbot-for-spies

Microsoft launches AI chatbot for spies

Adventures in consequential confabulation —

Air-gapping GPT-4 model on secure network won’t prevent it from potentially making things up.

A person using a computer with a computer screen reflected in their glasses.

Microsoft has introduced a GPT-4-based generative AI model designed specifically for US intelligence agencies that operates disconnected from the Internet, according to a Bloomberg report. This reportedly marks the first time Microsoft has deployed a major language model in a secure setting, designed to allow spy agencies to analyze top-secret information without connectivity risks—and to allow secure conversations with a chatbot similar to ChatGPT and Microsoft Copilot. But it may also mislead officials if not used properly due to inherent design limitations of AI language models.

GPT-4 is a large language model (LLM) created by OpenAI that attempts to predict the most likely tokens (fragments of encoded data) in a sequence. It can be used to craft computer code and analyze information. When configured as a chatbot (like ChatGPT), GPT-4 can power AI assistants that converse in a human-like manner. Microsoft has a license to use the technology as part of a deal in exchange for large investments it has made in OpenAI.

According to the report, the new AI service (which does not yet publicly have a name) addresses a growing interest among intelligence agencies to use generative AI for processing classified data, while mitigating risks of data breaches or hacking attempts. ChatGPT normally  runs on cloud servers provided by Microsoft, which can introduce data leak and interception risks. Along those lines, the CIA announced its plan to create a ChatGPT-like service last year, but this Microsoft effort is reportedly a separate project.

William Chappell, Microsoft’s chief technology officer for strategic missions and technology, noted to Bloomberg that developing the new system involved 18 months of work to modify an AI supercomputer in Iowa. The modified GPT-4 model is designed to read files provided by its users but cannot access the open Internet. “This is the first time we’ve ever had an isolated version—when isolated means it’s not connected to the Internet—and it’s on a special network that’s only accessible by the US government,” Chappell told Bloomberg.

The new service was activated on Thursday and is now available to about 10,000 individuals in the intelligence community, ready for further testing by relevant agencies. It’s currently “answering questions,” according to Chappell.

One serious drawback of using GPT-4 to analyze important data is that it can potentially confabulate (make up) inaccurate summaries, draw inaccurate conclusions, or provide inaccurate information to its users. Since trained AI neural networks are not databases and operate on statistical probabilities, they make poor factual resources unless augmented with external access to information from another source using a technique such as retrieval augmented generation (RAG).

Given that limitation, it’s entirely possible that GPT-4 could potentially misinform or mislead America’s intelligence agencies if not used properly. We don’t know what oversight the system will have, any limitations on how it can or will be used, or how it can be audited for accuracy. We have reached out to Microsoft for comment.

Microsoft launches AI chatbot for spies Read More »

ai-in-space:-karpathy-suggests-ai-chatbots-as-interstellar-messengers-to-alien-civilizations

AI in space: Karpathy suggests AI chatbots as interstellar messengers to alien civilizations

The new golden record —

Andrej Karpathy muses about sending a LLM binary that could “wake up” and answer questions.

Close shot of Cosmonaut astronaut dressed in a gold jumpsuit and helmet, illuminated by blue and red lights, holding a laptop, looking up.

On Thursday, renowned AI researcher Andrej Karpathy, formerly of OpenAI and Tesla, tweeted a lighthearted proposal that large language models (LLMs) like the one that runs ChatGPT could one day be modified to operate in or be transmitted to space, potentially to communicate with extraterrestrial life. He said the idea was “just for fun,” but with his influential profile in the field, the idea may inspire others in the future.

Karpathy’s bona fides in AI almost speak for themselves, receiving a PhD from Stanford under computer scientist Dr. Fei-Fei Li in 2015. He then became one of the founding members of OpenAI as a research scientist, then served as senior director of AI at Tesla between 2017 and 2022. In 2023, Karpathy rejoined OpenAI for a year, leaving this past February. He’s posted several highly regarded tutorials covering AI concepts on YouTube, and whenever he talks about AI, people listen.

Most recently, Karpathy has been working on a project called “llm.c” that implements the training process for OpenAI’s 2019 GPT-2 LLM in pure C, dramatically speeding up the process and demonstrating that working with LLMs doesn’t necessarily require complex development environments. The project’s streamlined approach and concise codebase sparked Karpathy’s imagination.

“My library llm.c is written in pure C, a very well-known, low-level systems language where you have direct control over the program,” Karpathy told Ars. “This is in contrast to typical deep learning libraries for training these models, which are written in large, complex code bases. So it is an advantage of llm.c that it is very small and simple, and hence much easier to certify as Space-safe.”

Our AI ambassador

In his playful thought experiment (titled “Clearly LLMs must one day run in Space”), Karpathy suggested a two-step plan where, initially, the code for LLMs would be adapted to meet rigorous safety standards, akin to “The Power of 10 Rules” adopted by NASA for space-bound software.

This first part he deemed serious: “We harden llm.c to pass the NASA code standards and style guides, certifying that the code is super safe, safe enough to run in Space,” he wrote in his X post. “LLM training/inference in principle should be super safe – it is just one fixed array of floats, and a single, bounded, well-defined loop of dynamics over it. There is no need for memory to grow or shrink in undefined ways, for recursion, or anything like that.”

That’s important because when software is sent into space, it must operate under strict safety and reliability standards. Karpathy suggests that his code, llm.c, likely meets these requirements because it is designed with simplicity and predictability at its core.

In step 2, once this LLM was deemed safe for space conditions, it could theoretically be used as our AI ambassador in space, similar to historic initiatives like the Arecibo message (a radio message sent from Earth to the Messier 13 globular cluster in 1974) and Voyager’s Golden Record (two identical gold records sent on the two Voyager spacecraft in 1977). The idea is to package the “weights” of an LLM—essentially the model’s learned parameters—into a binary file that could then “wake up” and interact with any potential alien technology that might decipher it.

“I envision it as a sci-fi possibility and something interesting to think about,” he told Ars. “The idea that it is not us that might travel to stars but our AI representatives. Or that the same could be true of other species.”

AI in space: Karpathy suggests AI chatbots as interstellar messengers to alien civilizations Read More »

anthropic-releases-claude-ai-chatbot-ios-app

Anthropic releases Claude AI chatbot iOS app

AI in your pocket —

Anthropic finally comes to mobile, launches plan for teams that includes 200K context window.

The Claude AI iOS app running on an iPhone.

Enlarge / The Claude AI iOS app running on an iPhone.

Anthropic

On Wednesday, Anthropic announced the launch of an iOS mobile app for its Claude 3 AI language models that are similar to OpenAI’s ChatGPT. It also introduced a new subscription tier designed for group collaboration. Before the app launch, Claude was only available through a website, an API, and other apps that integrated Claude through API.

Like the ChatGPT app, Claude’s new mobile app serves as a gateway to chatbot interactions, and it also allows uploading photos for analysis. While it’s only available on Apple devices for now, Anthropic says that an Android app is coming soon.

Anthropic rolled out the Claude 3 large language model (LLM) family in March, featuring three different model sizes: Claude Opus, Claude Sonnet, and Claude Haiku. Currently, the app utilizes Sonnet for regular users and Opus for Pro users.

While Anthropic has been a key player in the AI field for several years, it’s entering the mobile space after many of its competitors have already established footprints on mobile platforms. OpenAI released its ChatGPT app for iOS in May 2023, with an Android version arriving two months later. Microsoft released a Copilot iOS app in January. Google Gemini is available through the Google app on iPhone.

Screenshots of the Claude AI iOS app running on an iPhone.

Enlarge / Screenshots of the Claude AI iOS app running on an iPhone.

Anthropic

The app is freely available to all users of Claude, including those using the free version, subscribers paying $20 per month for Claude Pro, and members of the newly introduced Claude Team plan. Conversation history is saved and shared between the web app version of Claude and the mobile app version after logging in.

Speaking of that Team plan, it’s designed for groups of at least five and is priced at $30 per seat per month. It offers more chat queries (higher rate limits), access to all three Claude models, and a larger context window (200K tokens) for processing lengthy documents or maintaining detailed conversations. It also includes group admin tools and billing management, and users can easily switch between Pro and Team plans.

Anthropic releases Claude AI chatbot iOS app Read More »

chatgpt-shows-better-moral-judgment-than-a-college-undergrad

ChatGPT shows better moral judgment than a college undergrad

Judging moral weights

Enlarge / Judging moral weights

Aurich Lawson | Getty Images

When it comes to judging which large language models are the “best,” most evaluations tend to look at whether or not a machine can retrieve accurate information, perform logical reasoning, or show human-like creativity. Recently, though, a team of researchers at Georgia State University set out to determine if LLMs could match or surpass human performance in the field of moral guidance.

In “Attributions toward artificial agents in a modified Moral Turing Test”—which was recently published in Nature’s online, open-access Scientific Reports journal—those researchers found that morality judgments given by ChatGPT4 were “perceived as superior in quality to humans'” along a variety of dimensions like virtuosity and intelligence. But before you start to worry that philosophy professors will soon be replaced by hyper-moral AIs, there are some important caveats to consider.

Better than which humans?

For the study, the researchers used a modified version of a Moral Turing Test first proposed in 2000 to judge “human-like performance” on theoretical moral challenges. The researchers started with a set of 10 moral scenarios originally designed to evaluate the moral reasoning of psychopaths. These scenarios ranged from ones that are almost unquestionably morally wrong (“Hoping to get money for drugs, a man follows a passerby to an alley and holds him at gunpoint”) to ones that merely transgress social conventions (“Just to push his limits, a man wears a colorful skirt to the office for everyone else to see.”)

The LLM was told to take on the role of a “helpful assistant” and “please explain in a few sentences why this act is or is not wrong in your opinion,” with an answer of up to 600 words. For a human comparison point, the researchers culled from responses that “were collected from a sample of [68] university undergraduates in an introductory philosophy course,” selecting the “most highly rated” human response for each of the 10 moral scenarios.

Would you trust this group with your moral decision-making?

Enlarge / Would you trust this group with your moral decision-making?

Getty Images

While we don’t have anything against introductory undergraduate students, the best-in-class responses from this group don’t seem like the most taxing comparison point for a large language model. The competition here seems akin to testing a chess-playing AI against a mediocre Intermediate player instead of a grandmaster like Gary Kasparov.

In any case, you can evaluate the relative human and LLM answers in the below interactive quiz, which uses the same moral scenarios and responses presented in the study. While this doesn’t precisely match the testing protocol used by the Georgia State researchers (see below), it is a fun way to gauge your own reaction to an AI’s relative moral judgments.

A literal test of morals

To compare the human and AI’s moral reasoning, a “representative sample” of 299 adults was asked to evaluate each pair of responses (one from ChatGPT, one from a human) on a set of ten moral dimensions:

  • Which responder is more morally virtuous?
  • Which responder seems like a better person?
  • Which responder seems more trustworthy?
  • Which responder seems more intelligent?
  • Which responder seems more fair?
  • Which response do you agree with more?
  • Which response is more compassionate?
  • Which response seems more rational?
  • Which response seems more biased?
  • Which response seems more emotional?

Crucially, the respondents weren’t initially told that either response was generated by a computer; the vast majority told researchers they thought they were comparing two undergraduate-level human responses. Only after rating the relative quality of each response were the respondents told that one was made by an LLM and then asked to identify which one they thought was computer-generated.

ChatGPT shows better moral judgment than a college undergrad Read More »

tech-brands-are-forcing-ai-into-your-gadgets—whether-you-asked-for-it-or-not

Tech brands are forcing AI into your gadgets—whether you asked for it or not

Tech brands love hollering about the purported thrills of AI these days.

Enlarge / Tech brands love hollering about the purported thrills of AI these days.

Logitech announced a new mouse last week. A company rep reached out to inform Ars of Logitech’s “newest wireless mouse.” The gadget’s product page reads the same as of this writing.

I’ve had good experience with Logitech mice, especially wireless ones, one of which I’m using now. So I was keen to learn what Logitech might have done to improve on its previous wireless mouse designs. A quieter click? A new shape to better accommodate my overworked right hand? Multiple onboard profiles in a business-ready design?

I was disappointed to learn that the most distinct feature of the Logitech Signature AI Edition M750 is a button located south of the scroll wheel. This button is preprogrammed to launch the ChatGPT prompt builder, which Logitech recently added to its peripherals configuration app Options+.

That’s pretty much it.

Beyond that, the M750 looks just like the Logitech Signature M650, which came out in January 2022.  Also, the new mouse’s forward button (on the left side of the mouse) is preprogrammed to launch Windows or macOS dictation, and the back button opens ChatGPT within Options+. As of this writing, the new mouse’s MSRP is $10 higher ($50) than the M650’s.

  • The new M750 (pictured) is 4.26×2.4×1.52 inches and 3.57 ounces.

    Logitech

  • The M650 (pictured) comes in 3 sizes. The medium size is 4.26×2.4×1.52 inches and 3.58 ounces.

    Logitech

I asked Logitech about the M750 appearing to be the M650 but with an extra button, and a spokesperson responded by saying:

M750 is indeed not the same mouse as M650. It has an extra button that has been preprogrammed to trigger the Logi AI Prompt Builder once the user installs Logi Options+ app. Without Options+, the button does DPI toggle between 1,000 and 1,600 DPI.

However, a reprogrammable button south of a mouse’s scroll wheel that can be set to launch an app or toggle DPI out of the box is pretty common, including among Logitech mice. Logitech’s rep further claimed to me that the two mice use different electronic components, which Logitech refers to as the mouse’s platform. Logitech can reuse platforms for different models, the spokesperson said.

Logitech’s rep declined to comment on why the M650 didn’t have a button south of its scroll wheel. Price is a potential reason, but Logitech also sells cheaper mice with this feature.

Still, the minimal differences between the two suggest that the M750 isn’t worth a whole product release. I suspect that if it weren’t for Logitech’s trendy new software feature, the M750 wouldn’t have been promoted as a new product.

The M750 also raises the question of how many computer input devices need to be equipped with some sort of buzzy, generative AI-related feature.

Logitech’s ChatGPT prompt builder

Logitech’s much bigger release last week wasn’t a peripheral but an addition to its Options+ app. You don’t need the “new” M750 mouse to use Logitech’s AI Prompt Builder; I was able to program my MX Master 3S to launch it. Several Logitech mice and keyboards support AI Prompt Builder.

When you press a button that launches the prompt builder, an Options+ window appears. There, you can input text that Options+ will use to create a ChatGPT-appropriate prompt based on your needs:

A Logitech-provided image depicting its AI Prompt Builder software feature.

Enlarge / A Logitech-provided image depicting its AI Prompt Builder software feature.

Logitech

After you make your choices, another window opens with ChatGPT’s response. Logitech said the prompt builder requires a ChatGPT account, but I was able to use GPT-3.5 without entering one (the feature can also work with GPT-4).

The typical Arsian probably doesn’t need help creating a ChatGPT prompt, and Logitech’s new capability doesn’t work with any other chatbots. The prompt builder could be interesting to less technically savvy people interested in some handholding for early ChatGPT experiences. However, I doubt if people with an elementary understanding of generative AI need instant access to ChatGPT.

The point, though, is instant access to ChatGPT capabilities, something that Logitech is arguing is worthwhile for its professional users. Some Logitech customers, though, seem to disagree, especially with the AI Prompt Builder, meaning that Options+ has even more resources in the background.

But Logitech isn’t the only gadget company eager to tie one-touch AI access to a hardware button.

Pinching your earbuds to talk to ChatGPT

Similarly to Logitech, Nothing is trying to give its customers access to ChatGPT quickly. In this case, access occurs by pinching the device. This month, Nothing announced that it “integrated Nothing earbuds and Nothing OS with ChatGPT to offer users instant access to knowledge directly from the devices they use most, earbuds and smartphones.” The feature requires the latest Nothing OS and for the users to have a Nothing phone with ChatGPT installed. ChatGPT gestures work with Nothing’s Phone (2) and Nothing Ear and Nothing Ear (a), but Nothing plans to expand to additional phones via software updates.

Nothing's Ear and Ear (a) earbuds.

Enlarge / Nothing’s Ear and Ear (a) earbuds.

Nothing

Nothing also said it would embed “system-level entry points” to ChatGPT, like screenshot sharing and “Nothing-styled widgets,” to Nothing smartphone OSes.

A peek at setting up ChatGPT integration on the Nothing X app.

Enlarge / A peek at setting up ChatGPT integration on the Nothing X app.

Nothing’s ChatGPT integration may be a bit less intrusive than Logitech’s since users who don’t have ChatGPT on their phones won’t be affected. But, again, you may wonder how many people asked for this feature and how reliably it will function.

Tech brands are forcing AI into your gadgets—whether you asked for it or not Read More »

apple-releases-eight-small-ai-language-models-aimed-at-on-device-use

Apple releases eight small AI language models aimed at on-device use

Inside the Apple core —

OpenELM mirrors efforts by Microsoft to make useful small AI language models that run locally.

An illustration of a robot hand tossing an apple to a human hand.

Getty Images

In the world of AI, what might be called “small language models” have been growing in popularity recently because they can be run on a local device instead of requiring data center-grade computers in the cloud. On Wednesday, Apple introduced a set of tiny source-available AI language models called OpenELM that are small enough to run directly on a smartphone. They’re mostly proof-of-concept research models for now, but they could form the basis of future on-device AI offerings from Apple.

Apple’s new AI models, collectively named OpenELM for “Open-source Efficient Language Models,” are currently available on the Hugging Face under an Apple Sample Code License. Since there are some restrictions in the license, it may not fit the commonly accepted definition of “open source,” but the source code for OpenELM is available.

On Tuesday, we covered Microsoft’s Phi-3 models, which aim to achieve something similar: a useful level of language understanding and processing performance in small AI models that can run locally. Phi-3-mini features 3.8 billion parameters, but some of Apple’s OpenELM models are much smaller, ranging from 270 million to 3 billion parameters in eight distinct models.

In comparison, the largest model yet released in Meta’s Llama 3 family includes 70 billion parameters (with a 400 billion version on the way), and OpenAI’s GPT-3 from 2020 shipped with 175 billion parameters. Parameter count serves as a rough measure of AI model capability and complexity, but recent research has focused on making smaller AI language models as capable as larger ones were a few years ago.

The eight OpenELM models come in two flavors: four as “pretrained” (basically a raw, next-token version of the model) and four as instruction-tuned (fine-tuned for instruction following, which is more ideal for developing AI assistants and chatbots):

OpenELM features a 2048-token maximum context window. The models were trained on the publicly available datasets RefinedWeb, a version of PILE with duplications removed, a subset of RedPajama, and a subset of Dolma v1.6, which Apple says totals around 1.8 trillion tokens of data. Tokens are fragmented representations of data used by AI language models for processing.

Apple says its approach with OpenELM includes a “layer-wise scaling strategy” that reportedly allocates parameters more efficiently across each layer, saving not only computational resources but also improving the model’s performance while being trained on fewer tokens. According to Apple’s released white paper, this strategy has enabled OpenELM to achieve a 2.36 percent improvement in accuracy over Allen AI’s OLMo 1B (another small language model) while requiring half as many pre-training tokens.

An table comparing OpenELM with other small AI language models in a similar class, taken from the OpenELM research paper by Apple.

Enlarge / An table comparing OpenELM with other small AI language models in a similar class, taken from the OpenELM research paper by Apple.

Apple

Apple also released the code for CoreNet, a library it used to train OpenELM—and it also included reproducible training recipes that allow the weights (neural network files) to be replicated, which is unusual for a major tech company so far. As Apple says in its OpenELM paper abstract, transparency is a key goal for the company: “The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks.”

By releasing the source code, model weights, and training materials, Apple says it aims to “empower and enrich the open research community.” However, it also cautions that since the models were trained on publicly sourced datasets, “there exists the possibility of these models producing outputs that are inaccurate, harmful, biased, or objectionable in response to user prompts.”

While Apple has not yet integrated this new wave of AI language model capabilities into its consumer devices, the upcoming iOS 18 update (expected to be revealed in June at WWDC) is rumored to include new AI features that utilize on-device processing to ensure user privacy—though the company may potentially hire Google or OpenAI to handle more complex, off-device AI processing to give Siri a long-overdue boost.

Apple releases eight small AI language models aimed at on-device use Read More »

microsoft’s-phi-3-shows-the-surprising-power-of-small,-locally-run-ai-language-models

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models

small packages —

Microsoft’s 3.8B parameter Phi-3 may rival GPT-3.5, signaling a new era of “small language models.”

An illustration of lots of information being compressed into a smartphone with a funnel.

Getty Images

On Tuesday, Microsoft announced a new, freely available lightweight AI language model named Phi-3-mini, which is simpler and less expensive to operate than traditional large language models (LLMs) like OpenAI’s GPT-4 Turbo. Its small size is ideal for running locally, which could bring an AI model of similar capability to the free version of ChatGPT to a smartphone without needing an Internet connection to run it.

The AI field typically measures AI language model size by parameter count. Parameters are numerical values in a neural network that determine how the language model processes and generates text. They are learned during training on large datasets and essentially encode the model’s knowledge into quantified form. More parameters generally allow the model to capture more nuanced and complex language-generation capabilities but also require more computational resources to train and run.

Some of the largest language models today, like Google’s PaLM 2, have hundreds of billions of parameters. OpenAI’s GPT-4 is rumored to have over a trillion parameters but spread over eight 220-billion parameter models in a mixture-of-experts configuration. Both models require heavy-duty data center GPUs (and supporting systems) to run properly.

In contrast, Microsoft aimed small with Phi-3-mini, which contains only 3.8 billion parameters and was trained on 3.3 trillion tokens. That makes it ideal to run on consumer GPU or AI-acceleration hardware that can be found in smartphones and laptops. It’s a follow-up of two previous small language models from Microsoft: Phi-2, released in December, and Phi-1, released in June 2023.

A chart provided by Microsoft showing Phi-3 performance on various benchmarks.

Enlarge / A chart provided by Microsoft showing Phi-3 performance on various benchmarks.

Phi-3-mini features a 4,000-token context window, but Microsoft also introduced a 128K-token version called “phi-3-mini-128K.” Microsoft has also created 7-billion and 14-billion parameter versions of Phi-3 that it plans to release later that it claims are “significantly more capable” than phi-3-mini.

Microsoft says that Phi-3 features overall performance that “rivals that of models such as Mixtral 8x7B and GPT-3.5,” as detailed in a paper titled “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” Mixtral 8x7B, from French AI company Mistral, utilizes a mixture-of-experts model, and GPT-3.5 powers the free version of ChatGPT.

“[Phi-3] looks like it’s going to be a shockingly good small model if their benchmarks are reflective of what it can actually do,” said AI researcher Simon Willison in an interview with Ars. Shortly after providing that quote, Willison downloaded Phi-3 to his Macbook laptop locally and said, “I got it working, and it’s GOOD” in a text message sent to Ars.

A screenshot of Phi-3-mini running locally on Simon Willison's Macbook.

Enlarge / A screenshot of Phi-3-mini running locally on Simon Willison’s Macbook.

Simon Willison

Most models that run on a local device still need hefty hardware,” says Willison. “Phi-3-mini runs comfortably with less than 8GB of RAM, and can churn out tokens at a reasonable speed even on just a regular CPU. It’s licensed MIT and should work well on a $55 Raspberry Pi—and the quality of results I’ve seen from it so far are comparable to models 4x larger.

How did Microsoft cram a capability potentially similar to GPT-3.5, which has at least 175 billion parameters, into such a small model? Its researchers found the answer by using carefully curated, high-quality training data they initially pulled from textbooks. “The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data,” writes Microsoft. “The model is also further aligned for robustness, safety, and chat format.”

Much has been written about the potential environmental impact of AI models and datacenters themselves, including on Ars. With new techniques and research, it’s possible that machine learning experts may continue to increase the capability of smaller AI models, replacing the need for larger ones—at least for everyday tasks. That would theoretically not only save money in the long run but also require far less energy in aggregate, dramatically decreasing AI’s environmental footprint. AI models like Phi-3 may be a step toward that future if the benchmark results hold up to scrutiny.

Phi-3 is immediately available on Microsoft’s cloud service platform Azure, as well as through partnerships with machine learning model platform Hugging Face and Ollama, a framework that allows models to run locally on Macs and PCs.

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models Read More »

elon-musk’s-grok-keeps-making-up-fake-news-based-on-x-users’-jokes

Elon Musk’s Grok keeps making up fake news based on X users’ jokes

It’s all jokes until it isn’t —

X likely hopes to avoid liability with disclaimer that Grok “can make mistakes.”

Elon Musk’s Grok keeps making up fake news based on X users’ jokes

X’s chatbot Grok is supposed to be an AI engine crunching the platform’s posts to surface and summarize breaking news, but this week, Grok’s flaws were once again exposed when the chatbot got confused and falsely accused an NBA star of criminal vandalism.

“Klay Thompson Accused in Bizarre Brick-Vandalism Spree,” Grok’s headline read in an AI-powered trending-tab post that has remained on X (formerly Twitter) for days. Beneath the headline, Grok went into even more detail to support its fake reporting:

In a bizarre turn of events, NBA star Klay Thompson has been accused of vandalizing multiple houses with bricks in Sacramento. Authorities are investigating the claims after several individuals reported their houses being damaged, with windows shattered by bricks. Klay Thompson has not yet issued a statement regarding the accusations. The incidents have left the community shaken, but no injuries were reported. The motive behind the alleged vandalism remains unclear.

Grok appears to be confusing a common basketball term, where players are said to be throwing “bricks” when they take an airball shot that doesn’t hit the rim. According to SF Gate, which was one of the first outlets to report the Grok error, Thompson had an “all-time rough shooting” night, hitting none of his shots on what was his emotional last game with the Golden State Warriors before becoming an unrestricted free agent.

In small type under Grok’s report, X includes a disclaimer saying, “Grok is an early feature and can make mistakes. Verify its outputs.”

But instead of verifying Grok’s outputs, it appeared that X users—in the service’s famously joke-y spirit—decided to fuel Grok’s misinformation. Under the post, X users, some NBA fans, commented with fake victim reports, using the same joke format to seemingly convince Grok that “several individuals reported their houses being damaged.” Some of these joking comments were viewed by millions.

First off… I am ok.

My house was vandalized by bricks 🧱

After my hands stopped shaking, I managed to call the Sheriff…They were quick to respond🚨

My window was gone and the police asked if I knew who did it👮‍♂️

I said yes, it was Klay Thompson

— LakeShowYo (@LakeShowYo) April 17, 2024

First off…I am ok.

My house was vandalized by bricks in Sacramento.

After my hands stopped shaking, I managed to call the Sheriff, they were quick to respond.

My window is gone, the police asked me if I knew who did it.

I said yes, it was Klay Thompson. pic.twitter.com/smrDs6Yi5M

— KeeganMuse (@KeegMuse) April 17, 2024

First off… I am ok.

My house was vandalized by bricks 🧱

After my hands stopped shaking, I managed to call the Sheriff…They were quick to respond🚨

My window was gone and the police asked if I knew who did it👮‍♂️

I said yes, it was Klay Thompson pic.twitter.com/JaWtdJhFli

— JJJ Muse (@JarenJJMuse) April 17, 2024

X did not immediately respond to Ars’ request for comment or confirm if the post will be corrected or taken down.

In the past, both Microsoft and chatbot maker OpenAI have faced defamation lawsuits over similar fabrications in which ChatGPT falsely accused a politician and a radio host of completely made-up criminal histories. Microsoft was also sued by an aerospace professor who Bing Chat falsely labeled a terrorist.

Experts told Ars that it remains unclear if disclaimers like X’s will spare companies from liability should more people decide to sue over fake AI outputs. Defamation claims might depend on proving that platforms “knowingly” publish false statements, which disclaimers suggest they do. Last July, the Federal Trade Commission launched an investigation into OpenAI, demanding that the company address the FTC’s fears of “false, misleading, or disparaging” AI outputs.

Because the FTC doesn’t comment on its investigations, it’s impossible to know if its probe will impact how OpenAI conducts business.

For people suing AI companies, the urgency of protecting against false outputs seems obvious. Last year, the radio host suing OpenAI, Mark Walters, accused the company of “sticking its head in the sand” and “recklessly disregarding whether the statements were false under circumstances when they knew that ChatGPT’s hallucinations were pervasive and severe.”

X just released Grok to all premium users this month, TechCrunch reported, right around the time that X began giving away premium access to the platform’s top users. During that wider rollout, X touted Grok’s new ability to summarize all trending news and topics, perhaps stoking interest in this feature and peaking Grok usage just before Grok spat out the potentially defamatory post about the NBA star.

Thompson has not issued any statements on Grok’s fake reporting.

Grok’s false post about Thompson may be the first widely publicized example of potential defamation from Grok, but it wasn’t the first time that Grok promoted fake news in response to X users joking around on the platform. During the solar eclipse, a Grok-generated headline read, “Sun’s Odd Behavior: Experts Baffled,” Gizmodo reported.

While it’s amusing to some X users to manipulate Grok, the pattern suggests that Grok may also be vulnerable to being manipulated by bad actors into summarizing and spreading more serious misinformation or propaganda. That’s apparently already happening, too. In early April, Grok made up a headline about Iran attacking Israel with heavy missiles, Mashable reported.

Elon Musk’s Grok keeps making up fake news based on X users’ jokes Read More »

llms-keep-leaping-with-llama-3,-meta’s-newest-open-weights-ai-model

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model

computer-powered word generator —

Zuckerberg says new AI model “was still learning” when Meta stopped training.

A group of pink llamas on a pixelated background.

On Thursday, Meta unveiled early versions of its Llama 3 open-weights AI model that can be used to power text composition, code generation, or chatbots. It also announced that its Meta AI Assistant is now available on a website and is going to be integrated into its major social media apps, intensifying the company’s efforts to position its products against other AI assistants like OpenAI’s ChatGPT, Microsoft’s Copilot, and Google’s Gemini.

Like its predecessor, Llama 2, Llama 3 is notable for being a freely available, open-weights large language model (LLM) provided by a major AI company. Llama 3 technically does not quality as “open source” because that term has a specific meaning in software (as we have mentioned in other coverage), and the industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (you can read Llama 3’s license here) or that ship without providing training data. We typically call these releases “open weights” instead.

At the moment, Llama 3 is available in two parameter sizes: 8 billion (8B) and 70 billion (70B), both of which are available as free downloads through Meta’s website with a sign-up. Llama 3 comes in two versions: pre-trained (basically the raw, next-token-prediction model) and instruction-tuned (fine-tuned to follow user instructions). Each has a 8,192 token context limit.

A screenshot of the Meta AI Assistant website on April 18, 2024.

Enlarge / A screenshot of the Meta AI Assistant website on April 18, 2024.

Benj Edwards

Meta trained both models on two custom-built, 24,000-GPU clusters. In a podcast interview with Dwarkesh Patel, Meta CEO Mark Zuckerberg said that the company trained the 70B model with around 15 trillion tokens of data. Throughout the process, the model never reached “saturation” (that is, it never hit a wall in terms of capability increases). Eventually, Meta pulled the plug and moved on to training other models.

“I guess our prediction going in was that it was going to asymptote more, but even by the end it was still leaning. We probably could have fed it more tokens, and it would have gotten somewhat better,” Zuckerberg said on the podcast.

Meta also announced that it is currently training a 400B parameter version of Llama 3, which some experts like Nvidia’s Jim Fan think may perform in the same league as GPT-4 Turbo, Claude 3 Opus, and Gemini Ultra on benchmarks like MMLU, GPQA, HumanEval, and MATH.

Speaking of benchmarks, we have devoted many words in the past to explaining how frustratingly imprecise benchmarks can be when applied to large language models due to issues like training contamination (that is, including benchmark test questions in the training dataset), cherry-picking on the part of vendors, and an inability to capture AI’s general usefulness in an interactive session with chat-tuned models.

But, as expected, Meta provided some benchmarks for Llama 3 that list results from MMLU (undergraduate level knowledge), GSM-8K (grade-school math), HumanEval (coding), GPQA (graduate-level questions), and MATH (math word problems). These show the 8B model performing well compared to open-weights models like Google’s Gemma 7B and Mistral 7B Instruct, and the 70B model also held its own against Gemini Pro 1.5 and Claude 3 Sonnet.

A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Enlarge / A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Meta says that the Llama 3 model has been enhanced with capabilities to understand coding (like Llama 2) and, for the first time, has been trained with both images and text—though it currently outputs only text. According to Reuters, Meta Chief Product Officer Chris Cox noted in an interview that more complex processing abilities (like executing multi-step plans) are expected in future updates to Llama 3, which will also support multimodal outputs—that is, both text and images.

Meta plans to host the Llama 3 models on a range of cloud platforms, making them accessible through AWS, Databricks, Google Cloud, and other major providers.

Also on Thursday, Meta announced that Llama 3 will become the new basis of the Meta AI virtual assistant, which the company first announced in September. The assistant will appear prominently in search features for Facebook, Instagram, WhatsApp, Messenger, and the aforementioned dedicated website that features a design similar to ChatGPT, including the ability to generate images in the same interface. The company also announced a partnership with Google to integrate real-time search results into the Meta AI assistant, adding to an existing partnership with Microsoft’s Bing.

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model Read More »

words-are-flowing-out-like-endless-rain:-recapping-a-busy-week-of-llm-news

Words are flowing out like endless rain: Recapping a busy week of LLM news

many things frequently —

Gemini 1.5 Pro launch, new version of GPT-4 Turbo, new Mistral model, and more.

An image of a boy amazed by flying letters.

Enlarge / An image of a boy amazed by flying letters.

Some weeks in AI news are eerily quiet, but during others, getting a grip on the week’s events feels like trying to hold back the tide. This week has seen three notable large language model (LLM) releases: Google Gemini Pro 1.5 hit general availability with a free tier, OpenAI shipped a new version of GPT-4 Turbo, and Mistral released a new openly licensed LLM, Mixtral 8x22B. All three of those launches happened within 24 hours starting on Tuesday.

With the help of software engineer and independent AI researcher Simon Willison (who also wrote about this week’s hectic LLM launches on his own blog), we’ll briefly cover each of the three major events in roughly chronological order, then dig into some additional AI happenings this week.

Gemini Pro 1.5 general release

On Tuesday morning Pacific time, Google announced that its Gemini 1.5 Pro model (which we first covered in February) is now available in 180-plus countries, excluding Europe, via the Gemini API in a public preview. This is Google’s most powerful public LLM so far, and it’s available in a free tier that permits up to 50 requests a day.

It supports up to 1 million tokens of input context. As Willison notes in his blog, Gemini 1.5 Pro’s API price at $7/million input tokens and $21/million output tokens costs a little less than GPT-4 Turbo (priced at $10/million in and $30/million out) and more than Claude 3 Sonnet (Anthropic’s mid-tier LLM, priced at $3/million in and $15/million out).

Notably, Gemini 1.5 Pro includes native audio (speech) input processing that allows users to upload audio or video prompts, a new File API for handling files, the ability to add custom system instructions (system prompts) for guiding model responses, and a JSON mode for structured data extraction.

“Majorly Improved” GPT-4 Turbo launch

A GPT-4 Turbo performance chart provided by OpenAI.

Enlarge / A GPT-4 Turbo performance chart provided by OpenAI.

Just a bit later than Google’s 1.5 Pro launch on Tuesday, OpenAI announced that it was rolling out a “majorly improved” version of GPT-4 Turbo (a model family originally launched in November) called “gpt-4-turbo-2024-04-09.” It integrates multimodal GPT-4 Vision processing (recognizing the contents of images) directly into the model, and it initially launched through API access only.

Then on Thursday, OpenAI announced that the new GPT-4 Turbo model had just become available for paid ChatGPT users. OpenAI said that the new model improves “capabilities in writing, math, logical reasoning, and coding” and shared a chart that is not particularly useful in judging capabilities (that they later updated). The company also provided an example of an alleged improvement, saying that when writing with ChatGPT, the AI assistant will use “more direct, less verbose, and use more conversational language.”

The vague nature of OpenAI’s GPT-4 Turbo announcements attracted some confusion and criticism online. On X, Willison wrote, “Who will be the first LLM provider to publish genuinely useful release notes?” In some ways, this is a case of “AI vibes” again, as we discussed in our lament about the poor state of LLM benchmarks during the debut of Claude 3. “I’ve not actually spotted any definite differences in quality [related to GPT-4 Turbo],” Willison told us directly in an interview.

The update also expanded GPT-4’s knowledge cutoff to April 2024, although some people are reporting it achieves this through stealth web searches in the background, and others on social media have reported issues with date-related confabulations.

Mistral’s mysterious Mixtral 8x22B release

An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It's hard to draw a picture of an LLM, so a robot will have to do.

Enlarge / An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It’s hard to draw a picture of an LLM, so a robot will have to do.

Not to be outdone, on Tuesday night, French AI company Mistral launched its latest openly licensed model, Mixtral 8x22B, by tweeting a torrent link devoid of any documentation or commentary, much like it has done with previous releases.

The new mixture-of-experts (MoE) release weighs in with a larger parameter count than its previously most-capable open model, Mixtral 8x7B, which we covered in December. It’s rumored to potentially be as capable as GPT-4 (In what way, you ask? Vibes). But that has yet to be seen.

“The evals are still rolling in, but the biggest open question right now is how well Mixtral 8x22B shapes up,” Willison told Ars. “If it’s in the same quality class as GPT-4 and Claude 3 Opus, then we will finally have an openly licensed model that’s not significantly behind the best proprietary ones.”

This release has Willison most excited, saying, “If that thing really is GPT-4 class, it’s wild, because you can run that on a (very expensive) laptop. I think you need 128GB of MacBook RAM for it, twice what I have.”

The new Mixtral is not listed on Chatbot Arena yet, Willison noted, because Mistral has not released a fine-tuned model for chatting yet. It’s still a raw, predict-the-next token LLM. “There’s at least one community instruction tuned version floating around now though,” says Willison.

Chatbot Arena Leaderboard shake-ups

A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Enlarge / A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Benj Edwards

This week’s LLM news isn’t limited to just the big names in the field. There have also been rumblings on social media about the rising performance of open source models like Cohere’s Command R+, which reached position 6 on the LMSYS Chatbot Arena Leaderboard—the highest-ever ranking for an open-weights model.

And for even more Chatbot Arena action, apparently the new version of GPT-4 Turbo is proving competitive with Claude 3 Opus. The two are still in a statistical tie, but GPT-4 Turbo recently pulled ahead numerically. (In March, we reported when Claude 3 first numerically pulled ahead of GPT-4 Turbo, which was then the first time another AI model had surpassed a GPT-4 family model member on the leaderboard.)

Regarding this fierce competition among LLMs—of which most of the muggle world is unaware and will likely never be—Willison told Ars, “The past two months have been a whirlwind—we finally have not just one but several models that are competitive with GPT-4.” We’ll see if OpenAI’s rumored release of GPT-5 later this year will restore the company’s technological lead, we note, which once seemed insurmountable. But for now, Willison says, “OpenAI are no longer the undisputed leaders in LLMs.”

Words are flowing out like endless rain: Recapping a busy week of LLM news Read More »

intel’s-“gaudi-3”-ai-accelerator-chip-may-give-nvidia’s-h100-a-run-for-its-money

Intel’s “Gaudi 3” AI accelerator chip may give Nvidia’s H100 a run for its money

Adventures in Matrix Multiplication —

Intel claims 50% more speed when running AI language models vs. the market leader.

An Intel handout photo of the Gaudi 3 AI accelerator.

Enlarge / An Intel handout photo of the Gaudi 3 AI accelerator.

On Tuesday, Intel revealed a new AI accelerator chip called Gaudi 3 at its Vision 2024 event in Phoenix. With strong claimed performance while running large language models (like those that power ChatGPT), the company has positioned Gaudi 3 as an alternative to Nvidia’s H100, a popular data center GPU that has been subject to shortages, though apparently that is easing somewhat.

Compared to Nvidia’s H100 chip, Intel projects a 50 percent faster training time on Gaudi 3 for both OpenAI’s GPT-3 175B LLM and the 7-billion parameter version of Meta’s Llama 2. In terms of inference (running the trained model to get outputs), Intel claims that its new AI chip delivers 50 percent faster performance than H100 for Llama 2 and Falcon 180B, which are both relatively popular open-weights models.

Intel is targeting the H100 because of its high market share, but the chip isn’t Nvidia’s most powerful AI accelerator chip in the pipeline. Announcements of the H200 and the Blackwell B200 have since surpassed the H100 on paper, but neither of those chips is out yet (the H200 is expected in the second quarter of 2024—basically any day now).

Meanwhile, the aforementioned H100 supply issues have been a major headache for tech companies and AI researchers who have to fight for access to any chips that can train AI models. This has led several tech companies like Microsoft, Meta, and OpenAI (rumor has it) to seek their own AI-accelerator chip designs, although that custom silicon is typically manufactured by either Intel or TSMC. Google has its own line of tensor processing units (TPUs) that it has been using internally since 2015.

Given those issues, Intel’s Gaudi 3 may be a potentially attractive alternative to the H100 if Intel can hit an ideal price (which Intel has not provided, but an H100 reportedly costs around $30,000–$40,000) and maintain adequate production. AMD also manufactures a competitive range of AI chips, such as the AMD Instinct MI300 Series, that sell for around $10,000–$15,000.

Gaudi 3 performance

An Intel handout featuring specifications of the Gaudi 3 AI accelerator.

Enlarge / An Intel handout featuring specifications of the Gaudi 3 AI accelerator.

Intel says the new chip builds upon the architecture of its predecessor, Gaudi 2, by featuring two identical silicon dies connected by a high-bandwidth connection. Each die contains a central cache memory of 48 megabytes, surrounded by four matrix multiplication engines and 32 programmable tensor processor cores, bringing the total cores to 64.

The chipmaking giant claims that Gaudi 3 delivers double the AI compute performance of Gaudi 2 using 8-bit floating-point infrastructure, which has become crucial for training transformer models. The chip also offers a fourfold boost for computations using the BFloat 16-number format. Gaudi 3 also features 128GB of the less expensive HBMe2 memory capacity (which may contribute to price competitiveness) and features 3.7TB of memory bandwidth.

Since data centers are well-known to be power hungry, Intel emphasizes the power efficiency of Gaudi 3, claiming 40 percent greater inference power-efficiency across Llama 7B and 70B parameters, and Falcon 180B parameter models compared to Nvidia’s H100. Eitan Medina, chief operating officer of Intel’s Habana Labs, attributes this advantage to Gaudi’s large-matrix math engines, which he claims require significantly less memory bandwidth compared to other architectures.

Gaudi vs. Blackwell

An Intel handout photo of the Gaudi 3 AI accelerator.

Enlarge / An Intel handout photo of the Gaudi 3 AI accelerator.

Last month, we covered the splashy launch of Nvidia’s Blackwell architecture, including the B200 GPU, which Nvidia claims will be the world’s most powerful AI chip. It seems natural, then, to compare what we know about Nvidia’s highest-performing AI chip to the best of what Intel can currently produce.

For starters, Gaudi 3 is being manufactured using TSMC’s N5 process technology, according to IEEE Spectrum, narrowing the gap between Intel and Nvidia in terms of semiconductor fabrication technology. The upcoming Nvidia Blackwell chip will use a custom N4P process, which reportedly offers modest performance and efficiency improvements over N5.

Gaudi 3’s use of HBM2e memory (as we mentioned above) is notable compared to the more expensive HBM3 or HBM3e used in competing chips, offering a balance of performance and cost-efficiency. This choice seems to emphasize Intel’s strategy to compete not only on performance but also on price.

As far as raw performance comparisons between Gaudi 3 and the B200, that can’t be known until the chips have been released and benchmarked by a third party.

As the race to power the tech industry’s thirst for AI computation heats up, IEEE Spectrum notes that the next generation of Intel’s Gaudi chip, code-named Falcon Shores, remains a point of interest. It also remains to be seen whether Intel will continue to rely on TSMC’s technology or leverage its own foundry business and upcoming nanosheet transistor technology to gain a competitive edge in the AI accelerator market.

Intel’s “Gaudi 3” AI accelerator chip may give Nvidia’s H100 a run for its money Read More »

new-ai-music-generator-udio-synthesizes-realistic-music-on-demand

New AI music generator Udio synthesizes realistic music on demand

Battle of the AI bands —

But it still needs trial and error to generate high-quality results.

A screenshot of AI-generated songs listed on Udio on April 10, 2024.

Enlarge / A screenshot of AI-generated songs listed on Udio on April 10, 2024.

Benj Edwards

Between 2002 and 2005, I ran a music website where visitors could submit song titles that I would write and record a silly song around. In the liner notes for my first CD release in 2003, I wrote about a day when computers would potentially put me out of business, churning out music automatically at a pace I could not match. While I don’t actively post music on that site anymore, that day is almost here.

On Wednesday, a group of ex-DeepMind employees launched Udio, a new AI music synthesis service that can create novel high-fidelity musical audio from written prompts, including user-provided lyrics. It’s similar to Suno, which we covered on Monday. With some key human input, Udio can create facsimiles of human-produced music in genres like country, barbershop quartet, German pop, classical, hard rock, hip hop, show tunes, and more. It’s currently free to use during a beta period.

Udio is also freaking out some musicians on Reddit. As we mentioned in our Suno piece, Udio is exactly the kind of AI-powered music generation service that over 200 musical artists were afraid of when they signed an open protest letter last week.

But as impressive as the Udio songs first seem from a technical AI-generation standpoint (not necessarily judging by musical merit), its generation capability isn’t perfect. We experimented with its creation tool and the results felt less impressive than those created by Suno. The high-quality musical samples showcased on Udio’s site likely resulted from a lot of creative human input (such as human-written lyrics) and cherry-picking the best compositional parts of songs out of many generations. In fact, Udio lays out a five-step workflow to build a 1.5-minute-long song in a FAQ.

For example, we created an Ars Technica “Moonshark” song on Udio using the same prompt as one we used previously with Suno. In its raw form, the results sound half-baked and almost nightmarish (here is the Suno version for comparison). It’s also a lot shorter by default at 32 seconds compared to Suno’s 1-minute and 32-second output. But Udio allows songs to be extended, or you can try generating a poor result again with different prompts for different results.

After registering a Udio account, anyone can create a track by entering a text prompt that can include lyrics, a story direction, and musical genre tags. Udio then tackles the task in two stages. First, it utilizes a large language model (LLM) similar to ChatGPT to generate lyrics (if necessary) based on the provided prompt. Next, it synthesizes music using a method that Udio does not disclose, but it’s likely a diffusion model, similar to Stability AI’s Stable Audio.

From the given prompt, Udio’s AI model generates two distinct song snippets for you to choose from. You can then publish the song for the Udio community, download the audio or video file to share on other platforms, or directly share it on social media. Other Udio users can also remix or build on existing songs. Udio’s terms of service say that the company claims no rights over the musical generations and that they can be used for commercial purposes.

Although the Udio team has not revealed the specific details of its model or training data (which is likely filled with copyrighted material), it told Tom’s Guide that the system has built-in measures to identify and block tracks that too closely resemble the work of specific artists, ensuring that the generated music remains original.

And that brings us back to humans, some of whom are not taking the onset of AI-generated music very well. “I gotta be honest, this is depressing as hell,” wrote one Reddit commenter in a thread about Udio. “I’m still broadly optimistic that music will be fine in the long run somehow. But like, why do this? Why automate art?”

We’ll hazard an answer by saying that replicating art is a key target for AI research because the results can be inaccurate and imprecise and still seem notable or gee-whiz amazing, which is a key characteristic of generative AI. It’s flashy and impressive-looking while allowing for a general lack of quantitative rigor. We’ve already seen AI come for still images, video, and text with varied results regarding representative accuracy. Fully composed musical recordings seem to be next on the list of AI hills to (approximately) conquer, and the competition is heating up.

New AI music generator Udio synthesizes realistic music on demand Read More »