machine learning

hidden-ai-instructions-reveal-how-anthropic-controls-claude-4

Hidden AI instructions reveal how Anthropic controls Claude 4

Willison, who coined the term “prompt injection” in 2022, is always on the lookout for LLM vulnerabilities. In his post, he notes that reading system prompts reminds him of warning signs in the real world that hint at past problems. “A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them,” he writes.

Fighting the flattery problem

An illustrated robot holds four red hearts with its four robotic arms.

Willison’s analysis comes as AI companies grapple with sycophantic behavior in their models. As we reported in April, ChatGPT users have complained about GPT-4o’s “relentlessly positive tone” and excessive flattery since OpenAI’s March update. Users described feeling “buttered up” by responses like “Good question! You’re very astute to ask that,” with software engineer Craig Weiss tweeting that “ChatGPT is suddenly the biggest suckup I’ve ever met.”

The issue stems from how companies collect user feedback during training—people tend to prefer responses that make them feel good, creating a feedback loop where models learn that enthusiasm leads to higher ratings from humans. As a response to the feedback, OpenAI later rolled back ChatGPT’s 4o model and altered the system prompt as well, something we reported on and Willison also analyzed at the time.

One of Willison’s most interesting findings about Claude 4 relates to how Anthropic has guided both Claude models to avoid sycophantic behavior. “Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective,” Anthropic writes in the prompt. “It skips the flattery and responds directly.”

Other system prompt highlights

The Claude 4 system prompt also includes extensive instructions on when Claude should or shouldn’t use bullet points and lists, with multiple paragraphs dedicated to discouraging frequent list-making in casual conversation. “Claude should not use bullet points or numbered lists for reports, documents, explanations, or unless the user explicitly asks for a list or ranking,” the prompt states.

Hidden AI instructions reveal how Anthropic controls Claude 4 Read More »

google’s-will-smith-double-is-better-at-eating-ai-spaghetti-…-but-it’s-crunchy?

Google’s Will Smith double is better at eating AI spaghetti … but it’s crunchy?

On Tuesday, Google launched Veo 3, a new AI video synthesis model that can do something no major AI video generator has been able to do before: create a synchronized audio track. While from 2022 to 2024, we saw early steps in AI video generation, each video was silent and usually very short in duration. Now you can hear voices, dialog, and sound effects in eight-second high-definition video clips.

Shortly after the new launch, people began asking the most obvious benchmarking question: How good is Veo 3 at faking Oscar-winning actor Will Smith at eating spaghetti?

First, a brief recap. The spaghetti benchmark in AI video traces its origins back to March 2023, when we first covered an early example of horrific AI-generated video using an open source video synthesis model called ModelScope. The spaghetti example later became well-known enough that Smith parodied it almost a year later in February 2024.

Here’s what the original viral video looked like:

One thing people forget is that at the time, the Smith example wasn’t the best AI video generator out there—a video synthesis model called Gen-2 from Runway had already achieved superior results (though it was not yet publicly accessible). But the ModelScope result was funny and weird enough to stick in people’s memories as an early poor example of video synthesis, handy for future comparisons as AI models progressed.

AI app developer Javi Lopez first came to the rescue for curious spaghetti fans earlier this week with Veo 3, performing the Smith test and posting the results on X. But as you’ll notice below when you watch, the soundtrack has a curious quality: The faux Smith appears to be crunching on the spaghetti.

On X, Javi Lopez ran “Will Smith eating spaghetti” in Google’s Veo 3 AI video generator and received this result.

It’s a glitch in Veo 3’s experimental ability to apply sound effects to video, likely because the training data used to create Google’s AI models featured many examples of chewing mouths with crunching sound effects. Generative AI models are pattern-matching prediction machines, and they need to be shown enough examples of various types of media to generate convincing new outputs. If a concept is over-represented or under-represented in the training data, you’ll see unusual generation results, such as jabberwockies.

Google’s Will Smith double is better at eating AI spaghetti … but it’s crunchy? Read More »

new-claude-4-ai-model-refactored-code-for-7-hours-straight

New Claude 4 AI model refactored code for 7 hours straight


Anthropic says Claude 4 beats Gemini on coding benchmarks; works autonomously for hours.

The Claude 4 logo, created by Anthropic. Credit: Anthropic

On Thursday, Anthropic released Claude Opus 4 and Claude Sonnet 4, marking the company’s return to larger model releases after primarily focusing on mid-range Sonnet variants since June of last year. The new models represent what the company calls its most capable coding models yet, with Opus 4 designed for complex, long-running tasks that can operate autonomously for hours.

Alex Albert, Anthropic’s head of Claude Relations, told Ars Technica that the company chose to revive the Opus line because of growing demand for agentic AI applications. “Across all the companies out there that are building things, there’s a really large wave of these agentic applications springing up, and a very high demand and premium being placed on intelligence,” Albert said. “I think Opus is going to fit that groove perfectly.”

Before we go further, a brief refresher on Claude’s three AI model “size” names (first introduced in March 2024) is probably warranted. Haiku, Sonnet, and Opus offer a tradeoff between price (in the API), speed, and capability.

Haiku models are the smallest, least expensive to run, and least capable in terms of what you might call “context depth” (considering conceptual relationships in the prompt) and encoded knowledge. Owing to the small size in parameter count, Haiku models retain fewer concrete facts and thus tend to confabulate more frequently (plausibly answering questions based on lack of data) than larger models, but they are much faster at basic tasks than larger models. Sonnet is traditionally a mid-range model that hits a balance between cost and capability, and Opus models have always been the largest and slowest to run. However, Opus models process context more deeply and are hypothetically better suited for running deep logical tasks.

A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown.

A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown. Credit: Anthropic

There is no Claude 4 Haiku just yet, but the new Sonnet and Opus models can reportedly handle tasks that previous versions could not. In our interview with Albert, he described testing scenarios where Opus 4 worked coherently for up to 24 hours on tasks like playing Pokémon while coding refactoring tasks in Claude Code ran for seven hours without interruption. Earlier Claude models typically lasted only one to two hours before losing coherence, Albert said, meaning that the models could only produce useful self-referencing outputs for that long before beginning to output too many errors.

In particular, that marathon refactoring claim reportedly comes from Rakuten, a Japanese tech services conglomerate that “validated [Claude’s] capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance,” Anthropic said in a news release.

Whether you’d want to leave an AI model unsupervised for that long is another question entirely because even the most capable AI models can introduce subtle bugs, go down unproductive rabbit holes, or make choices that seem logical to the model but miss important context that a human developer would catch. While many people now use Claude for easy-going vibe coding, as we covered in March, the human-powered (and ironically-named) “vibe debugging” that often results from long AI coding sessions is also a very real thing. More on that below.

To shore up some of those shortcomings, Anthropic built memory capabilities into both new Claude 4 models, allowing them to maintain external files for storing key information across long sessions. When developers provide access to local files, the models can create and update “memory files” to track progress and things they deem important over time. Albert compared this to how humans take notes during extended work sessions.

Extended thinking meets tool use

Both Claude 4 models introduce what Anthropic calls “extended thinking with tool use,” a new beta feature allowing the models to alternate between simulated reasoning and using external tools like web search, similar to what OpenAI’s o3 and 04-mini-high AI models currently do in ChatGPT. While Claude 3.7 Sonnet already had strong tool use capabilities, the new models can now interleave simulated reasoning and tool calling in a single response.

“So now we can actually think, call a tool process, the results, think some more, call another tool, and repeat until it gets to a final answer,” Albert explained to Ars. The models self-determine when they have reached a useful conclusion, a capability picked up through training rather than governed by explicit human programming.

General Claude 4 benchmark results, provided by Anthropic.

General Claude 4 benchmark results, provided by Anthropic. Credit: Anthropic

In practice, we’ve anecdotally found parallel tool use capability very useful in AI assistants like OpenAI o3, since they don’t have to rely on what is trained in their neural network to provide accurate answers. Instead, these more agentic models can iteratively search the web, parse the results, analyze images, and spin up coding tasks for analysis in ways that can avoid falling into a confabulation trap by relying solely on pure LLM outputs.

“The world’s best coding model”

Anthropic says Opus 4 leads industry benchmarks for coding tasks, achieving 72.5 percent on SWE-bench and 43.2 percent on Terminal-bench, calling it “the world’s best coding model.” According to Anthropic, companies using early versions report improvements. Cursor described it as “state-of-the-art for coding and a leap forward in complex codebase understanding,” while Replit noted “improved precision and dramatic advancements for complex changes across multiple files.”

In fact, GitHub announced it will use Sonnet 4 as the base model for its new coding agent in GitHub Copilot, citing the model’s performance in “agentic scenarios” in Anthropic’s news release. Sonnet 4 scored 72.7 percent on SWE-bench while maintaining faster response times than Opus 4. The fact that GitHub is betting on Claude rather than a model from its parent company Microsoft (which has close ties to OpenAI) suggests Anthropic has built something genuinely competitive.

Software engineering benchmark results, provided by Anthropic.

Software engineering benchmark results, provided by Anthropic. Credit: Anthropic

Anthropic says it has addressed a persistent issue with Claude 3.7 Sonnet in which users complained that the model would take unauthorized actions or provide excessive output. Albert said the company reduced this “reward hacking behavior” by approximately 80 percent in the new models through training adjustments. An 80 percent reduction in unwanted behavior sounds impressive, but that also suggests that 20 percent of the problem behavior remains—a big concern when we’re talking about AI models that might be performing autonomous tasks for hours.

When we asked about code accuracy, Albert said that human code review is still an important part of shipping any production code. “There’s a human parallel, right? So this is just a problem we’ve had to deal with throughout the whole nature of software engineering. And this is why the code review process exists, so that you can catch these things. We don’t anticipate that going away with models either,” Albert said. “If anything, the human review will become more important, and more of your job as developer will be in this review than it will be in the generation part.”

Pricing and availability

Both Claude 4 models maintain the same pricing structure as their predecessors: Opus 4 costs $15 per million tokens for input and $75 per million for output, while Sonnet 4 remains at $3 and $15. The models offer two response modes: traditional LLM and simulated reasoning (“extended thinking”) for complex problems. Given that some Claude Code sessions can apparently run for hours, those per-token costs will likely add up very quickly for users who let the models run wild.

Anthropic made both models available through its API, Amazon Bedrock, and Google Cloud Vertex AI. Sonnet 4 remains accessible to free users, while Opus 4 requires a paid subscription.

The Claude 4 models also debut Claude Code (first introduced in February) as a generally available product after months of preview testing. Anthropic says the coding environment now integrates with VS Code and JetBrains IDEs, showing proposed edits directly in files. A new SDK allows developers to build custom agents using the same framework.

A screenshot of

A screenshot of “Claude Plays Pokemon,” a custom application where Claude 4 attempts to beat the classic Game Boy game. Credit: Anthropic

Even with Anthropic’s future riding on the capability of these new models, when we asked about how they guide Claude’s behavior by fine-tuning, Albert acknowledged that the inherent unpredictability of these systems presents ongoing challenges for both them and developers. “In the realm and the world of software for the past 40, 50 years, we’ve been running on deterministic systems, and now all of a sudden, it’s non-deterministic, and that changes how we build,” he said.

“I empathize with a lot of people out there trying to use our APIs and language models generally because they have to almost shift their perspective on what it means for reliability, what it means for powering a core of your application in a non-deterministic way,” Albert added. “These are general oddities that have kind of just been flipped, and it definitely makes things more difficult, but I think it opens up a lot of possibilities as well.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Claude 4 AI model refactored code for 7 hours straight Read More »

chicago-sun-times-prints-summer-reading-list-full-of-fake-books

Chicago Sun-Times prints summer reading list full of fake books

Photo of the Chicago Sun-Times

Photo of the Chicago Sun-Times “Summer reading list for 2025” supplement. Credit: Rachel King / Bluesky

Novelist Rachael King initially called attention to the error on Bluesky Tuesday morning. “The Chicago Sun-Times obviously gets ChatGPT to write a ‘summer reads’ feature almost entirely made up of real authors but completely fake books. What are we coming to?” King wrote.

So far, community reaction to the list has been largely negative online, but others have expressed sympathy for the publication. Freelance journalist Joshua J. Friedman noted on Bluesky that the reading list was “part of a ~60-page summer supplement” published on May 18, suggesting it might be “transparent filler” possibly created by “the lone freelancer apparently saddled with producing it.”

The staffing connection

The reading list appeared in a 64-page supplement called “Heat Index,” which was a promotional section not specific to Chicago. Buscaglia told 404 Media the content was meant to be “generic and national” and would be inserted into newspapers around the country. “We never get a list of where things ran,” he said.

The publication error comes two months after the Chicago Sun-Times lost 20 percent of its staff through a buyout program. In March, the newspaper’s nonprofit owner, Chicago Public Media, announced that 30 Sun-Times employees—including 23 from the newsroom—had accepted buyout offers amid financial struggles.

A March report on the buyout in the Sun-Times described the staff reduction as “the most drastic the oft-imperiled Sun-Times has faced in several years.” The departures included columnists, editorial writers, and editors with decades of experience.

Melissa Bell, CEO of Chicago Public Media, stated at the time that the exits would save the company $4.2 million annually. The company offered buyouts as it prepared for an expected expiration of grant support at the end of 2026.

Even with those pressures in the media, one Reddit user expressed disapproval of the apparent use of AI in the newspaper, even in a supplement that might not have been produced by staff. “As a subscriber, I am livid! What is the point of subscribing to a hard copy paper if they are just going to include AI slop too!?” wrote Reddit user xxxlovelit, who shared the reading list. “The Sun Times needs to answer for this, and there should be a reporter fired.”

This article was updated on May 20, 2025 at 11: 02 AM to include information on Marco Buscaglia from 404 Media.

Chicago Sun-Times prints summer reading list full of fake books Read More »

labor-dispute-erupts-over-ai-voiced-darth-vader-in-fortnite

Labor dispute erupts over AI-voiced Darth Vader in Fortnite

For voice actors who previously portrayed Darth Vader in video games, the Fortnite feature starkly illustrates how AI voice synthesis could reshape their profession. While James Earl Jones created the iconic voice for films, at least 54 voice actors have performed as Vader in various media games over the years when Jones wasn’t available—work that could vanish if AI replicas become the industry standard.

The union strikes back

SAG-AFTRA’s labor complaint (which can be read online here) doesn’t focus on the AI feature’s technical problems or on permission from the Jones estate, which explicitly authorized the use of a synthesized version of his voice for the character in Fortnite. The late actor, who died in 2024, had signed over his Darth Vader voice rights before his death.

Instead, the union’s grievance centers on labor rights and collective bargaining. In the NLRB filing, SAG-AFTRA alleges that Llama Productions “failed and refused to bargain in good faith with the union by making unilateral changes to terms and conditions of employment, without providing notice to the union or the opportunity to bargain, by utilizing AI-generated voices to replace bargaining unit work on the Interactive Program Fortnite.”

The action comes amid SAG-AFTRA’s ongoing interactive media strike, which began in July 2024 after negotiations with video game producers stalled primarily over AI protections. The strike continues, with more than 100 games signing interim agreements, while others, including those from major publishers like Epic, remain in dispute.

Labor dispute erupts over AI-voiced Darth Vader in Fortnite Read More »

the-empire-strikes-back-with-f-bombs:-ai-darth-vader-goes-rogue-with-profanity,-slurs

The empire strikes back with F-bombs: AI Darth Vader goes rogue with profanity, slurs

In that sense, the vulgar Vader situation creates a touchy dilemma for Epic Games and Disney, which likely invested substantially in this high-profile collaboration. While Epic acted swiftly in response, maintaining the feature while preventing further Jedi mind tricks from players presents ongoing technical challenges for interactive AI speech of any kind.

An AI language model like the one used for constructing responses for Vader (Google’s Gemini 2.0 Flash in this case, according to Epic) are fairly easy to trick with exploits like prompt injections and jailbreaks, and that has limited their usefulness in some applications. Imagine a truly ChatGPT-like Siri or Alexa, for example, that could be tricked into saying racist things on behalf of Apple or Amazon.

David Prowse as Darth Vader and Carrie Fisher as Princess Leia filming the original Star Wars. Credit: Sunset Boulevard/Corbis via Getty Images

Beyond language models, the AI voice technology behind the AI Darth Vader voice in Fortnite comes from ElevenLabs’ Flash v2.5 model, trained on examples of speech from James Earl Jones so it can synthesize new speech in the same style.

Previously, Lucasfilm worked with a Ukrainian startup we covered in 2022 on Obi-Wan Kenobi to recreate Darth Vader’s voice performance using a different AI voice model called Respeecher, which isn’t used in Fortnite.

According to Variety, Jones’ family supported the new Fortnite collaboration, stating: “James Earl felt that the voice of Darth Vader was inseparable from the story of Star Wars, and he always wanted fans of all ages to continue to experience it. We hope that this collaboration with Fortnite will allow both longtime fans of Darth Vader and newer generations to share in the enjoyment of this iconic character.”

This article was updated on May 16, 2025 at 4: 25 PM to include information about an email sent out from Epic Games to parents. This Article was updated again on May 17, 2025 at 10: 10 AM to correctly attribute ElevenLabs Flash v2.5 as the source of the Darth Vader audio model in Fortnite. The article previously incorrectly stated that Respeecher had been used for the game.

The empire strikes back with F-bombs: AI Darth Vader goes rogue with profanity, slurs Read More »

openai-adds-gpt-4.1-to-chatgpt-amid-complaints-over-confusing-model-lineup

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup

The release comes just two weeks after OpenAI made GPT-4 unavailable in ChatGPT on April 30. That earlier model, which launched in March 2023, once sparked widespread hype about AI capabilities. Compared to that hyperbolic launch, GPT-4.1’s rollout has been a fairly understated affair—probably because it’s tricky to convey the subtle differences between all of the available OpenAI models.

As if 4.1’s launch wasn’t confusing enough, the release also roughly coincides with OpenAI’s July 2025 deadline for retiring the GPT-4.5 Preview from the API, a model one AI expert called a “lemon.” Developers must migrate to other options, OpenAI says, although GPT-4.5 will remain available in ChatGPT for now.

A confusing addition to OpenAI’s model lineup

In February, OpenAI CEO Sam Altman acknowledged on X his company’s confusing AI model naming practices, writing, “We realize how complicated our model and product offerings have gotten.” He promised that a forthcoming “GPT-5” model would consolidate the o-series and GPT-series models into a unified branding structure. But the addition of GPT-4.1 to ChatGPT appears to contradict that simplification goal.

So, if you use ChatGPT, which model should you use? If you’re a developer using the models through the API, the consideration is more of a trade-off between capability, speed, and cost. But in ChatGPT, your choice might be limited more by personal taste in behavioral style and what you’d like to accomplish. Some of the “more capable” models have lower usage limits as well because they cost more for OpenAI to run.

For now, OpenAI is keeping GPT-4o as the default ChatGPT model, likely due to its general versatility, balance between speed and capability, and personable style (conditioned using reinforcement learning and a specialized system prompt). The simulated reasoning models like 03 and 04-mini-high are slower to execute but can consider analytical-style problems more systematically and perform comprehensive web research that sometimes feels genuinely useful when it surfaces relevant (non-confabulated) web links. Compared to those, OpenAI is largely positioning GPT-4.1 as a speedier AI model for coding assistance.

Just remember that all of the AI models are prone to confabulations, meaning that they tend to make up authoritative-sounding information when they encounter gaps in their trained “knowledge.” So you’ll need to double-check all of the outputs with other sources of information if you’re hoping to use these AI models to assist with an important task.

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup Read More »

gop-sneaks-decade-long-ai-regulation-ban-into-spending-bill

GOP sneaks decade-long AI regulation ban into spending bill

The reconciliation bill primarily focuses on cuts to Medicaid access and increased health care fees for millions of Americans. The AI provision appears as an addition to these broader health care changes, potentially limiting debate on the technology’s policy implications.

The move is already inspiring backlash. On Monday, tech safety groups and at least one Democrat criticized the proposal, reports The Hill. Rep. Jan Schakowsky (D-Ill.), the ranking member on the Commerce, Manufacturing and Trade Subcommittee, called the proposal a “giant gift to Big Tech,” while nonprofit groups like the Tech Oversight Project and Consumer Reports warned it would leave consumers unprotected from AI harms like deepfakes and bias.

Big Tech’s White House connections

President Trump has already reversed several Biden-era executive orders on AI safety and risk mitigation. The push to prevent state-level AI regulation represents an escalation in the administration’s industry-friendly approach to AI policy.

Perhaps it’s no surprise, as the AI industry has cultivated close ties with the Trump administration since before the president took office. For example, Tesla CEO Elon Musk serves in the Department of Government Efficiency (DOGE), while entrepreneur David Sacks acts as “AI czar,” and venture capitalist Marc Andreessen reportedly advises the administration. OpenAI CEO Sam Altman appeared with Trump in an AI datacenter development plan announcement in January.

By limiting states’ authority over AI regulation, the provision could prevent state governments from using federal funds to develop AI oversight programs or support initiatives that diverge from the administration’s deregulatory stance. This restriction would extend beyond enforcement to potentially affect how states design and fund their own AI governance frameworks.

GOP sneaks decade-long AI regulation ban into spending bill Read More »

new-pope-chose-his-name-based-on-ai’s-threats-to-“human-dignity”

New pope chose his name based on AI’s threats to “human dignity”

“Like any product of human creativity, AI can be directed toward positive or negative ends,” Francis said in January. “When used in ways that respect human dignity and promote the well-being of individuals and communities, it can contribute positively to the human vocation. Yet, as in all areas where humans are called to make decisions, the shadow of evil also looms here. Where human freedom allows for the possibility of choosing what is wrong, the moral evaluation of this technology will need to take into account how it is directed and used.”

History repeats with new technology

While Pope Francis led the call for respecting human dignity in the face of AI, it’s worth looking a little deeper into the historical inspiration for Leo XIV’s name choice.

In the 1891 encyclical Rerum Novarum, the earlier Leo XIII directly confronted the labor upheaval of the Industrial Revolution, which generated unprecedented wealth and productive capacity but came with severe human costs. At the time, factory conditions had created what the pope called “the misery and wretchedness pressing so unjustly on the majority of the working class.” Workers faced 16-hour days, child labor, dangerous machinery, and wages that barely sustained life.

The 1891 encyclical rejected both unchecked capitalism and socialism, instead proposing Catholic social doctrine that defended workers’ rights to form unions, earn living wages, and rest on Sundays. Leo XIII argued that labor possessed inherent dignity and that employers held moral obligations to their workers. The document shaped modern Catholic social teaching and influenced labor movements worldwide, establishing the church as an advocate for workers caught between industrial capital and revolutionary socialism.

Just as mechanization disrupted traditional labor in the 1890s, artificial intelligence now potentially threatens employment patterns and human dignity in ways that Pope Leo XIV believes demands similar moral leadership from the church.

“In our own day,” Leo XIV concluded in his formal address on Saturday, “the Church offers to everyone the treasury of her social teaching in response to another industrial revolution and to developments in the field of artificial intelligence that pose new challenges for the defense of human dignity, justice, and labor.”

New pope chose his name based on AI’s threats to “human dignity” Read More »

new-lego-building-ai-creates-models-that-actually-stand-up-in-real-life

New Lego-building AI creates models that actually stand up in real life

The LegoGPT system works in three parts, shown in this diagram.

The LegoGPT system works in three parts, shown in this diagram. Credit: Pun et al.

The researchers also expanded the system’s abilities by adding texture and color options. For example, using an appearance prompt like “Electric guitar in metallic purple,” LegoGPT can generate a guitar model, with bricks assigned a purple color.

Testing with robots and humans

To prove their designs worked in real life, the researchers had robots assemble the AI-created Lego models. They used a dual-robot arm system with force sensors to pick up and place bricks according to the AI-generated instructions.

Human testers also built some of the designs by hand, showing that the AI creates genuinely buildable models. “Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing Lego designs that align closely with the input text prompts,” the team noted in its paper.

When tested against other AI systems for 3D creation, LegoGPT stands out through its focus on structural integrity. The team tested against several alternatives, including LLaMA-Mesh and other 3D generation models, and found its approach produced the highest percentage of stable structures.

A video of two robot arms building a LegoGPT creation, provided by the researchers.

Still, there are some limitations. The current version of LegoGPT only works within a 20×20×20 building space and uses a mere eight standard brick types. “Our method currently supports a fixed set of commonly used Lego bricks,” the team acknowledged. “In future work, we plan to expand the brick library to include a broader range of dimensions and brick types, such as slopes and tiles.”

The researchers also hope to scale up their training dataset to include more objects than the 21 categories currently available. Meanwhile, others can literally build on their work—the researchers released their dataset, code, and models on their project website and GitHub.

New Lego-building AI creates models that actually stand up in real life Read More »

ai-use-damages-professional-reputation,-study-suggests

AI use damages professional reputation, study suggests

Using AI can be a double-edged sword, according to new research from Duke University. While generative AI tools may boost productivity for some, they might also secretly damage your professional reputation.

On Thursday, the Proceedings of the National Academy of Sciences (PNAS) published a study showing that employees who use AI tools like ChatGPT, Claude, and Gemini at work face negative judgments about their competence and motivation from colleagues and managers.

“Our findings reveal a dilemma for people considering adopting AI tools: Although AI can enhance productivity, its use carries social costs,” write researchers Jessica A. Reif, Richard P. Larrick, and Jack B. Soll of Duke’s Fuqua School of Business.

The Duke team conducted four experiments with over 4,400 participants to examine both anticipated and actual evaluations of AI tool users. Their findings, presented in a paper titled “Evidence of a social evaluation penalty for using AI,” reveal a consistent pattern of bias against those who receive help from AI.

What made this penalty particularly concerning for the researchers was its consistency across demographics. They found that the social stigma against AI use wasn’t limited to specific groups.

Fig. 1. Effect sizes for differences in expected perceptions and disclosure to others (Study 1). Note: Positive d values indicate higher values in the AI Tool condition, while negative d values indicate lower values in the AI Tool condition. N = 497. Error bars represent 95% CI. Correlations among variables range from | r |= 0.53 to 0.88.

Fig. 1 from the paper “Evidence of a social evaluation penalty for using AI.” Credit: Reif et al.

“Testing a broad range of stimuli enabled us to examine whether the target’s age, gender, or occupation qualifies the effect of receiving help from Al on these evaluations,” the authors wrote in the paper. “We found that none of these target demographic attributes influences the effect of receiving Al help on perceptions of laziness, diligence, competence, independence, or self-assuredness. This suggests that the social stigmatization of AI use is not limited to its use among particular demographic groups. The result appears to be a general one.”

The hidden social cost of AI adoption

In the first experiment conducted by the team from Duke, participants imagined using either an AI tool or a dashboard creation tool at work. It revealed that those in the AI group expected to be judged as lazier, less competent, less diligent, and more replaceable than those using conventional technology. They also reported less willingness to disclose their AI use to colleagues and managers.

The second experiment confirmed these fears were justified. When evaluating descriptions of employees, participants consistently rated those receiving AI help as lazier, less competent, less diligent, less independent, and less self-assured than those receiving similar help from non-AI sources or no help at all.

AI use damages professional reputation, study suggests Read More »

fidji-simo-joins-openai-as-new-ceo-of-applications

Fidji Simo joins OpenAI as new CEO of Applications

In the message, Altman described Simo as bringing “a rare blend of leadership, product and operational expertise” and expressed that her addition to the team makes him “even more optimistic about our future as we continue advancing toward becoming the superintelligence company.”

Simo becomes the newest high-profile female executive at OpenAI following the departure of Chief Technology Officer Mira Murati in September. Murati, who had been with the company since 2018 and helped launch ChatGPT, left alongside two other senior leaders and founded Thinking Machines Lab in February.

OpenAI’s evolving structure

The leadership addition comes as OpenAI continues to evolve beyond its origins as a research lab. In his announcement, Altman described how the company now operates in three distinct areas: as a research lab focused on artificial general intelligence (AGI), as a “global product company serving hundreds of millions of users,” and as an “infrastructure company” building systems that advance research and deliver AI tools “at unprecedented scale.”

Altman mentioned that as CEO of OpenAI, he will “continue to directly oversee success across all pillars,” including Research, Compute, and Applications, while staying “closely involved with key company decisions.”

The announcement follows recent news that OpenAI abandoned its original plan to cede control of its nonprofit branch to a for-profit entity. The company began as a nonprofit research lab in 2015 before creating a for-profit subsidiary in 2019, maintaining its original mission “to ensure artificial general intelligence benefits everyone.”

Fidji Simo joins OpenAI as new CEO of Applications Read More »