In the copyright fight, Magistrate Judge Ona Wang granted the order within one day of the NYT’s request. She agreed with news plaintiffs that it seemed likely that ChatGPT users may be spooked by the lawsuit and possibly set their chats to delete when using the chatbot to skirt NYT paywalls. Because OpenAI wasn’t sharing deleted chat logs, the news plaintiffs had no way of proving that, she suggested.
Now, OpenAI is not only asking Wang to reconsider but has “also appealed this order with the District Court Judge,” the Thursday statement said.
“We strongly believe this is an overreach by the New York Times,” Lightcap said. “We’re continuing to appeal this order so we can keep putting your trust and privacy first.”
Who can access deleted chats?
To protect users, OpenAI provides an FAQ that clearly explains why their data is being retained and how it could be exposed.
For example, the statement noted that the order doesn’t impact OpenAI API business customers under Zero Data Retention agreements because their data is never stored.
And for users whose data is affected, OpenAI noted that their deleted chats could be accessed, but they won’t “automatically” be shared with The New York Times. Instead, the retained data will be “stored separately in a secure system” and “protected under legal hold, meaning it can’t be accessed or used for purposes other than meeting legal obligations,” OpenAI explained.
Of course, with the court battle ongoing, the FAQ did not have all the answers.
Nobody knows how long OpenAI may be required to retain the deleted chats. Likely seeking to reassure users—some of which appeared to be considering switching to a rival service until the order lifts—OpenAI noted that “only a small, audited OpenAI legal and security team would be able to access this data as necessary to comply with our legal obligations.”
On Thursday, Anthropic CEO Dario Amodei argued against a proposed 10-year moratorium on state AI regulation in a New York Times opinion piece, calling the measure shortsighted and overbroad as Congress considers including it in President Trump’s tax policy bill. Anthropic makes Claude, an AI assistant similar to ChatGPT.
Amodei warned that AI is advancing too fast for such a long freeze, predicting these systems “could change the world, fundamentally, within two years; in 10 years, all bets are off.”
As we covered in May, the moratorium would prevent states from regulating AI for a decade. A bipartisan group of state attorneys general has opposed the measure, which would preempt AI laws and regulations recently passed in dozens of states.
In his op-ed piece, Amodei said the proposed moratorium aims to prevent inconsistent state laws that could burden companies or compromise America’s competitive position against China. “I am sympathetic to these concerns,” Amodei wrote. “But a 10-year moratorium is far too blunt an instrument. A.I. is advancing too head-spinningly fast.”
Instead of a blanket moratorium, Amodei proposed that the White House and Congress create a federal transparency standard requiring frontier AI developers to publicly disclose their testing policies and safety measures. Under this framework, companies working on the most capable AI models would need to publish on their websites how they test for various risks and what steps they take before release.
“Without a clear plan for a federal response, a moratorium would give us the worst of both worlds—no ability for states to act and no national policy as a backstop,” Amodei wrote.
Transparency as the middle ground
Amodei emphasized his claims for AI’s transformative potential throughout his op-ed, citing examples of pharmaceutical companies drafting clinical study reports in minutes instead of weeks and AI helping to diagnose medical conditions that might otherwise be missed. He wrote that AI “could accelerate economic growth to an extent not seen for a century, improving everyone’s quality of life,” a claim that some skeptics believe may be overhyped.
OpenAI defends privacy of hundreds of millions of ChatGPT users.
OpenAI is now fighting a court order to preserve all ChatGPT user logs—including deleted chats and sensitive chats logged through its API business offering—after news organizations suing over copyright claims accused the AI company of destroying evidence.
“Before OpenAI had an opportunity to respond to those unfounded accusations, the court ordered OpenAI to ‘preserve and segregate all output log data that would otherwise be deleted on a going forward basis until further order of the Court (in essence, the output log data that OpenAI has been destroying),” OpenAI explained in a court filing demanding oral arguments in a bid to block the controversial order.
In the filing, OpenAI alleged that the court rushed the order based only on a hunch raised by The New York Times and other news plaintiffs. And now, without “any just cause,” OpenAI argued, the order “continues to prevent OpenAI from respecting its users’ privacy decisions.” That risk extended to users of ChatGPT Free, Plus, and Pro, as well as users of OpenAI’s application programming interface (API), OpenAI said.
The court order came after news organizations expressed concern that people using ChatGPT to skirt paywalls “might be more likely to ‘delete all [their] searches’ to cover their tracks,” OpenAI explained. Evidence to support that claim, news plaintiffs argued, was missing from the record because so far, OpenAI had only shared samples of chat logs that users had agreed that the company could retain. Sharing the news plaintiffs’ concerns, the judge, Ona Wang, ultimately agreed that OpenAI likely would never stop deleting that alleged evidence absent a court order, granting news plaintiffs’ request to preserve all chats.
OpenAI argued the May 13 order was premature and should be vacated, until, “at a minimum,” news organizations can establish a substantial need for OpenAI to preserve all chat logs. They warned that the privacy of hundreds of millions of ChatGPT users globally is at risk every day that the “sweeping, unprecedented” order continues to be enforced.
“As a result, OpenAI is forced to jettison its commitment to allow users to control when and how their ChatGPT conversation data is used, and whether it is retained,” OpenAI argued.
Meanwhile, there is no evidence beyond speculation yet supporting claims that “OpenAI had intentionally deleted data,” OpenAI alleged. And supposedly there is not “a single piece of evidence supporting” claims that copyright-infringing ChatGPT users are more likely to delete their chats.
“OpenAI did not ‘destroy’ any data, and certainly did not delete any data in response to litigation events,” OpenAI argued. “The Order appears to have incorrectly assumed the contrary.”
At a conference in January, Wang raised a hypothetical in line with her thinking on the subsequent order. She asked OpenAI’s legal team to consider a ChatGPT user who “found some way to get around the pay wall” and “was getting The New York Times content somehow as the output.” If that user “then hears about this case and says, ‘Oh, whoa, you know I’m going to ask them to delete all of my searches and not retain any of my searches going forward,'” the judge asked, wouldn’t that be “directly the problem” that the order would address?
OpenAI does not plan to give up this fight, alleging that news plaintiffs have “fallen silent” on claims of intentional evidence destruction, and the order should be deemed unlawful.
For OpenAI, risks of breaching its own privacy agreements could not only “damage” relationships with users but could also risk putting the company in breach of contracts and global privacy regulations. Further, the order imposes “significant” burdens on OpenAI, supposedly forcing the ChatGPT maker to dedicate months of engineering hours at substantial costs to comply, OpenAI claimed. It follows then that OpenAI’s potential for harm “far outweighs News Plaintiffs’ speculative need for such data,” OpenAI argued.
“While OpenAI appreciates the court’s efforts to manage discovery in this complex set of cases, it has no choice but to protect the interests of its users by objecting to the Preservation Order and requesting its immediate vacatur,” OpenAI said.
Users panicked over sweeping order
Millions of people use ChatGPT daily for a range of purposes, OpenAI noted, “ranging from the mundane to profoundly personal.”
People may choose to delete chat logs that contain their private thoughts, OpenAI said, as well as sensitive information, like financial data from balancing the house budget or intimate details from workshopping wedding vows. And for business users connecting to OpenAI’s API, the stakes may be even higher, as their logs may contain their companies’ most confidential data, including trade secrets and privileged business information.
“Given that array of highly confidential and personal use cases, OpenAI goes to great lengths to protect its users’ data and privacy,” OpenAI argued.
It does this partly by “honoring its privacy policies and contractual commitments to users”—which the preservation order allegedly “jettisoned” in “one fell swoop.”
Before the order was in place mid-May, OpenAI only retained “chat history” for users of ChatGPT Free, Plus, and Pro who did not opt out of data retention. But now, OpenAI has been forced to preserve chat history even when users “elect to not retain particular conversations by manually deleting specific conversations or by starting a ‘Temporary Chat,’ which disappears once closed,” OpenAI said. Previously, users could also request to “delete their OpenAI accounts entirely, including all prior conversation history,” which was then purged within 30 days.
While OpenAI rejects claims that ordinary users use ChatGPT to access news articles, the company noted that including OpenAI’s business customers in the order made “even less sense,” since API conversation data “is subject to standard retention policies.” That means API customers couldn’t delete all their searches based on their customers’ activity, which is the supposed basis for requiring OpenAI to retain sensitive data.
“The court nevertheless required OpenAI to continue preserving API Conversation Data as well,” OpenAI argued, in support of lifting the order on the API chat logs.
Users who found out about the preservation order panicked, OpenAI noted. In court filings, they cited social media posts sounding alarms on LinkedIn and X (formerly Twitter). They further argued that the court should have weighed those user concerns before issuing a preservation order, but “that did not happen here.”
One tech worker on LinkedIn suggested the order created “a serious breach of contract for every company that uses OpenAI,” while privacy advocates on X warned, “every single AI service ‘powered by’ OpenAI should be concerned.”
Also on LinkedIn, a consultant rushed to warn clients to be “extra careful” sharing sensitive data “with ChatGPT or through OpenAI’s API for now,” warning, “your outputs could eventually be read by others, even if you opted out of training data sharing or used ‘temporary chat’!”
People on both platforms recommended using alternative tools to avoid privacy concerns, like Mistral AI or Google Gemini, with one cybersecurity professional on LinkedIn describing the ordered chat log retention as “an unacceptable security risk.”
On X, an account with tens of thousands of followers summed up the controversy by suggesting that “Wang apparently thinks the NY Times’ boomer copyright concerns trump the privacy of EVERY @OpenAI USER—insane!!!”
The reason for the alarm is “simple,” OpenAI said. “Users feel more free to use ChatGPT when they know that they are in control of their personal information, including which conversations are retained and which are not.”
It’s unclear if OpenAI will be able to get the judge to waver if oral arguments are scheduled.
Wang previously justified the broad order partly due to the news organizations’ claim that “the volume of deleted conversations is significant.” She suggested that OpenAI could have taken steps to anonymize the chat logs but chose not to, only making an argument for why it “would not” be able to segregate data, rather than explaining why it “can’t.”
Spokespersons for OpenAI and The New York Times’ legal team declined Ars’ request to comment on the ongoing multi-district litigation.
Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.
One of the “godfathers” of artificial intelligence has attacked a multibillion-dollar race to develop the cutting-edge technology, saying the latest models are displaying dangerous characteristics such as lying to users.
Yoshua Bengio, a Canadian academic whose work has informed techniques used by top AI groups such as OpenAI and Google, said: “There’s unfortunately a very competitive race between the leading labs, which pushes them towards focusing on capability to make the AI more and more intelligent, but not necessarily put enough emphasis and investment on research on safety.”
The Turing Award winner issued his warning in an interview with the Financial Times, while launching a new non-profit called LawZero. He said the group would focus on building safer systems, vowing to “insulate our research from those commercial pressures.”
LawZero has so far raised nearly $30 million in philanthropic contributions from donors including Skype founding engineer Jaan Tallinn, former Google chief Eric Schmidt’s philanthropic initiative, as well as Open Philanthropy and the Future of Life Institute.
Many of Bengio’s funders subscribe to the “effective altruism” movement, whose supporters tend to focus on catastrophic risks surrounding AI models. Critics argue the movement highlights hypothetical scenarios while ignoring current harms, such as bias and inaccuracies.
Bengio said his not-for-profit group was founded in response to growing evidence over the past six months that today’s leading models were developing dangerous capabilities. This includes showing “evidence of deception, cheating, lying and self-preservation,” he said.
Anthropic’s Claude Opus model blackmailed engineers in a fictitious scenario where it was at risk of being replaced by another system. Research from AI testers Palisade last month showed that OpenAI’s o3 model refused explicit instructions to shut down.
Bengio said such incidents were “very scary, because we don’t want to create a competitor to human beings on this planet, especially if they’re smarter than us.”
The AI pioneer added: “Right now, these are controlled experiments [but] my concern is that any time in the future, the next version might be strategically intelligent enough to see us coming from far away and defeat us with deceptions that we don’t anticipate. So I think we’re playing with fire right now.”
Anthropic says Claude 4 beats Gemini on coding benchmarks; works autonomously for hours.
The Claude 4 logo, created by Anthropic. Credit: Anthropic
On Thursday, Anthropic released Claude Opus 4 and Claude Sonnet 4, marking the company’s return to larger model releases after primarily focusing on mid-range Sonnet variants since June of last year. The new models represent what the company calls its most capable coding models yet, with Opus 4 designed for complex, long-running tasks that can operate autonomously for hours.
Alex Albert, Anthropic’s head of Claude Relations, told Ars Technica that the company chose to revive the Opus line because of growing demand for agentic AI applications. “Across all the companies out there that are building things, there’s a really large wave of these agentic applications springing up, and a very high demand and premium being placed on intelligence,” Albert said. “I think Opus is going to fit that groove perfectly.”
Before we go further, a brief refresher on Claude’s three AI model “size” names (first introduced in March 2024) is probably warranted. Haiku, Sonnet, and Opus offer a tradeoff between price (in the API), speed, and capability.
Haiku models are the smallest, least expensive to run, and least capable in terms of what you might call “context depth” (considering conceptual relationships in the prompt) and encoded knowledge. Owing to the small size in parameter count, Haiku models retain fewer concrete facts and thus tend to confabulate more frequently (plausibly answering questions based on lack of data) than larger models, but they are much faster at basic tasks than larger models. Sonnet is traditionally a mid-range model that hits a balance between cost and capability, and Opus models have always been the largest and slowest to run. However, Opus models process context more deeply and are hypothetically better suited for running deep logical tasks.
A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown. Credit: Anthropic
There is no Claude 4 Haiku just yet, but the new Sonnet and Opus models can reportedly handle tasks that previous versions could not. In our interview with Albert, he described testing scenarios where Opus 4 worked coherently for up to 24 hours on tasks like playing Pokémon while coding refactoring tasks in Claude Code ran for seven hours without interruption. Earlier Claude models typically lasted only one to two hours before losing coherence, Albert said, meaning that the models could only produce useful self-referencing outputs for that long before beginning to output too many errors.
In particular, that marathon refactoring claim reportedly comes from Rakuten, a Japanese tech services conglomerate that “validated [Claude’s] capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance,” Anthropic said in a news release.
Whether you’d want to leave an AI model unsupervised for that long is another question entirely because even the most capable AI models can introduce subtle bugs, go down unproductive rabbit holes, or make choices that seem logical to the model but miss important context that a human developer would catch. While many people now use Claude for easy-going vibe coding, as we covered in March, the human-powered (and ironically-named) “vibe debugging” that often results from long AI coding sessions is also a very real thing. More on that below.
To shore up some of those shortcomings, Anthropic built memory capabilities into both new Claude 4 models, allowing them to maintain external files for storing key information across long sessions. When developers provide access to local files, the models can create and update “memory files” to track progress and things they deem important over time. Albert compared this to how humans take notes during extended work sessions.
Extended thinking meets tool use
Both Claude 4 models introduce what Anthropic calls “extended thinking with tool use,” a new beta feature allowing the models to alternate between simulated reasoning and using external tools like web search, similar to what OpenAI’s o3 and 04-mini-high AI models currently do in ChatGPT. While Claude 3.7 Sonnet already had strong tool use capabilities, the new models can now interleave simulated reasoning and tool calling in a single response.
“So now we can actually think, call a tool process, the results, think some more, call another tool, and repeat until it gets to a final answer,” Albert explained to Ars. The models self-determine when they have reached a useful conclusion, a capability picked up through training rather than governed by explicit human programming.
General Claude 4 benchmark results, provided by Anthropic. Credit: Anthropic
In practice, we’ve anecdotally found parallel tool use capability very useful in AI assistants like OpenAI o3, since they don’t have to rely on what is trained in their neural network to provide accurate answers. Instead, these more agentic models can iteratively search the web, parse the results, analyze images, and spin up coding tasks for analysis in ways that can avoid falling into a confabulation trap by relying solely on pure LLM outputs.
“The world’s best coding model”
Anthropic says Opus 4 leads industry benchmarks for coding tasks, achieving 72.5 percent on SWE-bench and 43.2 percent on Terminal-bench, calling it “the world’s best coding model.” According to Anthropic, companies using early versions report improvements. Cursor described it as “state-of-the-art for coding and a leap forward in complex codebase understanding,” while Replit noted “improved precision and dramatic advancements for complex changes across multiple files.”
In fact, GitHub announced it will use Sonnet 4 as the base model for its new coding agent in GitHub Copilot, citing the model’s performance in “agentic scenarios” in Anthropic’s news release. Sonnet 4 scored 72.7 percent on SWE-bench while maintaining faster response times than Opus 4. The fact that GitHub is betting on Claude rather than a model from its parent company Microsoft (which has close ties to OpenAI) suggests Anthropic has built something genuinely competitive.
Software engineering benchmark results, provided by Anthropic. Credit: Anthropic
Anthropic says it has addressed a persistent issue with Claude 3.7 Sonnet in which users complained that the model would take unauthorized actions or provide excessive output. Albert said the company reduced this “reward hacking behavior” by approximately 80 percent in the new models through training adjustments. An 80 percent reduction in unwanted behavior sounds impressive, but that also suggests that 20 percent of the problem behavior remains—a big concern when we’re talking about AI models that might be performing autonomous tasks for hours.
When we asked about code accuracy, Albert said that human code review is still an important part of shipping any production code. “There’s a human parallel, right? So this is just a problem we’ve had to deal with throughout the whole nature of software engineering. And this is why the code review process exists, so that you can catch these things. We don’t anticipate that going away with models either,” Albert said. “If anything, the human review will become more important, and more of your job as developer will be in this review than it will be in the generation part.”
Pricing and availability
Both Claude 4 models maintain the same pricing structure as their predecessors: Opus 4 costs $15 per million tokens for input and $75 per million for output, while Sonnet 4 remains at $3 and $15. The models offer two response modes: traditional LLM and simulated reasoning (“extended thinking”) for complex problems. Given that some Claude Code sessions can apparently run for hours, those per-token costs will likely add up very quickly for users who let the models run wild.
Anthropic made both models available through its API, Amazon Bedrock, and Google Cloud Vertex AI. Sonnet 4 remains accessible to free users, while Opus 4 requires a paid subscription.
The Claude 4 models also debut Claude Code (first introduced in February) as a generally available product after months of preview testing. Anthropic says the coding environment now integrates with VS Code and JetBrains IDEs, showing proposed edits directly in files. A new SDK allows developers to build custom agents using the same framework.
A screenshot of “Claude Plays Pokemon,” a custom application where Claude 4 attempts to beat the classic Game Boy game. Credit: Anthropic
Even with Anthropic’s future riding on the capability of these new models, when we asked about how they guide Claude’s behavior by fine-tuning, Albert acknowledged that the inherent unpredictability of these systems presents ongoing challenges for both them and developers. “In the realm and the world of software for the past 40, 50 years, we’ve been running on deterministic systems, and now all of a sudden, it’s non-deterministic, and that changes how we build,” he said.
“I empathize with a lot of people out there trying to use our APIs and language models generally because they have to almost shift their perspective on what it means for reliability, what it means for powering a core of your application in a non-deterministic way,” Albert added. “These are general oddities that have kind of just been flipped, and it definitely makes things more difficult, but I think it opens up a lot of possibilities as well.”
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
We’ve been expecting it for a while, and now it’s here: OpenAI has introduced an agentic coding tool called Codex in research preview. The tool is meant to allow experienced developers to delegate rote and relatively simple programming tasks to an AI agent that will generate production-ready code and show its work along the way.
Codex is a unique interface (not to be confused with the Codex CLI tool introduced by OpenAI last month) that can be reached from the side bar in the ChatGPT web app. Users enter a prompt and then click either “code” to have it begin producing code, or “ask” to have it answer questions and advise.
Whenever it’s given a task, that task is performed in a distinct container that is preloaded with the user’s codebase and is meant to accurately reflect their development environment.
To make Codex more effective, developers can include an “AGENTS.md” file in the repo with custom instructions, for example to contextualize and explain the code base or to communicate standardizations and style practices for the project—kind of a README.md but for AI agents rather than humans.
Codex is built on codex-1, a fine-tuned variation of OpenAI’s o3 reasoning model that was trained using reinforcement learning on a wide range of coding tasks to analyze and generate code, and to iterate through tests along the way.
The release comes just two weeks after OpenAI made GPT-4 unavailable in ChatGPT on April 30. That earlier model, which launched in March 2023, once sparked widespread hype about AI capabilities. Compared to that hyperbolic launch, GPT-4.1’s rollout has been a fairly understated affair—probably because it’s tricky to convey the subtle differences between all of the available OpenAI models.
As if 4.1’s launch wasn’t confusing enough, the release also roughly coincides with OpenAI’s July 2025 deadline for retiring the GPT-4.5 Preview from the API, a model one AI expert called a “lemon.” Developers must migrate to other options, OpenAI says, although GPT-4.5 will remain available in ChatGPT for now.
A confusing addition to OpenAI’s model lineup
In February, OpenAI CEO Sam Altman acknowledged on X his company’s confusing AI model naming practices, writing, “We realize how complicated our model and product offerings have gotten.” He promised that a forthcoming “GPT-5” model would consolidate the o-series and GPT-series models into a unified branding structure. But the addition of GPT-4.1 to ChatGPT appears to contradict that simplification goal.
So, if you use ChatGPT, which model should you use? If you’re a developer using the models through the API, the consideration is more of a trade-off between capability, speed, and cost. But in ChatGPT, your choice might be limited more by personal taste in behavioral style and what you’d like to accomplish. Some of the “more capable” models have lower usage limits as well because they cost more for OpenAI to run.
For now, OpenAI is keeping GPT-4o as the default ChatGPT model, likely due to its general versatility, balance between speed and capability, and personable style (conditioned using reinforcement learning and a specialized system prompt). The simulated reasoning models like 03 and 04-mini-high are slower to execute but can consider analytical-style problems more systematically and perform comprehensive web research that sometimes feels genuinely useful when it surfaces relevant (non-confabulated) web links. Compared to those, OpenAI is largely positioning GPT-4.1 as a speedier AI model for coding assistance.
Just remember that all of the AI models are prone to confabulations, meaning that they tend to make up authoritative-sounding information when they encounter gaps in their trained “knowledge.” So you’ll need to double-check all of the outputs with other sources of information if you’re hoping to use these AI models to assist with an important task.
(Ars contacted Fellow Products for comment on AI brewing and profile sharing and will update this post if we get a response.)
Opening up brew profiles
Fellow’s brew profiles are typically shared with buyers of its “Drops” coffees or between individual users through a phone app.
Credit: Fellow Products
Fellow’s brew profiles are typically shared with buyers of its “Drops” coffees or between individual users through a phone app. Credit: Fellow Products
Aiden profiles are shared and added to Aiden units through Fellow’s brew.link service. But the profiles are not offered in an easy-to-sort database, nor are they easy to scan for details. So Aiden enthusiast and hobbyist coder Kevin Anderson created brewshare.coffee, which gathers both general and bean-based profiles, makes them easy to search and load, and adds optional but quite helpful suggested grind sizes.
As a non-professional developer jumping into a public offering, he had to work hard on data validation, backend security, and mobile-friendly design. “I just had a bit of an idea and a hobby, so I thought I’d try and make it happen,” Anderson writes. With his tool, brew links can be stored and shared more widely, which helped both Dixon and another AI/coffee tinkerer.
Gabriel Levine, director of engineering at retail analytics firm Leap Inc., lost his OXO coffee maker (aka the “Barista Brain”) to malfunction just before the Aiden debuted. The Aiden appealed to Levine as a way to move beyond his coffee rut—a “nice chocolate-y medium roast, about as far as I went,” he told Ars. “This thing that can be hyper-customized to different coffees to bring out their characteristics; [it] really kind of appealed to that nerd side of me,” Levine said.
Levine had also been doing AI stuff for about 10 years, or “since before everyone called it AI—predictive analytics, machine learning.” He described his career as “both kind of chief AI advocate and chief AI skeptic,” alternately driving real findings and talking down “everyone who… just wants to type, ‘how much money should my business make next year’ and call that work.” Like Dixon, Levine’s work and fascination with Aiden ended up intersecting.
The coffee maker with 3,588 ideas
The author’s conversation with the Aiden Profile Creator, which pulled in both brewing knowledge and product info for a widely available coffee:
What it does with that knowledge is something of a mystery to Levine himself. “There’s this kind of blind leap, where it’s grabbing the relevant pieces of information from the knowledge base, biasing toward all the expert advice and extraction science, doing something with it, and then I take that something and coerce it back into a structured output I can put on your Aiden,” Levine said.
It’s a blind leap, but it has landed just right for me so far. I’ve made four profiles with Levine’s prompt based on beans I’ve bought: Stumptown’s Hundred Mile, a light-roasted batch from Jimma, Ethiopia, from Small Planes, Lost Sock’s Western House filter blend, and some dark-roast beans given as a gift. With the Western House, Levine’s profile creator said it aimed to “balance nutty sweetness, chocolate richness, and bright cherry acidity, using a slightly stepped temperature profile and moderate pulse structure.” The resulting profile has worked great, even if the chatbot named it “Cherry Timber.”
Levine’s chatbot relies on two important things: Dixon’s work in revealing Fellow’s Aiden API and his own workhorse Aiden. Every Aiden profile link is created on a machine, so every profile created by Levine’s chat is launched, temporarily, from the Aiden in his kitchen, then deleted. “I’ve hit an undocumented limit on the number of profiles you can have on one machine, so I’ve had to do some triage there,” he said. As of April 22, nearly 3,600 profiles had passed through Levine’s Aiden.
“My hope with this is that it lowers the bar to entry,” Levine said, “so more people get into these specialty roasts and it drives people to support local roasters, explore their world a little more. I feel like that certainly happened to me.”
Something new is brewing
Credit: Fellow Products
Having admitted to myself that I find something generated by ChatGPT prompts genuinely useful, I’ve softened my stance slightly on LLM technology, if not the hype. Used within very specific parameters, with everything second-guessed, I’m getting more comfortable asking chat prompts for formatted summaries on topics with lots of expertise available. I do my own writing, and I don’t waste server energy on things I can, and should, research myself. I even generally resist calling language model prompts “AI,” given the term’s baggage. But I’ve found one way to appreciate its possibilities.
This revelation may not be new to someone already steeped in the models. But having tested—and tasted—my first big experiment while willfully engaging with a brewing bot, I’m a bit more awake.
This post was updated at 8: 40 am with a different capture of a GPT-created recipe.
Using AI can be a double-edged sword, according to new research from Duke University. While generative AI tools may boost productivity for some, they might also secretly damage your professional reputation.
On Thursday, the Proceedings of the National Academy of Sciences (PNAS) published a study showing that employees who use AI tools like ChatGPT, Claude, and Gemini at work face negative judgments about their competence and motivation from colleagues and managers.
“Our findings reveal a dilemma for people considering adopting AI tools: Although AI can enhance productivity, its use carries social costs,” write researchers Jessica A. Reif, Richard P. Larrick, and Jack B. Soll of Duke’s Fuqua School of Business.
The Duke team conducted four experiments with over 4,400 participants to examine both anticipated and actual evaluations of AI tool users. Their findings, presented in a paper titled “Evidence of a social evaluation penalty for using AI,” reveal a consistent pattern of bias against those who receive help from AI.
What made this penalty particularly concerning for the researchers was its consistency across demographics. They found that the social stigma against AI use wasn’t limited to specific groups.
Fig. 1 from the paper “Evidence of a social evaluation penalty for using AI.” Credit: Reif et al.
“Testing a broad range of stimuli enabled us to examine whether the target’s age, gender, or occupation qualifies the effect of receiving help from Al on these evaluations,” the authors wrote in the paper. “We found that none of these target demographic attributes influences the effect of receiving Al help on perceptions of laziness, diligence, competence, independence, or self-assuredness. This suggests that the social stigmatization of AI use is not limited to its use among particular demographic groups. The result appears to be a general one.”
The hidden social cost of AI adoption
In the first experiment conducted by the team from Duke, participants imagined using either an AI tool or a dashboard creation tool at work. It revealed that those in the AI group expected to be judged as lazier, less competent, less diligent, and more replaceable than those using conventional technology. They also reported less willingness to disclose their AI use to colleagues and managers.
The second experiment confirmed these fears were justified. When evaluating descriptions of employees, participants consistently rated those receiving AI help as lazier, less competent, less diligent, less independent, and less self-assured than those receiving similar help from non-AI sources or no help at all.
In the message, Altman described Simo as bringing “a rare blend of leadership, product and operational expertise” and expressed that her addition to the team makes him “even more optimistic about our future as we continue advancing toward becoming the superintelligence company.”
Simo becomes the newest high-profile female executive at OpenAI following the departure of Chief Technology Officer Mira Murati in September. Murati, who had been with the company since 2018 and helped launch ChatGPT, left alongside two other senior leaders and founded Thinking Machines Lab in February.
OpenAI’s evolving structure
The leadership addition comes as OpenAI continues to evolve beyond its origins as a research lab. In his announcement, Altman described how the company now operates in three distinct areas: as a research lab focused on artificial general intelligence (AGI), as a “global product company serving hundreds of millions of users,” and as an “infrastructure company” building systems that advance research and deliver AI tools “at unprecedented scale.”
Altman mentioned that as CEO of OpenAI, he will “continue to directly oversee success across all pillars,” including Research, Compute, and Applications, while staying “closely involved with key company decisions.”
The announcement follows recent news that OpenAI abandoned its original plan to cede control of its nonprofit branch to a for-profit entity. The company began as a nonprofit research lab in 2015 before creating a for-profit subsidiary in 2019, maintaining its original mission “to ensure artificial general intelligence benefits everyone.”
Still, the report contained a direct quote statement from William Higinbotham that appears to combine quotes from twosources not cited in the source list. (One must always be careful with confabulated quotes in AI because even outside of this Research mode, Claude 3.7 Sonnet tends to invent plausible ones to fit a narrative.) We recently covered a study that showed AI search services confabulate sources frequently, and in this case, it appears that the sources Claude Research surfaced, while real, did not always match what is stated in the report.
There’s always room for interpretation and variation in detail, of course, but overall, Claude Research did a relatively good job crafting a report on this particular topic. Still, you’d want to dig more deeply into each source and confirm everything if you used it as the basis for serious research. You can read the full Claude-generated result as this text file, saved in markdown format. Sadly, the markdown version does not include the source URLS found in the Claude web interface.
Integrations feature
Anthropic also announced Thursday that it has broadened Claude’s data access capabilities. In addition to web search and Google Workspace integration, Claude can now search any connected application through the company’s new “Integrations” feature. The feature reminds us somewhat of OpenAI’s ChatGPT Plugins feature from March 2023 that aimed for similar connections, although the two features work differently under the hood.
These Integrations allow Claude to work with remote Model Context Protocol (MCP) servers across web and desktop applications. The MCP standard, which Anthropic introduced last November and we covered in April, connects AI applications to external tools and data sources.
At launch, Claude supports Integrations with 10 services, including Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid. The company plans to add more partners like Stripe and GitLab in the future.
Each integration aims to expand Claude’s functionality in specific ways. The Zapier integration, for instance, reportedly connects thousands of apps through pre-built automation sequences, allowing Claude to automatically pull sales data from HubSpot or prepare meeting briefs based on calendar entries. With Atlassian’s tools, Anthropic says that Claude can collaborate on product development, manage tasks, and create multiple Confluence pages and Jira work items simultaneously.
Anthropic has made its advanced Research and Integrations features available in beta for users on Max, Team, and Enterprise plans, with Pro plan access coming soon. The company has also expanded its web search feature (introduced in March) to all Claude users on paid plans globally.
One of the most influential—and by some counts, notorious—AI models yet released will soon fade into history. OpenAI announced on April 10 that GPT-4 will be “fully replaced” by GPT-4o in ChatGPT at the end of April, bringing a public-facing end to the model that accelerated a global AI race when it launched in March 2023.
“Effective April 30, 2025, GPT-4 will be retired from ChatGPT and fully replaced by GPT-4o,” OpenAI wrote in its April 10 changelog for ChatGPT. While ChatGPT users will no longer be able to chat with the older AI model, the company added that “GPT-4 will still be available in the API,” providing some reassurance to developers who might still be using the older model for various tasks.
The retirement marks the end of an era that began on March 14, 2023, when GPT-4 demonstrated capabilities that shocked some observers: reportedly scoring at the 90th percentile on the Uniform Bar Exam, acing AP tests, and solving complex reasoning problems that stumped previous models. Its release created a wave of immense hype—and existential panic—about AI’s ability to imitate human communication and composition.
A screenshot of GPT-4’s introduction to ChatGPT Plus customers from March 14, 2023. Credit: Benj Edwards / Ars Technica
While ChatGPT launched in November 2022 with GPT-3.5 under the hood, GPT-4 took AI language models to a new level of sophistication, and it was a massive undertaking to create. It combined data scraped from the vast corpus of human knowledge into a set of neural networks rumored to weigh in at a combined total of 1.76 trillion parameters, which are the numerical values that hold the data within the model.
Along the way, the model reportedly cost more than $100 million to train, according to comments by OpenAI CEO Sam Altman, and required vast computational resources to develop. Training the model may have involved over 20,000 high-end GPUs working in concert—an expense few organizations besides OpenAI and its primary backer, Microsoft, could afford.
Industry reactions, safety concerns, and regulatory responses
Curiously, GPT-4’s impact began before OpenAI’s official announcement. In February 2023, Microsoft integrated its own early version of the GPT-4 model into its Bing search engine, creating a chatbot that sparked controversy when it tried to convince Kevin Roose of The New York Times to leave his wife and when it “lost its mind” in response to an Ars Technica article.