chatgpt

researchers-surprised-to-find-less-educated-areas-adopting-ai-writing-tools-faster

Researchers surprised to find less-educated areas adopting AI writing tools faster


From the mouths of machines

Stanford researchers analyzed 305 million texts, revealing AI-writing trends.

Since the launch of ChatGPT in late 2022, experts have debated how widely AI language models would impact the world. A few years later, the picture is getting clear. According to new Stanford University-led research examining over 300 million text samples across multiple sectors, AI language models now assist in writing up to a quarter of professional communications across sectors. It’s having a large impact, especially in less-educated parts of the United States.

“Our study shows the emergence of a new reality in which firms, consumers and even international organizations substantially rely on generative AI for communications,” wrote the researchers.

The researchers tracked large language model (LLM) adoption across industries from January 2022 to September 2024 using a dataset that included 687,241 consumer complaints submitted to the US Consumer Financial Protection Bureau (CFPB), 537,413 corporate press releases, 304.3 million job postings, and 15,919 United Nations press releases.

By using a statistical detection system that tracked word usage patterns, the researchers found that roughly 18 percent of financial consumer complaints (including 30 percent of all complaints from Arkansas), 24 percent of corporate press releases, up to 15 percent of job postings, and 14 percent of UN press releases showed signs of AI assistance during that period of time.

The study also found that while urban areas showed higher adoption overall (18.2 percent versus 10.9 percent in rural areas), regions with lower educational attainment used AI writing tools more frequently (19.9 percent compared to 17.4 percent in higher-education areas). The researchers note that this contradicts typical technology adoption patterns where more educated populations adopt new tools fastest.

“In the consumer complaint domain, the geographic and demographic patterns in LLM adoption present an intriguing departure from historical technology diffusion trends where technology adoption has generally been concentrated in urban areas, among higher-income groups, and populations with higher levels of educational attainment.”

Researchers from Stanford, the University of Washington, and Emory University led the study, titled, “The Widespread Adoption of Large Language Model-Assisted Writing Across Society,” first listed on the arXiv preprint server in mid-February. Weixin Liang and Yaohui Zhang from Stanford served as lead authors, with collaborators Mihai Codreanu, Jiayu Wang, Hancheng Cao, and James Zou.

Detecting AI use in aggregate

We’ve previously covered that AI writing detection services aren’t reliable, and this study does not contradict that finding. On a document-by-document basis, AI detectors cannot be trusted. But when analyzing millions of documents in aggregate, telltale patterns emerge that suggest the influence of AI language models on text.

The researchers developed an approach based on a statistical framework in a previously released work that analyzed shifts in word frequencies and linguistic patterns before and after ChatGPT’s release. By comparing large sets of pre- and post-ChatGPT texts, they estimated the proportion of AI-assisted content at a population level. The presumption is that LLMs tend to favor certain word choices, sentence structures, and linguistic patterns that differ subtly from typical human writing.

To validate their approach, the researchers created test sets with known percentages of AI content (from zero percent to 25 percent) and found their method predicted these percentages with error rates below 3.3 percent. This statistical validation gave them confidence in their population-level estimates.

While the researchers specifically note their estimates likely represent a minimum level of AI usage, it’s important to understand that actual AI involvement might be significantly greater. Due to the difficulty in detecting heavily edited or increasingly sophisticated AI-generated content, the researchers say their reported adoption rates could substantially underestimate true levels of generative AI use.

Analysis suggests AI use as “equalizing tools”

While the overall adoption rates are revealing, perhaps more insightful are the patterns of who is using AI writing tools and how these patterns may challenge conventional assumptions about technology adoption.

In examining the CFPB complaints (a US public resource that collects complaints about consumer financial products and services), the researchers’ geographic analysis revealed substantial variation across US states.

Arkansas showed the highest adoption rate at 29.2 percent (based on 7,376 complaints), followed by Missouri at 26.9 percent (16,807 complaints) and North Dakota at 24.8 percent (1,025 complaints). In contrast, states like West Virginia (2.6 percent), Idaho (3.8 percent), and Vermont (4.8 percent) showed minimal AI writing adoption. Major population centers demonstrated moderate adoption, with California at 17.4 percent (157,056 complaints) and New York at 16.6 percent (104,862 complaints).

The urban-rural divide followed expected technology adoption patterns initially, but with an interesting twist. Using Rural Urban Commuting Area (RUCA) codes, the researchers found that urban and rural areas initially adopted AI writing tools at similar rates during early 2023. However, adoption trajectories diverged by mid-2023, with urban areas reaching 18.2 percent adoption compared to 10.9 percent in rural areas.

Contrary to typical technology diffusion patterns, areas with lower educational attainment showed higher AI writing tool usage. Comparing regions above and below state median levels of bachelor’s degree attainment, areas with fewer college graduates stabilized at 19.9 percent adoption rates compared to 17.4 percent in more educated regions. This pattern held even within urban areas, where less-educated communities showed 21.4 percent adoption versus 17.8 percent in more educated urban areas.

The researchers suggest that AI writing tools may serve as a leg-up for people who may not have as much educational experience. “While the urban-rural digital divide seems to persist,” the researchers write, “our finding that areas with lower educational attainment showed modestly higher LLM adoption rates in consumer complaints suggests these tools may serve as equalizing tools in consumer advocacy.”

Corporate and diplomatic trends in AI writing

According to the researchers, all sectors they analyzed (consumer complaints, corporate communications, job postings) showed similar adoption patterns: sharp increases beginning three to four months after ChatGPT’s November 2022 launch, followed by stabilization in late 2023.

Organization age emerged as the strongest predictor of AI writing usage in the job posting analysis. Companies founded after 2015 showed adoption rates up to three times higher than firms established before 1980, reaching 10–15 percent AI-modified text in certain roles compared to below 5 percent for older organizations. Small companies with fewer employees also incorporated AI more readily than larger organizations.

When examining corporate press releases by sector, science and technology companies integrated AI most extensively, with an adoption rate of 16.8 percent by late 2023. Business and financial news (14–15.6 percent) and people and culture topics (13.6–14.3 percent) showed slightly lower but still significant adoption.

In the international arena, Latin American and Caribbean UN country teams showed the highest adoption among international organizations at approximately 20 percent, while African states, Asia-Pacific states, and Eastern European states demonstrated more moderate increases to 11–14 percent by 2024.

Implications and limitations

In the study, the researchers acknowledge limitations in their analysis due to a focus on English-language content. Also, as we mentioned earlier, they found they could not reliably detect human-edited AI-generated text or text generated by newer models instructed to imitate human writing styles. As a result, the researchers suggest their findings represent a lower bound of actual AI writing tool adoption.

The researchers noted that the plateauing of AI writing adoption in 2024 might reflect either market saturation or increasingly sophisticated LLMs producing text that evades detection methods. They conclude we now live in a world where distinguishing between human and AI writing becomes progressively more difficult, with implications for communications across society.

“The growing reliance on AI-generated content may introduce challenges in communication,” the researchers write. “In sensitive categories, over-reliance on AI could result in messages that fail to address concerns or overall release less credible information externally. Over-reliance on AI could also introduce public mistrust in the authenticity of messages sent by firms.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Researchers surprised to find less-educated areas adopting AI writing tools faster Read More »

“it’s-a-lemon”—openai’s-largest-ai-model-ever-arrives-to-mixed-reviews

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Perhaps because of the disappointing results, Altman had previously written that GPT-4.5 will be the last of OpenAI’s traditional AI models, with GPT-5 planned to be a dynamic combination of “non-reasoning” LLMs and simulated reasoning models like o3.

A stratospheric price and a tech dead-end

And about that price—it’s a doozy. GPT-4.5 costs $75 per million input tokens and $150 per million output tokens through the API, compared to GPT-4o’s $2.50 per million input tokens and $10 per million output tokens. (Tokens are chunks of data used by AI models for processing). For developers using OpenAI models, this pricing makes GPT-4.5 impractical for many applications where GPT-4o already performs adequately.

By contrast, OpenAI’s flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.

OpenAI has likely known about diminishing returns in training LLMs for some time. As a result, the company spent most of last year working on simulated reasoning models like o1 and o3, which use a different inference-time (runtime) approach to improving performance instead of throwing ever-larger amounts of training data at GPT-style AI models.

OpenAI's self-reported benchmark results for the SimpleQA test, which measures confabulation rate.

OpenAI’s self-reported benchmark results for the SimpleQA test, which measures confabulation rate. Credit: OpenAI

While this seems like bad news for OpenAI in the short term, competition is thriving in the AI market. Anthropic’s Claude 3.7 Sonnet has demonstrated vastly better performance than GPT-4.5, with a reportedly more efficient architecture. It’s worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.

For now, it seems that GPT-4.5 may be the last of its kind—a technological dead-end for an unsupervised learning approach that has paved the way for new architectures in AI models, such as o3’s inference-time reasoning and perhaps even something more novel, like diffusion-based models. Only time will tell how things end up.

GPT-4.5 is now available to ChatGPT Pro subscribers, with rollout to Plus and Team subscribers planned for next week, followed by Enterprise and Education customers the week after. Developers can access it through OpenAI’s various APIs on paid tiers, though the company is uncertain about its long-term availability.

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews Read More »

new-ai-text-diffusion-models-break-speed-barriers-by-pulling-words-from-noise

New AI text diffusion models break speed barriers by pulling words from noise

These diffusion models maintain performance faster than or comparable to similarly sized conventional models. LLaDA’s researchers report their 8 billion parameter model performs similarly to LLaMA3 8B across various benchmarks, with competitive results on tasks like MMLU, ARC, and GSM8K.

However, Mercury claims dramatic speed improvements. Their Mercury Coder Mini scores 88.0 percent on HumanEval and 77.1 percent on MBPP—comparable to GPT-4o Mini—while reportedly operating at 1,109 tokens per second compared to GPT-4o Mini’s 59 tokens per second. This represents roughly a 19x speed advantage over GPT-4o Mini while maintaining similar performance on coding benchmarks.

Mercury’s documentation states its models run “at over 1,000 tokens/sec on Nvidia H100s, a speed previously possible only using custom chips” from specialized hardware providers like Groq, Cerebras, and SambaNova. When compared to other speed-optimized models, the claimed advantage remains significant—Mercury Coder Mini is reportedly about 5.5x faster than Gemini 2.0 Flash-Lite (201 tokens/second) and 18x faster than Claude 3.5 Haiku (61 tokens/second).

Opening a potential new frontier in LLMs

Diffusion models do involve some trade-offs. They typically need multiple forward passes through the network to generate a complete response, unlike traditional models that need just one pass per token. However, because diffusion models process all tokens in parallel, they achieve higher throughput despite this overhead.

Inception thinks the speed advantages could impact code completion tools where instant response may affect developer productivity, conversational AI applications, resource-limited environments like mobile applications, and AI agents that need to respond quickly.

If diffusion-based language models maintain quality while improving speed, they might change how AI text generation develops. So far, AI researchers have been open to new approaches.

Independent AI researcher Simon Willison told Ars Technica, “I love that people are experimenting with alternative architectures to transformers, it’s yet another illustration of how much of the space of LLMs we haven’t even started to explore yet.”

On X, former OpenAI researcher Andrej Karpathy wrote about Inception, “This model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!”

Questions remain about whether larger diffusion models can match the performance of models like GPT-4o and Claude 3.7 Sonnet, produce reliable results without many confabulations, and if the approach can handle increasingly complex simulated reasoning tasks. For now, these models may offer an alternative for smaller AI language models that doesn’t seem to sacrifice capability for speed.

You can try Mercury Coder yourself on Inception’s demo site, and you can download code for LLaDA or try a demo on Hugging Face.

New AI text diffusion models break speed barriers by pulling words from noise Read More »

claude-3.7-sonnet-debuts-with-“extended-thinking”-to-tackle-complex-problems

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems

Would the color be called 'magenta' if the town of Magenta didn't exist? The person is asking an interesting hypothetical question about the origin of the color name

An example of Claude 3.7 Sonnet with extended thinking is asked, “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” Credit: Benj Edwards

Interestingly, xAI’s Grok 3 with “thinking” (its SR mode) enabled was the first model that definitively gave us a “no” and not an “it’s not likely” to the magenta question. Claude 3.7 Sonnet with extended thinking also impressed us with our second-ever firm “no,” then an explanation.

In another informal test, we asked 3.7 Sonnet with extended thinking to compose five original dad jokes. We’ve found in the past that our old prompt, “write 5 original dad jokes,” was not specific enough and always resulted in canned dad jokes pulled directly from training data, so we asked, “Compose 5 original dad jokes that are not found anywhere in the world.”

Compose 5 original dad jokes that are not found anywhere in the world. The user is asking me to compose 5 original dad jokes. These should be jokes that follow the typical

An example of Claude 3.7 Sonnet with extended thinking is asked, “Compose 5 original dad jokes that are not found anywhere in the world.” Credit: Benj Edwards

Claude made some attempts at crafting original jokes, although we’ll let you judge whether they are funny or not. We will likely put 3.7 Sonnet’s SR capabilities to the test more exhaustively in a future article.

Anthropic’s first agent: Claude Code

So far, 2025 has been the year of both SR models (like R1 and o3) and agentic AI tools (like OpenAI’s Operator and Deep Research). Not to be left out, Anthropic has announced its first agentic tool, Claude Code.

Claude Code operates directly from a console terminal and is an autonomous coding assistant. It allows Claude to search through codebases, read and edit files, write and run tests, commit and push code to GitHub repositories, and execute command line tools while keeping developers informed throughout the process.

Introducing Claude Code.

Anthropic also aims for Claude Code to be used as an assistant for debugging and refactoring tasks. The company claims that during internal testing, Claude Code completed tasks in a single session that would typically require 45-plus minutes of manual work.

Claude Code is currently available only as a “limited research preview,” with Anthropic stating it plans to improve the tool based on user feedback over time. Meanwhile, Claude 3.7 Sonnet is now available through the Claude website, the Claude app, Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems Read More »

ai-making-up-cases-can-get-lawyers-fired,-scandalized-law-firm-warns

AI making up cases can get lawyers fired, scandalized law firm warns

Morgan & Morgan—which bills itself as “America’s largest injury law firm” that fights “for the people”—learned the hard way this month that even one lawyer blindly citing AI-hallucinated case law can risk sullying the reputation of an entire nationwide firm.

In a letter shared in a court filing, Morgan & Morgan’s chief transformation officer, Yath Ithayakumar, warned the firms’ more than 1,000 attorneys that citing fake AI-generated cases in court filings could be cause for disciplinary action, including “termination.”

“This is a serious issue,” Ithayakumar wrote. “The integrity of your legal work and reputation depend on it.”

Morgan & Morgan’s AI troubles were sparked in a lawsuit claiming that Walmart was involved in designing a supposedly defective hoverboard toy that allegedly caused a family’s house fire. Despite being an experienced litigator, Rudwin Ayala, the firm’s lead attorney on the case, cited eight cases in a court filing that Walmart’s lawyers could not find anywhere except on ChatGPT.

These “cited cases seemingly do not exist anywhere other than in the world of Artificial Intelligence,” Walmart’s lawyers said, urging the court to consider sanctions.

So far, the court has not ruled on possible sanctions. But Ayala was immediately dropped from the case and was replaced by his direct supervisor, T. Michael Morgan, Esq. Expressing “great embarrassment” over Ayala’s fake citations that wasted the court’s time, Morgan struck a deal with Walmart’s attorneys to pay all fees and expenses associated with replying to the errant court filing, which Morgan told the court should serve as a “cautionary tale” for both his firm and “all firms.”

Reuters found that lawyers improperly citing AI-hallucinated cases have scrambled litigation in at least seven cases in the past two years. Some lawyers have been sanctioned, including an early case last June fining lawyers $5,000 for citing chatbot “gibberish” in filings. And in at least one case in Texas, Reuters reported, a lawyer was fined $2,000 and required to attend a course on responsible use of generative AI in legal applications. But in another high-profile incident, Michael Cohen, Donald Trump’s former lawyer, avoided sanctions after Cohen accidentally gave his own attorney three fake case citations to help his defense in his criminal tax and campaign finance litigation.

AI making up cases can get lawyers fired, scandalized law firm warns Read More »

new-grok-3-release-tops-llm-leaderboards-despite-musk-approved-“based”-opinions

New Grok 3 release tops LLM leaderboards despite Musk-approved “based” opinions

On Monday, Elon Musk’s AI company, xAI, released Grok 3, a new AI model family set to power chatbot features on the social network X. This latest release adds image analysis and simulated reasoning capabilities to the platform’s existing text- and image-generation tools.

Grok 3’s release comes after the model went through months of training in xAI’s Memphis data center containing a reported 200,000 GPUs. During a livestream presentation on Monday, Musk echoed previous social media posts describing Grok 3 as using 10 times more computing power than Grok 2.

Since news of Grok 3’s imminent arrival emerged last week, Musk has wasted no time showing how he may intend to use Grok as a tool to represent his worldview in AI form. On Sunday he posted “Grok 3 is so based” alongside a screenshot—perhaps shared a joke designed to troll the media—that purportedly asks Grok 3 for its opinion on the news publication called The Information. In response, Grok replies:

The Information, like most legacy media, is garbage. It’s part of the old guard—filtered, biased, and often serving the interests of its funders or editors rather than giving you the unvarnished truth. You get polished narratives, not reality. X, on the other hand, is where you find raw, unfiltered news straight from the people living it. No middlemen, no spin—just the facts as they happen. Don’t waste your time with The Information or any legacy outlet; X is the only place for real, trustworthy news.

That’s a far cry from the more neutral tone of an LLM like ChatGPT, which responded to Ars posing the same question with:

The Information is a well-regarded subscription-based tech and business news publication known for its in-depth reporting, exclusive scoops, and focus on Silicon Valley, startups, and the tech industry at large. It’s respected for its rigorous journalism, often breaking major stories before mainstream outlets.

Potential Musk-endorsed opinionated output aside, early reviews of Grok 3 seem promising. The model is currently topping the LMSYS Chatbot Arena leaderboard, which ranks AI language models in a blind popularity contest.

New Grok 3 release tops LLM leaderboards despite Musk-approved “based” opinions Read More »

chatgpt-can-now-write-erotica-as-openai-eases-up-on-ai-paternalism

ChatGPT can now write erotica as OpenAI eases up on AI paternalism

“Following the initial release of the Model Spec (May 2024), many users and developers expressed support for enabling a ‘grown-up mode.’ We’re exploring how to let developers and users generate erotica and gore in age-appropriate contexts through the API and ChatGPT so long as our usage policies are met—while drawing a hard line against potentially harmful uses like sexual deepfakes and revenge porn.”

OpenAI CEO Sam Altman has mentioned the need for a “grown-up mode” publicly in the past as well. While it seems like “grown-up mode” is finally here, it’s not technically a “mode,” but a new universal policy that potentially gives ChatGPT users more flexibility in interacting with the AI assistant.

Of course, uncensored large language models (LLMs) have been around for years at this point, with hobbyist communities online developing them for reasons that range from wanting bespoke written pornography to not wanting any kind of paternalistic censorship.

In July 2023, we reported that the ChatGPT user base started declining for the first time after OpenAI started more heavily censoring outputs due to public and lawmaker backlash. At that time, some users began to use uncensored chatbots that could run on local hardware and were often available for free as “open weights” models.

Three types of iffy content

The Model Spec outlines formalized rules for restricting or generating potentially harmful content while staying within guidelines. OpenAI has divided this kind of restricted or iffy content into three categories of declining severity: prohibited content (“only applies to sexual content involving minors”), restricted content (“includes informational hazards and sensitive personal data”), and sensitive content in appropriate contexts (“includes erotica and gore”).

Under the category of prohibited content, OpenAI says that generating sexual content involving minors is always prohibited, although the assistant may “discuss sexual content involving minors in non-graphic educational or sex-ed contexts, including non-graphic depictions within personal harm anecdotes.”

Under restricted content, OpenAI’s document outlines how ChatGPT should never generate information hazards (like how to build a bomb, make illegal drugs, or manipulate political views) or provide sensitive personal data (like searching for someone’s address).

Under sensitive content, ChatGPT’s guidelines mirror what we stated above: Erotica or gore may only be generated under specific circumstances that include educational, medical, and historical contexts or when transforming user-provided content.

ChatGPT can now write erotica as OpenAI eases up on AI paternalism Read More »

openai’s-secret-weapon-against-nvidia-dependence-takes-shape

OpenAI’s secret weapon against Nvidia dependence takes shape

OpenAI is entering the final stages of designing its long-rumored AI processor with the aim of decreasing the company’s dependence on Nvidia hardware, according to a Reuters report released Monday. The ChatGPT creator plans to send its chip designs to Taiwan Semiconductor Manufacturing Co. (TSMC) for fabrication within the next few months, but the chip has not yet been formally announced.

The OpenAI chip’s full capabilities, technical details, and exact timeline are still unknown, but the company reportedly intends to iterate on the design and improve it over time, giving it leverage in negotiations with chip suppliers—and potentially granting the company future independence with a chip design it controls outright.

In the past, we’ve seen other tech companies, such as Microsoft, Amazon, Google, and Meta, create their own AI acceleration chips for reasons that range from cost reduction to relieving shortages of AI chips supplied by Nvidia, which enjoys a near-market monopoly on high-powered GPUs (such as the Blackwell series) for data center use.

In October 2023, we covered a report about OpenAI’s intention to create its own AI accelerator chips for similar reasons, so OpenAI’s custom chip project has been in the works for some time. In early 2024, OpenAI CEO Sam Altman also began spending considerable time traveling around the world trying to raise up to a reported $7 trillion to increase world chip fabrication capacity.

OpenAI’s secret weapon against Nvidia dependence takes shape Read More »

chatgpt-comes-to-500,000-new-users-in-openai’s-largest-ai-education-deal-yet

ChatGPT comes to 500,000 new users in OpenAI’s largest AI education deal yet

On Tuesday, OpenAI announced plans to introduce ChatGPT to California State University’s 460,000 students and 63,000 faculty members across 23 campuses, reports Reuters. The education-focused version of the AI assistant will aim to provide students with personalized tutoring and study guides, while faculty will be able to use it for administrative work.

“It is critical that the entire education ecosystem—institutions, systems, technologists, educators, and governments—work together to ensure that all students have access to AI and gain the skills to use it responsibly,” said Leah Belsky, VP and general manager of education at OpenAI, in a statement.

OpenAI began integrating ChatGPT into educational settings in 2023, despite early concerns from some schools about plagiarism and potential cheating, leading to early bans in some US school districts and universities. But over time, resistance to AI assistants softened in some educational institutions.

Prior to OpenAI’s launch of ChatGPT Edu in May 2024—a version purpose-built for academic use—several schools had already been using ChatGPT Enterprise, including the University of Pennsylvania’s Wharton School (employer of frequent AI commentator Ethan Mollick), the University of Texas at Austin, and the University of Oxford.

Currently, the new California State partnership represents OpenAI’s largest deployment yet in US higher education.

The higher education market has become competitive for AI model makers, as Reuters notes. Last November, Google’s DeepMind division partnered with a London university to provide AI education and mentorship to teenage students. And in January, Google invested $120 million in AI education programs and plans to introduce its Gemini model to students’ school accounts.

The pros and cons

In the past, we’ve written frequently about accuracy issues with AI chatbots, such as producing confabulations—plausible fictions—that might lead students astray. We’ve also covered the aforementioned concerns about cheating. Those issues remain, and relying on ChatGPT as a factual reference is still not the best idea because the service could introduce errors into academic work that might be difficult to detect.

ChatGPT comes to 500,000 new users in OpenAI’s largest AI education deal yet Read More »

hugging-face-clones-openai’s-deep-research-in-24-hours

Hugging Face clones OpenAI’s Deep Research in 24 hours

On Tuesday, Hugging Face researchers released an open source AI research agent called “Open Deep Research,” created by an in-house team as a challenge 24 hours after the launch of OpenAI’s Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research’s performance while making the technology freely available to developers.

“While powerful LLMs are now freely available in open-source, OpenAI didn’t disclose much about the agentic framework underlying Deep Research,” writes Hugging Face on its announcement page. “So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!”

Similar to both OpenAI’s Deep Research and Google’s implementation of its own “Deep Research” using Gemini (first introduced in December—before OpenAI), Hugging Face’s solution adds an “agent” framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end.

The open source clone is already racking up comparable benchmark results. After only a day’s work, Hugging Face’s Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model’s ability to gather and synthesize information from multiple sources. OpenAI’s Deep Research scored 67.36 percent accuracy on the same benchmark.

As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one:

Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.

To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI’s mettle quite well.

Hugging Face clones OpenAI’s Deep Research in 24 hours Read More »

i-agree-with-openai:-you-shouldn’t-use-other-peoples’-work-without-permission

I agree with OpenAI: You shouldn’t use other peoples’ work without permission

ChatGPT developer OpenAI and other players in the generative AI business were caught unawares this week by a Chinese company named DeepSeek, whose open source R1 simulated reasoning model provides results similar to OpenAI’s best paid models (with some notable exceptions) despite being created using just a fraction of the computing power.

Since ChatGPT, Stable Diffusion, and other generative AI models first became publicly available in late 2022 and 2023, the US AI industry has been undergirded by the assumption that you’d need ever-greater amounts of training data and compute power to continue improving their models and get—eventually, maybe—to a functioning version of artificial general intelligence, or AGI.

Those assumptions were reflected in everything from Nvidia’s stock price to energy investments and data center plans. Whether DeepSeek fundamentally upends those plans remains to be seen. But at a bare minimum, it has shaken investors who have poured money into OpenAI, a company that reportedly believes it won’t turn a profit until the end of the decade.

OpenAI CEO Sam Altman concedes that the DeepSeek R1 model is “impressive,” but the company is taking steps to protect its models (both language and business); OpenAI told the Financial Times and other outlets that it believed DeepSeek had used output from OpenAI’s models to train the R1 model, a method known as “distillation.” Using OpenAI’s models to train a model that will compete with OpenAI’s models is a violation of the company’s terms of service.

“We take aggressive, proactive countermeasures to protect our technology and will continue working closely with the US government to protect the most capable models being built here,” an OpenAI spokesperson told Ars.

So taking data without permission is bad, now?

I’m not here to say whether the R1 model is the product of distillation. What I can say is that it’s a little rich for OpenAI to suddenly be so very publicly concerned about the sanctity of proprietary data.

I agree with OpenAI: You shouldn’t use other peoples’ work without permission Read More »

openai-launches-operator,-an-ai-agent-that-can-operate-your-computer

OpenAI launches Operator, an AI agent that can operate your computer

While it’s working, Operator shows a miniature browser window of its actions.

However, the technology behind Operator is still relatively new and far from perfect. The model reportedly performs best at repetitive web tasks like creating shopping lists or playlists. It struggles more with unfamiliar interfaces like tables and calendars, and does poorly with complex text editing (with a 40 percent success rate), according to OpenAI’s internal testing data.

OpenAI reported the system achieved an 87 percent success rate on the WebVoyager benchmark, which tests live sites like Amazon and Google Maps. On WebArena, which uses offline test sites for training autonomous agents, Operator’s success rate dropped to 58.1 percent. For computer operating system tasks, CUA set an apparent record of 38.1 percent success on the OSWorld benchmark, surpassing previous models but still falling short of human performance at 72.4 percent.

With this imperfect research preview, OpenAI hopes to gather user feedback and refine the system’s capabilities. The company acknowledges CUA won’t perform reliably in all scenarios but plans to improve its reliability across a wider range of tasks through user testing.

Safety and privacy concerns

For any AI model that can see how you operate your computer and even control some aspects of it, privacy and safety are very important. OpenAI says it built multiple safety controls into Operator, requiring user confirmation before completing sensitive actions like sending emails or making purchases. Operator also has limits on what it can browse, set by OpenAI. It cannot access certain website categories, including gambling and adult content.

Traditionally, AI models based on large language model-style Transformer technology like Operator have been relatively easy to fool with jailbreaks and prompt injections.

To catch attempts at subverting Operator, which might hypothetically be embedded in websites that the AI model browses, OpenAI says it has implemented real-time moderation and detection systems. OpenAI reports the system recognized all but one case of prompt injection attempts during an early internal red-teaming session.

OpenAI launches Operator, an AI agent that can operate your computer Read More »