Author name: Kelly Newman

trump’s-tariffs-trigger-price-hikes-at-large-online-retailers

Trump’s tariffs trigger price hikes at large online retailers

Popular online shopping meccas Temu and Shein have finally broken their silence, warning of potential price hikes starting next week due to Donald Trump’s tariffs.

Temu is a China-based e-commerce platform that has grown as popular as Amazon for global shoppers making cross-border purchases, according to 2024 Statista data. Its tagline, “Shop like a billionaire,” is inextricably linked to the affordability of items on its platform. And although Shein—which vows to make global fashion “accessible to all” by selling inexpensive stylish clothing—moved its headquarters from China to Singapore in 2022, most of its products are still controversially manufactured in China, the BBC reported.

For weeks, the US-China trade war has seen both sides spiking tariffs. In the US, the White House last night crunched the numbers and confirmed that China now faces tariffs of up to 245 percent, The Wall Street Journal reported. That figure includes new tariffs Trump has imposed, taxing all Chinese goods by 145 percent, as well as prior 100 percent tariffs lobbed by the Biden administration that are still in effect on EVs and Chinese syringes.

Last week, China announced that it would stop retaliations, CNBC reported. But that came after China rolled out 125 percent tariffs on US goods. While China has since accused Trump of weaponizing tariffs to “an irrational level,” other retaliations have included increasingly cutting off US access to critical minerals used in tech manufacturing and launching antitrust probes into US companies.

For global retailers, the tit-for-tat tariffs have immediately scrambled business plans. Particularly for Temu and Shein, Trump’s decision to end the “de minimis” exemption on May 2—which allowed shipments valued under $800 to be imported duty-free—will soon hit hard, exposing them to 90 percent tariffs that inevitably led to next week’s price shifts. According to The Guardian, starting on June 1, retailers will have to pay $150 tariffs on each individual package.

Trump’s tariffs trigger price hikes at large online retailers Read More »

ap:-trump-admin-to-kill-irs-free-tax-filing-service-that-intuit-lobbied-against

AP: Trump admin to kill IRS free tax-filing service that Intuit lobbied against

Sen. Elizabeth Warren (D-Mass.) criticized Intuit’s lobbying against Direct File and told the AP that Trump and Musk “are going after Direct File because it stops giant tax prep companies from ripping taxpayers off for services that should be free. Americans want a free and easy way to file their taxes—Trump and Musk want to take that away.”

Intuit’s TurboTax offers free filing for simple returns but has faced lawsuits alleging that its ads misled consumers who had to pay. In 2022, Intuit agreed to pay $141 million in restitution to millions of consumers and stop a specific ad campaign that promised free filing.

The Federal Trade Commission ruled last year that Intuit violated US law with deceptive advertising and ordered the company to stop telling consumers that TurboTax is free without more obvious disclaimers. Intuit responded by suing the FTC in a case that is still pending at the US Court of Appeals for the 5th Circuit.

The free IRS filing program is also limited to simple returns, but there was hope of expanding its usefulness. The program accepted returns from 140,803 taxpayers in the 12-state 2024 pilot, which was followed by a May 2024 announcement that Direct File would become “a permanent option for filing federal tax returns starting in the 2025 tax season.”

The IRS said in the 2024 announcement that it was looking for ways to cover more complicated tax returns. “Over the coming years, the agency’s goal is to expand Direct File to support most common tax situations, with a particular focus on those situations that impact working families,” the IRS said at the time. The Treasury Department estimated that over 30 million taxpayers were eligible for Direct File this year but hasn’t said yet how many people used it.

House Republicans urged Trump to act even more quickly to kill the program, saying in a December 2024 letter that he should issue “a day-one executive order to end the Internal Revenue Service’s (IRS) unauthorized and wasteful Direct File pilot program.”

AP: Trump admin to kill IRS free tax-filing service that Intuit lobbied against Read More »

ct-scans-could-cause-5%-of-cancers,-study-finds;-experts-note-uncertainty

CT scans could cause 5% of cancers, study finds; experts note uncertainty

Uncertainty and balancing

“The estimates, while based on the best models available to the authors, are indirect, so there is considerable uncertainty about the estimates,” Stephen Duffy, emeritus professor of Cancer Screening at Queen Mary University of London, said in a statement. “Thus, I would say to patients that if you are recommended to have a CT scan, it would be wise to do so.”

Duffy also highlighted that in the context of a person’s overall risk of cancer, CT scans don’t move the needle much. There were a little over 100,000 cancers linked to 93 million scans. “This amounts to around a 0.1 percent increase in cancer risk over the patient’s lifetime per CT examination,” he said. The lifetime risk of cancer in the US population is around 40 percent. Thus, the additional risk from CT scans “is small.” Overall, when a CT scan is deemed necessary, the “likely benefit in diagnosis and subsequent treatment of disease outweighs the very small increase in cancer risk.”

Doreen Lau, a cancer biology expert at Brunel University of London, agreed: “The findings don’t mean that people should avoid CT scans when recommended by a doctor. In most cases, the benefit of detecting or ruling out serious illness far outweighs the very small risk of harm.”

Still, the rise in CT scans in recent years may suggest that doctors could cut back on their use. In an accompanying editorial, Ilana Richman of Yale University and Mitchell Katz of NYC Health and Hospitals discussed ways that doctors could make sure they’re balancing risks and benefits before using CT scans, including using diagnostic algorithms and offering alternative imaging options, such as ultrasounds and magnetic resonance imaging (MRIs).

“As with all complex problems, there will be no simple solution,” they write. But, “educating clinicians about avoiding low-value testing and, in circumstances where alternatives are readily available, involving patients in the decision to do a CT scan may help shift culture and practice.”

CT scans could cause 5% of cancers, study finds; experts note uncertainty Read More »

hbo’s-the-last-of-us-is-back-for-season-2,-and-so-are-we

HBO’s The Last of Us is back for season 2, and so are we

New episodes of season 2 of The Last of Us are premiering on HBO every Sunday night, and Ars’ Kyle Orland (who’s played the games) and Andrew Cunningham (who hasn’t) will be talking about them here every Monday morning. While these recaps don’t delve into every single plot point of the episode, there are obviously heavy spoilers contained within, so go watch the episode first if you want to go in fresh.

Kyle: To start us off as we return to the world of The Last of Us, as a non-game player, maybe recap what you remember from the first season and what you’ve heard about the second.

Andrew: Going into the first season, I’d been aware of The Last of Us, the video game, as a story about an older guy and a kid trying to navigate a post-apocalyptic world. And the show was also mostly that: It’s Joel and Ellie against the world, and who knows, maybe this spunky young girl with an apparent immunity to the society-ravaging fungal infection could hold the key to a cure!

Things fell apart at the end of last season when the Fireflies (a group of survivalists/doctors/scientists/etc.) may or may not have been threatening to kill Ellie in order to research their cure, which made Joel go on a murder rampage, which he then lied to Ellie about. We fade to black as they make their way back toward the one semi-functioning human settlement they’d visited on their travels, where Joel’s brother and his family also happen to live.

Going into this season: I know nothing. I don’t really engage in TV show fandoms or keep up with casting announcements or plot speculation. And the only thing I know about the second game going into this is a vague sense that it wasn’t as well-received as the first. In short, I am as a newborn baby, ready to take in the second season of a show I kind of like with the freshest possible eyes.

Kyle: I may be to blame for that vague sense you have. I fell in love with the first game, especially the relationship between Joel and Ellie, and I thought the first season of the show captured that quite well. I thought the endings to both the game and season 1 of the show were just about perfect and that any continuation after that was gonna struggle to justify itself.

Without giving too much away, I think the second game misses a lot of what made the narrative of the first one special and gets sidetracked in a lot of frankly gratuitous directions. That said, this premiere episode of the second season drew me in more than I expected.

One jarring thing (in a good way) about both the second game and the second season is suddenly seeing Joel and Ellie just existing in a thriving community with electric lights, music, alcohol, decent food, laughter, etc., etc. After the near-constant precarity and danger they’ve faced in the recent past, it really throws you for a loop.

Andrew: Unfortunately but predictably, you see both of them struggling to adapt in different ways; these are two extremely individualistic, out-for-number-one people. Ellie (now a 19-year-old, after a five-year time jump) never met a rule she couldn’t break, even when it endangers her friends and other community members.

And while Joel will happily fix your circuit breaker or re-string your guitar, he emphatically rejected a needs-of-the-many-outweigh-the-needs-of-the-few approach at the end of last season. When stuff breaks bad (and I feel confident that it will, that’s the show that it is) these may not be the best people to have in your corner.

My only real Game Question for you at the outset is the big one: Is season 2 adapting The Last of Us Part II or is it doing its own thing or are we somewhere in between or is it too early to say?

“Oh, dang, is that Catherine O’Hara?”

“Oh, dang, is that Catherine O’Hara?”

Kyle: From what I have heard it will be adapting the first section of the second game (it’s a long game) and making some changes and digressions that expand on the game’s story (like the well-received Nick Offerman episode last season). Already, I can tell you that Joel’s therapy scene was created for the TV show, and I think it improves on a somewhat similar “Joel pours his heart out” scene from early in the game.

The debut episode is also already showing a willingness to move around scenes from the game to make them fit better in chronological order, which I’m already appreciating.

One thing I think the show is already doing well, too, is showing 19-year-old Ellie “acting like every 19-year-old ever” (as one character puts it) to father figure Joel. Even in a zombie apocalypse, it’s a relatable bit of character-building for anyone who’s been a teenager or raised a teenager.

Andrew: Joel’s therapist, played by the wonderful Catherine O’Hara. (See, that’s why you don’t follow casting announcements, so you can watch a show and be like, “Oh, dang, is that Catherine O’Hara?”)

I didn’t know if it was a direct adaptation, but I did notice that the show’s video gamey storytelling reflexes were still fully intact. We almost instantly end up in a ruined grocery store chock-full of environmental storytelling (Ellie notes a happy birthday banner and 2003’s Employee of the Year wall).

And like in any new game new season of a TV show, we quickly run into a whole new variant of mushroom monster that retains some of its strategic instincts and can take cover rather than blindly rushing at you. Some of the jump scares were so much like quick-time events that I almost grabbed my controller so I could press X and help Ellie out.

Kyle: Yeah, it’s pretty easy to see that the semi-stealthy assault on the abandoned market came directly from the game. I felt like there was some implication that the “strategic” zombie still had a little more humanity left in her that was struggling to fight against the fungus’ pull, which was pretty chilling in the way it was presented.

Andrew: Yes! Fungus is still a maximally creepy and visually interesting way for an infection to spread, and it’s a visual note that helps TLoU stand out from other zombie stories.

It does seem like we’re moving into Phase 2 of most zombie apocalypse fiction. Phase 1 is: There’s an infection! Society collapses. Phase 2 is: Humanity attempts to rebuild. But maybe the scariest monster of all… is humankind??

I’ve always found Phase 2 to be inherently less interesting because I can watch all kinds of shows where people are the antagonists, but Joel and Ellie remain unique and compelling enough as characters that maybe they’ll carry me through.

A teenager should have some hobbies.

Credit: Warner Bros. Discovery

A teenager should have some hobbies. Credit: Warner Bros. Discovery

Kyle: The first game already established a lot in the way of “humans are the real monsters” vignettes. And while I still don’t want to give too much away, I will say that human-versus-human drama is definitely going to be an increasingly central part of the narrative going forward.

Speaking of which, I wondered what you made of the brief scenes we get with Abby leading a reluctant but willing band of revenge-seekers that see doctor-murdering Joel as an unalloyed evil (somewhat justifiably, especially from their point of view).

Andrew: My first thought was “look at all these clean, hot, well-coiffed apocalypse survivors.” At least Joel and Ellie both look a little weathered.

But in seriousness, yes, it’s obvious that What Joel Did is a bomb that’s going to go off sooner rather than later. Trying to address it without addressing it has pushed taciturn, closed-off Joel into therapy, where he insists to a woman whose (presumably infected) husband he killed that he’s a “good guy.” And it seems clear to me that Ellie’s shunning of Joel is coming from her sense that something is amiss, just as much as it is about a 19-year-old rebelling against her would-be father figure.

In Joel’s case, it’s telling that it seems like lying to Ellie is weighing on him more than the murder-rampage itself. But having these improbably fresh-faced Firefly remnants chasing him down will mean that he might end up paying for both.

Kyle: I think Joel can live with sacrificing the entire world to save Ellie. I don’t think he can live with Ellie knowing he did that pretty much against her explicit wishes.

Andrew: Oops!! Pobody’s nerfect!

Kyle: I’m sure Abby will understand if Joel just says he made an oopsie.

Andrew: Seriously. Can’t believe they’re still mad even after a five-year time jump. Can’t we all just move on?

As we close, and while at least trying to avoid spoilers, are there any game moments you’re looking forward to seeing? Or are you just hoping that this season can “fix” a story that didn’t work as well for you in video game form?

How can you stay mad at this man?

Credit: Warner Bros. Discovery

How can you stay mad at this man? Credit: Warner Bros. Discovery

Kyle: Actually, I don’t have to spoil anything to say that the scene at the dance was one I was looking forward to seeing in both the game and the show. That’s because a large chunk of it was the first bit of the game Sony ever showed during a memorable E3 2018 press conference, which would end up being the company’s last ever official E3 press presentation.

Besides making me an instant fan of the song “.44 Pistol,” that scene had me very excited to see how the social adventures of “All Growed Up” Ellie might develop. And while I don’t feel like the game really delivered a very satisfying or believable version of Ellie’s evolution, I’m hopeful the show might be able to smooth out some of the rough storytelling edges and give a more compelling version of the character.

Andrew: Yeah. Video games get remastered, but they mostly seek to preserve the original game rather than overhauling it. A well-funded multiseason TV adaptation is a rare opportunity for a redo.

Kyle: The way HBO handled the first season gives me hope that they can once again embrace the excellent world-building of the games while adding some prestige TV polish to the plot.

HBO’s The Last of Us is back for season 2, and so are we Read More »

apple-silent-as-trump-promises-“impossible”-us-made-iphones

Apple silent as Trump promises “impossible” US-made iPhones


How does Apple solve a problem like Trump’s trade war?

Despite a recent pause on some tariffs, Apple remains in a particularly thorny spot as Donald Trump’s trade war spikes costs in the tech company’s iPhone manufacturing hub, China.

Analysts predict that Apple has no clear short-term options to shake up its supply chain to avoid tariffs entirely, and even if Trump grants Apple an exemption, iPhone prices may increase not just in the US but globally.

The US Trade Representative, which has previously granted Apple an exemption on a particular product, did not respond to Ars’ request to comment on whether any requests for exemptions have been submitted in 2025.

Currently, the US imposes a 145 percent tariff on Chinese imports, while China has raised tariffs on US imports to 125 percent.

Neither side seems ready to back down, and Trump’s TikTok deal—which must be approved by the Chinese government—risks further delays the longer negotiations and retaliations drag on. Trump has faced criticism for delaying the TikTok deal, with Senate Intelligence Committee Vice Chair Mark Warner (D-Va.) telling The Verge last week that the delay was “against the law” and threatened US national security. Meanwhile, China seems to expect more business to flow into China rather than into the US as a result of Trump’s tough stance on global trade.

With the economy and national security at risk, Trump is claiming that tariffs will drive manufacturing into the US, create jobs, and benefit the economy. Getting the world’s most valuable company, Apple, to manufacture its most popular product, the iPhone, in the US, is clearly part of Trump’s vision. White House Press Secretary Karoline Leavitt told reporters this week that Apple’s commitment to invest $500 billion in the US over the next four years was supposedly a clear indicator that Apple believed it was feasible to build iPhones here, Bloomberg reported.

“If Apple didn’t think the United States could do it, they probably wouldn’t have put up that big chunk of change,” Leavitt said.

Apple did not respond to Ars’ request to comment, and so far, it has been silent on how tariffs are impacting its business.

iPhone price increases expected globally

For Apple, even if it can build products for the US market in India, where tariffs remain lower, Trump’s negotiations with China “remain the most important variable for Apple” to retain its global dominance.

Dan Ives, global head of technology research at Wedbush Securities, told CNBC that “Apple could be set back many years by these tariffs.” Although Apple reportedly stockpiled phones to sell in the US market, that supply will likely dwindle fast as customers move to purchase phones before prices spike. In the medium-term, consultancy firm Omdia forecasted, Apple will likely “focus on increasing iPhone production and exports from India” rather than pushing its business into the US, as Trump desires.

But Apple will still incur additional costs from tariffs on India until that country tries to negotiate a more favorable trade deal. And any exemption that Apple may secure due to its investment promise in the US or moderation of China tariffs that could spare Apple some pain “may not be enough for Apple to avoid adverse business effects,” co-founder and senior analyst at equity research publisher MoffettNathanson, Craig Moffett, suggested to CNBC.

And if Apple is forced to increase prices, it likely won’t be limited to just the US, Bank of America Securities analyst Wamsi Mohan suggested, as reported by The Guardian. To ensure that Apple’s largest market isn’t the hardest hit, Apple may increase prices “across the board geographically,” he forecasted.

“While Apple has not commented on this, we expect prices will be changed globally to prevent arbitrage,” Mohan said.

Apple may even choose to increase prices everywhere but the US, vice president at Forrester Research, Dipanjan Chatterjee, explained in The Guardian’s report.

“If there is a cost impact in the US for certain products,” Chatterjee said, Apple may not increase US prices because “the market is far more competitive there.” Instead, “the company may choose to keep prices flat in the US while recovering the lost margin elsewhere in its global portfolio,” Chatterjee said.

Trump’s US-made iPhone may be an impossible dream

Analysts have said that Trump’s dream that a “made-in-the-USA” iPhone could be coming soon is divorced from reality. Not only do analysts estimate that more than 80 percent of Apple products are currently made in China, but so are many individual parts. So even if Apple built an iPhone factory in the US, it would still have to pay tariffs on individual parts, unless Trump agreed to a seemingly wide range of exemptions. Mohan estimated it would “likely take many years” to move the “entire iPhone supply chain,” if that’s “even possible.”

Further, Apple’s $500 billion commitment covered “building servers for its artificial intelligence products, Apple TV productions and 20,000 new jobs in research and development—not a promise to make the iPhone stateside,” The Guardian noted.

For Apple, it would likely take years to build a US factory and attract talent, all without knowing how tariffs might change. A former Apple manufacturing engineer, Matthew Moore, told Bloomberg that “there are millions of people employed by the Apple supply chain in China,” and Apple has long insisted that the US talent pool is too small to easily replace them.

“What city in America is going to put everything down and build only iPhones?” Moore said. “Boston is over 500,000 people. The whole city would need to stop everything and start assembling iPhones.”

In a CBS interview, Commerce Secretary Howard Lutnick suggested that the “army of millions and millions of human beings” could be automated, Bloomberg reported. But China has never been able to make low-cost automation work, so it’s unclear how the US could achieve that goal without serious investment.

“That’s not yet realistic,” people who have worked on Apple’s product manufacturing told Bloomberg, especially since each new iPhone model requires retooling of assembly, which typically requires manual labor. Other analysts agreed, CNBC reported, concluding that “the idea of an American-made iPhone is impossible at worst and highly expensive at best.”

For consumers, CNBC noted, a US-made iPhone would cost anywhere from 25 percent more than the $1,199 price point today, increasing to about $1,500 at least, to potentially $3,500 at most, Wall Street analysts have forecasted.

It took Apple a decade to build its factory in India, which Apple reportedly intends to use to avoid tariffs where possible. That factory “only began producing Apple’s top-of-the-line Pro and Pro Max iPhone models for the first time last year,” CNBC reported.

Analysts told CNBC that it would take years to launch a similar manufacturing process in the US, while “there’s no guarantee that US trade policy might not change yet again in a way to make the factory less useful.”

Apple CEO’s potential game plan to navigate tariffs

It appears that there’s not much Apple can do to avoid maximum pain through US-China negotiations. But Apple’s CEO Tim Cook—who is considered “a supply chain whisperer”—may be “uniquely suited” to navigate Trump’s trade war, Fortune reported.

After Cook arrived at Apple in 1998, he “redesigned Apple’s sprawling supply chain” and perhaps is game to do that again, Fortune reported. Jeremy Friedman, associate professor of business and geopolitics at Harvard Business School, told Fortune that rather than being stuck in the middle, Cook may turn out to be a key intermediary, helping the US and China iron out a deal.

During Trump’s last term, Cook raised a successful “charm offensive” that secured tariff exemptions without caving to Trump’s demand to build iPhones in the US, CNBC reported, and he’s likely betting that Apple’s recent $500 billion commitment will lead to similar outcomes, even if Apple never delivers a US-made iPhone.

Back in 2017, Trump announced that Apple partner Foxconn would be building three “big beautiful plants” in the US and claimed that they would be Apple plants, CNBC reported. But the pandemic disrupted construction, and most of those plans were abandoned, with one facility only briefly serving to make face masks, not Apple products. In 2019, Apple committed to building a Texas factory that Trump toured. While Trump insisted that a US-made iPhone was on the horizon due to Apple moving some business into the US, that factory only committed to assembling the MacBook Pro, CNBC noted.

Morgan Stanley analyst Erik Woodring suggested that Apple may “commit to some small-volume production in the US (HomePod? AirTags?)” to secure an exemption in 2025, rather than committing to building iPhones, CNBC reported.

Although this perhaps sounds like a tried-and-true game plan, for Cook, Apple’s logistics have likely never been so complicated. However, analysts told Fortune that experienced logistics masterminds understand that flexibility is the priority, and Cook has already shown that he can anticipate Trump’s moves by stockpiling iPhones and redirecting US-bound iPhones through its factory in India.

While Trump negotiates with China, Apple hopes that an estimated 35 million iPhones it makes annually in India can “cover a large portion of its needs in the US,” Bloomberg reported. These moves, analysts said, prove that Cook may be the man for the job when it comes to steering Apple through the trade war chaos.

But to keep up with global demand—selling more than 220 million iPhones annually—Apple will struggle to quickly distance itself from China, where there’s abundant talent to scale production that Apple says just doesn’t exist in the US. For example, CNBC noted that Foxconn hired 50,000 additional workers last fall at its largest China plant just to build enough iPhones to meet demand during the latest September launches.

As Apple remains dependent on China, Cook will likely need to remain at the table, seeking friendlier terms on both sides to ensure its business isn’t upended for years.

“One can imagine, if there is some sort of grand bargain between US and China coming in the next year or two,” Friedman said, “Tim Cook might as soon as anybody play an intermediary role.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Apple silent as Trump promises “impossible” US-made iPhones Read More »

quantum-hardware-may-be-a-good-match-for-ai

Quantum hardware may be a good match for AI

Quantum computers don’t have that sort of separation. While they could include some quantum memory, the data is generally housed directly in the qubits, while computation involves performing operations, called gates, directly on the qubits themselves. In fact, there has been a demonstration that, for supervised machine learning, where a system can learn to classify items after training on pre-classified data, a quantum system can outperform classical ones, even when the data being processed is housed on classical hardware.

This form of machine learning relies on what are called variational quantum circuits. This is a two-qubit gate operation that takes an additional factor that can be held on the classical side of the hardware and imparted to the qubits via the control signals that trigger the gate operation. You can think of this as analogous to the communications involved in a neural network, with the two-qubit gate operation equivalent to the passing of information between two artificial neurons and the factor analogous to the weight given to the signal.

That’s exactly the system that a team from the Honda Research Institute worked on in collaboration with a quantum software company called Blue Qubit.

Pixels to qubits

The focus of the new work was mostly on how to get data from the classical world into the quantum system for characterization. But the researchers ended up testing the results on two different quantum processors.

The problem they were testing is one of image classification. The raw material was from the Honda Scenes dataset, which has images taken from roughly 80 hours of driving in Northern California; the images are tagged with information about what’s in the scene. And the question the researchers wanted the machine learning to handle was a simple one: Is it snowing in the scene?

Quantum hardware may be a good match for AI Read More »

experimental-drug-looks-to-be-gastric-bypass-surgery-in-pill-form

Experimental drug looks to be gastric bypass surgery in pill form

In rats, the drug produced a consistent 1 percent weekly weight loss over a six-week study period while preserving 100 percent of lean muscle mass.

In a first-in-human pilot study of nine participants, the drug was safe with no adverse effects. Tissue samples taken from the intestine were used to confirm that the coating formed and was also cleared from the body within 24 hours. The study wasn’t designed to assess weight loss, but blood testing showed that after the drug was given, glucose levels and the “hunger hormone” ghrelin were lower while the levels of leptin, an appetite-regulating hormone, were higher.

“When nutrients are redirected to later in the intestine, you’re activating pathways that lead towards satiety, energy expenditure, and overall healthy, sustainable weight loss,” Dhanda says.

Syntis Bio’s findings in animals also hint at the drug’s potential for weight loss without compromising muscle mass, one of the concerns with current GLP-1 drugs. While weight loss in general is associated with numerous health benefits, there’s growing evidence that the kind of drastic weight loss that GLP-1s induce can also lead to a loss of lean muscle mass.

Louis Aronne, an obesity medicine specialist and professor of metabolic research at Weill-Cornell Medical College, says that while GLP-1s are wildly popular, they may not be right for everyone. He predicts that in the not-so-distant future there will be many drugs for obesity, and treatment will be more personalized. “I think Syntis’ compound fits in perfectly as a treatment that could be used early on. It’s a kind of thing you could use as a first-line medication,” he says. Arrone serves as a clinical adviser to the company.

Vladimir Kushnir, professor of medicine and director of bariatric endoscopy at Washington University in St. Louis, who isn’t involved with Syntis, says the early pilot data is encouraging, but it’s hard to draw any conclusions from such a small study. He expects that the drug will make people feel fuller but could also have some of the same side effects as gastric bypass surgery. “My anticipation is that this is going to have some digestive side effects like bloating and abdominal cramping, as well as potentially some diarrhea and nausea once it gets into a bigger study,” he says.

It’s early days for this novel technique, but if it proves effective, it could one day be an alternative or add-on drug to GLP-1 medications.

This story originally appeared on wired.com.

Experimental drug looks to be gastric bypass surgery in pill form Read More »

trump-administration’s-attack-on-university-research-accelerates

Trump administration’s attack on university research accelerates

Shortly after its inauguration, the Trump administration has made no secret that it isn’t especially interested in funding research. Before January’s end, major science agencies had instituted pauses on research funding, and grant funding has not been restored to previous levels since. Many individual grants have been targeted on ideological grounds, and agencies like the National Science Foundation are expected to see significant cuts. Since then, individual universities have been targeted, starting with an ongoing fight with Columbia University over $400 million in research funding.

This week, however, it appears that the targeting of university research has entered overdrive, with multiple announcements of funding freezes targeting several universities. Should these last for any considerable amount of time, they will likely cripple research at the targeted universities.

On Wednesday, Science learned that the National Institutes of Health has frozen all of its research funding to Columbia, despite the university agreeing to steps previously demanded by the administration and the resignation of its acting president. In 2024, Columbia had received nearly $700 million in grants from the NIH, with the money largely going to the university’s prestigious medical and public health schools.

But the attack goes well beyond a single university. On Tuesday, the Trump administration announced a hold on all research funding to Northwestern University (nearly $800 million) and Cornell University ($1 billion). These involved money granted by multiple government agencies, including a significant amount from the Department of Defense in Cornell’s case. Ostensibly, all of these actions were taken because of the university administrators’ approach to protests about the conflict in Gaza, which the administration has characterized as allowing antisemitism.

Trump administration’s attack on university research accelerates Read More »

google-announces-faster,-more-efficient-gemini-ai-model

Google announces faster, more efficient Gemini AI model

We recently spoke with Google’s Tulsee Doshi, who noted that the 2.5 Pro (Experimental) release was still prone to “overthinking” its responses to simple queries. However, the plan was to further improve dynamic thinking for the final release, and the team also hoped to give developers more control over the feature. That appears to be happening with Gemini 2.5 Flash, which includes “dynamic and controllable reasoning.”

The newest Gemini models will choose a “thinking budget” based on the complexity of the prompt. This helps reduce wait times and processing for 2.5 Flash. Developers even get granular control over the budget to lower costs and speed things along where appropriate. Gemini 2.5 models are also getting supervised tuning and context caching for Vertex AI in the coming weeks.

In addition to the arrival of Gemini 2.5 Flash, the larger Pro model has picked up a new gig. Google’s largest Gemini model is now powering its Deep Research tool, which was previously running Gemini 2.0 Pro. Deep Research lets you explore a topic in greater detail simply by entering a prompt. The agent then goes out into the Internet to collect data and synthesize a lengthy report.

Gemini vs. ChatGPT chart

Credit: Google

Google says that the move to Gemini 2.5 has boosted the accuracy and usefulness of Deep Research. The graphic above shows Google’s alleged advantage compared to OpenAI’s deep research tool. These stats are based on user evaluations (not synthetic benchmarks) and show a greater than 2-to-1 preference for Gemini 2.5 Pro reports.

Deep Research is available for limited use on non-paid accounts, but you won’t get the latest model. Deep Research with 2.5 Pro is currently limited to Gemini Advanced subscribers. However, we expect before long that all models in the Gemini app will move to the 2.5 branch. With dynamic reasoning and new TPUs, Google could begin lowering the sky-high costs that have thus far made generative AI unprofitable.

Google announces faster, more efficient Gemini AI model Read More »

llama-does-not-look-good-4-anything

Llama Does Not Look Good 4 Anything

Llama Scout (17B active parameters, 16 experts, 109B total) and Llama Maverick (17B active parameters, 128 experts, 400B total), released on Saturday, look deeply disappointing. They are disappointing on the level of ‘people think they have to be misconfigured to be this bad,’ and people wondering and debating how aggressively the benchmarks were gamed.

This was by far the most negative reaction I have seen to a model release, the opposite of the reaction to Gemini 2.5 Pro. I have seen similarly deeply disappointing and misleading releases, but they were non-American models from labs whose benchmarks and claims we have learned not to take as representing model capabilities.

After this release, I am placing Meta in that category of AI labs whose pronouncements about model capabilities are not to be trusted, that cannot be relied upon to follow industry norms, and which are clearly not on the frontier. Until they show otherwise, they clearly do not belong in the category that includes OpenAI, Anthropic, Google, xAI and DeepSeek.

Techikansh: I am just gonna leave this here…

  1. Llama We Doing This Again.

  2. Llama the License Favors Bad Actors.

  3. Llama You Do It This Way.

  4. Llama Fight in the Arena.

  5. Llama Would You Cheat on Other Benchmarks.

  6. Llama It So Bad on Independent Benchmarks.

  7. Llama You Don’t Like It.

  8. Llama Should We Care.

Meta released the first two Llama 4 models last Saturday, and there is a code change indicating that the original plan was to do it Monday and it got moved up. In general, releasing on a Saturday is such bad strategy it simply isn’t done. Zuck says ‘that’s when it was ready’ but that is not an explanation.

People are wondering why made an exception and did it anyway. I have two hypotheses for what happened (note: I do not have any private information here).

  1. They moved it up because the tariffs were about to potentially cause a Black Monday stock market crash, and Meta wanted to get ahead of that to protect themselves and also to not have the release buried under other news. This seems entirely reasonable under the circumstances.

  2. They released on Saturday to bury it, because it isn’t any good.

Those two look to be at cross-purposes, but I’m not so sure. Suppose, for the sake of argument here, that Llama-4 sucks.

  1. Investors can’t really tell the difference, especially not by Monday.

  2. Those who can tell the difference would be less likely to notice or talk about it.

Who knows. That’s all speculation.

What I do know is that the Llama 4 models released so far seem to not be any good.

You can download Llama 4 Scout and Maverick at Hugging Face or from llama.com. You can try it on the web, or within Meta’s products.

They offer a Llama license, which is rather obnoxious, restricting large companies from using it and requiring rather prominent acknowledgment of Llama’s use, including putting ‘Llama’ in the title and adhering to the ‘acceptable use policy.’

Putting such requirements on otherwise open weight models gives an advantage to overseas companies and governments, especially the PRC, that can and will simply ignore such rules, while handicapping American companies.

European companies are of course handicapped even more, they literally are not given a license at all, blame whoever you want for that part.

Lech Mazur: Large, it will be tough for enthusiasts to run them locally. The license is still quite restrictive. I can see why some might think it doesn’t qualify as open source.

Not cool. Be open, or be closed.

This may be part of a consistent pattern. We just saw this story by Allan Smith that Sarah Wynn-Williams, a former Facebook employee, will testify before Congress today that Meta executives undermined U.S. national security and briefed Chinese officials on emerging technologies like artificial intelligence. I don’t know if this is true, but ‘Meta has been cooperating with China for ordinary business reasons’ might be the explanation for a lot of its AI decisions.

If the models were good, this would potentially be a rather big deal.

In terms of techniques used, I take their announcement post to be ‘I hear you like mixture-of-expert LLMs and scaling up so I got you some scaled up MoEs to go with your scaled up MoEs.’ This includes the size in parameters and also amount of data.

I would take Meta’s outright statement of ‘newest model suite offering unrivaled speed and efficiency’ as an almost certainly false claim, as is the following quote from them. As in, they are sufficiently false as to downgrade my trust in Meta’s claims, which was never all that high.

Meta: Llama 4 Maverick, a 17 billion active parameter model with 128 experts, is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding—at less than half the active parameters.

That’s a bold claim. Feedback does not back this up.

The two features they do offer are support for 200 languages, and in theory a long context window. I say in theory because it’s easy to offer long context so you can tout it, and hard to make that long context do anything useful and preserve performance. Needle in a haystack is not a good measure of practical use here. Whereas to skip ahead to one private benchmark, Fiction.live, that tries to use that long context, it goes historically bad, the worst they’ve ever seen, even at 60k.

Meta offer some benchmarks, which many noted seem selected, and they also select their competition.

Anyone keeping up with LLM progress can see the choices here are a little suspicious.

Artificial Analysis confirms the scores, but only on the benchmarks Meta chose.

The Llama models are giant mixture of experts (MoE) models, similar to (and presumably because of and copying) DeepSeek’s v3 and r1. Scout is 17B active parameters, 16 experts, 109B total. Maverick is 17B active, 128 experts, 400B total. The unreleased Behemoth is huge, 288B active, 16 experts and 2T total parameters.

That means that while they are optimized to run fast on an H100, they can’t be run at all on a 4090 GPU or other similar consumer hardware, which negates one of the big advantages of open models. I presume you can run Scout and Maverick (quantized) on my Mac Studio, and I might well do that, but that’s a hefty ask.

Jeff Dean: Sure, but you can run it on 4 or 8 of them, no?

Jeremy Howard: Yes I can; as can you. But I’m primarily interested in what’s widely available in the community, where a single 4090 GPU machine is already a very rich investment.

Remember also that 3090s were the last consumer card with nvlink, so 4090 and 5090 cards aren’t good at multi gpu

Jeff Dean: Fwiw, this exact reason is why we made the Gemma 3 open source models something that developers could easily run on a single GPU or TPU.

And if you have only one or two GPUs and you want to run the model as fast as you can, here’s an RL algorithm that can help figure out how to use those GPU(s) plus your CPU to go as fast as you can with whatever hardware you have

Luke Metro: Apple Silicon’s using its large amount of unified memory for big on-device AI models might be the hardware coup of the decade if Apple Intelligence is able to get its sh*t together.

The strongest data point in Llama 4’s favor is the Arena ranking of 1417. That is good for second place, which is indeed impressive if it is reflective of general performance.

Alas, as we all know by now, Arena is being used as an optimization target. Was that done here? We don’t know.

Other signs like the selective benchmarks they released are suggestive of such a strategy, and they would be far from the only ones. Janus asks what other than Goodharting explains the rise in Arena ratings for new models, I think that’s definitely a lot of it, or for things that aren’t actually Arena but are highly corrected to area.

What does Arena optimize for? A random internet user prefers your response to another model’s response.

What makes people prefer one response to another? We can also look at the actual responses, and see, now that Arena has released answers for review.

Morgan: i probably arrive too late but the lmsys voter’s preference for sycophantic yapping is particularly clear this time

Wh: These examples are extremely damning on the utility of Chatbot arena as a serious benchmark. Look through all the examples that Maverick won, and it’s slop after slop after slop. This is the nonsense you are optimizing for if you are trying to goodhart lmsys. Let’s be serious.

This is the clearest evidence that no one should take these rankings seriously.

In this example it’s super yappy and factually inaccurate, and yet the user voted for Llama 4. The rest aren’t any better.

Always start by profusely telling the user how smart they are.

TDM: Struggling to find a single answer in this that is less than 100 lines and doesn’t make me throw up.

AKR: Llama 4 Maverick Experimental vs Claude 3.7 Sonnet

Prompt: Create a web page that shows the current the current month as a table, with no border lines, and has button to move to the previous and next month. It also has the ability to show a bar that can go horizontally across the days in a week to indicate a daily streak.

3.7 Sonnet won easily because of the “Add Streak for Current Week” button which is clearly what’s needed as the prompt. It also better UI imo.

But on the LMArena Experimental Battles UI, the user selected the Llama 4 Mav Exp as the better model 🤦‍♂️

Goes to show that you should never believe these benchmarks unless you really try it out yourself.

Hasan Can: When I said [a well-known AI company is clearly manipulating Arena via watermarking] back on March 28th, nobody offered support. Now, time has come to put a final nail in lmarena’s coffin.

These answers by Maverick, that users voted for, seem absurdly obnoxious and bad. I originally wrote ‘these make me want to puke,’ erased it, but now that I see TDM saying the same thing I’m putting that observation back in. This is the opposite of what I want.

And indeed, this also potentially explains Claude Sonnet 3.7’s low Arena ranking. What if people really do prefer syncopathy and lengthy slop? It exists for a reason.

It’s clear Llama-4 fell victim to Goodhart’s Law, either to Arena rankings directly or to a similar other ranking process they used in fine tuning.

We also know that this version of Maverick on Arena is not the same as the one they released, and it seems, shall we say, ‘slopified.’

The question is, is that all that happened? Did they also outright cheat to get this Arena ranking? I opened a Manifold market, unfortunately we likely never know for sure but I figured something was better than nothing here, suggestions for better resolution methods welcome. When I say ‘cheating’ I mean something beyond ‘a version optimized to do well on Arena.’ I mean actual outright cheating.

Did they flat out cheat?

Peter Wildeford: According to The Information, delays were due to the model underperforming on technical benchmarks. In my opinion, it still seems like Meta was pretty selective about the metrics they chose to use (and the metrics they didn’t) and how they did the comparisons, suggesting the model may not be that good.

Satya Benson: The interesting story here is the allegations of cheating on the benchmarks. I’d love to get better sense of to what extent this really happened and how bad the cheating is relative to other models.

First Worldist: My understanding is they tested “experimental” models without disclosing these models were trained specifically for the benchmarks

There’s at least one claim that they did fix that partly via cheating, obviously take with tons of salt given the sourcing.

I wouldn’t think Meta would go this far, for the same reasons as Peter, so I doubt it happened. Nor would they have had to go this far. You actually have to work hard to not accidentally de facto train on benchmarks when using 22T+ tokens.

So while I’m quoting the post for posterity, I assume this accusation is probably false.

Peter Wildeford: I don’t believe the conspiracy theories about training on the test set, but I do think they’ve been highly selective in which metrics they picked in order to pretend to be better than they are.

The fact that the Chatbot Arena is a different bot than the ones getting the math scores is also telling.

Leo: It’s a pretty big no-no in ML, and seems unlikely that Meta researchers would torch their reputation risking something like this. Would need strong evidence to be convinced otherwise.

Peter Wildeford: Agreed. Accusation seems unlikely on priors and the evidence isn’t sufficient to move me enough.

Rrryougi (I doubt the claims here are true, but they seem too important not to include in the record): Original post is in Chinese that can be found here. Please take the following with a grain of salt.

Content:

Despite repeated training efforts, the internal model’s performance still falls short of open-source SOTA benchmarks, lagging significantly behind. Company leadership suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a “presentable” result. Failure to achieve this goal by the end-of-April deadline would lead to dire consequences. Following yesterday’s release of Llama 4, many users on X and Reddit have already reported extremely poor real-world test results.

As someone currently in academia, I find this approach utterly unacceptable. Consequently, I have submitted my resignation and explicitly requested that my name be excluded from the technical report of Llama 4. Notably, the VP of AI at Meta also resigned for similar reasons.

Ortegaalfredo: “Meta’s head of AI research announces departure – Published Tue, Apr 1 2025”

At least that part is true. Ouch.

There is however this:

Hasan Can: This [below] might potentially constitute first solid evidence suggesting Llama 4 was actually trained on benchmarks.

Kaixuan Huang: Just tested Llama4-Scout on our MATH-Perturb benchmark. There is a surprising 18% gap between Original and MATH-P-Simple, making it unique among the 20+ models that came out after 2024. 😂😂

It doesn’t look great. Here is it in an easier to read form:

That sure looks like cheating. Again, it doesn’t mean they intentionally train on the test set. If you have 22T+ tokens and throw the entire internet at your model, there’s going to be contamination. All you have to do is not sufficiently care about not training on benchmarks. Alternatively, you can hill climb on your test scores.

Previously, I would have doubted Meta would let this happen. Now, I have less doubt.

This would not be the first time Meta has broken similar norms.

Holly Elmore: I don’t want to speak out of turn but it doesn’t seem out of character for Meta to me. They knowingly stole libgen and downloaded it via Tor bc they knew it would look bad. The informal ethics of ML are unfortunately not the reassurance I was hoping for.

Those sources seem rather illegal. Meta don’t care. What are you going to do about it?

It is 2025. In general, ‘[X] would goes against norms’ is no longer seen as so strong an argument against doing [X]. The question is now, if I do [X], yes it is against norms, but even if you figure out that I did that, what are you going to do about it?

That goes double for ‘not doing enough to prevent [X] would go against norms.’

This is everything I could find that plausibly counts as a benchmark. There are some benchmarks where Maverick is mid, others where it is less than mid.

I don’t know if ARC-AGI counts as ‘independent benchmarks’ but Maverick scored 4.38% and Scout 0.5% on ARC-AGI-1 and both got 0.00% on ARC-AGI-2.

On Livebench, Llama 4 Maverick does relatively okay with a 54.38, right behind DeepSeek R1 Distill Llama 70B and Gemini 2.0 Flash.

Here are the Lech Mazur benchmarks.

Extended Word Connections (which is de facto a reasoning benchmark):

Confabulations, it gets a 22.6 here, which is rather not good:

On Creative Writing Llama Maverick bombs super hard, Llama are the three bars on the left:

In the Elimination game, things again don’t go great.

It also does not do well in Thematic Generation or Step-Game Battles where even Llama 3.3 70B kicks its ass, as does almost everything else.

BigCodeBench didn’t go great, although Llama-4-Maverick did marginally beat out Gemma-3-27B.

Markus Zimmerman reports results for DevQualityEval v1.0, and they ‘do not look good,’ they are more than halfway down a very long chart of only open models.

Harvard Ihle is here with WeirdML, Maverick is in the middle, doing pretty well relative to other benchmarks.

In general, if you have your own benchmark, it doesn’t look good:

George: the most complementary informed takes have come from shrek, eg.

the most damning critical takes (imo) have come from curators of lesser known benchmarks, on which the new models are not performing well. The EQBench site has a couple (/they bombed), bigcodebench had Maverick coming in well below DSv2 (not a typo). Aider Polyglot bench was similarly bleak.

And here by “most damning” I am intentionally excluding takes informed by the sloptimized version that was sent to lmsys. Meta folks are chalking some of the poor results up to implementation issues, but on at least one benchmark (long context fiction) the proprietors have tried three different implementations and netted similarly disappointing scores each time.

This was Aider polyglot:

Here’s that positive viewpoint, from xjdr, clearly in the context of open models only, essentially saying that Maverick is a specialized model and is good in particular for agentic and tool calling work and for that purpose it is good:

xjdr: my detailed personal benchmarks ran overnight.

– Scout is best at summarization and function calling. exactly what you want from a cheap long ctx model. this is going to be a workhorse in coding flows and RAG applications. the single shot ICL recall is very very good.

– Maverick was built for replacing developers and doing agenic / tool calling work. it is very consistent in instruction following, very long context ICL and parallel multi tool calls. this is EXACTLY the model and capabilities i want in my coder style flows. it is not creative, i have V3 and R1 for that tho. multimodal is very good at OCR and charts and graphs outperforming both 4o and qwen 2.5 VL 72 in my typical tests. the only thing i haven’t tested is computer use but i doubt it will beat sonnet or qwen at that as both models were explicitly trained for it. The output is kind of bland (hence the constant 4o comparisons) with little personality, which is totally fine. this is a professional tool built for professional work (testing it on RP or the like will lead to terrible results). Im not sure what more you could ask for in a agent focused model.

– V3-0324 is not consistent enough with tool calling output to be useful but when it gets it right, it is always the clear and best choice. however, it excels at creativity, problem solving and multi-turn interactions. this will continue to be my non-function calling workhorse. the 131k ctx feels remarkably restrictive now tho. i am going to do some more long ctx testing on V3 cause im almost positive i can get more out of it (200k – 300k ideally), but i think this is where MLA is going to show its tradeoffs. FIM and completion are also huge V3 specific wins here and places where it not only excels but is really in a league of its own.

– R1 continues to be the smartest and most creative model available when used single shot, single turn and when prompted correctly. its the genius in the corner who cant make eye contact but if you properly specify a problem it will be solved with an incredibly high degree of confidence. Function calling (really all of the V3 features) work as expected but the formatting is a bit 1/2 baked and doubly so when you use them with tool use. however, with proper parsing and sampling effort, its a truly remarkable model.

– All of these models benefit tremendously from proper sampling and lovingly crafted matmuls and accumulations. they are all much better and smarter than what is generally available from lmsys or openrouter.

I am incredibly bullish on Behemoth and R2 and cannot wait to fold them into my daily workflow. I have never been happier about the state of open source models and since the R1 launch and when used correctly they provide a viable alternative to frontier models for the first time. I am happy to answer and specific questions but this is probably my last general post on this. i gotta get back to work …

I suppose that is possible. Perhaps it has its niche and will be good at that niche once people adapt to it and scaffold it well. But that’s definitely not how Meta is presenting Maverick or the future Behemoth.

It’s weird to call it a ‘benchmark’ but worth noting that Llama 4 Scout and Maverick did not exhibit alignment faking in a new test.

Another sort-of benchmark would be red teaming, done here by Virtue AI. Alas, their tests seem to be against mundane risks only. They find that Llama 4 is significantly less compliant with AI regulations than Claude 3.7 or GPT-4.5, ‘lagging behind peers,’ and evaluations show ‘noticeable weaknesses’ against mundane harms, despite what they call ‘Maverick’s caution dilemma’ and false refusals.

That is distinct from asking about misuse, malicious fine-tuning or other sources of potential catastrophic risk from an open weights model – as always, ‘the license says you cannot do that’ is going to get ignored here. One presumes that the main defense is that these models lack the capability to cause new trouble here, at least in the absence of Behemoth.

Or, here is what people are saying in other realms.

Yair Halberstadt: Reviews on Reddit were that it was total trash, so bad they assume it must be misconfigured or something.

I’ve had confirmation of Yair’s statement from other reliable sources.

Murat: just tried llama 4 scout on groq cloud. 512 tok/s is great

however just like all the other eval-optimized models (like claude 3.7, o3-mini etc.) it doesn’t follow instructions properly. i can’t use it as drop-in replacement for my existing prompt pipelines.

just tried llama maverick. same thing. unimpressed.

grok lacks api so sonnet 3.5 is still my main squeeze.

Medo 42: Personal toy benchmark (a coding task I give to every new model): Not good at all. Shares last place with Gemini 2.0 Pro 02-07 now.

Roughly: “The code returned an array of objects in the right shape and one of the fields of the objects had the right value most of the time”

Scaling01: Llama-4-Yapper strikes again

I can’t even run tic-tac-toe bench properly because Llama-4-400B can’t shut up and just answer with 1 number.

Llama-4-109B can for some reason.

Who was the biggest cheerleader that doesn’t work at Meta?

AI and crypto czar David Sacks: Congrats to the @AIatMeta team on the launch of their new Llama 4 open-weights models. For the U.S. to win the AI race, we have to win in open source too, and Llama 4 puts us back in the lead.

Peter Wildeford: Google is so bad at marketing that @davidsacks47 doesn’t praise Gemma 3.

Failure to mention Gemma 3 feels like strong mood affectation, on top of the marketing issues. Google is known as a closed lab, Meta is known as open. But mainly yes, Google’s marketing is atrocious. But a claim that Gemma 3 put us back in the lead was a lot more defensible than one about Llama 4.

The Llama tokenizer is a place you might fear to tread.

Kalomaze: if at any point someone on your team says

“yeah we need 10 special tokens for reasoning and 10 for vision and another 10 for image generation and 10 agent tokens and 10 post tr-“

you should have slapped them

this is what happens when that doesn’t happen

Minh Nhat Nguyen: do not go into the llama tokenizer dot json. worst mistake of my life.

tbf i think the reserved llama tokens are nice for ablation experiments, but they rly go overboard with it

Jim Fan says ‘Llama-4 doesn’t disappoint’ but his response seems entirely based on Meta’s claims and reports rather than any independent assessment of performance.

All general reports on feedback say that people are disappointed. It was so disappointing that mostly people treated it as a non-event until asked.

Mena Fleischman: I haven’t seen anything particularly complimentary. They held off on dropping Behemoth which was supposed to be the real showcase of something SOTA, and next-best Maverick in their own stats got mostly beat by Deepseek, who was already beaten on release.

Very weak showing.

Andriy Burkov: If today’s disappointing release of Llama 4 tells us something, it’s that even 30 trillion training tokens and 2 trillion parameters don’t make your non-reasoning model better than smaller reasoning models.

Model and data size scaling are over.

Along similar lines, Alexander Doria doesn’t see much point in giving 40T tokens to Llama-4 Scout, and 22T to Llama-4 Maverick.

I don’t think this means model and data size scaling are over. I think it means that if you do not know how to execute, sheer size will not save you, and probably gives you smaller marginal gains than if you executed well.

The big takeaway is that we have to downgrade expectations for Meta in AI, and also our expectations for how much we can trust Meta.

Despite vastly superior resources, Meta now seems to be trying to copy DeepSeek and coming up short. Exactly how short depends on who you ask. And Meta is, to an unknown degree, making a deliberate effort to make its models look good on benchmarks in ways that violate norms.

It is hard to count out a top tech company with tons of compute and almost endless capital. They could still turn this ship around. But they’re going to have to turn this ship around, and do it fast, if they want to be competitive.

Right now, America’s open model champion isn’t Meta. It is Google with Gemma 3, and soon it may also be OpenAI, which is planning an open reasoning model soon. I realize that causes some dissonance, but that’s where we are. Beware mood affectation.

Discussion about this post

Llama Does Not Look Good 4 Anything Read More »

fewer-beans-=-great-coffee-if-you-get-the-pour-height-right

Fewer beans = great coffee if you get the pour height right

Based on their findings, the authors recommend pouring hot water over your coffee grounds slowly to give the beans more time immersed in the water. But pour the water too slowly and the resulting jet will stick to the spout (the “teapot effect”) and there won’t be sufficient mixing of the grounds; they’ll just settle to the bottom instead, decreasing extraction yield. “If you have a thin jet, then it tends to break up into droplets,” said co-author Margot Young. “That’s what you want to avoid in these pour-overs, because that means the jet cannot mix the coffee grounds effectively.”

Smaller jet diameter impact on dynamics.

Smaller jet diameter impact on dynamics. Credit: E. Park et al., 2025

That’s where increasing the height from which you pour comes in. This imparts more energy from gravity, per the authors, increasing the mixing of the granular coffee grounds. But again, there’s such a thing as pouring from too great a height, causing the water jet to break apart. The ideal height is no more than 50 centimeters (about 20 inches) above the filter. The classic goosenecked tea kettle turns out to be ideal for achieving that optimal height. Future research might explore the effects of varying the grain size of the coffee grounds.

Increasing extraction yields and, by extension, reducing how much coffee grounds one uses matters because it is becoming increasingly difficult to cultivate the most common species of coffee because of ongoing climate change. “Coffee is getting harder to grow, and so, because of that, prices for coffee will likely increase in coming years,” co-author Arnold Mathijssen told New Scientist. “The idea for this research was really to see if we could help do something by reducing the amount of coffee beans that are needed while still keeping the same amount of extraction, so that you get the same strength of coffee.”

But the potential applications aren’t limited to brewing coffee. The authors note that this same liquid jet/submerged granular bed interplay is also involved in soil erosion from waterfalls, for example, as well as wastewater treatment—using liquid jets to aerate wastewater to enhance biodegradation of organic matter—and dam scouring, where the solid ground behind a dam is slowly worn away by water jets. “Although dams operate on a much larger scale, they may undergo similar dynamics, and finding ways to decrease the jet height in dams may decrease erosion and elongate dam health,” they wrote.

Physics of Fluids, 2025. DOI: 10.1063/5.0257924 (About DOIs).

Fewer beans = great coffee if you get the pour height right Read More »

framework-“temporarily-pausing”-some-laptop-sales-because-of-new-tariffs

Framework “temporarily pausing” some laptop sales because of new tariffs

Framework, the designers and sellers of the modular and repairable Framework Laptop 13 and other products, announced today that it would be “temporarily pausing US sales” on some of its laptop configurations as a result of new tariffs put on Taiwanese imports by the Trump administration. The affected models will be removed from Framework’s online store for now, and there’s no word on when buyers can expect them to come back.

“We priced our laptops when tariffs on imports from Taiwan were 0 percent,” the company responded to a post asking why it was pausing sales. “At a 10 percent tariff, we would have to sell the lowest-end SKUs at a loss.”

“Other consumer goods makers have performed the same calculations and taken the same actions, though most have not been open about it,” Framework said. Nintendo also paused US preorders for its upcoming Switch 2 console last week after the tariffs were announced.

For right now, Framework’s sales pause affects at least two specific laptop configurations: the Intel Core Ultra 5 125H and AMD Ryzen 5 7640U versions of the Framework Laptop 13. As of April 1, Framework was selling pre-built versions of those laptops for $999 and $899, respectively. Without those options, the cheapest versions of those laptops start at $1,399 and $1,499.

Framework “temporarily pausing” some laptop sales because of new tariffs Read More »