Google Gemini

gemini-“coming-together-in-really-awesome-ways,”-google-says-after-2.5-pro-release

Gemini “coming together in really awesome ways,” Google says after 2.5 Pro release


Google’s Tulsee Doshi talks vibes and efficiency in Gemini 2.5 Pro.

Google was caught flat-footed by the sudden skyrocketing interest in generative AI despite its role in developing the underlying technology. This prompted the company to refocus its considerable resources on catching up to OpenAI. Since then, we’ve seen the detail-flubbing Bard and numerous versions of the multimodal Gemini models. While Gemini has struggled to make progress in benchmarks and user experience, that could be changing with the new 2.5 Pro (Experimental) release. With big gains in benchmarks and vibes, this might be the first Google model that can make a dent in ChatGPT’s dominance.

We recently spoke to Google’s Tulsee Doshi, director of product management for Gemini, to talk about the process of releasing Gemini 2.5, as well as where Google’s AI models are going in the future.

Welcome to the vibes era

Google may have had a slow start in building generative AI products, but the Gemini team has picked up the pace in recent months. The company released Gemini 2.0 in December, showing a modest improvement over the 1.5 branch. It only took three months to reach 2.5, meaning Gemini 2.0 Pro wasn’t even out of the experimental stage yet. To hear Doshi tell it, this was the result of Google’s long-term investments in Gemini.

“A big part of it is honestly that a lot of the pieces and the fundamentals we’ve been building are now coming together in really awesome ways, ” Doshi said. “And so we feel like we’re able to pick up the pace here.”

The process of releasing a new model involves testing a lot of candidates. According to Doshi, Google takes a multilayered approach to inspecting those models, starting with benchmarks. “We have a set of evals, both external academic benchmarks as well as internal evals that we created for use cases that we care about,” she said.

Credit: Google

The team also uses these tests to work on safety, which, as Google points out at every given opportunity, is still a core part of how it develops Gemini. Doshi noted that making a model safe and ready for wide release involves adversarial testing and lots of hands-on time.

But we can’t forget the vibes, which have become an increasingly important part of AI models. There’s great focus on the vibe of outputs—how engaging and useful they are. There’s also the emerging trend of vibe coding, in which you use AI prompts to build things instead of typing the code yourself. For the Gemini team, these concepts are connected. The team uses product and user feedback to understand the “vibes” of the output, be that code or just an answer to a question.

Google has noted on a few occasions that Gemini 2.5 is at the top of the LM Arena leaderboard, which shows that people who have used the model prefer the output by a considerable margin—it has good vibes. That’s certainly a positive place for Gemini to be after a long climb, but there is some concern in the field that too much emphasis on vibes could push us toward models that make us feel good regardless of whether the output is good, a property known as sycophancy.

If the Gemini team has concerns about feel-good models, they’re not letting it show. Doshi mentioned the team’s focus on code generation, which she noted can be optimized for “delightful experiences” without stoking the user’s ego. “I think about vibe less as a certain type of personality trait that we’re trying to work towards,” Doshi said.

Hallucinations are another area of concern with generative AI models. Google has had plenty of embarrassing experiences with Gemini and Bard making things up, but the Gemini team believes they’re on the right path. Gemini 2.5 apparently has set a high-water mark in the team’s factuality metrics. But will hallucinations ever be reduced to the point we can fully trust the AI? No comment on that front.

Don’t overthink it

Perhaps the most interesting thing you’ll notice when using Gemini 2.5 is that it’s very fast compared to other models that use simulated reasoning. Google says it’s building this “thinking” capability into all of its models going forward, which should lead to improved outputs. The expansion of reasoning in large language models in 2024 resulted in a noticeable improvement in the quality of these tools. It also made them even more expensive to run, exacerbating an already serious problem with generative AI.

The larger and more complex an LLM becomes, the more expensive it is to run. Google hasn’t released technical data like parameter count on its newer models—you’ll have to go back to the 1.5 branch to get that kind of detail. However, Doshi explained that Gemini 2.5 is not a substantially larger model than Google’s last iteration, calling it “comparable” in size to 2.0.

Gemini 2.5 is more efficient in one key area: the chain of thought. It’s Google’s first public model to support a feature called Dynamic Thinking, which allows the model to modulate the amount of reasoning that goes into an output. This is just the first step, though.

“I think right now, the 2.5 Pro model we ship still does overthink for simpler prompts in a way that we’re hoping to continue to improve,” Doshi said. “So one big area we are investing in is Dynamic Thinking as a way to get towards our [general availability] version of 2.5 Pro where it thinks even less for simpler prompts.”

Gemini models on phone

Credit: Ryan Whitwam

Google doesn’t break out earnings from its new AI ventures, but we can safely assume there’s no profit to be had. No one has managed to turn these huge LLMs into a viable business yet. OpenAI, which has the largest user base with ChatGPT, loses money even on the users paying for its $200 Pro plan. Google is planning to spend $75 billion on AI infrastructure in 2025, so it will be crucial to make the most of this very expensive hardware. Building models that don’t waste cycles on overthinking “Hi, how are you?” could be a big help.

Missing technical details

Google plays it close to the chest with Gemini, but the 2.5 Pro release has offered more insight into where the company plans to go than ever before. To really understand this model, though, we’ll need to see the technical report. Google last released such a document for Gemini 1.5. We still haven’t seen the 2.0 version, and we may never see that document now that 2.5 has supplanted 2.0.

Doshi notes that 2.5 Pro is still an experimental model. So, don’t expect full evaluation reports to happen right away. A Google spokesperson clarified that a full technical evaluation report on the 2.5 branch is planned, but there is no firm timeline. Google hasn’t even released updated model cards for Gemini 2.0, let alone 2.5. These documents are brief one-page summaries of a model’s training, intended use, evaluation data, and more. They’re essentially LLM nutrition labels. It’s much less detailed than a technical report, but it’s better than nothing. Google confirms model cards are on the way for Gemini 2.0 and 2.5.

Given the recent rapid pace of releases, it’s possible Gemini 2.5 Pro could be rolling out more widely around Google I/O in May. We certainly hope Google has more details when the 2.5 branch expands. As Gemini development picks up steam, transparency shouldn’t fall by the wayside.

Photo of Ryan Whitwam

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he’s written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

Gemini “coming together in really awesome ways,” Google says after 2.5 Pro release Read More »

google-shakes-up-gemini-leadership,-google-labs-head-taking-the-reins

Google shakes up Gemini leadership, Google Labs head taking the reins

On the heels of releasing its most capable AI model yet, Google is making some changes to the Gemini team. A new report from Semafor reveals that longtime Googler Sissie Hsiao will step down from her role leading the Gemini team effective immediately. In her place, Google is appointing Josh Woodward, who currently leads Google Labs.

According to a memo from DeepMind CEO Demis Hassabis, this change is designed to “sharpen our focus on the next evolution of the Gemini app.” This new responsibility won’t take Woodward away from his role at Google Labs—he will remain in charge of that division while leading the Gemini team.

Meanwhile, Hsiao says in a message to employees that she is happy with “Chapter 1” of the Bard story and is optimistic for Woodward’s “Chapter 2.” Hsiao won’t be involved in Google’s AI efforts for now—she’s opted to take some time off before returning to Google in a new role.

Hsiao has been at Google for 19 years and was tasked with building Google’s chatbot in 2022. At the time, Google was reeling after ChatGPT took the world by storm using the very transformer architecture that Google originally invented. Initially, the team’s chatbot efforts were known as Bard before being unified under the Gemini brand at the end of 2023.

This process has been a bit of a slog, with Google’s models improving slowly while simultaneously worming their way into many beloved products. However, the sense inside the company is that Gemini has turned a corner with 2.5 Pro. While this model is still in the experimental stage, it has bested other models in academic benchmarks and has blown right past them in all-important vibemarks like LM Arena.

Google shakes up Gemini leadership, Google Labs head taking the reins Read More »

gemini-2.5-pro-is-here-with-bigger-numbers-and-great-vibes

Gemini 2.5 Pro is here with bigger numbers and great vibes

Just a few months after releasing its first Gemini 2.0 AI models, Google is upgrading again. The company says the new Gemini 2.5 Pro Experimental is its “most intelligent” model yet, offering a massive context window, multimodality, and reasoning capabilities. Google points to a raft of benchmarks that show the new Gemini clobbering other large language models (LLMs), and our testing seems to back that up—Gemini 2.5 Pro is one of the most impressive generative AI models we’ve seen.

Gemini 2.5, like all Google’s models going forward, has reasoning built in. The AI essentially fact-checks itself along the way to generating an output. We like to call this “simulated reasoning,” as there’s no evidence that this process is akin to human reasoning. However, it can go a long way to improving LLM outputs. Google specifically cites the model’s “agentic” coding capabilities as a beneficiary of this process. Gemini 2.5 Pro Experimental can, for example, generate a full working video game from a single prompt. We’ve tested this, and it works with the publicly available version of the model.

Gemini 2.5 Pro builds a game in one step.

Google says a lot of things about Gemini 2.5 Pro; it’s smarter, it’s context-aware, it thinks—but it’s hard to quantify what constitutes improvement in generative AI bots. There are some clear technical upsides, though. Gemini 2.5 Pro comes with a 1 million token context window, which is common for the big Gemini models but massive compared to competing models like OpenAI GPT or Anthropic Claude. You could feed multiple very long books to Gemini 2.5 Pro in a single prompt, and the output maxes out at 64,000 tokens. That’s the same as Flash 2.0, but it’s still objectively a lot of tokens compared to other LLMs.

Naturally, Google has run Gemini 2.5 Experimental through a battery of benchmarks, in which it scores a bit higher than other AI systems. For example, it squeaks past OpenAI’s o3-mini in GPQA and AIME 2025, which measure how well the AI answers complex questions about science and math, respectively. It also set a new record in the Humanity’s Last Exam benchmark, which consists of 3,000 questions curated by domain experts. Google’s new AI managed a score of 18.8 percent to OpenAI’s 14 percent.

Gemini 2.5 Pro is here with bigger numbers and great vibes Read More »

gemini-gets-new-coding-and-writing-tools,-plus-ai-generated-“podcasts”

Gemini gets new coding and writing tools, plus AI-generated “podcasts”

On the heels of its release of new Gemini models last week, Google has announced a pair of new features for its flagship AI product. Starting today, Gemini has a new Canvas feature that lets you draft, edit, and refine documents or code. Gemini is also getting Audio Overviews, a neat capability that first appeared in the company’s NotebookLM product, but it’s getting even more useful as part of Gemini.

Canvas is similar (confusingly) to the OpenAI product of the same name. Canvas is available in the Gemini prompt bar on the web and mobile app. Simply upload a document and tell Gemini what you need to do with it. In Google’s example, the user asks for a speech based on a PDF containing class notes. And just like that, Gemini spits out a document.

Canvas lets you refine the AI-generated documents right inside Gemini. The writing tools available across the Google ecosystem, with options like suggested edits and different tones, are available inside the Gemini-based editor. If you want to do more edits or collaborate with others, you can export the document to Google Docs with a single click.

Gemini Canvas with tic-tac-toe game

Credit: Google

Canvas is also adept at coding. Just ask, and Canvas can generate prototype web apps, Python scripts, HTML, and more. You can ask Gemini about the code, make alterations, and even preview your results in real time inside Gemini as you (or the AI) make changes.

Gemini gets new coding and writing tools, plus AI-generated “podcasts” Read More »

farewell-photoshop?-google’s-new-ai-lets-you-edit-images-by-asking.

Farewell Photoshop? Google’s new AI lets you edit images by asking.


New AI allows no-skill photo editing, including adding objects and removing watermarks.

A collection of images either generated or modified by Gemini 2.0 Flash (Image Generation) Experimental. Credit: Google / Ars Technica

There’s a new Google AI model in town, and it can generate or edit images as easily as it can create text—as part of its chatbot conversation. The results aren’t perfect, but it’s quite possible everyone in the near future will be able to manipulate images this way.

Last Wednesday, Google expanded access to Gemini 2.0 Flash’s native image-generation capabilities, making the experimental feature available to anyone using Google AI Studio. Previously limited to testers since December, the multimodal technology integrates both native text and image processing capabilities into one AI model.

The new model, titled “Gemini 2.0 Flash (Image Generation) Experimental,” flew somewhat under the radar last week, but it has been garnering more attention over the past few days due to its ability to remove watermarks from images, albeit with artifacts and a reduction in image quality.

That’s not the only trick. Gemini 2.0 Flash can add objects, remove objects, modify scenery, change lighting, attempt to change image angles, zoom in or out, and perform other transformations—all to varying levels of success depending on the subject matter, style, and image in question.

To pull it off, Google trained Gemini 2.0 on a large dataset of images (converted into tokens) and text. The model’s “knowledge” about images occupies the same neural network space as its knowledge about world concepts from text sources, so it can directly output image tokens that get converted back into images and fed to the user.

Adding a water-skiing barbarian to a photograph with Gemini 2.0 Flash.

Adding a water-skiing barbarian to a photograph with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Incorporating image generation into an AI chat isn’t itself new—OpenAI integrated its image-generator DALL-E 3 into ChatGPT last September, and other tech companies like xAI followed suit. But until now, every one of those AI chat assistants called on a separate diffusion-based AI model (which uses a different synthesis principle than LLMs) to generate images, which were then returned to the user within the chat interface. In this case, Gemini 2.0 Flash is both the large language model (LLM) and AI image generator rolled into one system.

Interestingly, OpenAI’s GPT-4o is capable of native image output as well (and OpenAI President Greg Brock teased the feature at one point on X last year), but that company has yet to release true multimodal image output capability. One reason why is possibly because true multimodal image output is very computationally expensive, since each image either inputted or generated is composed of tokens that become part of the context that runs through the image model again and again with each successive prompt. And given the compute needs and size of the training data required to create a truly visually comprehensive multimodal model, the output quality of the images isn’t necessarily as good as diffusion models just yet.

Creating another angle of a person with Gemini 2.0 Flash.

Creating another angle of a person with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Another reason OpenAI has held back may be “safety”-related: In a similar way to how multimodal models trained on audio can absorb a short clip of a sample person’s voice and then imitate it flawlessly (this is how ChatGPT’s Advanced Voice Mode works, with a clip of a voice actor it is authorized to imitate), multimodal image output models are capable of faking media reality in a relatively effortless and convincing way, given proper training data and compute behind it. With a good enough multimodal model, potentially life-wrecking deepfakes and photo manipulations could become even more trivial to produce than they are now.

Putting it to the test

So, what exactly can Gemini 2.0 Flash do? Notably, its support for conversational image editing allows users to iteratively refine images through natural language dialogue across multiple successive prompts. You can talk to it and tell it what you want to add, remove, or change. It’s imperfect, but it’s the beginning of a new type of native image editing capability in the tech world.

We gave Gemini Flash 2.0 a battery of informal AI image-editing tests, and you’ll see the results below. For example, we removed a rabbit from an image in a grassy yard. We also removed a chicken from a messy garage. Gemini fills in the background with its best guess. No need for a clone brush—watch out, Photoshop!

We also tried adding synthesized objects to images. Being always wary of the collapse of media reality, called the “cultural singularity,” we added a UFO to a photo the author took from an airplane window. Then we tried adding a Sasquatch and a ghost. The results were unrealistic, but this model was also trained on a limited image dataset (more on that below).

Adding a UFO to a photograph with Gemini 2.0 Flash. Google / Benj Edwards

We then added a video game character to a photo of an Atari 800 screen (Wizard of Wor), resulting in perhaps the most realistic image synthesis result in the set. You might not see it here, but Gemini added realistic CRT scanlines that matched the monitor’s characteristics pretty well.

Adding a monster to an Atari video game with Gemini 2.0 Flash.

Adding a monster to an Atari video game with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Gemini can also warp an image in novel ways, like “zooming out” of an image into a fictional setting or giving an EGA-palette character a body, then sticking him into an adventure game.

“Zooming out” on an image with Gemini 2.0 Flash. Google / Benj Edwards

And yes, you can remove watermarks. We tried removing a watermark from a Getty Images image, and it worked, although the resulting image is nowhere near the resolution or detail quality of the original. Ultimately, if your brain can picture what an image is like without a watermark, so can an AI model. It fills in the watermark space with the most plausible result based on its training data.

Removing a watermark with Gemini 2.0 Flash.

Removing a watermark with Gemini 2.0 Flash. Credit: Nomadsoul1 via Getty Images

And finally, we know you’ve likely missed seeing barbarians beside TV sets (as per tradition), so we gave that a shot. Originally, Gemini didn’t add a CRT TV set to the barbarian image, so we asked for one.

Adding a TV set to a barbarian image with Gemini 2.0 Flash.

Adding a TV set to a barbarian image with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Then we set the TV on fire.

Setting the TV set on fire with Gemini 2.0 Flash.

Setting the TV set on fire with Gemini 2.0 Flash. Credit: Google / Benj Edwards

All in all, it doesn’t produce images of pristine quality or detail, but we literally did no editing work on these images other than typing requests. Adobe Photoshop currently lets users manipulate images using AI synthesis based on written prompts with “Generative Fill,” but it’s not quite as natural as this. We could see Adobe adding a more conversational AI image-editing flow like this one in the future.

Multimodal output opens up new possibilities

Having true multimodal output opens up interesting new possibilities in chatbots. For example, Gemini 2.0 Flash can play interactive graphical games or generate stories with consistent illustrations, maintaining character and setting continuity throughout multiple images. It’s far from perfect, but character consistency is a new capability in AI assistants. We tried it out and it was pretty wild—especially when it generated a view of a photo we provided from another angle.

Creating a multi-image story with Gemini 2.0 Flash, part 1. Google / Benj Edwards

Text rendering represents another potential strength of the model. Google claims that internal benchmarks show Gemini 2.0 Flash performs better than “leading competitive models” when generating images containing text, making it potentially suitable for creating content with integrated text. From our experience, the results weren’t that exciting, but they were legible.

An example of in-image text rendering generated with Gemini 2.0 Flash.

An example of in-image text rendering generated with Gemini 2.0 Flash. Credit: Google / Ars Technica

Despite Gemini 2.0 Flash’s shortcomings so far, the emergence of true multimodal image output feels like a notable moment in AI history because of what it suggests if the technology continues to improve. If you imagine a future, say 10 years from now, where a sufficiently complex AI model could generate any type of media in real time—text, images, audio, video, 3D graphics, 3D-printed physical objects, and interactive experiences—you basically have a holodeck, but without the matter replication.

Coming back to reality, it’s still “early days” for multimodal image output, and Google recognizes that. Recall that Flash 2.0 is intended to be a smaller AI model that is faster and cheaper to run, so it hasn’t absorbed the entire breadth of the Internet. All that information takes a lot of space in terms of parameter count, and more parameters means more compute. Instead, Google trained Gemini 2.0 Flash by feeding it a curated dataset that also likely included targeted synthetic data. As a result, the model does not “know” everything visual about the world, and Google itself says the training data is “broad and general, not absolute or complete.”

That’s just a fancy way of saying that the image output quality isn’t perfect—yet. But there is plenty of room for improvement in the future to incorporate more visual “knowledge” as training techniques advance and compute drops in cost. If the process becomes anything like we’ve seen with diffusion-based AI image generators like Stable Diffusion, Midjourney, and Flux, multimodal image output quality may improve rapidly over a short period of time. Get ready for a completely fluid media reality.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Farewell Photoshop? Google’s new AI lets you edit images by asking. Read More »

end-of-life:-gemini-will-completely-replace-google-assistant-later-this-year

End of Life: Gemini will completely replace Google Assistant later this year

Not all devices can simply download an updated app—after almost a decade, Assistant is baked into many Google products. The company says Google-powered cars, watches, headphones, and other devices that use Assistant will receive updates that transition them to Gemini. It’s unclear if all Assistant-powered gadgets will be part of the migration. Most of these devices connect to your phone, so the update should be relatively straightforward, even for accessories that launched early in the Assistant era.

There are also plenty of standalone devices that run Assistant, like TVs and smart speakers. Google says it’s working on updated Gemini experiences for those devices. For example, there’s a Gemini preview program for select Google Nest speakers. It’s unclear if all these devices will get updates. Google says there will be more details on this in the coming months.

Meanwhile, Gemini still has some ground to make up. There are basic features that work fine in Assistant, like setting timers and alarms, that can go sideways with Gemini. On the other hand, Assistant had its fair share of problems and didn’t exactly win a lot of fans. Regardless, this transition could be fraught with danger for Google as it upends how people interact with their devices.

End of Life: Gemini will completely replace Google Assistant later this year Read More »

ai-search-engines-cite-incorrect-sources-at-an-alarming-60%-rate,-study-says

AI search engines cite incorrect sources at an alarming 60% rate, study says

A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.

Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now use AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.

Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.

A graph from CJR shows

A graph from CJR shows “confidently wrong” search results. Credit: CJR

For the tests, researchers fed direct excerpts from actual news articles to the AI models, then asked each model to identify the article’s headline, original publisher, publication date, and URL. They ran 1,600 queries across the eight different generative search tools.

The study highlighted a common trend among these AI models: rather than declining to respond when they lacked reliable information, the models frequently provided confabulations—plausible-sounding incorrect or speculative answers. The researchers emphasized that this behavior was consistent across all tested models, not limited to just one tool.

Surprisingly, premium paid versions of these AI search tools fared even worse in certain respects. Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) confidently delivered incorrect responses more often than their free counterparts. Though these premium models correctly answered a higher number of prompts, their reluctance to decline uncertain responses drove higher overall error rates.

Issues with citations and publisher control

The CJR researchers also uncovered evidence suggesting some AI tools ignored Robot Exclusion Protocol settings, which publishers use to prevent unauthorized access. For example, Perplexity’s free version correctly identified all 10 excerpts from paywalled National Geographic content, despite National Geographic explicitly disallowing Perplexity’s web crawlers.

AI search engines cite incorrect sources at an alarming 60% rate, study says Read More »

gmail-gains-gemini-powered-“add-to-calendar”-button

Gmail gains Gemini-powered “Add to calendar” button

Google has a new mission in the AI era: to add Gemini to as many of the company’s products as possible. We’ve already seen Gemini appear in search results, text messages, and more. In Google’s latest update to Workspace, Gemini will be able to add calendar appointments from Gmail with a single click. Well, assuming Gemini gets it right the first time, which is far from certain.

The new calendar button will appear at the top of emails, right next to the summarize button that arrived last year. The calendar option will show up in Gmail threads with actionable meeting chit-chat, allowing you to mash that button to create an appointment in one step. The Gemini sidebar will open to confirm the appointment was made, which is a good opportunity to double-check the robot. There will be a handy edit button in the Gemini window in the event it makes a mistake. However, the robot can’t invite people to these events yet.

The effect of using the button is the same as opening the Gemini panel and asking it to create an appointment. The new functionality is simply detecting events and offering the button as a shortcut of sorts. You should not expect to see this button appear on messages that already have calendar integration, like dining reservations and flights. Those already pop up in Google Calendar without AI.

Gmail gains Gemini-powered “Add to calendar” button Read More »

google’s-ai-powered-pixel-sense-app-could-gobble-up-all-your-pixel-10-data

Google’s AI-powered Pixel Sense app could gobble up all your Pixel 10 data

Google’s AI ambitions know no bounds. A new report claims Google’s next phones will herald the arrival of a feature called Pixel Sense that will ingest data from virtually every Google app on your phone, fueling a new personalized experience. This app could be the premiere feature of the Pixel 10 series expected out late this year.

According to a report from Android Authority, Pixel Sense is the new name for Pixie, an AI that was supposed to integrate with Google Assistant before Gemini became the center of Google’s universe. In late 2023, it looked as though Pixie would be launched on the Pixel 9 series, but that never happened. Now, it’s reportedly coming back as Pixel Sense, and we have more details on how it might work.

Pixel Sense will apparently be able to leverage data you create in apps like Calendar, Gmail, Docs, Maps, Keep Notes, Recorder, Wallet, and almost every other Google app. It can also process media files like screenshots in the same way the Pixel Screenshots app currently does. The goal of collecting all this data is to help you complete tasks faster by suggesting content, products, and names by understanding the context of how you use the phone. Pixel Sense will essentially try to predict what you need without being prompted.

Samsung is pursuing a goal that is ostensibly similar to Now Brief, a new AI feature available on the Galaxy S25 series. Now Brief collects data from a handful of apps like Samsung Health, Samsung Calendar, and YouTube to distill your important data with AI. However, it rarely offers anything of use with its morning, noon, and night “Now Bar” updates.

Pixel Sense sounds like a more expansive version of this same approach to processing user data—and perhaps the fulfillment of Google Now’s decade-old promise. The supposed list of supported apps is much larger, and they’re apps people actually use. If pouring more and more data into a large language model leads to better insights into your activities, Pixel Sense should be better at guessing what you’ll need. Admittedly, that’s a big “if.”

Google’s AI-powered Pixel Sense app could gobble up all your Pixel 10 data Read More »

gemini-live-will-learn-to-peer-through-your-camera-lens-in-a-few-weeks

Gemini Live will learn to peer through your camera lens in a few weeks

At Mobile World Congress, Google confirmed that a long-awaited Gemini AI feature it first teased nearly a year ago is ready for launch. The company’s conversational Gemini Live will soon be able to view live video and screen sharing, a feature Google previously demoed as Project Astra. When Gemini’s video capabilities arrive, you’ll be able to simply show the robot something instead of telling it.

Right now, Google’s multimodal AI can process text, images, and various kinds of documents. However, its ability to accept video as an input is spotty at best—sometimes it can summarize a YouTube video, and sometimes it can’t, for unknown reasons. Later in March, the Gemini app on Android will get a major update to its video functionality. You’ll be able to open your camera to provide Gemini Live a video stream or share your screen as a live video, thus allowing you to pepper Gemini with questions about what it sees.

Gemini Live with video.

It can be hard to keep track of which Google AI project is which—the 2024 Google I/O was largely a celebration of all things Gemini AI. The Astra demo made waves as it demonstrated a more natural way to interact with the AI. In the original video, which you can see below, Google showed how Gemini Live could answer questions in real time as the user swept a phone around a room. It had things to say about code on a computer screen, how speakers work, and a network diagram on a whiteboard. It even remembered where the user left their glasses from an earlier part of the video.

Gemini Live will learn to peer through your camera lens in a few weeks Read More »

microsoft’s-new-ai-agent-can-control-software-and-robots

Microsoft’s new AI agent can control software and robots

The researchers' explanations about how

The researchers’ explanations about how “Set-of-Mark” and “Trace-of-Mark” work. Credit: Microsoft Research

The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects.

Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch),” after some people noted that “Magma” already belongs to an existing matrix algebra library, which could create some confusion in technical discussions.

Reported improvements over previous models

In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks.

For example, it scored 80.0 on the VQAv2 visual question-answering benchmark—higher than GPT-4V’s 77.2 but lower than LLaVA-Next’s 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks.

Magma's agentic benchmarks, as reported by the researchers.

Magma’s agentic benchmarks, as reported by the researchers. Credit: Microsoft Research

As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft’s benchmark results will become possible once other researchers can access the public code release.

Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft’s documentation. The company says it continues to work on improving these capabilities through ongoing research.

Yang says Microsoft will release Magma’s training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft’s AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics.

Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.

Microsoft’s new AI agent can control software and robots Read More »

hugging-face-clones-openai’s-deep-research-in-24-hours

Hugging Face clones OpenAI’s Deep Research in 24 hours

On Tuesday, Hugging Face researchers released an open source AI research agent called “Open Deep Research,” created by an in-house team as a challenge 24 hours after the launch of OpenAI’s Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research’s performance while making the technology freely available to developers.

“While powerful LLMs are now freely available in open-source, OpenAI didn’t disclose much about the agentic framework underlying Deep Research,” writes Hugging Face on its announcement page. “So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!”

Similar to both OpenAI’s Deep Research and Google’s implementation of its own “Deep Research” using Gemini (first introduced in December—before OpenAI), Hugging Face’s solution adds an “agent” framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end.

The open source clone is already racking up comparable benchmark results. After only a day’s work, Hugging Face’s Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model’s ability to gather and synthesize information from multiple sources. OpenAI’s Deep Research scored 67.36 percent accuracy on the same benchmark.

As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one:

Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.

To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI’s mettle quite well.

Hugging Face clones OpenAI’s Deep Research in 24 hours Read More »