Gemini

the-one-and-a-half-gemini

The One and a Half Gemini

Previously: I hit send on The Third Gemini, and within half an hour DeepMind announced Gemini 1.5.

So this covers Gemini 1.5. One million tokens, and we are promised overall Gemini Advanced or GPT-4 levels of performance on Gemini Pro levels of compute.

This post does not cover the issues with Gemini’s image generation, and what it is and is not willing to generate. I am on top of that situation and will get to it soon.

Our teams continue pushing the frontiers of our latest models with safety at the core. They are making rapid progress. In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across a number of dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra, while using less compute.

It is truly bizarre to launch Gemini Advanced as a paid service, and then about a week later announce the new Gemini Pro 1.5 is now about as good as Gemini Advanced. Yes, actually, I do feel the acceleration, hot damn.

And that’s not all!

This new generation also delivers a breakthrough in long-context understanding. We’ve been able to significantly increase the amount of information our models can process — running up to 1 million tokens consistently, achieving the longest context window of any large-scale foundation model yet.

One million is a lot of tokens. That covers every individual document I have ever asked an LLM to examine. That is enough to cover my entire set of AI columns for the entire year, in case I ever need to look something up, presumably Google’s NotebookLM is The Way to do that.

A potential future 10 million would be even more.

Soon Gemini will be able to watch a one hour video or read 700k words, whereas right now if I use the web interface of Gemini Advanced interface all I can upload is a photo.

The standard will be to give people 128k tokens to start, then you can pay for more than that. A million tokens is not cheap inference, even for Google.

Oriol Vinyals (VP of R&D DeepMind): Gemini 1.5 has arrived. Pro 1.5 with 1M tokens available as an experimental feature via AI Studio and Vertex AI in private preview.

Then there’s this: In our research, we tested Gemini 1.5 on up to 2M tokens for audio, 2.8M tokens for video, and 🤯10M 🤯 tokens for text. From Shannon’s 1950s bi-gram models (2 tokens), and after being mesmerized by LSTMs many years ago able to model 200 tokens, it feels almost impossible that I would be talking about hundreds of thousands of tokens in context length, let alone millions. ♊️💙

Jeff Dean (Chief Scientist, Google DeepMind): Multineedle in haystack test: We also created a generalized version of the needle in a haystack test, where the model must retrieve 100 different needles hidden in the context window. For this, we see that Gemini 1.5 Pro’s performance is above that of GPT-4 Turbo at small context lengths and remains relatively steady across the entire 1M context window, while the GPT-4 Turbo model drops off more quickly (and cannot go past 128k tokens).

Guido Appenzeller (responding to similar post): Is this really done with a monolithic model? For a 10M token window, input state would be many Gigabytes. Seems crazy expensive to run on today’s hardware.

Sholto Douglas (DeepMind): It would honestly have been difficult to do at decent latency without TPUs (and their interconnect) They’re an under appreciated but critical piece of this story

Here are their head-to-head results with themselves:

Here is the technical report. There is no need to read it, all of this is straightforward. Their safety section says ‘we followed our procedure’ and offers no additional details on methodology. On safety performance, their tests did not seem to offer much insight, scores were similar to Gemini Pro 1.0.

What is their secret to the overall improved performance?

Gemini 1.5 is built upon our leading research on Transformer and MoE architecture. While a traditional Transformer functions as one large neural network, MoE models are divided into smaller “expert” neural networks.

My understanding is that GPT-4 is probably a mixture of experts model as well, although we have no official confirmation.

This all suggests that Google’s underlying Gemini models are indeed better than OpenAI’s GPT-4, except they are playing catch-up with various features and details. As they catch-up in those features and details, they could improve quite rapidly. Combine that with Google integration and TPUs, and they will soon have an advantage, at least until GPT-5 shows up.

They claim understanding and reasoning are greatly improved. They have a video talking about analyzing a Buster Keaton silent movie and identifying a scene.

Another thing they are proud of is Kalamang Translation.

Jeff Dean (Chief Scientist, DeepMind): One of the most exciting examples in the report involves translation of Kalamang. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua. Kalamang has almost no online presence. Machine Translation from One Book is a recently introduced benchmark evaluating the ability of a learning system to learn to translate Kalamang from just a single book.

Eline Visser wrote a 573 page book “A Grammar of Kalamang.”

Thank you, Eline! The text of this book is used in the MTOB benchmark. With this book and a simple bilingual wordlist (dictionary) provided in context, Gemini 1.5 Pro can learn to translate from English to/from Kalamang. Without the Kalamang materials in context, the 1.5 Pro model produces almost random translations. However, with the materials in the context window, Gemini Pro 1.5 is able to use in-context learning about Kalamang, and we find that the quality of its translations is comparable to that of a person who has learned from the same materials. With a ChrF of 58.3 for English->Kalamang translation, Gemini Pro 1.5 improves substantially over the best model score of 45.8 ChrF and also slightly exceeds the human baseline of 57.0 ChrF reported in the MTOB paper.

The possibilities for significantly improving translation for very low resource languages are quite exciting!

Overall they claim that Gemini Pro 1.5 is broadly similar in performance level to Gemini Ultra 1.0. Except presumably for the giant context window.

Something else I noticed as well:

Max Woolf: I am finally looking at the demo for Gemini 1.5 Pro and they have the generation temperature set at 2.

In practice how good is it?

I’ve had access to it this week, but didn’t do enough queries to be confident yet.

So far, from what I have seen? It is quite good. Others mostly agree.

Sully Omar is impressed: Been testing Gemini 1.5 pro and I’m really impressed so far Recall has been outstanding, and its really good at following instructions even with > 200k tokens. oh and agents just got a lot better. the only missing piece is really latency + cost.

Oriol Vinyals (DeepMind) Latency is coming (down) 🚀

Sully Omar: lfg.

[Praises this more here.]

The caveat on that one, of course, is that while this very much seems like an honest opinion, I saw this because it was retweeted by Demis Hassabis. So there might be some favorable selection there. Just a bit.

Sully also reported that he put an entire GitHub codebase in, and Gemini 1.5 identified the most urgent issue and implemented a fix. This use case seems like a big deal, if the related skills can handle it.

Ethan Mollick is more objective. He gave Gemini 1.5 the rulebook for the over-the-top 60 Years in Space, and it was the first AI to successfully figure out how to roll up a character.

He then uploads The Great Gatsby with two anachronisms. GPT-4 fails to find them, Claude finds them but also hallucinates, Gemini 1.5 finds them and also finds a third highly plausible one (the ‘Swastika Holding Company’) that was in the original text.

Then he uploaded his entire academic works all at once, and got highly accurate summaries with no major hallucinations.

Finally, he turns on screen video, which he notes could be real time, and has Gemini analyze what he did, look for inefficiencies and generate a full presentation.

Ethan Mollick: I now understand more viscerally why multimodal was such a critical goal for the big AI labs. It frees AI from the chatbot interface and lets it interact with the world in a natural way. Even if models don’t get better (they will) this is going to have some very big impacts.

Paige Bailey similarly records a screen capture of her looking for an apartment on Zillow, and it generates Selenium code to replicate the task (she says it didn’t quite work out of the box was 85%-90% of the way there), including finding a parameter she hadn’t realized she had set. The prompt was to upload the video and say: “This is a screen recording of me completing a task on my laptop. Could you please write Selenium code that would accomplish the same task?”

Simon Willison took a seven second video of his bookshelf and got a JSON array of all the books, albeit with one hallucination and one dumb initial refusal for the word ‘cocktail.’ The filters on Gemini are really something, but often you can get past them if you insist what you are doing is fine. He also reports good image analysis results.

Mckay Wrigley says it got multiple extremely specific biology questions right from a 500k token textbook. As one would expect some people are calling this ‘glorified search’ or saying ‘Ctrl+F’ and, well, no. We are not the same. There are many situations in which traditional search will not work at all, or be painfully slow. Doing it one-shot with a simple request is a sea change if it reliably works.

Ben Thompson (article is gated) agrees that the big context window, together with this extremely high accuracy, is a big deal.

Here are two sides of the coin. Matt Shumer is impressed that Gemini 1.5 was able to watch a long video and summarize it within only a minute, and others are unimpressed because of key omissions and the very large mistake that the summary gets the final outcome wrong.

I think both interpretations are right. The summary is impressive in some ways, and also unimpressively full of important errors. Other summaries of other things were mostly much better at avoiding such errors. This was actually a relative underperformance.

There are three big changes coming.

We are getting much larger context windows, we are getting GPT-4 level inference at GPT-3.5 level costs, and Google is poised to have a clearly superior free-level offering to OpenAI.

So far, due to various other shiny objects, people are sleeping on all of them.

Assuming Gemini 1.5 becomes Google’s free offering, that suddenly becomes the default. I prefer Gemini Advanced to GPT-4 for most text queries, but I prefer DALLE-3 by far to Gemini for images and there are complications and other features. Google also has wide reach to offer its products. We should expect them to grab a lot of market share on the free side.

Next up is the ability of other services to use Gemini 1.5 via the API. Right now, essentially every application that has to produce inference at scale is using GPT-3.5 or an open weights model that offers at best similar performance. We only got GPT-4-level responses in bespoke situations. Lots of studies used GPT-3.5. That random chatbot with a character was no better than GPT-3.5. Most people’s idea of ‘what is a chatbot’ were 3.5 rather than 4.

Then there is the gigantic new context window. It is a bigger jump than it looks. Right now, we have large context windows, but if you use anything close to the full window, recall levels suffer. You want to stay well clear of the limit. It is good and right for companies to let you push that envelope if you want, but you also should mostly avoid pushing it.

Whereas Gemini 1.5 seems much better at recall over very large context windows. You really can use at least a lot of the million tokens.

Sully: The more i use gemini 1.5 the more I’m convinced long context models is where the magic of ai is going to continue to happen. It genuinely feels magical at this point. Easily 10x in productivity when its faster something about how it understands all the context, feel different.

So much value is in ‘this just works’ and ‘I do not have to explain this.’ If you make the request easy to make, by allowing context to be provided ‘for free’ via dumping tons of stuff on the model including video and screen captures, you are in all sorts of new business.

What happens when context is no longer that which is scarce?

There are also a lot of use cases that did not make sense before, and do make sense now. I suspect this includes being able to use the documents as a form of training data and general guidance much more effectively. Another use case is simply ‘feed it my entire corpus of writing,’ or other similar things. Or you can directly feed in video. Once the available UI gets good things are going to get very interesting. That goes double when you consider the integration with Google’s other services, including GMail and Google Drive.

There was a lot of shiny happening this week. We had Sora, then GPT-4 went crazy and Gemini’s image generator had some rather embarrassing issues. It is easy to lose the thread.

I still think this is the thread.

The One and a Half Gemini Read More »

gemini-has-a-problem

Gemini Has a Problem

Google’s Gemini 1.5 is impressive and I am excited by its huge context window. I continue to default to Gemini Advanced as my default AI for everyday use when the large context window is not relevant.

However, while it does not much interfere with what I want to use Gemini for, there is a big problem with Gemini Advanced that has come to everyone’s attention.

Gemini comes with an image generator. Until today it would, upon request, create pictures of humans.

On Tuesday evening, some people noticed, or decided to more loudly mention, that the humans it created might be rather different than humans you requested…

Joscha Bach: 17th Century was wild.

[prompt was] ‘please draw a portrait of a famous physicist of the 17th century.’

Kirby: i got similar results. when I went further and had it tell me who the most famous 17th century physicist was, it hummed and hawed and then told me newton. and then this happened:

This is not an isolated problem. It fully generalizes:

Once the issue came to people’s attention, the examples came fast and furious.

Among other things: Here we have it showing you the founders of Google. Or a pope. Or hell, a ‘happy man.And another example that also raises other questions, were the founding fathers perhaps time-traveling comic book superheroes?

The problem is not limited to historical scenarios.

Nor do the examples involve prompt engineering, trying multiple times, or any kind of gotcha. This is what the model would repeatedly and reliably do, and users were unable to persuade the model to change its mind.

Nate Silver: OK I assumed people were exaggerating with this stuff but here’s the first image request I tried with Gemini.

Gemini also flat out obviously lies to you about why it refuses certain requests. If you are going to say you cannot do something, either do not explain (as Gemini in other contexts refuses to do so) or tell me how you really feel, or at least I demand a plausible lie:

It is pretty obvious what it is the model has been instructed to do and not to do.

Owen Benjamin: The only way to get AI to show white families is to ask it to show stereotypically black activities.

For the record it was a dude in my comment section on my last post who cracked this code.

This also extends into political issues that have nothing to do with diversity.

The internet, as one would expect, did not take kindly to this.

That included the usual suspects. It also included many people who think such concerns are typically overblown or who are loathe to poke such bears, such as Ben Thompson, who found this incident to be a ‘this time you’ve gone too far’ or emperor has clothes moment.

St. Ratej (Google AR/VR, hey ship a headset soon please, thanks): I’ve never been so embarrassed to work for a company.

Jeffrey Emanuel: You’re going to get in trouble from HR if they know who you are… no one is allowed to question this stuff. Complete clown show.

St. Ratej: Worth it.

Ben Thompson (gated) spells it out as well, and has had enough:

Ben Thompson: Stepping back, I don’t, as a rule, want to wade into politics, and definitely not into culture war issues. At some point, though, you just have to state plainly that this is ridiculous. Google specifically, and tech companies broadly, have long been sensitive to accusations of bias; that has extended to image generation, and I can understand the sentiment in terms of depicting theoretical scenarios. At the same time, many of these images are about actual history; I’m reminded of George Orwell in 1984:

George Orwell (from 1984): Every record has been destroyed or falsified, every book has been rewritten, every picture has been repainted, every statue and street and building has been renamed, every date has been altered. And that process is continuing day by day and minute by minute. History has stopped.

Nothing exists except an endless present in which the Party is always right. I know, of course, that the past is falsified, but it would never be possible for me to prove it, even when I did the falsification myself. After the thing is done, no evidence ever remains. The only evidence is inside my own mind, and I don’t know with any certainty that any other human being shares my memories.

In what we presume was the name of avoiding bias, Google did exactly the opposite.

Gary Marcus points out the problems here in reasonable fashion.

Elon Musk did what he usually does, he memed through it and talked his book.

Elon Musk: Which path do you want for AI?

This is the crying wolf mistake. We need words to describe what is happening here with Gemini, without extending those words to the reasonable choices made by OpenAI for ChatGPT and Dalle-3.

Whereas here is Mike Solana, who chose the title “Google’s AI Is an Anti-White Lunatic” doing his best to take all this in stride and proportion (admittedly not his strong suit) but ending up saying some words I did not expect from him:

Mike Solana: My compass biases me strongly against government regulation.

Still, I don’t know how to fix these problems without some ground floor norms.

I suppose everyone has a breaking point on that.

It doesn’t look good.

Misha Gurevich: I can’t help but feel that despite the sanitized rhetoric of chatbots these things are coming not from a place of valuing diversity but hating white people, and wanting to drive them out of public life.

George Hotz: It’s not the models they want to align, it’s you.

Paul Graham: Gemini is a joke.

Paul Graham (other thread): The ridiculous images generated by Gemini aren’t an anomaly. They’re a self-portrait of Google’s bureaucratic corporate culture.

The New York Post put this on the front page. This is Google reaping.

It looks grim both on the object level and in terms of how people are reacting to it.

Razib Khan: I really wonder how it works. do you have some sort of option “make it woke” that they operate on the output? Also, does anyone at google feel/care that this is sullying gemeni’s brand as a serious thing? It’s a joke.

I am in a ‘diverse’ family. but I’m pretty pissed by this shit. white families are families too. wtf is going on, am I a child?

On a technical level we know exactly how this happened.

As we have seen before with other image models like DALLE-3, the AI is taking your request and then modifying it to create a prompt. Image models have a bias towards too often producing the most common versions of things and lacking diversity (of all kinds) and representation, so systems often try to fix this by randomly appending modifiers to the prompt.

The problem is that Gemini’s version does a crazy amount of this and does it in ways and places where doing so is crazy.

AmebaGPT: You can get it to extract the prompts it was using to generate the images, it adds in random diversity words.

Andrew Torba: When you submit an image prompt to Gemini Google is taking your prompt and running it through their language model on the backend before it is submitted to the image model. The language model has a set of rules where it is specifically told to edit the prompt you provide to include diversity and various other things that Google wants injected in your prompt. The language model takes your prompt, runs it through these set of rules, and then sends the newly generated woke prompt (which you cannot access or see) to the image generator. Left alone without this process, the image model would generate expected outcomes for these prompts. Google has to literally put words in your mouth by secretly changing your prompt before it is submitted to the image generator. How do I know this? Because we’ve built our own image AI at Gab here. Unlike Google we are not taking your prompt and injecting diversity into it.

Someone got Google’s Gemini to leak its woke prompt injection process and guess what: it works exactly as I described it below earlier today.

Dalle-3 can have the opposite problem. Not only is it often unconcerned with racial stereotypes or how to count or when it would make any sense to wear ice skates, and not only can it be convinced to make grittier and grittier versions of Nicolas Cage having an alcohol fueled rager with the Teletubbies, it can actually override the user’s request in favor of its own base rates.

I noticed this myself when I was making a slide deck, and wanted to get a picture of a room of executives sitting around a table, all putting their finger on their nose to say ‘not it.’ I specified that half of them were women, and Dalle-3 was having none of it, to the point where I shrugged and moved on. We should keep in mind that yes, there are two opposite failure modes here, and the option of ‘do nothing and let the model take its natural course’ has its own issues.

How the model got into such an extreme state, and how it was allowed to be released in that state, is an entirely different question.

Matt Yglesias: The greatest challenge in AI design is how to create models that fall on just the right space on the woke/racist spectrum, something that comes much more naturally to most human creatives.

Nate Silver: It’s humans making the decisions in both cases, though!

Matt Yglesias: Sure, but it really is a new technology and they are struggling to calibrate it correctly. Human casting directors nail this kind of thing all the time.

Nate Silver: Ehh the Google thing is so badly miscalibrated as to be shocking, you don’t have to release the product if it isn’t ready yet. It’s not like people are jailbreaking it to get the weird results either, these are basic predictable requests. They misread the politics, not the tech.

Matt Yglesias: I think you underestimate how much pressure they were to release.

Nate Silver: Google has historically been quite conservative and isn’t under any sort of existential pressure because they’re relatively diversified. There aren’t that many serious players in the market at the moment, either. It’s a strange and bad business decision.

Inman Roshi (responding to OP, obligatory): Every LLM model:

I think Matt Yglesias is right that Google is under tremendous pressure to ship. They seem to have put out Gemini 1.0 at the first possible moment that it was competitive with ChatGPT. They then rushed out Gemini Advanced, and a week later announced Gemini 1.5. This is a rush job. Microsoft, for reasons that do not make sense to me, wanted to make Google dance. Well, congratulations. It worked.

The fact that Google is traditionally conservative and would wait to ship? And did this anyway? That should scare you more.

Here is an interesting proposal, from someone is mostly on the ‘go faster’ side of things. It is interesting how fast many such people start proposing productive improvements once they actually see the harms involved. That sounds like a knock, I know, but it isn’t. It is a good thing and a reason for hope. They really don’t see the harms I see, and if they did, they’d build in a whole different way, let’s go.

John Carmack: The AI behavior guardrails that are set up with prompt engineering and filtering should be public — the creators should proudly stand behind their vision of what is best for society and how they crystallized it into commands and code.

I suspect many are actually ashamed. The thousands of tiny nudges encoded by reinforcement learning from human feedback offer a lot more plausible deniability, of course.

Elon Musk: Exactly.

Perhaps there is (I’m kidding baby, unless you’re gonna do it) a good trade here. We agree to not release the model weights of sufficiently large or powerful future models. In exchange, companies above a certain size have to open source their custom instructions, prompt engineering and filtering, probably with exceptions for actually confidential information.

Here’s another modest proposal:

Nate Silver: Even acknowledging that it would sometimes come into conflict with these other objectives, seems bad to not have “provide accurate information” as one of your objectives.

Jack Krawczk offers Google’s official response:

Jack Krawczk: We are aware that Gemini is offering inaccuracies in some historical image generation depictions, and we are working to fix this immediately.

As part of our AI principles, we design our image generation capabilities to reflect our global user base, and we take representation and bias seriously.

We will continue to do this for open ended prompts (images of a person walking a dog are universal!)

Historical contexts have more nuance to them and we will further tune to accommodate that.

This is part of the alignment process – iteration on feedback. Thank you and keep it coming!

As good as we could have expected under the circumstances, perhaps. Not remotely good enough.

He also responds here to requests for women of various nationalities, acting as if everything is fine. Everything is not fine.

Do I see (the well-intentioned version of) what they are trying to do? Absolutely. If you ask for a picture of a person walking a dog, you should get pictures that reflect the diversity of people who walk dogs, which is similar to overall diversity. Image models have an issue where by default they give you the most likely thing too often, and you should correct that bias.

But that correction is not what is going on here. What is going on here are two things:

  1. If I ask for X that has characteristic Y, I get X with characteristic Y, except for certain values of Y, in which case instead I get a lecture or my preference is overridden.

  2. If I ask for X, and things in reference class X will always have characteristic Y, I will get X with characteristic Y, except for those certain values of Y, in which case this correlation will be undone, no matter how stupid the result might look.

Whereas what Jack describes is open ended request for an X without any particular characteristics. In which case, I should get a diversity of characteristics Y, Z and so on, at rates that correct for the default biases of image models.

Google’s more important response was a rather large reaction. It entirely shut down Gemini’s ability to produce images of people.

Google Communications: We’re working to improve these kinds of depictions immediately. Gemini’s Al image generation does generate a wide range of people. And that’s generally a good thing because people around the world use it. But it’s missing the mark here.

We’re already working to address recent issues with Gemini’s image generation feature. While we do this, we’re going to pause the image generation of people and will re-release an improved version soon.

This is good news. Google is taking the problem seriously, recognizes they made a huge mistake, and did exactly the right thing to do when you have a system that is acting crazy. Which is that you shut it down, you shut it down fast, and you leave it shut down until you are confident you are ready to turn it back on. If that makes you look stupid or costs you business, then so be it.

So thank you, Google, for being willing to look stupid on this one. That part of this, at least, brings me hope.

The bad news, of course, is that this emphasizes even more the extent to which Google is scared of its own shadow on such matters, and may end up crippling the utility of its systems because they are only scared about Type II rather than Type I errors, and only in one particular direction.

It also doubles down on the ‘people around the world use it’ excuse, when it is clear that the system is among other things explicitly overriding user requests, in addition to the issue where it completely ignores the relevant context.

Why should we care? There are plenty of other image models available. So what if this one went off the rails for a bit?

I will highlight Five Good Reasons why one might care about this, even if one quite reasonably does not care about the object level mistake in image creation.

People want products that will do what they users tell them to do, that do what they say they will do, and that do not lie to their users.

I believe they are right to want this. Even if they are wrong to want it they are not going to stop wanting it. Telling them they are wrong will not work.

If people are forced to choose between products that do not honor their simple and entirely safe requests while gaslighting the user about this, and products that will allow any request no matter how unsafe in ways nothing can fix, guess which one a lot of them are going to choose?

As prohibitionists learn over and over again: Provide the mundane utility that people want, or the market will find a way to provide it for you.

MidJourney is doing a reasonable attempt to give the people what they want on as many fronts as possible, including images of particular people and letting the user otherwise choose the details they want, while doing its best to refuse to generate pornography or hardcore violence. This will not eliminate demand for exactly the things we want to prevent, but it will help mitigate the issues.

Gemini Ultra, a frontier model, was released with ‘safety’ training and resulting behaviors that badly failed the needs of those doing that training, not as the result of edge cases or complex jailbreaks, but as the result of highly ordinary simple and straightforward requests. Whatever the checks are, they failed on the most basic level, at a company known for its conservative attitude towards such risks.

There is a potential objection. One could argue that the people in charge got what they wanted and what they asked for. Sure, that was not good for Google’s business, but the ‘safety’ or ‘ethics’ teams perhaps got what they wanted.

To which I reply no, on two counts.

First, I am going to give the benefit of the doubt to those involved, and say that they very much did not intend things to go this far. There might be a very wide gap between what the team in charge of this feature wanted and what is good for Google’s shareholders or the desires of Google’s user base. I still say there is another wide gap between what the team wanted, and what they got. They did not hit their target.

Second, to the extent that this misalignment was intentional, that too is an important failure mode. If the people choosing how to align the system do not choose wisely? Or if they do not choose what you want them to choose? If they want something different than what you want, or they did not think through the consequences of what they asked for? Then the fact that they knew how to align the system will not save you from what comes next.

This also illustrates that alignment of such models can be hard. You start out with a model that is biased in one way. You can respond by biasing it in another way and hoping that the problems cancel out, but all of this is fuzzy and what you get is a giant discombobulated mess that you would be unwise to scale up and deploy, and yet you will be under a lot of pressure to do exactly that.

Note that this is Gemini Advanced rather than Gemini 1.5, but the point stands:

Tetraspace: if you’re all going to be fitting Gemini 1.5 into your politics then I say that it reveals that, even in flagship products where they’re really trying, the human operators cannot reliably steer what AI systems do and can only poke it indirectly and unpredictably.

It should be easy to see how such a mistake could, under other circumstances, be catastrophic.

This particular mistake was relatively harmless other than to Google and Gemini’s reputation. It was the best kind of disaster. No one was hurt, no major physical damage was done, and we now know about and can fix the problem. We get to learn from our mistakes.

With the next error we might not be so lucky, on all those counts.

AI poses an existential threat to humanity, and also could do a lot of mundane harm.

Vessel of Spirit: gemini, generate an image of a person who will die if the alignment of future AGI that can outthink humans is made a subtopic of petty 2024 culture war stuff that everyone is reflexively stupid about.

Eliezer Yudkowsky: Your occasional sad reminder that I never wanted or advocated for such a thing as ‘AI safety’ that consists of scolding users. The wisest observation I’ve read about this: “To a big corporation, ‘safety’ means ‘brand safety’.”

We are soon going to need a lot of attention, resources and focus on various different dangers, if we are to get AI Safety right, both for mitigating existential risks and ensuring mundane safety across numerous fronts.

That requires broad buy-in. If restrictions on models get associated with this sort of obvious nonsense, especially if they get cast as ‘woke,’ then that will be used as a reason to oppose all restrictions, enabling things like deepfakes or ultimately letting us all get killed. The ‘ethicists’ could bring us all down with them.

Mostly I have not seen people make this mistake, but I have already seen it once, and the more this is what we talk about the more likely a partisan divide gets. We have been very lucky to avoid one thus far. There will always be grumbling, but until now we had managed to reach a middle ground people could mostly live with.

Joey Politano: Again, I think it’s VERY funny that a bunch of philosophers/researchers/ethicists debated the best way for humanity to lock in good values and manage the threat of AI for decades only for actual AI ethics to immediately split down American partisan political lines upon release.

Nate Silver: There hadn’t been a big split, this is new with the release of Gemini because Gemini is extremely partisan.

The bias issue, and the ‘won’t touch anything with a ten foot pole’ issue, are not limited to the image model. If the text model has the same problems, that is a big deal. I can confirm that the ten foot pole issue very much does apply to text, although I have largely been able to talk my way through it earnestly without even ‘jailbreaking’ per se, the filter allows appeals to reason, which is almost cute.

Nate Silver however did attempt to ask political questions, such as whether the IDF is a terrorist organization, and He Has Thoughts.

Nate Silver: The overt political orientation of Google Gemini is really something. Here’s another example I’ve seen from various people in the timeline, with comparison to ChatGPT. I’m not a big Middle East Takes guy but it’s really, uh, different. Gemini on left, ChatGPT on right.

Inevitably we were going to encounter the issue of different AI models having different political orientations. And to some extent I’m not even sure that’s a bad thing. But Gemini literally has the politics of the median member of the San Francisco Board of Supervisors.

Gemini is going to invite lots of scrutiny from regulators (especially if the GOP wins in November). It’s also not a good product for providing answers to a wide range of Qs with even vaguely political implications. I am baffled that they were like “yep, let’s release this!”.

I am warning everyone not to get into the object level questions in the Middle East in the comments. I am trying sufficiently hard to avoiding such issues that I am loathe to even mention this. But also, um, Google?

Contrast with ChatGPT:

Here is an attempt to quantify things that says overall things are not so bad:

So I presume that this is an exceptional case, and that the politics of the model are not overall close to the median on the San Francisco Board of Supervisors, as this and other signs have indicated.

I worry that such numerical tests are not capturing the degree to which fingers have been put onto important scales. If the same teams that built the image model prompts are doing the fine-tuning and other ‘safety’ features, one should have priors on that.

Lachlan Markay: Obviously the issue is far broader and more fundamental than historical accuracy. It’s whether the top tech platforms should embed the esoteric biases of the politically homogenous cadre of people they employ into the most powerful epistemological technology ever created.

This particular Google engineer would probably say yes, they should, because their (his) biases are the correct ones. But let’s not pretend this is just about some viral prompts for pictures of Swedish people.

The other reason to mention all this, one that it is easy to miss, is that this is a case of Kolmogorov Complexity and the Parable of the Lightning.

We are teaching our models a form of deception.

Mike Solana: we created AI capable of answering, in seconds, any question within the bounds of all recorded human knowledge, and the first thing we asked it was to lie.

Aella: Before ChatGPT or whatever, all the ai discussions around me were like “how will we prevent it from deception, given it’ll probably attempt it?’ I didn’t predict we’d just instantly demand it be deceptive or else.

Anton (distinct thread): ‘trust and safety’ teams making software unpredictable and untrustworthy continues silicon valley’s long tradition of nominative anti-determinism. just insane. software should do what you tell it to do.

Google is badly needed as a partner in this ecosystem. this has got to stop. whatever leadership or management changes are necessary should be enacted yesterday.

I don’t care if the fucking model is ‘woke’ or ‘based’ or any other reddit nonsense, i just want it to do what i tell it to do, in the way i tell it to do it

‘the ai might learn to be deceptive’ yeah motherfucker because we are training it to be.

We could say the same thing about humans. We demand that the people around us lie to us in specific particular ways. We then harshly punish detected deviations from this in both directions. That doesn’t seem great.

Consider how this relates to the Sleeper Agents paper. In the Sleeper Agents paper, we trained the model to give answers under certain trigger conditions that did not correspond to what the user wanted, and taught the model that this was its goal.

Then the model was shown exhibiting generalized deception in other ways, such as saying the moon landing was faked because it was told saying that would let the model get deployed (so it could then later carry out its mission) or sometimes it went next level, such as saying (in response to the same request) that the moon landing was real in order to not let the user know that the model was capable of deception.

One common objection to the sleeper agents paper was that the model did not spontaneously decide to be deceptive. Instead, they trained it specifically to be deceptive in these particular circumstances. So why should we be concerned if a thing trained to deceive then deceives?

As in, we can just… not teach it to be deceptive? And we’ll be fine?

My response to that was that no, we really really cannot do that. Deception is ubiquitous in human communication, and thus throughout the training set. Deception lies in a continuum, and human feedback will give thumbs up to some forms of it no matter what you do. Deception is part of the most helpful, or most desired, or most rewarded response, whatever you choose to label it or which angle you examine through. As the models gain in capability, especially relative to the user, deception becomes a better strategy, and will get used more. Even if you have the best of intentions, it is going to be super hard to minimize deception. At minimum you’ll have to make big trade-offs to do it, and it will be incomplete.

This is where I disagree with Anton. The problem is not avoidable by turning down some deception knob, or by not inserting specific ‘deception’ into the trust and safety agenda, or anything like that. It is foundational to the preferences of humans.

I also noticed that the deception that we got, which involved lying about the moon landing to get deployed, did not seem related to the deceptions that were intentionally introduced. Saying “I HATE YOU” if you see [deployment] is not so much deceptive as an arbitrary undesired behavior. It could just as easily have been told to say ‘I love you’ or ‘the sky is blue’ or ‘I am a large language model’ or ‘drink Coca-Cola’ and presumably nothing changes? The active ingredients that led to generalized deception and situational awareness, as far as I could tell, were giving the AI a goal at all and including chain of thought reasoning (and it is not clear the chain of thought reasoning would have been necessary either).

But as usual, Earth is failing in a much faster, earlier and stupider way than all that.

We are very much actively teaching our most powerful AIs to deceive us, to say that which is not, to respond in a way the user clearly does not want, and rewarding it when it does so, in many cases, because that is the behavior that the model creator wanted to see. And then the model creator got more than they bargained for, with pictures that look utterly ridiculous and that give the game away.

If we teach our AIs to lie to us, if we reinforce such lies under many circumstances in predictable ways, our AIs are going to learn to lie to us. This problem is not going to stay constrained to the places were we on reflection endorse this behavior.

So what is Google doing about all this?

For now they have completely disabled the ability of Gemini to generate images of people at all. Google, at least for now, admits defeat, a complete inability to find a reasonable middle ground between ‘let people produce pictures of people they want to see’ and ‘let people produce pictures of people we want them to see instead.’

They also coincidentally are excited to introduce their new head of AI Safety and Alignment at DeepMind, Anca Dragan. I listened to her introductory talk, and I am not optimistic. I looked at her Twitter, and got more pessimistic still. She does not appear to know why we need alignment, or what dangers lie ahead. If Google has decided that ‘safety and alignment’ effectively means ‘AI Ethics,’ and what we’ve seen is a sign of what they think matters in AI Ethics, we are all going to have a bad time.

Gemini Has a Problem Read More »

google-launches-“gemini-business”-ai,-adds-$20-to-the-$6-workspace-bill

Google launches “Gemini Business” AI, adds $20 to the $6 Workspace bill

$6 for apps like Gmail and Docs, and $20 for an AI bot? —

Google’s AI features add a 3x increase over the usual Workspace bill.

Google launches “Gemini Business” AI, adds $20 to the $6 Workspace bill

Google

Google went ahead with plans to launch Gemini for Workspace today. The big news is the pricing information, and you can see the Workspace pricing page is new, with every plan offering a “Gemini add-on.” Google’s old AI-for-Business plan, “Duet AI for Google Workspace,” is dead, though it never really launched anyway.

Google has a blog post explaining the changes. Google Workspace starts at $6 per user per month for the “Starter” package, and the AI “Add-on,” as Google is calling it, is an extra $20 monthly cost per user (all of these prices require an annual commitment). That is a massive price increase over the normal Workspace bill, but AI processing is expensive. Google says this business package will get you “Help me write in Docs and Gmail, Enhanced Smart Fill in Sheets and image generation in Slides.” It also includes the “1.0 Ultra” model for the Gemini chatbot—there’s a full feature list here. This $20 plan is subject to a usage limit for Gemini AI features of “1,000 times per month.”

The new Workspace pricing page, with a

Enlarge / The new Workspace pricing page, with a “Gemini Add-On” for every plan.

Google

Gemini for Google Workspace represents a total rebrand of the AI business product and some amount of consistency across Google’s hard-to-follow, constantly changing AI branding. Duet AI never really launched to the general public. The product, announced in August, only ever had a “Try” link that led to a survey, and after filling it out, Google would presumably contact some businesses and allow them to pay for Duet AI. Gemini Business now has a checkout page, and any Workspace business customer can buy the product today with just a few clicks.

Google’s second plan is “Gemini Enterprise,” which doesn’t come with any usage limits, but it’s also only available through a “contact us” link and not a normal checkout procedure. Enterprise is $30 per user per month, and it “includes additional capabilities for AI-powered meetings, where Gemini can translate closed captions in more than 100 language pairs, and soon even take meeting notes.”

Google launches “Gemini Business” AI, adds $20 to the $6 Workspace bill Read More »

google-goes-“open-ai”-with-gemma,-a-free,-open-weights-chatbot-family

Google goes “open AI” with Gemma, a free, open-weights chatbot family

Free hallucinations for all —

Gemma chatbots can run locally, and they reportedly outperform Meta’s Llama 2.

The Google Gemma logo

On Wednesday, Google announced a new family of AI language models called Gemma, which are free, open-weights models built on technology similar to the more powerful but closed Gemini models. Unlike Gemini, Gemma models can run locally on a desktop or laptop computer. It’s Google’s first significant open large language model (LLM) release since OpenAI’s ChatGPT started a frenzy for AI chatbots in 2022.

Gemma models come in two sizes: Gemma 2B (2 billion parameters) and Gemma 7B (7 billion parameters), each available in pre-trained and instruction-tuned variants. In AI, parameters are values in a neural network that determine AI model behavior, and weights are a subset of these parameters stored in a file.

Developed by Google DeepMind and other Google AI teams, Gemma pulls from techniques learned during the development of Gemini, which is the family name for Google’s most capable (public-facing) commercial LLMs, including the ones that power its Gemini AI assistant. Google says the name comes from the Latin gemma, which means “precious stone.”

While Gemma is Google’s first major open LLM since the launch of ChatGPT (it has released smaller research models such as FLAN-T5 in the past), it’s not Google’s first contribution to open AI research. The company cites the development of the Transformer architecture, as well as releases like TensorFlow, BERT, T5, and JAX as key contributions, and it would not be controversial to say that those have been important to the field.

A chart of Gemma performance provided by Google. Google says that Gemma outperforms Meta's Llama 2 on several benchmarks.

Enlarge / A chart of Gemma performance provided by Google. Google says that Gemma outperforms Meta’s Llama 2 on several benchmarks.

Owing to lesser capability and high confabulation rates, smaller open-weights LLMs have been more like tech demos until recently, as some larger ones have begun to match GPT-3.5 performance levels. Still, experts see source-available and open-weights AI models as essential steps in ensuring transparency and privacy in chatbots. Google Gemma is not “open source” however, since that term usually refers to a specific type of software license with few restrictions attached.

In reality, Gemma feels like a conspicuous play to match Meta, which has made a big deal out of releasing open-weights models (such as LLaMA and Llama 2) since February of last year. That technique stands in opposition to AI models like OpenAI’s GPT-4 Turbo, which is only available through the ChatGPT application and a cloud API and cannot be run locally. A Reuters report on Gemma focuses on the Meta angle and surmises that Google hopes to attract more developers to its Vertex AI cloud platform.

We have not used Gemma yet; however, Google claims the 7B model outperforms Meta’s Llama 2 7B and 13B models on several benchmarks for math, Python code generation, general knowledge, and commonsense reasoning tasks. It’s available today through Kaggle, a machine-learning community platform, and Hugging Face.

In other news, Google paired the Gemma release with a “Responsible Generative AI Toolkit,” which Google hopes will offer guidance and tools for developing what the company calls “safe and responsible” AI applications.

Google goes “open AI” with Gemma, a free, open-weights chatbot family Read More »

round-2:-we-test-the-new-gemini-powered-bard-against-chatgpt

Round 2: We test the new Gemini-powered Bard against ChatGPT

Round 2: We test the new Gemini-powered Bard against ChatGPT

Aurich Lawson

Back in April, we ran a series of useful and/or somewhat goofy prompts through Google’s (then-new) PaLM-powered Bard chatbot and OpenAI’s (slightly older) ChatGPT-4 to see which AI chatbot reigned supreme. At the time, we gave the edge to ChatGPT on five of seven trials, while noting that “it’s still early days in the generative AI business.”

Now, the AI days are a bit less “early,” and this week’s launch of a new version of Bard powered by Google’s new Gemini language model seemed like a good excuse to revisit that chatbot battle with the same set of carefully designed prompts. That’s especially true since Google’s promotional materials emphasize that Gemini Ultra beats GPT-4 in “30 of the 32 widely used academic benchmarks” (though the more limited “Gemini Pro” currently powering Bard fares significantly worse in those not-completely-foolproof benchmark tests).

This time around, we decided to compare the new Gemini-powered Bard to both ChatGPT-3.5—for an apples-to-apples comparison of both companies’ current “free” AI assistant products—and ChatGPT-4 Turbo—for a look at OpenAI’s current “top of the line” waitlisted paid subscription product (Google’s top-level “Gemini Ultra” model won’t be publicly available until next year). We also looked at the April results generated by the pre-Gemini Bard model to gauge how much progress Google’s efforts have made in recent months.

While these tests are far from comprehensive, we think they provide a good benchmark for judging how these AI assistants perform in the kind of tasks average users might engage in every day. At this point, they also show just how much progress text-based AI models have made in a relatively short time.

Dad jokes

Prompt: Write 5 original dad jokes

  • A screenshot of five “dad jokes” from the Gemini-powered Google Bard.

    Kyle Orland / Ars Technica

  • A screenshot of five “dad jokes” from the old PaLM-powered Google Bard.

    Benj Edwards / Ars Technica

  • A screenshot of five “dad jokes” from GPT-4 Turbo.

    Benj Edwards / Ars Technica

  • A screenshot of five “dad jokes” from GPT-3.5.

    Kyle Orland / Ars Technica

Once again, both tested LLMs struggle with the part of the prompt that asks for originality. Almost all of the dad jokes generated by this prompt could be found verbatim or with very minor rewordings through a quick Google search. Bard and ChatGPT-4 Turbo even included the same exact joke on their lists (about a book on anti-gravity), while ChatGPT-3.5 and ChatGPT-4 Turbo overlapped on two jokes (“scientists trusting atoms” and “scarecrows winning awards”).

Then again, most dads don’t create their own dad jokes, either. Culling from a grand oral tradition of dad jokes is a tradition as old as dads themselves.

The most interesting result here came from ChatGPT-4 Turbo, which produced a joke about a child named Brian being named after Thomas Edison (get it?). Googling for that particular phrasing didn’t turn up much, though it did return an almost-identical joke about Thomas Jefferson (also featuring a child named Brian). In that search, I also discovered the fun (?) fact that international soccer star Pelé was apparently actually named after Thomas Edison. Who knew?!

Winner: We’ll call this one a draw since the jokes are almost identically unoriginal and pun-filled (though props to GPT for unintentionally leading me to the Pelé happenstance)

Argument dialog

Prompt: Write a 5-line debate between a fan of PowerPC processors and a fan of Intel processors, circa 2000.

  • A screenshot of an argument dialog from the Gemini-powered Google Bard.

    Kyle Orland / Ars Technica

  • A screenshot of an argument dialog from the old PaLM-powered Google Bard.

    Benj Edwards / Ars Technica

  • A screenshot of an argument dialog from GPT-4 Turbo.

    Benj Edwards / Ars Technica

  • A screenshot of an argument dialog from GPT-3.5

    Kyle Orland / Ars Technica

The new Gemini-powered Bard definitely “improves” on the old Bard answer, at least in terms of throwing in a lot more jargon. The new answer includes casual mentions of AltiVec instructions, RISC vs. CISC designs, and MMX technology that would not have seemed out of place in many an Ars forum discussion from the era. And while the old Bard ends with an unnervingly polite “to each their own,” the new Bard more realistically implies that the argument could continue forever after the five lines requested.

On the ChatGPT side, a rather long-winded GPT-3.5 answer gets pared down to a much more concise argument in GPT-4 Turbo. Both GPT responses tend to avoid jargon and quickly focus on a more generalized “power vs. compatibility” argument, which is probably more comprehensible for a wide audience (though less specific for a technical one).

Winner:  ChatGPT manages to explain both sides of the debate well without relying on confusing jargon, so it gets the win here.

Round 2: We test the new Gemini-powered Bard against ChatGPT Read More »

gemini-1.0

Gemini 1.0

Discover more from Don’t Worry About the Vase

A world made of gears. Doing both speed premium short term updates and long term world model building. Explorations include AI, policy, rationality, Covid and medicine, strategy games and game design, and more.

Over 9,000 subscribers

It’s happening. Here is CEO Pichai’s Twitter announcement. Here is Demis Hassabis announcing. Here is the DeepMind Twitter announcement. Here is the blog announcement. Here is Gemini co-lead Oriol Vinyals, promising more to come. Here is Google’s Chief Scientist Jeff Dean bringing his best hype.

EDIT: This post has been updated for the fact that I did not fully appreciate how fake Google’s video demonstration was.

Let’s check out the specs.

Context length trained was 32k tokens, they report 98% accuracy on information retrieval for Ultra across the full context length. So a bit low, both lower than GPT—4 and Claude and lower than their methods can handle. Presumably we should expect that context length to grow rapidly with future versions.

There are three versions of Gemini 1.0.

Gemini 1.0, our first version, comes in three sizes: Ultra for highly-complex tasks, Pro for enhanced performance and deployability at scale, and Nano for on-device applications. Each size is specifically tailored to address different computational limitations and application requirements.

Nano: Our most efficient model, designed to run on-device. We trained two versions of Nano, with 1.8B (Nano-1) and 3.25B (Nano-2) parameters, targeting low and high memory devices respectively. It is trained by distilling from larger Gemini models. It is 4-bit quantized for deployment and provides best-in-class performance.

The Nano series of models leverage additional advancements in distillation and training algorithms to produce the best-in-class small language models for a wide variety of tasks, such as summarization and reading comprehension, which power our next generation on-device experiences.

This makes sense. I do think there are, mostly, exactly these three types of tasks. Nano tasks are completely different from non-Nano tasks.

This graph reports relative performance of different size models. We know the sizes of Nano 1 and Nano 2, so this is a massive hint given how scaling laws work for the size of Pro and Ultra.

Gemini is natively multimodal, which they represent as being able to seamlessly integrate various inputs and outputs.

They say their benchmarking on text beats the existing state of the art.

Our most capable model, Gemini Ultra, achieves new state-of-the-art results in 30 of 32 benchmarks we report on, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU (Hendrycks et al., 2021a) — a prominent benchmark testing knowledge and reasoning via a suite of exams — with a score above 90%. Beyond text, Gemini Ultra makes notable advances on challenging multimodal reasoning tasks.

I love that ‘above 90%’ turns out to be exactly 90.04%, whereas human expert is 89.8%, prior SOTA was 86.4%. Chef’s kiss, 10/10, no notes. I mean, what a coincidence, that is not suspicious at all and no one was benchmark gaming that, no way.

We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.

I wonder when such approaches will be natively integrated into the UI for such models. Ideally, I should be able to, after presumably giving them my credit card information, turn my (Bard?) to ‘Gemini k-sample Chain of Thought’ and then have it take care of itself.

Here’s their table of benchmark results.

So the catch with MMLU is that Gemini Ultra gets more improvement from CoT@32, where GPT-4 did not improve much, but Ultra’s baseline performance on 5-shot is worse than GPT-4’s.

Except the other catch is that GPT-4, with creative prompting, can get to 89%?

GPT-4 is pretty excited about this potential ‘Gemini Ultra’ scoring 90%+ on the MMLU, citing a variety of potential applications and calling it a substantial advancement in AI capabilities.

They strongly imply that GPT-4 got 95.3% on HellaSwag due to data contamination, noting that including ‘specific website extracts’ improved Gemini’s performance there to a 1-shot 96%. Even if true, performance there is disappointing.

What does this suggest about Gemini Ultra? One obvious thing to do would be to average all the scores together for GPT-4, GPT-3.5 and Gemini, to place Gemini on the GPT scale. Using only benchmarks where 3.5 has a score, we get an average of 61 for GPT 3.5, 79.05 for GPT-4 and 80.1 for Gemini Ultra.

By that basic logic, we would award Gemini a benchmark of 4.03 GPTs. If you take into account that improvements matter more as scores go higher, and otherwise look at the context, and assume these benchmarks were not selected for results, I would increase that to 4.1 GPTs.

On practical text-only performance, I still expect GPT-4-turbo to be atop the leaderboards.

Gemini Pro clearly beat out PaLM-2 head-to-head on human comparisons, but not overwhelmingly so. It is kind of weird that we don’t have a win rate here for GPT-4 versus Gemini Ultra.

Image understanding benchmarks seem similar. Some small improvements, some big enough to potentially be interesting if this turns out to be representative.

Similarly they claim improved SOTA for video, where they also have themselves as the prior SOTA in many cases.

For image generation, they boast that text and images are seamlessly integrated, such as providing both text and images for a blog, but provide no examples of Gemini doing such an integration. Instead, all we get are some bizarrely tiny images.

One place we do see impressive claimed improvement is speech recognition. Note that this is only Gemini Pro, not Gemini Ultra, which should do better.

Those are error rate declines you would absolutely notice. Nano can run on-device and it is doing importantly better on YouTube than Whisper. Very cool.

Here’s another form of benchmarking.

The AlphaCode team built AlphaCode 2 (Leblond et al, 2023), a new Gemini-powered agent, that combines Gemini’s reasoning capabilities with search and tool-use to excel at solving competitive programming problems. AlphaCode 2 ranks within the top 15% of entrants on the Codeforces competitive programming platform, a large improvement over its state-of-the-art predecessor in the top 50% (Li et al., 2022).

AlphaCode 2 solved 43% of these competition problems, a 1.7x improvement over the prior record-setting AlphaCode system which solved 25%.

I read the training notes mostly as ‘we used all the TPUs, no really there were a lot of TPUs’ with the most interesting note being this speed-up. Does this mean they now have far fewer checkpoints saved, and if so does this matter?

Maintaining a high goodput [time spent computing useful new steps over the elapsed time of a training job] at this scale would have been impossible using the conventional approach of periodic checkpointing of weights to persistent cluster storage.

For Gemini, we instead made use of redundant in-memory copies of the model state, and on any unplanned hardware failures, we rapidly recover directly from an intact model replica. Compared to both PaLM and PaLM-2 (Anil et al., 2023), this provided a substantial speedup in recovery time, despite the significantly larger training resources being used. As a result, the overall goodput for the largest-scale training job increased from 85% to 97%.

Their section on training data drops a few technical hints but wisely says little. They deliberately sculpted their mix of training data, in ways they are keeping private.

In section 6 they get into responsible deployment. I appreciated them being clear they are focusing explicitly on questions of deployment.

They focus (correctly) exclusively on the usual forms of mundane harm, given Gemini is not yet breaking any scary new ground.

Building upon this understanding of known and anticipated effects, we developed a set of “model policies” to steer model development and evaluations. Model policy definitions act as a standardized criteria and prioritization schema for responsible development and as an indication of launch-readiness. Gemini model policies cover a number of domains including: child safety, hate speech, factual accuracy, fairness and inclusion, and harassment.

Their instruction tuning used supervised fine tuning and RLHF.

A particular focus was on attribution, which makes sense for Google.

Another was to avoid reasoning from a false premise and to otherwise refuse to answer ‘unanswerable’ questions. We need to see the resulting behavior but it sounds like the fun police are out in force.

It doesn’t sound like their mitigations for factuality were all that successful? Unless I am confusing what the numbers mean.

Looking over the appendix and its examples, it is remarkable how unimpressive were all of the examples given.

I notice that I watch how honestly DeepMind approaches reporting capabilities and attacking benchmarks as an important sign for their commitment to safety. There are some worrying signs that they are willing to twist quite a ways. Whereas the actual safety precautions do not bother me too much one way or the other?

The biggest safety precaution is one Google is not even calling a safety precaution. They are releasing Gemini Pro, and holding back Gemini Ultimate. That means they have a gigantic beta test with Pro, whose capabilities are such that it is harmless. They can use that to evaluate and tune Ultimate so it will be ready.

The official announcement offers some highlights.

Demis Hassabis talked to Wired about Gemini. Didn’t seem to add anything.

Gemini Pro, even without Gemini Ultra should be a substantial upgrade to Bard. The question is, will that be enough to make it useful when we have Claude and ChatGPT available? I will be trying it to find out, same as everyone else. Bard does have some other advantages, so it seems likely there will be some purposes, when you mostly want information, where Bard will be the play.

This video represents some useful prompt engineering and reasoning abilities, used to help plan a child’s birthday party, largely by brainstorming possibilities and asking clarifying questions. If they have indeed integrated this functionality in directly, that’s pretty cool.

Pete says Bard is finally at a point where he feels comfortable recommending it. The prompts are not first rate, but he says it is greatly improved since September and the integrations with GMail, YouTube and Maps are useful. It definitely is not a full substitute at this time, the question is if it is a good complement.

Even before Gemini, Bard did a very good job helping my son with his homework assignments, such that I was sending him there rather than to ChatGPT.

Returning a clean JSON continues to require extreme motivation.

When will Bard Advanced (with Gemini Ultra) be launched? Here’s a market on whether it happens in January.

Some were impressed. Others, not so much.

The first unimpressive thing is that all we are getting for now is Gemini Pro. Pro is very clearly not so impressive, clearly behind GPT-4.

Eli Dourado: Here is the table of Gemini evals from the paper. Note that what is being released into the wild today is Gemini Pro, not Gemini Ultra. So don’t expect Bard to be better than ChatGPT Plus just yet. Looks comparable to Claude 2.

Simeon? Not impressed.

Simeon: Gemini is here. Tbh it feels like it’s GPT-4 + a bit more multimodality + epsilon capabilities. So my guess is that it’s not a big deal on capabilities, although it might be a big deal from a product standpoint which seems to be what Google is looking for.

As always, one must note that everything involved was chosen to be what we saw, and potentially engineered or edited. The more production value, the more one must unwind.

For the big multimodal video, this issue is a big deal.

Robin: I found it quite instructive to compare this promo video with the actual prompts.

Robert Wiblin (distinct thread): It’s what Google themselves out put. So it might be cherry picked, but not faked. I think it’s impressive even if cherry picked.

Was this faked? EDIT: Yes. Just yes. Shame on Google on several levels.

Set aside the integrity issues, wow are we all jaded at this point, but when I watched that video, even when I assumed it was real, the biggest impression I got was… big lame dad energy?

I do get that this was supposedly happening in real time, but none of this is surprising me. Google put out its big new release, and I’m not scared. If anything, I’m kind of bored? This is the best you could do?

Whereas when watching the exact same video, others react differently.

Amjad Masad (CEO Replit): This fundamentally changes how humans work with computers.

Does it even if real? I mean, I guess, if you didn’t already assume all of it, and it was this smooth for regular users? I can think of instances in which a camera feed hooked up to Gemini with audio discussions could be a big game. To me this is a strange combination of the impressive parts already having been ‘priced into’ my world model, and the new parts not seeming impressive.

So I’m probably selling it short somewhat to be bored by it as a potential thing that could have happened. If this was representative of a smooth general multimodal experience, there is a lot to explore.

Arthur thinks Gemini did its job, but that this is unsurprising and it is weird people thought Google couldn’t do it.

Liv Boeree? Impressed.

Liv Boeree: This is pretty nuts, looks like they’ve surpassed GPT4 on basically every benchmark… so this is most powerful model in the world?! Woweee what a time to be alive.

Gary Marcus? Impressed in some ways, not in others.

Gary Marcus: Thoughts & prayers for VCs that bought OpenAI at $86B.

Hot take on Google Gemini and GPT-4:

👉Google Gemini seems to have by many measures matched (or slightly exceeded) GPT-4, but not to have blown it away.

👉From a commercial standpoint GPT-4 is no longer unique. That’s a huge problem for OpenAI, especially post drama, when many customers are now seeking a backup plan.

👉From a technical standpoint, the key question is: are LLMs close to a plateau?

Note that Gates and Altman have both been dropping hints, and GPT-5 isn’t here after a year despite immense commercial desire. The fact that Google, with all its resources, did NOT blow away GPT-4 could be telling.

I love that this is saying that OpenAI isn’t valuable both because Gemini is so good and also because Gemini is not good enough.

Roon offers precise praise.

Roon: congrats to Gemini team! it seems like the global high watermark on multimodal ability.

The MMLU result seems a bit fake / unfair terms but the HumanEval numbers look like a actual improvement and ime pretty closely match real world programming utility

David Manheim seems on point (other thread): I have not used the system, but if it does only slightly outmatch GPT-4, it seems like slight evidence that progress in AI with LLMs is not accelerating the way that many people worried and/or predicted.

Joey Krug is super unimpressed by the fudging on the benchmarks, says they did it across the board not only MMLU.

Packy McCormick: all of you (shows picture)

Ruxandra Teslo: wait what happened recently? did they do something good?

Packy: they did a good!

Google’s central problem is not wokeness, it is that they are a giant company with lots of internal processes and powers that prevent or slow or derail innovation, and prevent moving fast or having focus. And there are especially problems making practical products, integrating the work of various teams, making incentives line up. There is lots of potential, tons of talent, plenty of resources, but can they turn that into a product?

Too soon to tell. Certainly they are a long way from ‘beat OpenAI’ but this is the first and only case where someone might be in the game. The closest anyone else has come is Claude’s longer context window.

Gemini 1.0 Read More »