AI image generator

openai’s-new-ai-image-generator-is-potent-and-bound-to-provoke

OpenAI’s new AI image generator is potent and bound to provoke


The visual apocalypse is probably nigh, but perhaps seeing was never believing.

A trio of AI-generated images created using OpenAI’s 4o Image Generation model in ChatGPT. Credit: OpenAI

The arrival of OpenAI’s DALL-E 2 in the spring of 2022 marked a turning point in AI when text-to-image generation suddenly became accessible to a select group of users, creating a community of digital explorers who experienced wonder and controversy as the technology automated the act of visual creation.

But like many early AI systems, DALL-E 2 struggled with consistent text rendering, often producing garbled words and phrases within images. It also had limitations in following complex prompts with multiple elements, sometimes missing key details or misinterpreting instructions. These shortcomings left room for improvement that OpenAI would address in subsequent iterations, such as DALL-E 3 in 2023.

On Tuesday, OpenAI announced new multimodal image generation capabilities that are directly integrated into its GPT-4o AI language model, making it the default image generator within the ChatGPT interface. The integration, called “4o Image Generation” (which we’ll call “4o IG” for short), allows the model to follow prompts more accurately (with better text rendering than DALL-E 3) and respond to chat context for image modification instructions.

An AI-generated cat in a car drinking a can of beer created by OpenAI’s 4o Image Generation model. OpenAI

The new image generation feature began rolling out Tuesday to ChatGPT Free, Plus, Pro, and Team users, with Enterprise and Education access coming later. The capability is also available within OpenAI’s Sora video generation tool. OpenAI told Ars that the image generation when GPT-4.5 is selected calls upon the same 4o-based image generation model as when GPT-4o is selected in the ChatGPT interface.

Like DALL-E 2 before it, 4o IG is bound to provoke debate as it enables sophisticated media manipulation capabilities that were once the domain of sci-fi and skilled human creators into an accessible AI tool that people can use through simple text prompts. It will also likely ignite a new round of controversy over artistic styles and copyright—but more on that below.

Some users on social media initially reported confusion since there’s no UI indication of which image generator is active, but you’ll know it’s the new model if the generation is ultra slow and proceeds from top to bottom. The previous DALL-E model remains available through a dedicated “DALL-E GPT” interface, while API access to GPT-4o image generation is expected within weeks.

Truly multimodal output

4o IG represents a shift to “native multimodal image generation,” where the large language model processes and outputs image data directly as tokens. That’s a big deal, because it means image tokens and text tokens share the same neural network. It leads to new flexibility in image creation and modification.

Despite baking-in multimodal image generation capabilities when GPT-4o launched in May 2024—when the “o” in GPT-4o was touted as standing for “omni” to highlight its ability to both understand and generate text, images, and audio—OpenAI has taken over 10 months to deliver the functionality to users, despite OpenAI president Greg Brock teasing the feature on X last year.

OpenAI was likely goaded by the release of Google’s multimodal LLM-based image generator called “Gemini 2.0 Flash (Image Generation) Experimental,” last week. The tech giants continue their AI arms race, with each attempting to one-up the other.

And perhaps we know why OpenAI waited: At a reasonable resolution and level of detail, the new 4o IG process is extremely slow, taking anywhere from 30 seconds to one minute (or longer) for each image.

Even if it’s slow (for now), the ability to generate images using a purely autoregressive approach is arguably a major leap for OpenAI due to its flexibility. But it’s also very compute-intensive, since the model generates the image token by token, building it sequentially. This contrasts with diffusion-based methods like DALL-E 3, which start with random noise and gradually refine an entire image over many iterative steps.

Conversational image editing

In a blog post, OpenAI positions 4o Image Generation as moving beyond generating “surreal, breathtaking scenes” seen with earlier AI image generators and toward creating “workhorse imagery” like logos and diagrams used for communication.

The company particularly notes improved text rendering within images, a capability where previous text-to-image models often spectacularly failed, often turning “Happy Birthday” into something resembling alien hieroglyphics.

OpenAI claims several key improvements: users can refine images through conversation while maintaining visual consistency; the system can analyze uploaded images and incorporate their details into new generations; and it offers stronger photorealism—although what constitutes photorealism (for example, imitations of HDR camera features, detail level, and image contrast) can be subjective.

A screenshot of OpenAI's 4o Image Generation model in ChatGPT. We see an existing AI-generated image of a barbarian and a TV set, then a request to set the TV set on fire.

A screenshot of OpenAI’s 4o Image Generation model in ChatGPT. We see an existing AI-generated image of a barbarian and a TV set, then a request to set the TV set on fire. Credit: OpenAI / Benj Edwards

In its blog post, OpenAI provided examples of intended uses for the image generator, including creating diagrams, infographics, social media graphics using specific color codes, logos, instruction posters, business cards, custom stock photos with transparent backgrounds, editing user photos, or visualizing concepts discussed earlier in a chat conversation.

Notably absent: Any mention of the artists and graphic designers whose jobs might be affected by this technology. As we covered throughout 2022 and 2023, job impact is still a top concern among critics of AI-generated graphics.

Fluid media manipulation

Shortly after OpenAI launched 4o Image Generation, the AI community on X put the feature through its paces, finding that it is quite capable at inserting someone’s face into an existing image, creating fake screenshots, and converting meme photos into the style of Studio Ghibli, South Park, felt, Muppets, Rick and Morty, Family Guy, and much more.

It seems like we’re entering a completely fluid media “reality” courtesy of a tool that can effortlessly convert visual media between styles. The styles also potentially encroach upon protected intellectual property. Given what Studio Ghibli co-founder Hayao Miyazaki has previously said about AI-generated artwork (“I strongly feel that this is an insult to life itself.”), it seems he’d be unlikely to appreciate the current AI-generated Ghibli fad on X at the moment.

To get a sense of what 4o IG can do ourselves, we ran some informal tests, including some of the usual CRT barbarians, queens of the universe, and beer-drinking cats, which you’ve already seen above (and of course, the plate of pickles.)

The ChatGPT interface with the new 4o image model is conversational (like before with DALL-E 3), but you can suggest changes over time. For example, we took the author’s EGA pixel bio (as we did with Google’s model last week) and attempted to give it a full body. Arguably, Google’s more limited image model did a far better job than 4o IG.

Giving the author's pixel avatar a body using OpenAI's 4o Image Generation model in ChatGPT.

Giving the author’s pixel avatar a body using OpenAI’s 4o Image Generation model in ChatGPT. Credit: OpenAI / Benj Edwards

While my pixel avatar was commissioned from the very human (and talented) Julia Minamata in 2020, I also tried to convert the inspiration image for my avatar (which features me and legendary video game engineer Ed Smith) into EGA pixel style to see what would happen. In my opinion, the result proves the continued superiority of human artistry and attention to detail.

Converting a photo of Benj Edwards and video game legend Ed Smith into “EGA pixel art” using OpenAI’s 4o Image Generation model in ChatGPT. Credit: OpenAI / Benj Edwards

We also tried to see how many objects 4o Image Generation could cram into an image, inspired by a 2023 tweet by Nathan Shipley when he was evaluating DALL-E 3 shortly after its release. We did not account for every object, but it looks like most of them are there.

Generating an image of a surfer holding tons of items, inspired by a 2023 Twitter post from Nathan Shipley.

Generating an image of a surfer holding tons of items, inspired by a 2023 Twitter post from Nathan Shipley. Credit: OpenAI / Benj Edwards

On social media, other people have manipulated images using 4o IG (like Simon Willison’s bear selfie), so we tried changing an AI-generated note featured in an article last year. It worked fairly well, though it did not really imitate the handwriting style as requested.

Modifying text in an image using OpenAI's 4o Image Generation model in ChatGPT.

Modifying text in an image using OpenAI’s 4o Image Generation model in ChatGPT. Credit: OpenAI / Benj Edwards

To take text generation a little further, we generated a poem about barbarians using ChatGPT, then fed it into an image prompt. The result feels roughly equivalent to diffusion-based Flux in capability—maybe slightly better—but there are still some obvious mistakes here and there, such as repeated letters.

Testing text generation using OpenAI's 4o Image Generation model in ChatGPT.

Testing text generation using OpenAI’s 4o Image Generation model in ChatGPT. Credit: OpenAI / Benj Edwards

We also tested the model’s ability to create logos featuring our favorite fictional Moonshark brand. One of the logos not pictured here was delivered as a transparent PNG file with an alpha channel. This may be a useful capability for some people in a pinch, but to the extent that the model may produce “good enough” (not exceptional, but looks OK at a glance) logos for the price of $o (not including an OpenAI subscription), it may end up competing with some human logo designers, and that will likely cause some consternation among professional artists.

Generating a

Generating a “Moonshark Moon Pies” logo using OpenAI’s 4o Image Generation model in ChatGPT. Credit: OpenAI / Benj Edwards

Frankly, this model is so slow we didn’t have time to test everything before we needed to get this article out the door. It can do much more than we have shown here—such as adding items to scenes or removing them. We may explore more capabilities in a future article.

Limitations

By now, you’ve seen that, like previous AI image generators, 4o IG is not perfect in quality: It consistently renders the author’s nose at an incorrect size.

Other than that, while this is one of the most capable AI image generators ever created, OpenAI openly acknowledges significant limitations of the model. For example, 4o IG sometimes crops images too tightly or includes inaccurate information (confabulations) with vague prompts or when rendering topics it hasn’t encountered in its training data.

The model also tends to fail when rendering more than 10–20 objects or concepts simultaneously (making tasks like generating an accurate periodic table currently impossible) and struggles with non-Latin text fonts. Image editing is currently unreliable over many multiple passes, with a specific bug affecting face editing consistency that OpenAI says it plans to fix soon. And it’s not great with dense charts or accurately rendering graphs or technical diagrams. In our testing, 4o Image Generation produced mostly accurate but flawed electronic circuit schematics.

Move fast and break everything

Even with those limitations, multimodal image generators are an early step into a much larger world of completely plastic media reality where any pixel can be manipulated on demand with no particular photo editing skill required. That brings with it potential benefits, ethical pitfalls, and the potential for terrible abuse.

In a notable shift from DALL-E, OpenAI now allows 4o IG to generate adult public figures (not children) with certain safeguards, while letting public figures opt out if desired. Like DALL-E, the model still blocks policy-violating content requests (such as graphic violence, nudity, and sex).

The ability for 4o Image Generation to imitate celebrity likenesses, brand logos, and Studio Ghibli films reinforces and reminds us how GPT-4o is partly (aside from some licensed content) a product of a massive scrape of the Internet without regard to copyright or consent from artists. That mass-scraping practice has resulted in lawsuits against OpenAI in the past, and we would not be surprised to see more lawsuits or at least public complaints from celebrities (or their estates) about their likenesses potentially being misused.

On X, OpenAI CEO Sam Altman wrote about the company’s somewhat devil-may-care position about 4o IG: “This represents a new high-water mark for us in allowing creative freedom. People are going to create some really amazing stuff and some stuff that may offend people; what we’d like to aim for is that the tool doesn’t create offensive stuff unless you want it to, in which case within reason it does.”

An original photo of the author beside AI-generated images created by OpenAI's 4o Image Generation model. From left to right: Studio Ghibli style, Muppet style, and pasta style.

An original photo of the author beside AI-generated images created by OpenAI’s 4o Image Generation model. From second left to right: Studio Ghibli style, Muppet style, and pasta style. Credit: OpenAI / Benj Edwards

Zooming out, GPT-4o’s image generation model (and the technology behind it, once open source) feels like it further erodes trust in remotely produced media. While we’ve always needed to verify important media through context and trusted sources, these new tools may further expand the “deep doubt” media skepticism that’s become necessary in the age of AI. By opening up photorealistic image manipulation to the masses, more people than ever can create or alter visual media without specialized skills.

While OpenAI includes C2PA metadata in all generated images, that data can be stripped away and might not matter much in the context of a deceptive social media post. But 4o IG doesn’t change what has always been true: We judge information primarily by the reputation of its messenger, not by the pixels themselves. Forgery existed long before AI. It reinforces that everyone needs media literacy skills—understanding that context and source verification have always been the best arbiters of media authenticity.

For now, Altman is ready to take on the risks of releasing the technology into the world. “As we talk about in our model spec, we think putting this intellectual freedom and control in the hands of users is the right thing to do, but we will observe how it goes and listen to society,” Altman wrote on X. “We think respecting the very wide bounds society will eventually choose to set for AI is the right thing to do, and increasingly important as we get closer to AGI. Thanks in advance for the understanding as we work through this.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI’s new AI image generator is potent and bound to provoke Read More »

new-zemeckis-film-used-ai-to-de-age-tom-hanks-and-robin-wright

New Zemeckis film used AI to de-age Tom Hanks and Robin Wright

On Friday, TriStar Pictures released Here, a $50 million Robert Zemeckis-directed film that used real time generative AI face transformation techniques to portray actors Tom Hanks and Robin Wright across a 60-year span, marking one of Hollywood’s first full-length features built around AI-powered visual effects.

The film adapts a 2014 graphic novel set primarily in a New Jersey living room across multiple time periods. Rather than cast different actors for various ages, the production used AI to modify Hanks’ and Wright’s appearances throughout.

The de-aging technology comes from Metaphysic, a visual effects company that creates real time face swapping and aging effects. During filming, the crew watched two monitors simultaneously: one showing the actors’ actual appearances and another displaying them at whatever age the scene required.

Here – Official Trailer (HD)

Metaphysic developed the facial modification system by training custom machine-learning models on frames of Hanks’ and Wright’s previous films. This included a large dataset of facial movements, skin textures, and appearances under varied lighting conditions and camera angles. The resulting models can generate instant face transformations without the months of manual post-production work traditional CGI requires.

Unlike previous aging effects that relied on frame-by-frame manipulation, Metaphysic’s approach generates transformations instantly by analyzing facial landmarks and mapping them to trained age variations.

“You couldn’t have made this movie three years ago,” Zemeckis told The New York Times in a detailed feature about the film. Traditional visual effects for this level of face modification would reportedly require hundreds of artists and a substantially larger budget closer to standard Marvel movie costs.

This isn’t the first film that has used AI techniques to de-age actors. ILM’s approach to de-aging Harrison Ford in 2023’s Indiana Jones and the Dial of Destiny used a proprietary system called Flux with infrared cameras to capture facial data during filming, then old images of Ford to de-age him in post-production. By contrast, Metaphysic’s AI models process transformations without additional hardware and show results during filming.

New Zemeckis film used AI to de-age Tom Hanks and Robin Wright Read More »

flux:-this-new-ai-image-generator-is-eerily-good-at-creating-human-hands

FLUX: This new AI image generator is eerily good at creating human hands

five-finger salute —

FLUX.1 is the open-weights heir apparent to Stable Diffusion, turning text into images.

AI-generated image by FLUX.1 dev:

Enlarge / AI-generated image by FLUX.1 dev: “A beautiful queen of the universe holding up her hands, face in the background.”

FLUX.1

On Thursday, AI-startup Black Forest Labs announced the launch of its company and the release of its first suite of text-to-image AI models, called FLUX.1. The German-based company, founded by researchers who developed the technology behind Stable Diffusion and invented the latent diffusion technique, aims to create advanced generative AI for images and videos.

The launch of FLUX.1 comes about seven weeks after Stability AI’s troubled release of Stable Diffusion 3 Medium in mid-June. Stability AI’s offering faced widespread criticism among image-synthesis hobbyists for its poor performance in generating human anatomy, with users sharing examples of distorted limbs and bodies across social media. That problematic launch followed the earlier departure of three key engineers from Stability AI—Robin Rombach, Andreas Blattmann, and Dominik Lorenz—who went on to found Black Forest Labs along with latent diffusion co-developer Patrick Esser and others.

Black Forest Labs launched with the release of three FLUX.1 text-to-image models: a high-end commercial “pro” version, a mid-range “dev” version with open weights for non-commercial use, and a faster open-weights “schnell” version (“schnell” means quick or fast in German). Black Forest Labs claims its models outperform existing options like Midjourney and DALL-E in areas such as image quality and adherence to text prompts.

  • AI-generated image by FLUX.1 dev: “A close-up photo of a pair of hands holding a plate full of pickles.”

    FLUX.1

  • AI-generated image by FLUX.1 dev: A hand holding up five fingers with a starry background.

    FLUX.1

  • AI-generated image by FLUX.1 dev: “An Ars Technica reader sitting in front of a computer monitor. The screen shows the Ars Technica website.”

    FLUX.1

  • AI-generated image by FLUX.1 dev: “a boxer posing with fists raised, no gloves.”

    FLUX.1

  • AI-generated image by FLUX.1 dev: “An advertisement for ‘Frosted Prick’ cereal.”

    FLUX.1

  • AI-generated image of a happy woman in a bakery baking a cake by FLUX.1 dev.

    FLUX.1

  • AI-generated image by FLUX.1 dev: “An advertisement for ‘Marshmallow Menace’ cereal.”

    FLUX.1

  • AI-generated image of “A handsome Asian influencer on top of the Empire State Building, instagram” by FLUX.1 dev.

    FLUX.1

In our experience, the outputs of the two higher-end FLUX.1 models are generally comparable with OpenAI’s DALL-E 3 in prompt fidelity, with photorealism that seems close to Midjourney 6. They represent a significant improvement over Stable Diffusion XL, the team’s last major release under Stability (if you don’t count SDXL Turbo).

The FLUX.1 models use what the company calls a “hybrid architecture” combining transformer and diffusion techniques, scaled up to 12 billion parameters. Black Forest Labs said it improves on previous diffusion models by incorporating flow matching and other optimizations.

FLUX.1 seems competent at generating human hands, which was a weak spot in earlier image-synthesis models like Stable Diffusion 1.5 due to a lack of training images that focused on hands. Since those early days, other AI image generators like Midjourney have mastered hands as well, but it’s notable to see an open-weights model that renders hands relatively accurately in various poses.

We downloaded the weights file to the FLUX.1 dev model from GitHub, but at 23GB, it won’t fit in the 12GB VRAM of our RTX 3060 card, so it will need quantization to run locally (reducing its size), which reportedly (through chatter on Reddit) some people have already had success with.

Instead, we experimented with FLUX.1 models on AI cloud-hosting platforms Fal and Replicate, which cost money to use, though Fal offers some free credits to start.

Black Forest looks ahead

Black Forest Labs may be a new company, but it’s already attracting funding from investors. It recently closed a $31 million Series Seed funding round led by Andreessen Horowitz, with additional investments from General Catalyst and MätchVC. The company also brought on high-profile advisers, including entertainment executive and former Disney President Michael Ovitz and AI researcher Matthias Bethge.

“We believe that generative AI will be a fundamental building block of all future technologies,” the company stated in its announcement. “By making our models available to a wide audience, we want to bring its benefits to everyone, educate the public and enhance trust in the safety of these models.”

  • AI-generated image by FLUX.1 dev: A cat in a car holding a can of beer that reads, ‘AI Slop.’

    FLUX.1

  • AI-generated image by FLUX.1 dev: Mickey Mouse and Spider-Man singing to each other.

    FLUX.1

  • AI-generated image by FLUX.1 dev: “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting.”

    FLUX.1

  • AI-generated image of a flaming cheeseburger created by FLUX.1 dev.

    FLUX.1

  • AI-generated image by FLUX.1 dev: “Will Smith eating spaghetti.”

    FLUX.1

  • AI-generated image by FLUX.1 dev: “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting. The screen reads ‘Ars Technica.'”

    FLUX.1

  • AI-generated image by FLUX.1 dev: “An advertisement for ‘Burt’s Grenades’ cereal.”

    FLUX.1

  • AI-generated image by FLUX.1 dev: “A close-up photo of a pair of hands holding a plate that contains a portrait of the queen of the universe”

    FLUX.1

Speaking of “trust and safety,” the company did not mention where it obtained the training data that taught the FLUX.1 models how to generate images. Judging by the outputs we could produce with the model that included depictions of copyrighted characters, Black Forest Labs likely used a huge unauthorized image scrape of the Internet, possibly collected by LAION, an organization that collected the datasets that trained Stable Diffusion. This is speculation at this point. While the underlying technological achievement of FLUX.1 is notable, it feels likely that the team is playing fast and loose with the ethics of “fair use” image scraping much like Stability AI did. That practice may eventually attract lawsuits like those filed against Stability AI.

Though text-to-image generation is Black Forest’s current focus, the company plans to expand into video generation next, saying that FLUX.1 will serve as the foundation of a new text-to-video model in development, which will compete with OpenAI’s Sora, Runway’s Gen-3 Alpha, and Kuaishou’s Kling in a contest to warp media reality on demand. “Our video models will unlock precise creation and editing at high definition and unprecedented speed,” the Black Forest announcement claims.

FLUX: This new AI image generator is eerily good at creating human hands Read More »

ridiculed-stable-diffusion-3-release-excels-at-ai-generated-body-horror

Ridiculed Stable Diffusion 3 release excels at AI-generated body horror

unstable diffusion —

Users react to mangled SD3 generations and ask, “Is this release supposed to be a joke?”

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

Enlarge / An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

On Wednesday, Stability AI released weights for Stable Diffusion 3 Medium, an AI image-synthesis model that turns text prompts into AI-generated images. Its arrival has been ridiculed online, however, because it generates images of humans in a way that seems like a step backward from other state-of-the-art image-synthesis models like Midjourney or DALL-E 3. As a result, it can churn out wild anatomically incorrect visual abominations with ease.

A thread on Reddit, titled, “Is this release supposed to be a joke? [SD3-2B],” details the spectacular failures of SD3 Medium at rendering humans, especially human limbs like hands and feet. Another thread, titled, “Why is SD3 so bad at generating girls lying on the grass?” shows similar issues, but for entire human bodies.

Hands have traditionally been a challenge for AI image generators due to lack of good examples in early training data sets, but more recently, several image-synthesis models seemed to have overcome the issue. In that sense, SD3 appears to be a huge step backward for the image-synthesis enthusiasts that gather on Reddit—especially compared to recent Stability releases like SD XL Turbo in November.

“It wasn’t too long ago that StableDiffusion was competing with Midjourney, now it just looks like a joke in comparison. At least our datasets are safe and ethical!” wrote one Reddit user.

  • An AI-generated image created using Stable Diffusion 3 Medium.

  • An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

  • An AI-generated image created using Stable Diffusion 3 that shows mangled hands.

  • An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

  • An AI-generated image created using Stable Diffusion 3 that shows mangled hands.

  • An AI-generated SD3 Medium image a Reddit user made with the prompt “woman wearing a dress on the beach.”

  • An AI-generated SD3 Medium image a Reddit user made with the prompt “photograph of a person napping in a living room.”

AI image fans are so far blaming the Stable Diffusion 3’s anatomy fails on Stability’s insistence on filtering out adult content (often called “NSFW” content) from the SD3 training data that teaches the model how to generate images. “Believe it or not, heavily censoring a model also gets rid of human anatomy, so… that’s what happened,” wrote one Reddit user in the thread.

Basically, any time a user prompt homes in on a concept that isn’t represented well in the AI model’s training dataset, the image-synthesis model will confabulate its best interpretation of what the user is asking for. And sometimes that can be completely terrifying.

The release of Stable Diffusion 2.0 in 2022 suffered from similar problems in depicting humans well, and AI researchers soon discovered that censoring adult content that contains nudity can severely hamper an AI model’s ability to generate accurate human anatomy. At the time, Stability AI reversed course with SD 2.1 and SD XL, regaining some abilities lost by strongly filtering NSFW content.

Another issue that can occur during model pre-training is that sometimes the NSFW filter researchers use remove adult images from the dataset is too picky, accidentally removing images that might not be offensive and depriving the model of depictions of humans in certain situations. “[SD3] works fine as long as there are no humans in the picture, I think their improved nsfw filter for filtering training data decided anything humanoid is nsfw,” wrote one Redditor on the topic.

Using a free online demo of SD3 on Hugging Face, we ran prompts and saw similar results to those being reported by others. For example, the prompt “a man showing his hands” returned an image of a man holding up two giant-sized backward hands, although each hand at least had five fingers.

  • A SD3 Medium example we generated with the prompt “A woman lying on the beach.”

  • A SD3 Medium example we generated with the prompt “A man showing his hands.”

    Stability AI

  • A SD3 Medium example we generated with the prompt “A woman showing her hands.”

    Stability AI

  • A SD3 Medium example we generated with the prompt “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting.”

  • A SD3 Medium example we generated with the prompt “A cat in a car holding a can of beer.”

Stability first announced Stable Diffusion 3 in February, and the company has planned to make it available in a variety of different model sizes. Today’s release is for the “Medium” version, which is a 2 billion-parameter model. In addition to the weights being available on Hugging Face, they are also available for experimentation through the company’s Stability Platform. The weights are available for download and use for free under a non-commercial license only.

Soon after its February announcement, delays in releasing the SD3 model weights inspired rumors that the release was being held back due to technical issues or mismanagement. Stability AI as a company fell into a tailspin recently with the resignation of its founder and CEO, Emad Mostaque, in March and then a series of layoffs. Just prior to that, three key engineers—Robin Rombach, Andreas Blattmann, and Dominik Lorenz—left the company. And its troubles go back even farther, with news of the company’s dire financial position lingering since 2023.

To some Stable Diffusion fans, the failures with Stable Diffusion 3 Medium are a visual manifestation of the company’s mismanagement—and an obvious sign of things falling apart. Although the company has not filed for bankruptcy, some users made dark jokes about the possibility after seeing SD3 Medium:

“I guess now they can go bankrupt in a safe and ethically [sic] way, after all.”

Ridiculed Stable Diffusion 3 release excels at AI-generated body horror Read More »