ERNIE Image: Baidu's Open-Weight Text-to-Image Model
ERNIE Image is an open text-to-image model built on an 8B single-stream Diffusion Transformer. It's specifically trained for the cases that trip up most image generators — legible in-image text, structured layouts, and complex multi-object prompts — and runs on a single consumer GPU.
The straight answer — model architecture, what it's built for, and where it fits.
ERNIE Image is an open-source text-to-image generation model developed by the ERNIE team at Baidu. It uses an 8-billion-parameter single-stream Diffusion Transformer (DiT) and ships with a lightweight Prompt Enhancer that expands short user inputs into richer, structured descriptions before generation.
The model is designed for practical deployment — it runs on a single consumer GPU with 24G VRAM, not a cluster. Despite the compact parameter count, it reaches state-of-the-art performance among open-weight text-to-image models across several benchmarks.
It's released under Apache 2.0. That means the weights are free to download, use commercially, fine-tune, and redistribute — with no API dependency and no usage quota.
Six capabilities from the official model documentation — and what each one means in practice.
Render Dense, Layout-Sensitive Text Inside Images
ERNIE Image performs particularly well on long-form, layout-sensitive text — the kind that breaks most diffusion models. Posters with real headlines, infographics with data labels, and UI-like mockups with readable copy all come out clean.
LongTextBench 0.9733
Follow Complex Prompts Involving Multiple Objects
The model handles prompts with multiple objects, detailed spatial relationships, and knowledge-intensive descriptions — and doesn't collapse them into a generic output. GENEval 0.8856 puts it ahead of Qwen-Image and competitive with larger open-weight models.
GENEval 0.8856
Generate Posters, Comics, and Multi-Panel Compositions
Structured visual tasks are where ERNIE Image stands out among open-weight models. Posters, comic panels, storyboards, and multi-panel compositions come out with consistent layout logic — not just good-looking subjects dropped onto a canvas.
Cover Realistic, Design-Oriented, and Stylized Outputs
The model isn’t locked into one visual register. Realistic photography, clean design-oriented imagery, and distinctive stylized aesthetics are all within its range. You’re not choosing between “photo” and “art” mode — it handles both.
Run on a Consumer GPU — No Cloud Required
The full model runs on a single GPU with 24G VRAM — RTX 3090, RTX 4090, or A10G. That's local inference with no API dependency and no per-image cost. Self-hosting the checkpoint also means you control the data pipeline end to end.
24G VRAM · Single GPU
Expand Short Prompts with the Built-In Prompt Enhancer
A lightweight Prompt Enhancer ships alongside the main DiT. It takes brief user inputs and rewrites them into richer, structured descriptions before the model generates. The result: better output from short prompts, without prompt engineering overhead.
How to Get ERNIE Image
Where to Download and Run ERNIE Image
Official weights on Hugging Face, ComfyUI workflow on GitHub — both under Apache 2.0.
Download from Hugging Face — Official Weights
The official checkpoint is hosted at baidu/ERNIE-Image on Hugging Face under Apache 2.0. Both the main SFT model and the Turbo variant are available. The Prompt Enhancer ships as a separate safetensors file in the same repository.
Run in ComfyUI with the Official Workflow Template
ComfyUI added Day-0 support for ERNIE Image in April 2026. Load the safetensors checkpoint, add the Prompt Enhancer node, and it integrates with any standard ComfyUI pipeline. The workflow template is published on GitHub.
Two variants ship in the same release. Here's what's different and when to pick each one.
Standard
ERNIE Image SFT — Full Quality, 50-Step Generation
The SFT model is the standard release — 50 denoising steps, full instruction fidelity, and the strongest benchmark scores. Use it for final renders where text accuracy, layout precision, and output quality are non-negotiable.
Steps
50
GENEval
0.8856
LongTextBench
0.9733
Best for
Final renders
Fast
ERNIE Image Turbo — 8-Step Drafts for Fast Iteration
ERNIE Image Turbo is a distilled variant trained with DMD (Distribution Matching Distillation) and reinforcement learning. It cuts generation down to 8 steps — fast enough to preview 20+ compositions before committing to a final render. Output quality is lower than SFT but sufficient for client reviews and direction exploration.
Steps
8
Speed
~6× faster
Training
DMD + RL
Best for
Drafts, iteration
ERNIE Image SFT vs Turbo — feature comparison
ERNIE Image SFT
ERNIE Image Turbo
Steps
50
8
Speed
Baseline
~6× faster
Best for
Final renders
Drafts, iteration
GENEval
0.8856
Lower
LongTextBench
0.9733
Lower
Training method
SFT
DMD + Reinforcement Learning
Available on
Hugging Face
Hugging Face
AI Image Gallery
Real Images Generated with ERNIE Image
Every image below was generated from a text prompt using ERNIE Image — from cinematic portraits to structured posters and bilingual compositions.
Female Dark Knight
Car at Sunset Field
Celestial Moon Messenger
Rooftop Assassin in Rain
Desert Nomad Hunter
Sea Witch Cave
Rainy Izakaya Street — bilingual text rendering
Japanese Summer Park
Phone Illustration Blend
Browser LLM Intro — structured layout
Alphabet of Careers — poster with dense text rendering
Is ERNIE Image free?
Yes. ERNIE Image is released under Apache 2.0 — the weights are free to download and use. Commercial use, fine-tuning, and redistribution are all permitted with no separate license purchase. There is no API quota or usage cap on the self-hosted version.
How does ERNIE Image compare to FLUX.1 or Midjourney?
For in-image text and structured layout generation, ERNIE Image leads most open-weight competitors. GENEval 0.8856 puts it competitive with FLUX.1 on instruction following. Midjourney produces stronger stylized aesthetics but is closed-source with no self-hosting option. ERNIE Image is the stronger choice when text accuracy and layout control matter more than visual style range.
Can I use ERNIE Image outputs commercially?
Yes. The Apache 2.0 license permits commercial use of both the model weights and generated outputs. Ads, product imagery, print, and resale are all allowed. No additional commercial license is needed.
What GPU do I need to run ERNIE Image locally?
ERNIE Image requires a GPU with 24G VRAM for the full SFT model — RTX 3090, RTX 4090, and A10G all work. The Turbo variant is faster and has lower memory requirements. The model runs on a single GPU; no multi-GPU setup is needed.
Does ERNIE Image work with ComfyUI?
Yes. ComfyUI added official Day-0 support for ERNIE Image in April 2026. The model loads as a standard safetensors checkpoint. Baidu published a workflow template on GitHub that includes the Prompt Enhancer node. It's compatible with standard ComfyUI custom nodes.
What languages can I use for prompts?
ERNIE Image supports English, Chinese, and Japanese prompts. In-image text renders cleanly in both English and Chinese in the same generation pass. Benchmark scores are comparable across languages — OneIG-EN 0.5750 vs OneIG-ZH 0.5543 — so there’s no meaningful quality gap between the two.
Official Resources
Official ERNIE Image Resources
Everything in one place — model weights, code, documentation, and the online demo.