Open-Source · Apache 2.0

ERNIE Image: Baidu's Open-Weight
Text-to-Image Model

ERNIE Image is an open text-to-image model built on an 8B single-stream Diffusion Transformer. It's specifically trained for the cases that trip up most image generators — legible in-image text, structured layouts, and complex multi-object prompts — and runs on a single consumer GPU.

8B
DiT Parameters
Single-stream architecture
0.8856
GENEval
Instruction following
0.9733
LongTextBench
Text accuracy benchmark
24G
Min. VRAM
Consumer GPU compatible
ERNIE Image architecture: text prompt → Prompt Enhancer (LLM) → 8B Diffusion Transformer → Generated Image (PNG)

What Is ERNIE Image?

The straight answer — model architecture, what it's built for, and where it fits.

ERNIE Image is an open-source text-to-image generation model developed by the ERNIE team at Baidu. It uses an 8-billion-parameter single-stream Diffusion Transformer (DiT) and ships with a lightweight Prompt Enhancer that expands short user inputs into richer, structured descriptions before generation.

The model is designed for practical deployment — it runs on a single consumer GPU with 24G VRAM, not a cluster. Despite the compact parameter count, it reaches state-of-the-art performance among open-weight text-to-image models across several benchmarks.

It's released under Apache 2.0. That means the weights are free to download, use commercially, fine-tune, and redistribute — with no API dependency and no usage quota.

Apache 2.0 · Free for commercial use · No API quota

What ERNIE Image Is Built For

Six capabilities from the official model documentation — and what each one means in practice.

Render Dense, Layout-Sensitive Text Inside Images

ERNIE Image performs particularly well on long-form, layout-sensitive text — the kind that breaks most diffusion models. Posters with real headlines, infographics with data labels, and UI-like mockups with readable copy all come out clean.

LongTextBench 0.9733

Follow Complex Prompts Involving Multiple Objects

The model handles prompts with multiple objects, detailed spatial relationships, and knowledge-intensive descriptions — and doesn't collapse them into a generic output. GENEval 0.8856 puts it ahead of Qwen-Image and competitive with larger open-weight models.

GENEval 0.8856

Generate Posters, Comics, and Multi-Panel Compositions

Structured visual tasks are where ERNIE Image stands out among open-weight models. Posters, comic panels, storyboards, and multi-panel compositions come out with consistent layout logic — not just good-looking subjects dropped onto a canvas.

Cover Realistic, Design-Oriented, and Stylized Outputs

The model isn’t locked into one visual register. Realistic photography, clean design-oriented imagery, and distinctive stylized aesthetics are all within its range. You’re not choosing between “photo” and “art” mode — it handles both.

Run on a Consumer GPU — No Cloud Required

The full model runs on a single GPU with 24G VRAM — RTX 3090, RTX 4090, or A10G. That's local inference with no API dependency and no per-image cost. Self-hosting the checkpoint also means you control the data pipeline end to end.

24G VRAM · Single GPU

Expand Short Prompts with the Built-In Prompt Enhancer

A lightweight Prompt Enhancer ships alongside the main DiT. It takes brief user inputs and rewrites them into richer, structured descriptions before the model generates. The result: better output from short prompts, without prompt engineering overhead.

Where to Download and Run ERNIE Image

Official weights on Hugging Face, ComfyUI workflow on GitHub — both under Apache 2.0.

Download from Hugging Face — Official Weights

The official checkpoint is hosted at baidu/ERNIE-Image on Hugging Face under Apache 2.0. Both the main SFT model and the Turbo variant are available. The Prompt Enhancer ships as a separate safetensors file in the same repository.

huggingface.co/baidu/ERNIE-ImageOpen on Hugging Face

Run in ComfyUI with the Official Workflow Template

ComfyUI added Day-0 support for ERNIE Image in April 2026. Load the safetensors checkpoint, add the Prompt Enhancer node, and it integrates with any standard ComfyUI pipeline. The workflow template is published on GitHub.

github.com/baidu/ernie-imageView on GitHub

ERNIE Image SFT vs Turbo — Which Version to Use

Two variants ship in the same release. Here's what's different and when to pick each one.

Standard

ERNIE Image SFT — Full Quality, 50-Step Generation

The SFT model is the standard release — 50 denoising steps, full instruction fidelity, and the strongest benchmark scores. Use it for final renders where text accuracy, layout precision, and output quality are non-negotiable.

Steps
50
GENEval
0.8856
LongTextBench
0.9733
Best for
Final renders
Fast

ERNIE Image Turbo — 8-Step Drafts for Fast Iteration

ERNIE Image Turbo is a distilled variant trained with DMD (Distribution Matching Distillation) and reinforcement learning. It cuts generation down to 8 steps — fast enough to preview 20+ compositions before committing to a final render. Output quality is lower than SFT but sufficient for client reviews and direction exploration.

Steps
8
Speed
~6× faster
Training
DMD + RL
Best for
Drafts, iteration
ERNIE Image SFT vs Turbo — feature comparison
ERNIE Image SFTERNIE Image Turbo
Steps508
SpeedBaseline~6× faster
Best forFinal rendersDrafts, iteration
GENEval0.8856Lower
LongTextBench0.9733Lower
Training methodSFTDMD + Reinforcement Learning
Available onHugging FaceHugging Face
Is ERNIE Image free?

Yes. ERNIE Image is released under Apache 2.0 — the weights are free to download and use. Commercial use, fine-tuning, and redistribution are all permitted with no separate license purchase. There is no API quota or usage cap on the self-hosted version.

How does ERNIE Image compare to FLUX.1 or Midjourney?

For in-image text and structured layout generation, ERNIE Image leads most open-weight competitors. GENEval 0.8856 puts it competitive with FLUX.1 on instruction following. Midjourney produces stronger stylized aesthetics but is closed-source with no self-hosting option. ERNIE Image is the stronger choice when text accuracy and layout control matter more than visual style range.

Can I use ERNIE Image outputs commercially?

Yes. The Apache 2.0 license permits commercial use of both the model weights and generated outputs. Ads, product imagery, print, and resale are all allowed. No additional commercial license is needed.

What GPU do I need to run ERNIE Image locally?

ERNIE Image requires a GPU with 24G VRAM for the full SFT model — RTX 3090, RTX 4090, and A10G all work. The Turbo variant is faster and has lower memory requirements. The model runs on a single GPU; no multi-GPU setup is needed.

Does ERNIE Image work with ComfyUI?

Yes. ComfyUI added official Day-0 support for ERNIE Image in April 2026. The model loads as a standard safetensors checkpoint. Baidu published a workflow template on GitHub that includes the Prompt Enhancer node. It's compatible with standard ComfyUI custom nodes.

What languages can I use for prompts?

ERNIE Image supports English, Chinese, and Japanese prompts. In-image text renders cleanly in both English and Chinese in the same generation pass. Benchmark scores are comparable across languages — OneIG-EN 0.5750 vs OneIG-ZH 0.5543 — so there’s no meaningful quality gap between the two.

Official ERNIE Image Resources

Everything in one place — model weights, code, documentation, and the online demo.