AuDiffusion: multi-agent controlled text-to-image generation with attention-enhanced mamba blocks
DZ An and WY Zhang and SC Zhang and J Lu, COMPLEX & INTELLIGENT SYSTEMS, 12, 78 (2025).
DOI: 10.1007/s40747-025-02211-1
We present AuDiffusion, a diffusion framework that introduces a multi- agent design to improve controllability, semantic alignment, and efficiency in text-to-image generation. The system comprises three cooperating agents responsible for enriching textual input, selecting suitable structural constraints, and performing image synthesis with an enhanced diffusion backbone. This modular design provides more explicit structural control and adaptive decision-making compared with conventional monolithic pipelines, while retaining strong global context modeling. We evaluate AuDiffusion on the ImageNet 256 x\documentclass12ptminimal \usepackageamsmath \usepackagewasysym \usepackageamsfonts \usepackageamssymb \usepackageamsbsy \usepackagemathrsfs \usepackageupgreek \setlength\oddsidemargin-69pt \begindocument$$\times $$\enddocument 256 and 512 x\documentclass12ptminimal \usepackageamsmath \usepackagewasysym \usepackageamsfonts \usepackageamssymb \usepackageamsbsy \usepackagemathrsfs \usepackageupgreek \setlength\oddsidemargin-69pt \begindocument$$\times $$\enddocument 512 benchmarks. At 256 x\documentclass12ptminimal \usepackageamsmath \usepackagewasysym \usepackageamsfonts \usepackageamssymb \usepackageamsbsy \usepackagemathrsfs \usepackageupgreek \setlength\oddsidemargin-69pt \begindocument$$\times $$\enddocument 256 resolution, AuDiffusion achieves a FID of 2.21, IS of 274.12, Precision of 0.85, and Recall of 0.59; at 512 x\documentclass12ptminimal \usepackageamsmath \usepackagewasysym \usepackageamsfonts \usepackageamssymb \usepackageamsbsy \usepackagemathrsfs \usepackageupgreek \setlength\oddsidemargin-69pt \begindocument$$\times $$\enddocument 512, it attains a FID of 3.02 and IS of 268.31 while remaining computationally efficient. These results indicate that a multi-agent diffusion framework can improve controllability and image quality without incurring prohibitive computational overhead, making AuDiffusion a practical candidate for applications such as visual prototyping in creative workflows.
Return to Publications page