CTIGEN-CDM: Controlled Text-to-Image Generation Using Cropped Diffusion Models

YP Liu and JJ Huang and SP Wen and X He and W Zhang and Z Feng, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 35, 11849-11862 (2025).

DOI: 10.1109/TCSVT.2025.3585688

Text-to-image models based on diffusion models are capable of generating highly realistic images from text descriptions. Nevertheless, in practical applications, the generated images frequently fail to fully satisfy user requirements regarding position and structure due to the absence of detailed location information and complex structural demands in the text descriptions. In order to improve the accuracy of the generated image in position and structure, the introduction of additional control conditions such as keypoint annotations or semantic segmentation has become an important research direction. This paper proposes a novel method based on a lightweight pre-trained diffusion model called CTIGEN-CDM. The model reduces computational costs by pruning the denoising network of the diffusion model and integrates control conditions into the denoising process through a gating mechanism to guide image generation. These control conditions encompass Canny edge detection, HED edge detection, depth maps, keypoints, and semantic segmentation. Experimental results reveal that CTIGEN-CDM possesses excellent generation quality and broad application potential. This method can generate high-quality images with precise positioning and structure while significantly saving computational resources, and it offers a promising new solution for text-to-image generation tasks.

Return to Publications page