A Language-Image Pre-training Model Based on Context Optimization and Region of Interest

R Jin and T Jin and ZA Li and TZ Wu and Y Wang and QH Zhu and M Luo, INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 18, 322 (2025).

DOI: 10.1007/s44196-025-01061-6

With the rapid development and increasing sophistication of multimedia technologies, extensive visual-verbal pre-training has yielded favorable outcomes in subsequent tasks. Current methodologies mostly depend on the assumption that image-text pairings sourced from the Internet represent flawless one-to-one correspondences. The emergence of powerful pre- training models in recent years has facilitated a series of researches on migrating pre-training models to downstream tasks. Inspired by related technologies, this paper takes the CLIP (Contrastive Language- Image Pre-Training) model as the backbone model, and proposes a new method named CORCLIP (Context Optimization and Region of Interest Contrastive Language-Image Pre-Training), which addresses the issue of requiring extensive time for word refinement via context optimization, and introduces an attention mechanism to guide ROI (Region of Interest) for image segmentation task to divide the image into different regions or objects. By selecting regions of interest, processing can be focused on key regions to improve the efficiency of segmenting images. Using comprehensive qualitative experimentation and analysis on Flickr30K and MSCOCO datasets, we show that the suggested strategy is better.

Return to Publications page