Cross-Modal Synthesis: Generating Semantic Segmentation Masks from Image Captions and Vice-Versa using Multi-Modal Transformers
  • Author(s): Subham Sahoo; Rajat Gupta; Chintan Tibrewala
  • Paper ID: 1711964
  • Page: 2418-2423
  • Published Date: 08-09-2025
  • Published In: Iconic Research And Engineering Journals
  • Publisher: IRE Journals
  • e-ISSN: 2456-8880
  • Volume/Issue: Volume 8 Issue 11 May-2025
Abstract

The integration of visual and linguistic modalities has transformed computer vision and natural language processing research. Cross-modal synthesis, which seeks to generate segmentation masks from text and captions from images, presents significant opportunities for understanding visual scenes semantically. In this work, we propose a unified transformer-based framework that performs bi-directional synthesis between image captions and semantic segmentation masks. The model, built on top of a multi-modal transformer encoder-decoder, learns shared latent representations enabling seamless translation between visual regions and linguistic tokens. Extensive experiments on COCO-Stuff and ADE20K datasets demonstrate that our method outperforms baseline models by 16% in mIoU for caption-to-mask synthesis and 14% BLEU improvement for mask-to-caption generation, establishing a new benchmark for multi-modal reasoning.

Citations

IRE Journals:
Subham Sahoo, Rajat Gupta, Chintan Tibrewala "Cross-Modal Synthesis: Generating Semantic Segmentation Masks from Image Captions and Vice-Versa using Multi-Modal Transformers" Iconic Research And Engineering Journals Volume 8 Issue 11 2025 Page 2418-2423 https://doi.org/10.64388/IREV8I11-1711964

IEEE:
Subham Sahoo, Rajat Gupta, Chintan Tibrewala "Cross-Modal Synthesis: Generating Semantic Segmentation Masks from Image Captions and Vice-Versa using Multi-Modal Transformers" Iconic Research And Engineering Journals, 8(11) https://doi.org/10.64388/IREV8I11-1711964