DiffX: Guide Your Layout to Cross-Modal Generative Modeling  

Zeyu Wang*1 Jingyu Lin*2 Yifei Qian3 Yi Huang4 Shicen Tian1
Bosong Chai1 Juncan Deng1 Qu Yang5 Lan Du2 Cunjian Chen2 Kejie Huang1
*Equal Contribution. Corresponding Author.
1Zhejiang University 2Monash University 3University of Nottingham 4Peking University 5Wuhan University

[Paper]     [Github]    


Layout
Image

Case 1 Image

Generated
RGB Image

Case 1 Image

Generated
Infrared Image

Case 1 Image

Layout
Image

Case 1 Image

Generated
RGB Image

Case 1 Image

Generated
Infrared Image

Case 1 Image

Layout
Image

Case 1 Image

Generated
RGB Image

Case 1 Image

Generated
Infrared Image

Case 1 Image

Layout
Image

Case 1 Image

Generated
RGB Image

Case 1 Image

Generated
Infrared Image

Case 1 Image

Layout
Image

Case 1 Image

Generated
RGB Image

Case 1 Image

Generated
Depth Map

Case 1 Image

Layout
Image

Case 1 Image

Generated
RGB Image

Case 1 Image

Generated
Depth Map

Case 1 Image

Cross-Modal Dataset Construction

(a) The process of constructing the image captions. (b) Examples of cross-modal images, labels, and prepared captions.

Workflow

Workflow of our Multi-Path Variational AutoEncoder (MP-VAE). Here, the RGB+X modal encoding is employed for illustration. However, the framework is capable of supporting additional modal inputs and outputs.

Methodology

Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ``RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. It also shows the potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on COME15K and MCXFace datasets.

Comparisons

As shown in the figure below, we can see that our DiffX significantly outperforms the baseline methods in qualitative results. It proves the effectiveness of employing Long-CLIP for caption embedding and our JME for joint-modal connection.

And results in the table below demonstrate that our DiffX outperforms all baseline models in all metrics.

Analysis

The ablation studies and potential applications are presented here.

Ablation Study

Firstly, we conduct the ablation study on the Laplacian Pyramid (LP) in our MP-VAE. Secondly, we aim to compare the unique-modal generation with the cross-modal RGB+X generation by DiffX.


Effectiveness of long text captions

We conduct the ablation study on the impact of text captions on SOD → RGB+D and Seg. → RGB+T tasks. The qualitative comparison in figure below shows that DiffX can effectively capture the crucial captions, while the variant model without caption embeddings generates broken or misaligned images, ultimately affecting the image quality.


Adaptation to Diverse-Modal Generation

Given that DiffX’s can generate cross-modal “RGB+X” images, we also wonder if we can apply this framework to robust, controllable, and versatile generation across diverse modalities? Therefore, we also conduct experiments on COME15K and MCXFace datasets for “SOD → RGB+D+Edge” and “3DDFA → RGB+NIR+SWIR+T”, respectively. The qualitative results are shown in the figures below, respectively.

Gallery

More results generated by our method are shown here.