DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Zeyu Wang^*¹ Jingyu Lin^*² Yifei Qian³ Yi Huang⁴ Shicen Tian¹
Bosong Chai¹ Juncan Deng¹ Qu Yang⁵ Lan Du² Cunjian Chen^✝² Kejie Huang^✝¹
^*Equal Contribution. ^✝Corresponding Author.
¹Zhejiang University ²Monash University ³University of Nottingham ⁴Peking University ⁵Wuhan University

[Paper] [Github]

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Depth Map

Layout
Image

Generated
RGB Image

Generated
Depth Map

Cross-Modal Dataset Construction

(a) The process of constructing the image captions. (b) Examples of cross-modal images, labels, and prepared captions.

Workflow

Workflow of our Multi-Path Variational AutoEncoder (MP-VAE). Here, the RGB+X modal encoding is employed for illustration. However, the framework is capable of supporting additional modal inputs and outputs.

MP-VAE

Workflow in Latent Space

Methodology

Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ``RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. It also shows the potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on COME15K and MCXFace datasets.

Comparisons

As shown in the figure below, we can see that our DiffX significantly outperforms the baseline methods in qualitative results. It proves the effectiveness of employing Long-CLIP for caption embedding and our JME for joint-modal connection.

And results in the table below demonstrate that our DiffX outperforms all baseline models in all metrics.

Qualitative Comparison

Quantitative Comparison

Analysis

The ablation studies and potential applications are presented here.

Ablation Study

Firstly, we conduct the ablation study on the Laplacian Pyramid (LP) in our MP-VAE. Secondly, we aim to compare the unique-modal generation with the cross-modal RGB+X generation by DiffX.

Ablation study on the Laplacian Pyramid

Comparison between the unique- and cross-modal

Effectiveness of long text captions

We conduct the ablation study on the impact of text captions on SOD → RGB+D and Seg. → RGB+T tasks. The qualitative comparison in figure below shows that DiffX can effectively capture the crucial captions, while the variant model without caption embeddings generates broken or misaligned images, ultimately affecting the image quality.

Impact of text captions in cross-modal generation

Adaptation to Diverse-Modal Generation

Given that DiffX’s can generate cross-modal “RGB+X” images, we also wonder if we can apply this framework to robust, controllable, and versatile generation across diverse modalities? Therefore, we also conduct experiments on COME15K and MCXFace datasets for “SOD → RGB+D+Edge” and “3DDFA → RGB+NIR+SWIR+T”, respectively. The qualitative results are shown in the figures below, respectively.

Adaptation to “SOD→RGB+D+Edge”

Adaptation to “3DDFA→RGB+NIR+SWIR+T”

Gallery

More results generated by our method are shown here.

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Depth Map

Layout
Image

Generated
RGB Image

Generated
Depth Map

MP-VAE

Workflow in Latent Space

Qualitative Comparison

Quantitative Comparison

Ablation study on the Laplacian Pyramid

Comparison between the unique- and cross-modal

Impact of text captions in cross-modal generation

Adaptation to “SOD→RGB+D+Edge”

Adaptation to “3DDFA→RGB+NIR+SWIR+T”

“SOD → RGB+D+Edge” task

“3DDFA → RGB+NIR+SWIR+T” task

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Layout Image

Generated RGB Image

Generated Infrared Image

Layout Image

Generated RGB Image

Generated Infrared Image

Layout Image

Generated RGB Image

Generated Infrared Image

Layout Image

Generated RGB Image

Generated Infrared Image

Layout Image

Generated RGB Image

Generated Depth Map

Layout Image

Generated RGB Image

Generated Depth Map

MP-VAE

Workflow in Latent Space

Qualitative Comparison

Quantitative Comparison

Ablation study on the Laplacian Pyramid

Comparison between the unique- and cross-modal

Impact of text captions in cross-modal generation

Adaptation to “SOD→RGB+D+Edge”

Adaptation to “3DDFA→RGB+NIR+SWIR+T”

“SOD → RGB+D+Edge” task

“3DDFA → RGB+NIR+SWIR+T” task

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Infrared Image

Layout
Image

Generated
RGB Image

Generated
Depth Map

Layout
Image

Generated
RGB Image

Generated
Depth Map