Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision
Researchers propose a hybrid approach combining energy-based models with multimodal VAEs to overcome a fundamental limitation in generative modeling: capturing complex cross-modal dependencies. Standard multimodal VAEs rely on unimodal Gaussian posteriors that fail to represent intricate inter-modal structure, while EBMs struggle with MCMC sampling in high-dimensional joint spaces. This work addresses a real bottleneck in multimodal generation by using VAE-guided MCMC revision to improve EBM training, potentially enabling more coherent joint representations across text, image, and audio domains. The technique matters for practitioners building systems that must reason across modalities without collapsing to oversimplified latent assumptions.58

















