AstroAI Workshop 2026

Yang Cheng

A Multi-modal Learning Framework for JWST Imaging and Low-Resolution Spectra

Presenter: Yang Cheng (Max Planck Institue for Astronomy)

Title: A Multi-modal Learning Framework for JWST Imaging and Low-Resolution Spectra

Date/Time: Monday, June 15, 4:00 PM - 5:30 PM

Abstract: Self-supervised multimodal models have transformed natural language processing and computer vision, and are beginning to make an impact in astronomy through emerging foundation models. To date, most such models in astronomy have been developed on ground-based surveys with relatively rich and homogeneous paired modalities, such as SDSS and DESI. In contrast, space-based observatories such as JWST remain largely unexplored in self-supervised multimodal representation learning, owing to smaller data volumes and more heterogeneous instrument configurations.

We present a flexible self-supervised multimodal framework for learning unified representations of JWST imaging and low-resolution spectra. Our framework is explicitly designed to treat each imaging band and each spectrum as an independent modality, enabling robustness to missing inputs during both training and inference. We train a ViT-based encoder for imaging and a 1D BERT-style transformer autoencoder with masked modeling for spectra. Modality-specific embeddings are projected into a shared latent space and aligned with a pairwise contrastive objective, then fused through cross-attention token pooling to produce a unified representation that can reweight informative inputs while naturally down-weighting missing or uninformative ones. Preliminary results on JWST F150W imaging and NIRSpec Prism spectra show that the learned representations preserve both morphological and spectral information, achieving strong morphology classification performance (average F1 = 0.93) and accurate spectrum reconstruction (R² = 0.89). Using a single imaging band together with spectroscopy, we further obtain promising downstream performance for physical parameter inference from aligned multimodal embeddings, including redshift estimation with σNMAD(Δz/(1+z)) = 0.13 and stellar mass prediction with σlog M* = 0.32 dex. We expect these downstream capabilities to improve further as additional modalities are incorporated.

-->