AstroAI Workshop 2026
Atal Agrawal
Beyond Pattern Matching: Bridging the Sim-to-Real Gap in Transient Flare Verification Using Vision Language Models
Presenter: Atal Agrawal (Indian Institute)
Title: Beyond Pattern Matching: Bridging the Sim-to-Real Gap in Transient Flare Verification Using Vision Language Models
Date/Time: Monday, June 15, 4:00 PM - 5:30 PM
Abstract: Detecting transient flares in quasar light curves is fundamentally an anomaly detection challenge. While statistical pipelines excel at flagging candidate events, they necessitate visual verification to differentiate genuine astrophysical transients from instrumental or stochastic noise artifacts. In the era of large-scale surveys, this manual inspection forms a critical operational bottleneck. Automating this verification is severely constrained by a stark data limitation: labeled, real flare data does not exist for the survey under analysis. Furthermore, supervised cross-survey transfer is rendered unfeasible due to inherent differences in observational cadence, photometric depth, and noise properties. To overcome this constraint, we evaluate two contrasting paradigms, a supervised Convolutional Neural Network (CNN) trained on synthetic data, and Vision Language Models (VLMs) deployed in a zero-shot framework, to verify flare candidates within the SDSS Stripe 82 legacy survey. Our statistical front-end models baseline quasar variability using an iterative Ornstein-Uhlenbeck process, applying Extreme Value Theory to isolate 92 anomalous real candidates from a sample of 9,258 spectroscopically confirmed quasars observed over a ~10-year baseline. To train the supervised model, we generate a synthetic dataset of 37,024 light curves by drawing Damped Random Walk (DRW) realizations on the exact observed Modified Julian Date (MJD) timestamps, injecting empirical photometric errors, and overlaying three morphologically distinct synthetic flare profiles (FRED, Gaussian, and Gamma) alongside pure DRW baselines and single-point spike artifacts. These light curves are rendered as minimalist scatter plots devoid of axes or labels to force classification based strictly on geometric and morphological reasoning. A ResNet-18 trained from scratch on this synthetic corpus achieves a high five-class accuracy of 70.59% on held-out synthetic test data. However, when deployed on the 92 real Stripe 82 candidates, the network classifies every single event as a flare, a complete generalization failure that underscores a profound synthetic-to-real domain gap. Conversely, while individual state-of-the-art VLMs underperform on the synthetic test set, with GPT-5 reaching only 42.8% five-class accuracy, our optimized VLM recognition engine successfully bridges the domain gap. This architecture utilizes a dual-classifier framework, pairing Grok-4.1-fast for high recall and Qwen-3.5-plus for high precision, with GPT-5 acting as a meta-evaluator, selected from a comprehensive benchmark of 12 proprietary and open-weight models (including Claude 3 Haiku, Mistral Large, Kimi-k2.5, and Qwen2.5VL and Qwen3VL variants). Our VLM engine achieves 55.2% precision and 59.3% recall on the real candidates. These results demonstrate that while supervised networks succumb to synthetic pattern memorization during training, VLMs leverage zero-shot morphological reasoning alongside embedded astronomical priors to generalize to real data. This work establishes a scalable, survey-agnostic paradigm for automated transient verification, holding immediate utility for upcoming massive data streams such as the Vera C. Rubin Observatory’s Legacy Survey of Space and Time (LSST).