Benchmarking Commonsense Visual Reasoning for Vision-Language Models
Co-author · Research with Prof. Ernest Davis (NYU) · 2025 · Manuscript in preparation
Diagnostic study of visual commonsense reasoning with a focus on visibility, occlusion, and viewpoint shifts in vision-language models.
My role
- Built a diagnostic benchmark (100 base + 100 counterfactual images; 100 questions + 100 flips) and automatic graders
- Evaluated six VLMs (ChatGPT, Claude, LLaVA, etc.) and analysed hallucination/abstention behavior