PI: Grant Van Horn · University of Massachusetts Amherst
Fine-grained visual classification has blossomed under the advances in large-scale image datasets like iNaturalist as well as algorithmic contributions like ResNet, ViT, Vision-Language Models (VLMs) like CLIP, and Multimodal Large Language Models (MLLMs). However, VLMs and MLLMs have garnered increased interest in FGVC due to the surprising fact that underperform more classical, simpler, and smaller approaches. We currently are investigating the root causes and solutions for this underperformance, focusing on making sure that VLM/MLLM responses are visually-grounded (eg. fine-tuningt), figuring out more faithful ways to evaluate their responses (evaluation procedures and benchmarks), and being able to steer predictions with expert knowledge (Visipedia).
Data delivered over the OSDF
Jobs
Files via OSDF
CPU hours
GPU hours
Cumulative usage · Jul 2, 2025 – Jul 2, 2026
Request an access point and connect your first repository in an afternoon — facilitation is free.