GEOBench-VLM

Benchmarking Vision-Language Models for Geospatial Tasks

Overview

GEOBench-VLM is a comprehensive benchmarking framework designed to evaluate vision-language models on the unique challenges of geospatial data. Unlike traditional computer vision benchmarks, GEOBench-VLM addresses domain-specific requirements including temporal analysis, fine-grained object detection in satellite imagery, damage assessment, and complex spatial reasoning tasks.

The benchmark provides a standardized evaluation methodology for assessing model capabilities across diverse Earth observation scenarios, helping researchers and practitioners understand model strengths, limitations, and guide future development of geospatial AI systems. GEOBench-VLM has evolved through multiple iterations, incorporating community feedback and expanding task coverage.

Related Research

Geo-bench-2

Geo-bench-2

An evolution of the benchmarking framework that shifts focus from pure performance metrics to capability assessment. Geo-bench-2 provides deeper insights into what geospatial AI models can actually do, moving beyond simple accuracy scores to understand functional capabilities and limitations across diverse Earth observation tasks.

[arXiv 2025] Simumba, Naomi, et al. "Geo-bench-2: From performance to capability, rethinking evaluation in geospatial ai." arXiv preprint arXiv:2511.15658 (2025).
GEOBench-VLM

GEOBench-VLM

The foundational benchmark for evaluating vision-language models on geospatial tasks. GEOBench-VLM introduces comprehensive evaluation protocols for temporal analysis, object detection, damage assessment, and spatial reasoning, establishing standards for measuring VLM performance in Earth observation applications.

[ICCV 2025] Danish, Muhammad, et al. "Geobench-vlm: Benchmarking vision-language models for geospatial tasks." Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2025.
EarthDial

EarthDial

A practical application demonstrating the capabilities measured by GEOBench-VLM. EarthDial transforms multi-sensory Earth observations into interactive dialogues, showcasing how vision-language models can enable natural language interfaces for complex geospatial analysis and decision-making tasks.

[CVPR 2025] Soni, Sagar, et al. "Earthdial: Turning multi-sensory earth observations to interactive dialogues." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2025.