Home

Geospot

Teaching models to geoguess

AuthorSurya Dantuluri
Published
Views8.4M views

Geospot is a line of work on visual geolocation. It began with an app built on image embeddings trained on street-view data, and later expanded into online reinforcement learning, image-to-GPS embedding models, and vision-language models for geoguessing.

July 2024 — Geospot app

The first version was a SwiftUI app that guessed where you are in the world from a single image using image embeddings trained on street-view data. It spread quickly and capped out at the 10,000-person TestFlight limit overnight. The app was less important than the question behind it: whether geolocation could be a useful setting for training visual systems.

October 2025 — Geospot Infinity

Geospot Infinity explored online reinforcement learning for geolocation. A 330k-parameter policy consumed embeddings from a 450M GeoCLIP model and updated online from user interaction. It improved at first, but the setup exposed a basic limit: online RL is bottlenecked by the quality of the users supplying feedback. In this case the distribution was not strong enough to carry the system further.

December 2025 — geospot-base

The next step was to move from the app and the online policy to the underlying model. After downloading 51 million street-view images and training 449 embedding runs, I released geospot-base, a 400M image-to-GPS embedding model trained on public Flickr, Mapillary, and related data. A larger variant, geospot-pro, reached state of the art on the 750km benchmark.

February 2026 — GeoVLM

GeoVLM moved the project from embeddings to tool-using vision-language models. It post-trained Qwen3VL-30B for geoguessing in baseline, thinking, and agent variants. The part I cared about was the zoom-in tool. Instead of making a guess from one full image, the model could inspect a sign, a road marking, a storefront, a power line, or a patch of vegetation and keep looking when something seemed useful. That made the system feel less like retrieval and more like curiosity: see something small, zoom in, update the guess, and keep going.

What Geospot became

Geospot became a way to return to the same question from different angles: app, online RL, embedding models, data pipeline, visualizer, and VLMs. What kept pulling me back was that geolocation only looks simple from far away. A single image can contain language, infrastructure, weather, terrain, architecture, and local habits all at once. To geoguess well, a model has to decide what matters, what deserves a closer look, and what can be ignored.