I'm working on a new version of this — bigger dataset, cleaner agent loop. If that's interesting to you, reach out: [@sdand](https://x.com/sdand) or [email protected].
Vista3 is Qwen3VL-30B post-trained to geo-guess, and on IM2GPS3K it scores below GPT-5.2 and Opus-4.5 while closing most of the gap at the coarser distance thresholds. I'd taken a run at this a year earlier with a GeoCLIP-based app, but that was a contrastive embedding model and I ended up pouring most of the effort into the app rather than the model, so this time I started from the model itself by reimplementing Qwen3VL by hand in JAX as an RL gym, vlm-gym.
The training data came from scraping 50M+ Street View images with Modal and narrowing them down to roughly 50k curated pairs, which I then post-trained on with Tinker in two stages: first SFT to teach the format — how to think through a guess and how to call the zoom-in tool baked into Qwen3VL — and then RL to predict country, region, and coordinates against a log-distance reward. That left me with three variants, baseline, thinking, and agent, where the agent learned to spend test-time compute iterating through the same loop a human does, thinking, zooming into a region, and thinking again before it commits to a guess.
You can watch the thinking variant reason through guesses live — each marker is a rollout on the globe, and the panel shows the model thinking, calling the zoom-in tool, and committing to coordinates.



