Post-trained Qwen3VL-30B on ~50k curated Street View pairs with Tinker to imitate how professionals geo-guess: think, zoom in, think more, then predict. The agent variant iteratively reasons and zooms into regions of interest.
Used Modal to scrape 50M+ images for training data. The model learns to use a zoom-in tool baked into the Qwen3VL architecture via SFT, then refines with RL.



