I'm working on a new version of this — bigger dataset, cleaner agent loop. If that's interesting to you, reach out: [@sdand](https://x.com/sdand) or [email protected].
Post-trained Qwen3VL-30B on ~50k curated Street View pairs with Tinker to imitate how professionals geo-guess: think, zoom in, think more, then predict. The agent variant iteratively reasons and zooms into regions of interest.
Used Modal to scrape 50M+ images for training data. The model learns to use a zoom-in tool baked into the Qwen3VL architecture via SFT, then refines with RL.



