From-scratch JAX implementation of Qwen3-VL. I reimplemented mRoPE, Deepstack, KV cache, and the vision towers instead of wrapping Hugging Face.
Built as the foundation for vlm-gym and the geoguessing RL pipeline. The point was a lean codebase I could run on TPUs and modify without fighting a giant library stack.
