GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

Qianzhong Chen¹, Naixiang Gao¹, Suning Huang¹, JunEn Low¹, Timothy Chen¹, Jiankai Sun¹, Timothy Chen¹ Mac Schwager¹

¹Stanford University

Prev. Work Arxiv Code

Abstract

Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework that runs fully onboard and follows natural-language commands in real time. Our policy is trained in a photorealistic 3D Gaussian Splatting (3DGS) simulator via Differentiable Reinforcement Learning (DiffRL), enabling efficient learning of low-level control from visual and linguistic inputs. At its core is a Mixture-of-Experts (MoE) action head, which adaptively routes computation to improve generalization while mitigating forgetting. In multi-task generalization experiments, GRaD-Nav++ achieves a success rate of 83% on trained tasks and 75% on unseen tasks in simulation. When deployed on real hardware, it attains 67% success on trained tasks and 50% on unseen ones. In multi-environment adaptation experiments, GRaD-Nav++ achieves an average success rate of 81% across diverse simulated environments and 67% across varied real-world settings. These results establish a new benchmark for fully onboard Vision-Language-Action (VLA) flight and demonstrate that compact, efficient models can enable reliable, language-guided navigation without relying on external infrastructure.

Overview of GRaD-Nav++

arch-imag

Our GRaD-Nav++ architecture leverages a Vision-Language Model (VLM) to offer the down stream policy network an informative and semantically rich representation of the environment given high-level language commands. The mixture of experts (MoE) action head adaptively routes computation through specialized experts to enhance generalization and mitigate forgetting when learning multiple tasks. The entire model is trained end-to-end via Differentiable Reinforcement Learning (DiffRL) in a photorealistic 3D Gaussian Splatting (3DGS) simulator, enabling efficient learning of low-level control from visual and linguistic inputs.

Results of GRaD-Nav++

Multi Long-Horizon Tasks Generalization Experiment

arch-imag

Example trajectories of untrained long horizon tasks. The instructions are “GO THROUGH gate then STOP over CART” (left) and “FLY past the RIGHT side of the gate then STOP over MONITOR” (right).

Multi Environment Adaptation Experiment

arch-imag

Demonstration of multi-environment adaptation in real-world experiments using video frame overlay visualization. The top row shows the drone flying to the left of, through, above, and to the right of the gate in the middle-gate environment. The bottom row shows the drone executing the same directional tasks in the left-gate environment. Red arrowed curves illustrate the approximate flight trajectories. The learned policy demonstrates robust generalization to varying environments, adapting to changes in gate positions and the presence of distractor objects.

Task Shift Experiment

arch-imag

Task-switching experiment with instruction change at step 100. Left: Experiment scene showing the drone flying past the gate. Right: Normalized cosine similarity between text and visual embeddings over time, reflecting the VLM's ability to re-ground the new instruction.

MoE Loading

arch-imag

Experts' usage intensity when executing the same task (“GO THROUGH gate”) at different surrounding environments (top-left gate, bottom-middle gate). The results demonstrate that the MoE architecture adaptively allocates expert resources according to the specific demands of each environment.

Citations

If you find our work useful in your research, please consider citing:


@article{chen2025grad,
  title={GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics},
  author={Chen, Qianzhong and Gao, Naixiang and Huang, Suning and Low, JunEn and Chen, Timothy and Sun, Jiankai and Schwager, Mac},
  journal={arXiv preprint arXiv:2506.14009},
  year={2025}
}