LIFT-GS:

Cross-Scene Render-Supervised Distillation for 3D Language Grounding

1University of Michigan, 2Fundamental AI Research (FAIR), Meta 3Carnegie Mellon University 4Stanford University

Overview

TL;DR: We train 3D vision-language grounding (3D VLG) models that is supervised only in 2D, using 2D losses and differentiable rendering.

Our approach to training 3D vision-language understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering. The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable'', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models). For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques.

Description of the GIF

3D Referential Grounding: Taking point cloud and language queries as input, LIFT-GS reconstructs the scene (in Gaussian Splatting) and grounds the nouns in the 3D scene.

Reconstruction, Recognition, and Reorganization (Three R)

Our model can do 3D reconstruction, open-vocabulary recognition, which consequently leads to 3D reorganization, without any 3D supervision. We show the reconstructed 3D scenes from the sparse point cloud input, and the open-vocabulary segmentation results given language queries.

scene_1
scene_2

Learning from 2D Foundation Models

We leverage 2D foundation models to generate pseudo-labels to train our 3D VLG models, and render the outputs to 2D via gaussian splatting for supervision. In principle, this pipeline is task-agnostic and architecture-agnostic, and can be used to train any 3D models as long as the outputs are renderable to 2D.

Data Engineering

As a Pretraining Pipeline

Finetuning with limited 3D data, our model can significantly outperform the baseline trained with 3D data from scratch.

Pseudo-Labeling

Scaling


Finetuning Data Scaling



We finetune the pretrained model with different amounts of 3D data, and find that pretraining effectively multiplies the fine-tuning dataset. This phenomenon is aligned with the observation from Scaling laws for transfer..

Pretraining Data Scaling

Adding more data to pretraining consistently improves the performance of finetuning.

2D Model Scaling

Our pipeline allows a flexible usage of 2D foudnation models, and shows performance gains with larger 2D foundation models and better pseud-labeling designs.

BibTeX

@article{liftgs2025,
  author    = {Cao, Ang and Arnaud, Sergio and Maksymets, Oleksandr and Yang, Jianing and Jain, Ayush and Yenamandra, Sriram and Martin, Ada and Berges, Vincent-Pierre and McVay, Paul and Partsey, Ruslan and Rajeswaran, Aravind and Meier, Franziska and Johnson, Justin and Park, Jeong Joon and Sax, Alexander},
  title     = {LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding},
  year      = {2025},
}