Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction

TD;LR: We observe that the primary bottleneck in 3D semantic scene graph prediction lies in object representation, which directly impacts the accuracy of predicate reasoning. (a) Baseline(VL-SAT) embeds object features non-discriminatively, leading to low-confidence predictions and frequent object misclassifications, which degrade relationship accuracy. In contrast, (b) our method embeds object features in a more discriminative manner, yielding high confidence scores and more accurate object classifications. Consequently, relationship predictions are significantly improved, resulting in a more coherent and semantically accurate scene graph.

Abstract

3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods.

Video (To be updated)

Core Contributions

We identify the overlooked importance of object representation in prior 3DSSG methods and propose a Discriminative Object Feature Encoder, pretrained independently to serve as a robust semantic foundation—improving not only our model but also enhancing performance when integrated into existing frameworks.
A novel Relationship Feature Encoder that combines object pair embeddings with geometric relationship information, enhanced by LSE
A Bidirectional Edge Gating mechanism that explicitly models subject-object asymmetry, along with a Global Spatial Enhancement to incorporate holistic spatial context
We validate our approach through extensive experiments, achieving significant performance improvements over state-of-the-art 3DSSG methods.

Proposed Method

We propose a two-stage training framework for 3D semantic scene graph prediction. In the first stage, we pretrain an object encoder to learn discriminative object-level feature representations. In the second stage, we predict the full 3D scene graph using a graph neural network equipped with our proposed components: Bidirectional Edge Gating (BEG), Local Spatial Enhancement (LSE), and Global Spatial Enhancement (GSE).

Object Encoder Pretrain

Architecture of first stage(object encoder pretraining): The encoder extracts object embedding from point clouds via affine transformation, aligned with text/visual CLIP features

Scene Graph Prediction

Architecture of second stage(3D semantic scene graph prediction): Object features are refined via Global Spatial Enhancement to incorporate global spatial context based on inter-object distances, producing enhanced features. Simultaneously, the Local Spatial Enhancement locally preserves geometric relationships between object pairs. The Bidirectional Gated Graph Attention Network then selectively modulates the information of reverse edges, effectively capturing asymmetric relationships between objects

3D Scene Graph Prediction

Qualitative Research

Qualitative comparison between baseline(VL-SAT) and ours: VL-SAT frequently misclassifies visually similar but semantically distinct objects such as cabinet and chair, or stool and garbage bin, which leads to erroneous relationship predictions. In contrast, our method correctly identifies object categories, thereby facilitating accurate and consistent relationship prediction

Object Feature Representation

We visualize the learned object embedding space using t-SNE for the ten most frequent object categories in the dataset. Compared to baseline(VL-SAT), our approach yields more compact and well-separated clusters, particularly for structurally similar object pairs such as ceiling–floor, wall–door, and curtain–window. These results suggest that our object encoder learns more discriminative features, which provide a semantically stronger foundation for subsequent relationship classification