SSync: Selective Synergistic Learning for Video Object-Centric Learning

Abstract

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder–decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module—noisy encoder predictions and blurred decoder boundaries—while incurring a cost quadratic in the number of spatio-temporal patches.

Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising, realized via a pseudo-labeling scheme with linear complexity. To prevent the reinforcement of architectural biases like slot redundancy, we further introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies show that SSync improves decomposition quality, serves as a versatile plug-and-play module, and remains exceptionally robust to slot configurations.

How SSync Works

SSync overview (paper Figure 1) — **Overview.** Slot-based VOCL is guided by two spatial maps—the encoder's **attention map** (sharp but noisy) and the decoder's **object map** (coherent but blurry). Rather than forcing them to agree everywhere, SSync distills the encoder's crisp **boundaries** into the decoder and the decoder's clean **interiors** into the encoder, a selective cross-distillation realized by a simple, linear-time pseudo-labeling scheme.

SSync qualitative mechanism (paper Figure 2) — **Selective supervision in practice.** Given encoder attention map and decoder object map, a transitive merging step initially consolidates redundant slots that split a single object, stabilizing the pseudo-labels throughout training. Subsequently, boundary regions detected from the attention map and interior regions detected from the object map provide reliable, region-specific supervision.

Quantitative Results

Object discovery on three VOCL benchmarks (averaged over 3 runs). Best per column in bold.

Method	Venue	MOVi-C336×336		MOVi-E336×336		YouTube-VIS518×518
Method	Venue	FG-ARI↑	mBO↑	FG-ARI↑	mBO↑	FG-ARI↑	mBO↑
SAVi	ICLR'22	22.2	13.6	42.8	16.0	–	–
STEVE	NeurIPS'22	36.1	26.5	50.6	26.6	15.0	19.1
VideoSAUR	NeurIPS'23	64.8	38.9	73.9	35.6	28.9	26.3
VideoSAURv2	NeurIPS'23	–	–	77.1	34.4	31.2	29.7
SlotContrast	CVPR'25	69.3	32.7	82.9	29.2	38.0	33.7
SRL	ICLR'26	74.3	34.5	81.9	29.3	42.9	35.6
SlotCurri	CVPR'26	77.6	32.8	83.7	28.9	44.8	35.5
SSync (Ours)	ECCV'26	79.4	39.5	84.0	34.8	42.6	38.7

Robustness to Slot Configuration

On MOVi-C, prior methods degrade sharply as the slot count grows (SlotContrast drops to 61.8 FG-ARI at 15 slots), whereas SSync stays stable thanks to transitive pseudo-label merging. Best per column in bold.

Method	7 slots		11 slots		15 slots
Method	FG-ARI↑	mBO↑	FG-ARI↑	mBO↑	FG-ARI↑	mBO↑
SlotContrast	74.9	27.9	69.3	32.7	61.8	31.2
SRL	76.5	31.6	74.3	34.5	72.8	31.1
SSync (Ours)	76.9	39.8	79.4	39.5	78.8	41.0

Object Dynamics Prediction (Future Rollout)

Transferability of the learned slots to downstream future prediction: a frozen object-centric model + SlotFormer (SF) autoregressively predicts future slots. SSync yields the most predictable representations. Best per column in bold.

Method	MOVi-C		MOVi-E		YouTube-VIS
Method	FG-ARI↑	mBO↑	FG-ARI↑	mBO↑	FG-ARI↑	mBO↑
Reconstruction + SF	50.7	25.9	70.6	24.3	27.4	28.9
SlotContrast + SF	63.8	26.1	70.5	24.9	29.2	29.6
SRL + SF	68.9	27.4	70.4	24.9	32.2	30.0
SSync (Ours) + SF	69.1	29.0	72.1	27.1	32.1	30.9

Plug-and-Play under the RandSF.Q Protocol

As a plug-and-play module, SSync added on top of two RandSF.Q variants (t_sim from VideoSAUR, SSC from SlotContrast) improves nearly every metric across MOVi-C, MOVi-D, and HQ-YTVIS.

Method	MOVi-C224×224				MOVi-D224×224				HQ-YTVIS224×224
Method	ARI↑	FG-ARI↑	mBO↑	mIoU↑	ARI↑	FG-ARI↑	mBO↑	mIoU↑	ARI↑	FG-ARI↑	mBO↑	mIoU↑
RandSF.Q (t_sim)	70.7	63.3	31.1	28.1	39.3	70.5	25.6	24.3	39.2	56.3	37.2	37.0
+ SSync	73.0	67.3	32.3	29.9	45.5	71.2	27.6	25.7	42.9	58.6	39.4	39.2
RandSF.Q (SSC)	52.7	67.8	24.2	22.1	37.3	85.8	28.0	26.8	40.7	57.2	38.3	37.8
+ SSync	55.5	71.4	25.1	23.1	39.4	86.5	27.9	26.8	48.6	57.5	42.1	41.9

Image-level Object-Centric Learning

Beyond video, SSync also improves image object-centric learning on MOVi-E and real-world COCO 2017. Best in bold.

MOVi-E

Method	FG-ARI↑
VideoSAUR	78.4
SOLV	80.8
SlotContrast	84.8
SlotCurri	84.9
SSync (Ours)	86.0

COCO 2017

Method	FG-ARI↑	mBO↑
Baseline	40.5	28.8
SRL	42.8	29.4
SlotCurri	43.4	28.9
SSync (Ours)	47.9	33.1

Qualitative Results

Each dataset has its own slider—drag it to blend between the input video and the predicted slot map (each slot shown in a distinct color).

MOVi-C

Original Video50%Slot Map

MOVi-E

Original Video50%Slot Map

YouTube-VIS 2021

Original Video50%Slot Map

BibTeX

@inproceedings{moon2026ssync,
  title     = {Selective Synergistic Learning for Video Object-Centric Learning},
  author    = {Moon, WonJun and Heo, Jae-Pil},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

@inproceedings{moon2026reconstruction,
  title     = {Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning},
  author    = {Moon, WonJun and Seong, Hyun Seok and Heo, Jae-Pil},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

@inproceedings{seong2026synergistic,
  title     = {From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning},
  author    = {Seong, Hyun Seok and Moon, WonJun and Heo, Jae-Pil},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}

SSync: Selective Synergistic Learning
for Video Object-Centric Learning

SSync aligns the encoder's sharp boundaries with the decoder's coherent interiors, yielding clean and temporally consistent object decomposition.

Abstract

How SSync Works