SSync: Selective Synergistic Learning
for Video Object-Centric Learning

WonJun Moon1, Jae-Pil Heo2
1KAIST, 2Sungkyunkwan University
Work done at Sungkyunkwan University   Corresponding author
ECCV 2026

SSync aligns the encoder's sharp boundaries with the decoder's coherent interiors, yielding clean and temporally consistent object decomposition.

Abstract

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder–decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module—noisy encoder predictions and blurred decoder boundaries—while incurring a cost quadratic in the number of spatio-temporal patches.

Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising, realized via a pseudo-labeling scheme with linear complexity. To prevent the reinforcement of architectural biases like slot redundancy, we further introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies show that SSync improves decomposition quality, serves as a versatile plug-and-play module, and remains exceptionally robust to slot configurations.

How SSync Works

SSync overview (paper Figure 1)
Overview. Slot-based VOCL is guided by two spatial maps—the encoder's attention map (sharp but noisy) and the decoder's object map (coherent but blurry). Rather than forcing them to agree everywhere, SSync distills the encoder's crisp boundaries into the decoder and the decoder's clean interiors into the encoder, a selective cross-distillation realized by a simple, linear-time pseudo-labeling scheme.
SSync qualitative mechanism (paper Figure 2)
Selective supervision in practice. Given encoder attention map and decoder object map, a transitive merging step initially consolidates redundant slots that split a single object, stabilizing the pseudo-labels throughout training. Subsequently, boundary regions detected from the attention map and interior regions detected from the object map provide reliable, region-specific supervision.

Quantitative Results

Object discovery on three VOCL benchmarks (averaged over 3 runs). Best per column in bold.

MethodVenue MOVi-C336×336 MOVi-E336×336 YouTube-VIS518×518
FG-ARI↑mBO↑ FG-ARI↑mBO↑ FG-ARI↑mBO↑
SAViICLR'2222.213.642.816.0
STEVENeurIPS'2236.126.550.626.615.019.1
VideoSAURNeurIPS'2364.838.973.935.628.926.3
VideoSAURv2NeurIPS'2377.134.431.229.7
SlotContrastCVPR'2569.332.782.929.238.033.7
SRL ICLR'2674.334.581.929.342.935.6
SlotCurri CVPR'2677.632.883.728.944.835.5
SSync (Ours)ECCV'2679.439.584.034.842.638.7

Robustness to Slot Configuration

On MOVi-C, prior methods degrade sharply as the slot count grows (SlotContrast drops to 61.8 FG-ARI at 15 slots), whereas SSync stays stable thanks to transitive pseudo-label merging. Best per column in bold.

Method7 slots 11 slots15 slots
FG-ARI↑mBO↑ FG-ARI↑mBO↑ FG-ARI↑mBO↑
SlotContrast74.927.969.332.761.831.2
SRL76.531.674.334.572.831.1
SSync (Ours)76.939.879.439.578.841.0

Object Dynamics Prediction (Future Rollout)

Transferability of the learned slots to downstream future prediction: a frozen object-centric model + SlotFormer (SF) autoregressively predicts future slots. SSync yields the most predictable representations. Best per column in bold.

MethodMOVi-C MOVi-EYouTube-VIS
FG-ARI↑mBO↑ FG-ARI↑mBO↑ FG-ARI↑mBO↑
Reconstruction + SF50.725.970.624.327.428.9
SlotContrast + SF63.826.170.524.929.229.6
SRL + SF68.927.470.424.932.230.0
SSync (Ours) + SF69.129.072.127.132.130.9

Plug-and-Play under the RandSF.Q Protocol

As a plug-and-play module, SSync added on top of two RandSF.Q variants (tsim from VideoSAUR, SSC from SlotContrast) improves nearly every metric across MOVi-C, MOVi-D, and HQ-YTVIS.

MethodMOVi-C224×224 MOVi-D224×224HQ-YTVIS224×224
ARI↑FG-ARI↑mBO↑mIoU↑ ARI↑FG-ARI↑mBO↑mIoU↑ ARI↑FG-ARI↑mBO↑mIoU↑
RandSF.Q (tsim)70.763.331.128.139.370.525.624.339.256.337.237.0
+ SSync73.067.332.329.945.571.227.625.742.958.639.439.2
RandSF.Q (SSC)52.767.824.222.137.385.828.026.840.757.238.337.8
+ SSync55.571.425.123.139.486.527.926.848.657.542.141.9

Image-level Object-Centric Learning

Beyond video, SSync also improves image object-centric learning on MOVi-E and real-world COCO 2017. Best in bold.

MOVi-E

MethodFG-ARI↑
VideoSAUR78.4
SOLV80.8
SlotContrast84.8
SlotCurri84.9
SSync (Ours)86.0

COCO 2017

MethodFG-ARI↑mBO↑
Baseline40.528.8
SRL42.829.4
SlotCurri43.428.9
SSync (Ours)47.933.1

Qualitative Results

Each dataset has its own slider—drag it to blend between the input video and the predicted slot map (each slot shown in a distinct color).

MOVi-C

Original Video50%Slot Map

MOVi-E

Original Video50%Slot Map

YouTube-VIS 2021

Original Video50%Slot Map

All Results

Every evaluated clip (input overlaid with predicted slot masks). Use the player controls to pause and scrub through time.

BibTeX

@inproceedings{moon2026ssync,
  title     = {Selective Synergistic Learning for Video Object-Centric Learning},
  author    = {Moon, WonJun and Heo, Jae-Pil},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

@inproceedings{moon2026reconstruction,
  title     = {Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning},
  author    = {Moon, WonJun and Seong, Hyun Seok and Heo, Jae-Pil},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

@inproceedings{seong2026synergistic,
  title     = {From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning},
  author    = {Seong, Hyun Seok and Moon, WonJun and Heo, Jae-Pil},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}