RefineVAD follows the MIL setup by splitting each video into T segments and extracting segment features.
MoTAR first recalibrates temporal cues using motion salience (adaptive shifting) and global attention, which improves “when” and “where” the anomaly evidence concentrates over time.
CORE then performs category-oriented refinement with learnable prototypes.
Specifically, it predicts a soft anomaly-type distribution at the video level, uses it to form a weighted prototype embedding, and injects this semantic prior into segment features via cross-attention.
This mechanism enriches segment representations with anomaly-type semantics and makes snippet-level scoring more discriminative—especially when the anomaly space is diverse.
In short, CORE turns anomaly-type recognition into a scoring-relevant signal, closing the gap between “recognizing the type” and “detecting the anomaly.”