RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Junhee Lee*¹, ChaeBeen Bang*¹, MyoungChul Kim*², MyeongAh Cho†²

¹Department of Artificial Intelligence, Kyung Hee University

²Department of Software Convergence, Kyung Hee University

*Equal Contribution †Corresponding Author

RefineVAD refines segment features with a dual process: MoTAR improves temporal localization by motion-aware recalibration, and CORE strengthens detection by making anomaly-type semantics directly useful for scoring.

Teaser: an overview of RefineVAD. MoTAR refines when anomalies occur via motion-aware temporal attention, while CORE refines what the anomaly resembles by injecting soft category priors into the features used for scoring.

Abstract

Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, RefineVAD explicitly models both "how" motion evolves and "what" semantic anomaly category it resembles, enabling category cues to support more discriminative scoring under weak supervision. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

Core Contributions

We revisit WVAD from a human-inspired perspective and propose dual-process refinement that combines temporal dynamics and semantic structure for anomaly localization.
MoTAR: motion-aware temporal attention that adaptively recalibrates temporal feature flows based on motion salience, improving temporal focus and long-range dependency modeling.
CORE: a category-oriented refinement module that learns anomaly-type priors and feeds them back into the detection pathway via prototype-based cross-attention, improving snippet-level representation for scoring.
Extensive experiments (benchmark results, fine-grained localization, ablations, and transfer analysis) demonstrate consistent gains and validate the role of semantic guidance.

Why Category-Aware WVAD?

Prior WVAD paradigms largely focus on learning a single anomaly score from weak labels. While some methods additionally train a category classifier, the category signal often remains auxiliary and does not meaningfully influence snippet scoring. RefineVAD addresses this limitation by (i) improving temporal evidence with MoTAR and (ii) leveraging semantic evidence with CORE.

Key point: CORE is designed to resolve the “decoupling” issue—anomaly-type identification is not separate from detection, but is explicitly injected into the features used for anomaly scoring.

WVAD paradigms: (a) coarse snippet scoring, (b) category recognition is learned but does not directly guide scoring, (c) RefineVAD integrates temporal refinement (MoTAR) and semantic refinement (CORE).

Proposed Method

RefineVAD follows the MIL setup by splitting each video into T segments and extracting segment features. MoTAR first recalibrates temporal cues using motion salience (adaptive shifting) and global attention, which improves “when” and “where” the anomaly evidence concentrates over time.

CORE then performs category-oriented refinement with learnable prototypes. Specifically, it predicts a soft anomaly-type distribution at the video level, uses it to form a weighted prototype embedding, and injects this semantic prior into segment features via cross-attention. This mechanism enriches segment representations with anomaly-type semantics and makes snippet-level scoring more discriminative—especially when the anomaly space is diverse.

In short, CORE turns anomaly-type recognition into a scoring-relevant signal, closing the gap between “recognizing the type” and “detecting the anomaly.”

Overall architecture of RefineVAD (MoTAR + CORE).

MoTAR: Motion-aware Temporal Attention & Recalibration

Estimate motion salience from feature differences and compute an adaptive shift ratio.
Apply channel shifting proportionally to motion intensity, then model long-range dependencies with a lightweight Transformer.

MoTAR overview: motion variance guides adaptive channel shifting + global attention.

CORE: Category-Oriented Refinement

Predict a video-level soft distribution over anomaly categories and compute a weighted prototype prior.
Inject the prototype prior into segment features via cross-attention, refining snippet representations for scoring.
Improves both fine-grained localization and cross-dataset semantic transfer by learning a structured semantic space.

CORE is emphasized in our design because WVAD anomalies are semantically diverse; explicitly modeling anomaly types helps prevent collapsing all abnormal events into a single coarse score.

Results on WVAD Benchmarks

RefineVAD reports 88.92% AUC on UCF-Crime and 88.66% AP on XD-Violence (Table 1), showing strong performance across two widely-used WVAD benchmarks. Improvements come from the complementary effects of MoTAR (temporal recalibration) and CORE (semantic refinement for scoring).

Setting	Method	Source	UCF-AUC (%)	XD-AP (%)
semi-sup.	SVM baseline	-	50.10	50.80
	OCSVM (1999)	NeurIPS	63.20	28.63
	Hasan et al. (2016)	CVPR	51.20	31.25
weakly-sup.	Ju et al. (2022)	ECCV	84.72	76.57
	Sultani et al. (2018)	CVPR	84.14	75.18
	Wu et al. (2020)	ECCV	84.57	80.00
	AVVD (2022)	TMM	82.45	78.10
	RTFM (2021)	ICCV	85.66	78.27
	DMU (2023)	AAAI	86.75	81.66
	UMIL (2023)	CVPR	86.75	-
	CLIP-TSA (2023)	ICIP	87.58	82.17
	HSN (2024)	CVIU	85.45	-
	IFS-VAD (2024)	TCSVT	85.47	83.14
	VadCLIP (2024)	AAAI	88.02	84.57
	PEMIL (2024)	CVPR	86.83	88.21
	ReFLIP (2024)	TCSVT	88.57	85.81
	CMHKF (2025a)	ACL	-	86.57
	Ex-VAD (2025)	ICML	88.29	86.52
	π-VAD (2025)	CVPR	90.33	85.37
RefineVAD (Ours)		88.92	88.66

Fine-grained Evaluation & Ablations

Fine-grained comparisons on UCF-Crime (mAP@IoU)

Method	0.1	0.2	0.3	0.4	0.5	AVG
Random Baseline	0.21	0.14	0.04	0.02	0.01	0.08
Sultani et al. (2018)	5.73	4.41	2.69	1.93	1.44	3.24
AVVD (2022)	10.27	7.01	6.25	3.42	3.29	6.05
VadCLIP (2024)	11.72	7.83	6.40	4.53	2.93	6.68
ITC (2024)	13.54	9.24	7.45	5.46	3.79	7.90
ReFLIP (2024)	14.23	10.34	9.32	7.54	6.81	9.62
Ex-VAD (2025)	16.51	12.35	9.41	7.82	4.65	10.15
RefineVAD (Ours)	20.90	13.17	8.14	4.41	3.03	9.93

Ablation on UCF-Crime (AUC %)

Ablation shows that both temporal recalibration (MoTAR) and semantic refinement (CORE) contribute, with notable gains coming from CORE’s category injection and soft classification.

MoTAR	Category Injection	Soft Classification	AUC (%)
			84.60
✓			85.43
	✓		87.28
✓	✓		87.85
✓	✓	✓	88.89

Cross-Dataset Semantic Transfer

CORE’s category classifier and prototypes learned on UCF-Crime remain effective on XD-Violence, supporting transferability of the learned semantic space.

Setting	Category GT	AP (%)
Full Training	✓	88.66
Freeze CORE (Classifier & Embeddings trained only on UCF)	✗	87.52
Direct Cross-Domain Transfer (Train: UCF, Test: XD, no fine-tuning)	✗	77.56

t-SNE visualization of category logit features: semantically similar categories form meaningful clusters.

Qualitative Results

Example frame and predicted anomaly scores over time (blue), with ground-truth anomalous intervals (red shaded).

Citation

@article{lee2025refinevad,
  title   = {RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection},
  author  = {Junhee Lee and ChaeBeen Bang and MyoungChul Kim and MyeongAh Cho},
  journal = {arXiv preprint arXiv:2511.13204},
  year    = {2025}
}