UniGuard-Cascade: A Unified LLM-Driven Safety Scoring and Multi-Stage Audit Framework for Cross-Platform User Comments

Authors

  • Hao Tan Guangdong University of Technology, Guangzhou, China Author
  • Ziming Chen Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA Author

DOI:

https://doi.org/10.71222/n1xmg294

Keywords:

content moderation, large language models, DistilBERT, cross-platform comment safety, unified label taxonomy, RAG, cascading model, Trust & Safety

Abstract

User-generated comments across social media, e-commerce platforms, and online forums pose increasing challenges for automated content safety management. Existing moderation systems typically target a single platform, rely on inconsistent labeling standards, and struggle to balance accuracy with operational cost. To address these limitations, we propose UniGuard-Cascade, a unified safety scoring and cascading moderation framework that integrates large language models (LLMs), lightweight classifiers, and retrieval-augmented verification. UniGuard-Cascade introduces a platform-agnostic unified safety taxonomy covering toxicity, hate speech, spam, sexual/NSFW content, and misinformation, generated through LLM-based label alignment and normalization. The system operates in a three-stage cascade: (1) Fast Path Screening using a DistilBERT-based multi-label classifier for low-cost filtering; (2) Slow Path LLM Examination that provides refined labels and natural-language explanations; and (3) RAG-Enhanced Misinformation Verification, retrieving external evidence to validate factuality. Experiments conducted on four real-world datasets-Twitter, Amazon Reviews, Reddit, and YouTube Comments-show that UniGuard-Cascade consistently outperforms existing moderation baselines, achieving a Micro-F1 of 0.91 and a ROC-AUC of 0.97. The framework further reduces LLM usage by 87 percent compared with single-stage GPT-4 moderation, yielding a 5.7×reduction in overall inference cost while maintaining state-of-the-art multi-platform safety performance.

References

1. J. Chan, and Y. Li, "Unveiling disguised toxicity: A novel pre-processing module for enhanced content moderation," MethodsX, vol. 12, p. 102668, 2024. doi: 10.1016/j.mex.2024.102668

2. P. Fortuna, and S. Nunes, "A survey on automatic detection of hate speech in text," Acm Computing Surveys (Csur), vol. 51, no. 4, pp. 1-30, 2018. doi: 10.1145/3232676

3. Z. Zhang, D. Robinson, and J. Tepper, "Detecting hate speech on twitter using a convolution-gru based deep neural network," In European semantic web conference, June, 2018, pp. 745-760. doi: 10.1007/978-3-319-93417-4_48

4. T. Davidson, D. Warmsley, M. Macy, and I. Weber, "Automated hate speech detection and the problem of offensive language," In Proceedings of the international AAAI conference on web and social media, May, 2017, pp. 512-515. doi: 10.1609/icwsm.v11i1.14955

5. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, and V. Stoyanov, "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.

6. X. Zhan, A. Goyal, Y. Chen, E. Chandrasekharan, and K. Saha, "SLM-mod: Small language models surpass LLMs at content moderation," In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), April, 2025, pp. 8774-8790. doi: 10.18653/v1/2025.naacl-long.441

7. D. Shi, T. Shen, Y. Huang, Z. Li, Y. Leng, R. Jin, and D. Xiong, "Large language model safety: A holistic survey," arXiv preprint arXiv:2412.17686, 2024.

8. A. Nirmal, "Interpretable hate speech detection via large language model-extracted rationales (Master's thesis, Arizona State University)," 2024.

9. D. Tsirmpas, I. Androutsopoulos, and J. Pavlopoulos, "Scalable Evaluation of Online Moderation Strategies via Synthetic Simulations," arXiv e-prints, 2025.

10. M. Franco, O. Gaggi, and C. E. Palazzi, "Integrating content moderation systems with large language models," ACM Transactions on the Web, vol. 19, no. 2, pp. 1-21, 2025. doi: 10.1145/3700789

11. J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, "FEVER: a large-scale dataset for fact extraction and VERification," arXiv preprint arXiv:1803.05355, 2018. doi: 10.18653/v1/n18-1074

12. I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen, C. Hansen, and J. G. Simonsen, "MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims," arXiv preprint arXiv:1909.03242, 2019.

Downloads

Published

14 January 2026

Issue

Section

Article

How to Cite

Tan, H., & Chen, Z. (2026). UniGuard-Cascade: A Unified LLM-Driven Safety Scoring and Multi-Stage Audit Framework for Cross-Platform User Comments. Journal of Computer, Signal, and System Research, 3(1), 42-50. https://doi.org/10.71222/n1xmg294