A Secure Federated Learning Algorithm for Emotion Recognition Towards Multimodal Speaker Signals on the Client Side
DOI:
https://doi.org/10.71222/gqqywa40Keywords:
multimodal emotion recognition, federated learning, transformer, attention scores, heterogeneityAbstract
With the widespread application of multimodal data in dialog emotion recognition, effectively integrating text, audio, and visual information while addressing data heterogeneity across multiple clients and ensuring user privacy has become a key research challenge. This paper integrates a Transformer self-distillation model with attention scores and federated learning algorithms to propose a multimodal emotion recognition framework. The framework employs intra-modal and inter-modal Transformers to capture multimodal interactions, enhances modality representations through attention weights, and incorporates a federated learning structure to safeguard data privacy. A global model distance-weighted aggregation strategy is introduced to mitigate model bias caused by heterogeneous data. Experimental results on the IEMOCAP dataset demonstrate that the proposed framework achieves superior emotion recognition accuracy and exhibits more stable model convergence compared to existing baseline models.
References
1. A. Kumar, P. Dogra, and V. Dabas, "Emotion analysis of Twitter using opinion mining," In 2015 Eighth International Conference on Contemporary Computing (IC3), August, 2015, pp. 285-290. doi: 10.1109/ic3.2015.7346694
2. B. C. Gül, S. Nadig, S. Tziampazis, N. Jazdi, and M. Weyrich, "FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning," arXiv preprint arXiv:2507.15470, 2025. doi: 10.1109/iceccme64568.2025.11277502
3. L. Zhou, J. Gao, D. Li, and H. Y. Shum, "The design and implementation of xiaoice, an empathetic social chatbot," Computational Linguistics, vol. 46, no. 1, pp. 53-93, 2020.
4. A. Mehrabian, "Silent messages (Vol. 8, No. 152, p. 30)," Belmont, CA: Wadsworth, 1971.
5. M. Wöllmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. P. Morency, "Youtube movie reviews: Sentiment analysis in an audio-visual context," IEEE Intelligent Systems, vol. 28, no. 3, pp. 46-53, 2013.
6. S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain, "Convolutional MKL based multimodal emotion recognition and sentiment analysis," In 2016 IEEE 16th international conference on data mining (ICDM), December, 2016, pp. 439-448. doi: 10.1109/icdm.2016.0055
7. B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L. P. Morency, "Deep multimodal fusion for persuasiveness prediction," In Proceedings of the 18th ACM international conference on multimodal interaction, October, 2016, pp. 284-288.
8. O. Kampman, E. J. Barezi, D. Bertero, and P. Fung, "Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction," arXiv preprint arXiv:1805.00705, 2018.
9. Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. B. Zadeh, and L. P. Morency, "Efficient low-rank multimodal fusion with modality-specific factors," In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July, 2018, pp. 2247-2256.
10. A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. P. Morency, "Tensor fusion network for multimodal sentiment analysis," arXiv preprint arXiv:1707.07250, 2017.
11. A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L. P. Morency, "Memory fusion network for multi-view sequential learning," In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1)., April, 2018.
12. Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, and R. Salakhutdinov, "Multimodal transformer for unaligned multimodal language sequences," In Proceedings of the conference. Association for computational linguistics. Meeting, July, 2019, p. 6558.
13. W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. P. Morency, and E. Hoque, "Integrating multimodal information in large pretrained transformers," In Proceedings of the 58th annual meeting of the association for computational linguistics, July, 2020, pp. 2359-2369.
14. W. Yu, H. Xu, Z. Yuan, and J. Wu, "Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis," In Proceedings of the AAAI conference on artificial intelligence, May, 2021, pp. 10790-10797. doi: 10.1609/aaai.v35i12.17289
15. Z. Yuan, W. Li, H. Xu, and W. Yu, "Transformer-based feature reconstruction network for robust multimodal sentiment analysis," In Proceedings of the 29th ACM international conference on multimedia, October, 2021, pp. 4400-4407. doi: 10.1145/3474085.3475585
16. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," In Artificial intelligence and statistics, April, 2017, pp. 1273-1282.
17. A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, and D. Ramage, "Federated learning for mobile keyboard prediction," arXiv preprint arXiv:1811.03604, 2018.
18. Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, "Federated learning with non-iid data," arXiv preprint arXiv:1806.00582, 2018.
19. T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, "Federated optimization in heterogeneous networks," Proceedings of Machine learning and systems, vol. 2, pp. 429-450, 2020.
20. J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, "Tackling the objective inconsistency problem in heterogeneous federated optimization," Advances in neural information processing systems, vol. 33, pp. 7611-7623, 2020.
21. A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, "Analyzing federated learning through an adversarial lens," In International conference on machine learning, May, 2019, pp. 634-643.
22. Y. Tan, G. Long, L. Liu, T. Zhou, Q. Lu, J. Jiang, and C. Zhang, "Fedproto: Federated prototype learning across heterogeneous clients," In Proceedings of the AAAI conference on artificial intelligence, June, 2022, pp. 8432-8440.
23. C. Chen, J. Zhang, A. K. Tung, M. Kankanhalli, and G. Chen, "Robust federated recommendation system," arXiv preprint arXiv:2006.08259, 2020.
24. V. Smith, C. K. Chiang, M. Sanjabi, and A. S. Talwalkar, "Federated multi-task learning," Advances in neural information processing systems, vol. 30, 2017.
25. O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra, "Matching networks for one shot learning," Advances in neural information processing systems, vol. 29, 2016.
26. F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified embedding for face recognition and clustering," In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815-823. doi: 10.1109/cvpr.2015.7298682
27. A. Hermans, L. Beyer, and B. Leibe, "In defense of the triplet loss for person re-identification," arXiv preprint arXiv:1703.07737, 2017.
28. H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu, "A transformer-based model with self-distillation for multimodal emotion recognition in conversations," IEEE Transactions on Multimedia, vol. 26, pp. 776-788, 2023. doi: 10.1109/tmm.2023.3271019
29. J. C. D. SILVA, "MFFER: Multimodal Federated-Learning based Facial Emotion Recognition," 2025.
30. D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L. P. Morency, and R. Zimmermann, "Conversational memory network for emotion recognition in dyadic dialogue videos," In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, June, 2018, p. 2122.
31. D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, "Icon: Interactive conversational memory network for multimodal emotion detection," In Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2594-2604. doi: 10.18653/v1/d18-1280
32. J. Hu, Y. Liu, J. Zhao, and Q. Jin, "MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation," arXiv preprint arXiv:2107.06779, 2021. doi: 10.18653/v1/2021.acl-long.440
33. Y. Mao, G. Liu, X. Wang, W. Gao, and X. Li, "DialogueTRM: Exploring multi-modal emotional dynamics in a conversation," In Findings of the association for computational linguistics: EMNLP 2021, November, 2021, pp. 2694-2704. doi: 10.18653/v1/2021.findings-emnlp.229
34. D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, "MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations," In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May, 2022, pp. 7037-7041. doi: 10.1109/icassp43922.2022.9747397
35. S. Zou, X. Huang, X. Shen, and H. Liu, "Improving multimodal fusion with main modal transformer for emotion recognition in conversation," Knowledge-Based Systems, vol. 258, p. 109978, 2022. doi: 10.1016/j.knosys.2022.109978
36. A. Nandi, and F. Xhafa, "A federated learning method for real-time emotion state classification from multi-modal streaming," Methods, vol. 204, pp. 340-347, 2022. doi: 10.1016/j.ymeth.2022.03.005
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Xin Wang, Longlong Qiao, Guangxin Dai, Quanping Chen, Yan Zhang, Wensong Li (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.







