Research on Feature Fusion and Multimodal Patent Text Based on Graph Attention Network
DOI:
https://doi.org/10.71222/m3pbw568Keywords:
hierarchical comparative learning, multimodal graph attention networks, multi-granularity sparse attention, patent semantic miningAbstract
Aiming at the challenges of cross-modal feature fusion, low computational efficiency in long patent text modeling, and insufficient hierarchical semantic coherence in patent text semantic mining, this study proposes a novel deep learning framework termed HGM-Net. The framework integrates Hierarchical Comparative Learning (HCL), a Multi-modal Graph Attention Network (M-GAT), and Multi-Granularity Sparse Attention (MSA) to achieve robust, efficient, and semantically consistent patent representation learning. Specifically, HCL introduces dynamic masking, contrastive learning, and cross-structural similarity constraints across word-, sentence-, and paragraph-level hierarchies, enabling the model to jointly capture fine-grained local semantics and high-level thematic consistency. Contrastive and cross-structural similarity constraints are particularly enforced at the word and paragraph levels, effectively enhancing semantic discrimination and global coherence within complex patent documents. Furthermore, M-GAT models patent classification codes, citation relationships, and textual semantics as heterogeneous graph structures, and employs cross-modal gated attention mechanisms to dynamically fuse multi-source and multi-modal features, thereby improving representation completeness and robustness. To address the high computational cost of long-text processing, MSA adopts a hierarchical sparse attention strategy that selectively allocates attention across multiple granularities, including words, phrases, sentences, and paragraphs, significantly reducing computational overhead while preserving critical semantic information. Extensive experimental evaluations on patent classification and similarity matching tasks demonstrate that HGM-Net consistently outperforms existing state-of-the-art deep learning approaches. The results validate the effectiveness and generalization capability of the proposed framework, highlighting its theoretical innovation and practical value in improving patent examination efficiency and enabling large-scale technology relevance mining.References
1. C. W. Lee, F. Tao, Y. Y. Ma, and H. L. Lin, "Development of Patent Technology Prediction Model Based on Machine Learning," Axioms, vol. 11, no. 6, p. 253, 2022.
2. Z. Erdogan, S. Altuntas, and T. Dereli, "Predicting patent quality based on machine learning approach," IEEE Transactions on Engineering Management, vol. 71, pp. 3144-3157, 2022.
3. L. Yu, B. Liu, Q. Lin, X. Zhao, and C. Che, "Semantic similarity matching for patent documents using ensemble bert-related model and novel text processing method," arXiv preprint arXiv:2401.06782, 2024.
4. Z. Song, Z. Liu, and H. Li, "Research on feature fusion and multimodal patent text based on graph attention network," arXiv preprint arXiv:2505.20188, 2025.
5. Y. L. Chen, and Y. T. Chiu, "Cross-language patent matching via an international patent classification-based concept bridge," Journal of information science, vol. 39, no. 6, pp. 737-753, 2013.
6. S. Verberne, I. Chios, and J. Wang, "Extracting and Matching Patent In-text References to Scientific Publications," In BIRNDL@ SIGIR, 2019, pp. 56-69.
7. Y. H. Tseng, C. J. Lin, and Y. I. Lin, "Text mining techniques for patent analysis," Information processing & management, vol. 43, no. 5, pp. 1216-1247, 2007.
8. B. Yoon, and Y. Park, "A text-mining-based patent network: Analytical tool for high-technology trend," The Journal of High Technology Management Research, vol. 15, no. 1, pp. 37-50, 2004. doi: 10.1016/j.hitech.2003.09.003
9. Zhang G, Zeng H, Jiang L. Uni-FinLLM: A Unified Multimodal Large Language Model with Modular Task Heads for Micro-Level Stock Prediction and Macro-Level Systemic Risk Assessment[J]. arXiv preprint arXiv:2601.02677, 2026.
10. Z. Tao, Y. Wei, X. Wang, X. He, X. Huang, and T. S. Chua, "Mgat: Multimodal graph attention network for recommendation," Information Processing & Management, vol. 57, no. 5, p. 102277, 2020.
11. Y. Zhao, Z. Zheng, D. Xue, W. Dai, C. Li, J. Zou, and H. Xiong, "Computation, Parameter, and Memory Efficient Implicit Graph Transformer with Multi-granularity Sparse Attention," In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), October, 2025, pp. 257-271. doi: 10.1007/978-981-95-5699-1_18
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Zhenzhen Song, Ziwei Liu, Hongji Li (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.







