Cross-Region Few-Shot Remote Sensing Image Captioning via Adaptive Vision-Language Feature Fusion
DOI:
https://doi.org/10.71222/kt0ew220Keywords:
remote sensing, image captioning, few-shot learning, vision-language fusion, cross-region adaptationAbstract
Remote Sensing Image Captioning (RSIC) enables automated interpretation of aerial imagery by converting complex visual scenes into coherent natural language descriptions. A key challenge in RSIC is the scarcity of annotated data and the significant domain shifts across geographic regions. Models trained on specific regional features often degrade in performance when applied to visually distinct landscapes such as agricultural or coastal areas. To address this, we propose the Adaptive Vision-Language Feature Fusion (AVLF) network, a few-shot learning framework designed to achieve robust cross-region transfer with minimal data. The AVLF framework bridges the semantic gap between visual and linguistic representations through an adaptive gating mechanism that dynamically balances visual and language features during caption generation. Extensive experiments on cross-region splits of multiple remote sensing datasets demonstrate that AVLF achieves state-of-the-art performance, maintains high captioning quality with limited support sets, generalizes effectively to unseen semantic categories, and incurs minimal computational overhead. Feature space visualizations show well-separated class distributions, while attention maps confirm that the model focuses on semantically relevant geographic objects. Ablation studies further highlight the importance of the adaptive fusion strategy in overcoming domain discrepancies and enhancing few-shot learning capability.References
1. Q. Cheng, H. Huang, Y. Xu, Y. Zhou, H. Li, and Z. Wang, "NWPU-captions dataset and MLCA-net for remote sensing image captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-19, 2022. doi: 10.1109/tgrs.2022.3201474
2. Q. Yang, Z. Ni, and P. Ren, "Meta captioning: A meta learning based remote sensing image captioning framework," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 186, pp. 190-200, 2022. doi: 10.1016/j.isprsjprs.2022.02.001
3. Y. Li, X. Zhang, J. Gu, C. Li, X. Wang, X. Tang, and L. Jiao, "Recurrent attention and semantic gate for remote sensing image captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-16, 2021. doi: 10.1109/tgrs.2021.3102590
4. Q. Wang, Z. Yang, W. Ni, J. Wu, and Q. Li, "Semantic-spatial collaborative perception network for remote sensing image captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-12, 2024. doi: 10.1109/tgrs.2024.3502805
5. K. Zhao, and W. Xiong, "Exploring region features in remote sensing image captioning," International Journal of Applied Earth Observation and Geoinformation, vol. 127, p. 103672, 2024. doi: 10.1016/j.jag.2024.103672
6. C. Yang, Z. Li, and L. Zhang, "Bootstrapping interactive image-text alignment for remote sensing image captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-12, 2024. doi: 10.1109/tgrs.2024.3359316
7. C. Liu, R. Zhao, and Z. Shi, "Remote-sensing image captioning based on multilayer aggregated transformer," IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, 2022. doi: 10.1109/lgrs.2022.3150957
8. Y. Wu, L. Li, L. Jiao, F. Liu, X. Liu, and S. Yang, "Trtr-cmr: Cross-modal reasoning dual transformer for remote sensing image captioning," IEEE Transactions on Geoscience and Remote Sensing, 2024. doi: 10.1109/tgrs.2024.3475633
9. Y. Li, X. Zhang, X. Cheng, X. Tang, and L. Jiao, "Learning consensus-aware semantic knowledge for remote sensing image captioning," Pattern Recognition, vol. 145, p. 109893, 2024. doi: 10.1016/j.patcog.2023.109893
10. H. Kandala, S. Saha, B. Banerjee, and X. X. Zhu, "Exploring transformer and multilabel classification for remote sensing image captioning," IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, 2022. doi: 10.1109/lgrs.2022.3198234
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Qikun Zuo (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.







