Optimization of Large Models for Efficient Inference: Algorithm, Compiler, and System Co-Design
DOI:
https://doi.org/10.71222/078xh379Keywords:
large language models, inference optimization, algorithm-compiler-system co-design, computational efficiency, memory management, parallel executionAbstract
Large language models have achieved remarkable capabilities in natural language processing, yet their inference cost remains a significant challenge due to high computation, memory usage, and latency. This study presents a cross-layer co-optimization framework that integrates algorithmic, compiler, and system-level strategies to enhance inference efficiency. Algorithmic techniques, including structured pruning, sparsity, quantization, and dynamic inference, reduce computational workload and memory footprint. Compiler optimizations, such as operator fusion, graph rewriting, and layout specialization, translate algorithmic improvements into hardware-efficient execution. System-level strategies, encompassing parallel execution, memory management, and KV cache optimization, further improve resource utilization and reduce latency. The framework synergistically coordinates these layers, providing a theoretically grounded approach for reducing FLOPs, memory consumption, and inference latency. Its adaptability extends to cloud, edge, and interactive deployment scenarios, offering a unified methodology for efficient and scalable large-model inference. This work contributes a systematic and extensible pathway for accelerating model inference without relying on empirical performance measurements.
References
1. Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li, and Y. Wang, "A survey on efficient inference for large language models," arXiv preprint arXiv:2404.14294, 2024. doi: 10.48550/arXiv.2404.14294.
2. R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, and Y. He, "Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale," In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, November, 2022, pp. 1-15. doi: 10.1109/SC41404.2022.00051.
3. S. Park, S. Jeon, C. Lee, S. Jeon, B. S. Kim, and J. Lee, "A survey on inference engines for large language models: Perspectives on optimization and efficiency," arXiv preprint arXiv:2505.01658, 2025. doi: 10.48550/arXiv.2505.01658.
4. Y. Liu, J. Wu, Y. He, R. Gong, J. Xia, L. Li, and K. Li, "Efficient inference for large reasoning models: A survey," arXiv preprint arXiv:2503.23077, 2025. doi: 10.48550/arXiv.2503.23077.
5. W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, and X. He, "Model compression and efficient inference for large language models: A survey," arXiv preprint arXiv:2402.09748, 2024.
6. C. Guo, F. Cheng, Z. Du, J. Kiessling, J. Ku, S. Li, and Y. Chen, "A survey: Collaborative hardware and software design in the era of large language models," IEEE Circuits and Systems Magazine, vol. 25, no. 1, pp. 35-57, 2025. doi: 10.1109/mcas.2024.3476008.
7. X. Zhang, J. Liu, Z. Xiong, Y. Huang, G. Xie, and R. Zhang, "Edge intelligence optimization for large language model inference with batching and quantization," In 2024 IEEE Wireless Communications and Networking Conference (WCNC), April, 2024, pp. 1-6. doi: 10.1109/wcnc57260.2024.10571127.
8. J. Li, J. Xu, S. Huang, Y. Chen, W. Li, J. Liu, and G. Dai, "Large language model inference acceleration: A comprehensive hardware perspective," arXiv preprint arXiv:2410.04466, 2024. doi: 10.48550/arXiv.2410.04466.
9. J. Liu, P. Tang, W. Wang, Y. Ren, X. Hou, P. A. Heng, and C. Li, "A survey on inference optimization techniques for mixture of experts models," arXiv preprint arXiv:2412.14219, 2024. doi: 10.48550/arXiv.2412.14219.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Shengyi Gao (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.







