Open Access
Review
Issue
Natl Sci Open
Volume 5, Number 3, 2026
Article Number 20260016
Number of page(s) 39
Section Information Sciences
DOI https://doi.org/10.1360/nso/20260016
Published online 29 April 2026
  • Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Annual Conference on Neural Information Processing Systems. Long Beach, CA, 2017, 5998–6008. [Google Scholar]
  • Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Research 2020; 21: 1–67. [Google Scholar]
  • Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI Technical Report 2018. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf. [Google Scholar]
  • Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1: 8–9. [Google Scholar]
  • Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. In: Proceedings of the International Conference on Neural Information Processing Systems. Vancouver BC, 2020, 1877–1901. [Google Scholar]
  • Devvlin J, Chang M-W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN, 2019, 4171–4186. [Google Scholar]
  • Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning. Virtual, 2021, 8748–8763. [Google Scholar]
  • Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the International Conference on Machine Learning. Baltimore, 2022. [Google Scholar]
  • Zhang C, Yang Z, He X, et al. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J Sel Top Signal Process 2020; 14: 478–493.[Article] [Google Scholar]
  • Zhao F, Zhang C, Geng B. Deep multimodal data fusion. arXiv: https://arxiv.org/abs/2312.11805. [Google Scholar]
  • Li Y, Jiang S, Hu B, et al. Uni-MoE: Scaling unified multimodal LLMs with mixture of experts. IEEE Trans Pattern Anal Mach Intell 2025; 47: 3424–3439.[Article] [Google Scholar]
  • Ye Q, Xu H, Ye J, et al. mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024, 13040–13051. [Google Scholar]
  • Alayrac J-B, Donahue J, Luc P, et al. Flamingo: A visual language model for few-shot learning. In: Proceedings of the Annual Conference on Neural Information Processing Systems. New Orleans, LA, 2022, 23789–23803. [Google Scholar]
  • Driess D, Xia F, Sajjadi MSM, et al. PaLM-E: An embodied multimodal language model. In: Proceedings of the International Conference on Machine Learning. Honolulu, 2023. [Google Scholar]
  • Li J, Li D, Savarese S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the International Conference on Machine Learning. Honolulu, 2023. [Google Scholar]
  • Yang Z, Li L, Lin K, et al. The dawn of lmms: Preliminary explorations with GPT-4V (ision). arXiv: https://arxiv.org/abs/2309.17421. [Google Scholar]
  • Gemini Team, Anil R, Borgeaud S, et al. Gemini: A family of highly capable multimodal models. arXiv: https://arxiv.org/abs/2312.11805. [Google Scholar]
  • Bourechak A, Zedadra O, Kouahla MN, et al. At the confluence of artificial intelligence and edge computing in IoT-based applications: A review and new perspectives. Sensors 2023; 23: 1639.[Article] [Google Scholar]
  • Wang X, Garg S, Lin H, et al. A secure data aggregation strategy in edge computing and blockchain-empowered internet of things. IEEE Internet Things J 2020; 9: 14237–14246.[Article] [Google Scholar]
  • Alwarafy A, Al-Thelaya KA, Abdallah M, et al. A survey on security and privacy issues in edge-computing-assisted internet of things. IEEE Internet Things J 2020; 8: 4004–4022.[Article] [Google Scholar]
  • Al-Ansi A, Al-Ansi AM, Muthanna A, et al. Survey on intelligence edge computing in 6G: Characteristics, challenges, potential use cases, and market drivers. Future Internet 2021; 13: 118.[Article] [Google Scholar]
  • Yao Y, Yu T, Zhang A, et al. Efficient GPT-4V level multimodal large language model for deployment on edge devices. Nat Commun 2025; 16: 5509.[Article] [Google Scholar]
  • Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv: https://arxiv.org/abs/2001.08361. [Google Scholar]
  • Gholami A, Yao Z, Kim S, et al. AI and memory wall. IEEE Micro 2024; 44: 33–39.[Article] [Google Scholar]
  • Zheng Y, Chen Y, Qian B, et al. A review on edge large language models: Design, execution, and applications. ACM Comput Surv 2025; 57: 1–35.[Article] [Google Scholar]
  • Li J, Li J, Yang G, et al. Applications of large language models and multimodal large models in autonomous driving: A comprehensive review. Drones 2025; 9: 238.[Article] [Google Scholar]
  • Bajpai K, Gupta V. EcoLLM: A joint optimization framework for ultra-low power, mixed-precision LLM inference on resource-constrained edge systems. 2025, 10.36227/techrxiv.176114085.56648160/v1. [Google Scholar]
  • Xiong ZB, Luo XR, Wang BN. Research on lightweight deployment of artificial intelligence models and intelligent network bandwidth optimization based on edge computing. In: Proceedings of the 2025 IEEE 8th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE). Shenyang, 2025. [Google Scholar]
  • Chen Y, Yan Y, Ge S, et al. Confidant: Customizing transformer-based LLMs via collaborative training on mobile devices. In: Proceedings of the 31st Annual International Conference on Mobile Computing and Networking. Hong Kong, 2025, 483–497. [Google Scholar]
  • Jin Y, Li J, Liu Y, et al. Efficient multimodal large language models: A survey. arXiv: https://arxiv.org/abs/2405.10739. [Google Scholar]
  • Zhou Z, Ning X, Hong K, et al. A survey on efficient inference for large language models. arXiv: https://arxiv.org/abs/2404.14294. [Google Scholar]
  • Sharshar A, Khan LU, Ullah W, et al. Vision-language models for edge networks: A comprehensive survey. IEEE Internet Things J 2025; 12: 32701–32724.[Article] [Google Scholar]
  • Li Y, Liu Z, Li Z, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv: https://arxiv.org/abs/2505.04921. [Google Scholar]
  • Han J, Gong K, Zhang Y, et al. Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024. [Google Scholar]
  • Xing S, Qian C, Wang Y, et al. OpenEMMA: Open-source multimodal model for end-to-end autonomous driving. In: Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Tucson, AZ, 2025, 911–919. [Google Scholar]
  • Luo R, Lin T-E, Zhang H, et al. OpenOmni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time emotional speech synthesis. In: Proceedings of the Annual Conference on Neural Information Processing Systems. San Diego, CA, 2025. [Google Scholar]
  • Cantini R, Orsino A, Talia D. Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices. J Big Data 2024; 11: 63.[Article] [Google Scholar]
  • Liu H, Li C, Wu Q, et al. Visual instruction tuning. In: Proceedings of the Annual Conference on Neural Information Processing Systems. New Orleans, LA, 2023. [Google Scholar]
  • Chu X, Qiao L, Lin X, et al. Mobilevlm: A fast, strong and open vision language assistant for mobile devices. arXiv: https://arxiv.org/abs/2312.16886. [Google Scholar]
  • Zhu Y, Zhu M, Liu N, et al. LLaVA-Phi: Efficient multi-modal assistant with small language model. In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited. Melbourne VIC, 2024, 18–22. [Google Scholar]
  • Zhu M, Zhu Y, Liu X, et al. Mipha: A comprehensive overhaul of multimodal assistant with small language models. arXiv: https://arxiv.org/abs/2403.06199. [Google Scholar]
  • Javaheripi M, Bubeck S, Abdin M, et al. Phi-2: The surprising power of small language models. Microsoft Res Blog 2023, 1: 3–3. [Google Scholar]
  • Zhang P, Zeng G, Wang T, et al. Tinyllama: An open-source small language model. arXiv: https://arxiv.org/abs/2401.02385. [Google Scholar]
  • Yuan Z, Li Z, Huang W, et al. TinyGPT-V: Efficient multimodal large language model via small backbones. In: Proceedings of the 2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ ICML 2024). Vienna, 2024. [Google Scholar]
  • Wei H, Kong L, Chen J, et al. Small language model meets with reinforced vision vocabulary. arXiv: https://arxiv.org/abs/2401.12503. [Google Scholar]
  • Google. Gemma 3n. 2025. https://ai.google.dev/gemma/docs/gemma-3n. [Google Scholar]
  • Yao Y, Yu T, Zhang A, et al. Minicpm-v: A GPT-4V level mllm on your phone. arXiv: https://arxiv.org/abs/2408.01800. [Google Scholar]
  • OpenBMB MiniCPM-o Team. Minicpm-o 2.6: A GPT-4O level mllm for vision, speech, and multimodal live streaming on your phone. 2025. https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9. [Google Scholar]
  • Yu T, Wang Z, Wang C, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe. arXiv: https://arxiv.org/abs/2509.18154. [Google Scholar]
  • ZhipuAI. GLM-Edge. 2024. https://github.com/zai-org/GLM-Edge. [Google Scholar]
  • Ning Z, Zhao J, Jin Q, et al. Inf-MLLM: Efficient streaming inference of multimodal large language models on a single GPU. arXiv: https://arxiv.org/abs/2409.09086. [Google Scholar]
  • Lin Z, Lin M, Lin L, et al. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia, 2025, 105–113. [Google Scholar]
  • Singh G, Wang X, Hu Y, et al. Efficiently serving large multimodal models using epd disaggregation. In: Proceedings of the International Conference on Machine Learning. Vancouver, 2025. [Google Scholar]
  • Wang H, Yu Z, Spadaro G, et al. Folder: Accelerating multi-modal large language models with enhanced performance. arXiv: https://arxiv.org/abs/2501.02430. [Google Scholar]
  • Xie X, Zhang X, Tang X, et al. MACTFusion: Lightweight cross transformer for adaptive multimodal medical image fusion. IEEE J Biomed Health Inform 2024; 29: 3317–3328.[Article] [Google Scholar]
  • Wang J, Chen H, Zhang X, et al. CaPaT: Cross-aware paired-affine transformation for multimodal data fusion network. IEEE Geosci Remote Sens Lett 2025; 22: 1–5.[Article] [Google Scholar]
  • Yu J, Zhou S, Yang D, et al. Mquant: Unleashing the inference potential of multimodal large language models via static quantization. In: Proceedings of the 33rd ACM International Conference on Multimedia. Dublin, 2025, 1783–1792. [Google Scholar]
  • Gagrani M, Goel R, Jeon W, et al. On speculative decoding for multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024. [Google Scholar]
  • Shukor M, Cord M. Skipping computations in multimodal LLMs. arXiv: https://arxiv.org/abs/2410.09454. [Google Scholar]
  • Wan Z, Wu Z, Liu C, et al. Look-m: Look-once optimization in KV cache for efficient multimodal long-context inference. arXiv: https://arxiv.org/abs/2406.18139. [Google Scholar]
  • Huang W, Zhai Z, Shen Y, et al. Dynamic-LLaVA: Efficient multimodal large language models via dynamic vision-language context sparsification. In: Proceedings of the Thirteenth International Conference on Learning Representations. Singapore, 2025. [Google Scholar]
  • Pope R, Douglas S, Chowdhery A, et al. Efficiently scaling transformer inference. Proceed Mach Learn Sys 2023, 5: 606–624. [Google Scholar]
  • NVIDIA NIM LLMs Benchmarking. 2025. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html. [Google Scholar]
  • Desislavov R, Martínez-Plumed F, Hernández-Orallo J. Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning. Sustain Computing-Inf Syst 2023; 38: 100857.[Article] [Google Scholar]
  • Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. Koblenz, 2023, 611–626. [Google Scholar]
  • Sheng Y, Zheng L, Yuan B, et al. Flexgen: High-throughput generative inference of large language models with a single GPU. In: Proceedings of the International Conference on Machine Learning. Honolulu 2023, 202: 31094–31116. [Google Scholar]
  • Fu C, Chen P, Shen Y, et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv: https://arxiv.org/abs/2306.13394. [Google Scholar]
  • Liu Y, Duan H, Zhang Y, et al. MMBench: Is your multi-modal model an all-around player? In: Computer Vision—ECCV 2024. ECCV 2024. Lecture Notes in Computer Science. Cham: Springer, 2024. [Google Scholar]
  • Yue X, Ni Y, Zhang K, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024. [Google Scholar]
  • Lu P, Bansal H, Xia T, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: Proceedings of the Twelfth International Conference on Learning Representations. Vienna, 2024. [Google Scholar]
  • Fu C, Dai Y, Luo Y, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, 2025. [Google Scholar]
  • Yang N, Wen J, Zhang M, et al. Generalizable pareto-optimal offloading with reinforcement learning in mobile edge computing. IEEE Trans Serv Comput 2025; 18: 3824–3836.[Article] [Google Scholar]
  • Shao K, Tao K, Zhang K, et al. When tokens talk too much: A survey of multimodal long-context token compression across images, videos, and audios. arXiv: https://arxiv.org/abs/2507.20198. [Google Scholar]
  • Ma Y, Abdelraouf A, Gupta R, et al. Video token sparsification for efficient multimodal LLMs in driving visual question answering. In: Proceedings of the 2025 IEEE Intelligent Vehicles Symposium (IV). Cluj-Napoca, 2025. [Google Scholar]
  • Gao Z, Wang Y, Chen J, et al. MMTSA: Multi-modal temporal segment attention network for efficient human activity recognition. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. 2023, 1–26. [Google Scholar]
  • Gao Z, Chen Z, Cui E, et al. Mini-InternVL: A flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Vis Intell 2024; 2: 32.[Article] [Google Scholar]
  • Wang P, Bai S, Tan S, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv: https://arxiv.org/abs/2409.12191. [Google Scholar]
  • Su J, Ahmed M, Lu Y, et al. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing 2024; 568: 127063.[Article] [Google Scholar]
  • Kimi Team, Du A, Yin B, et al. Kimi-VL technical report. arXiv: https://arxiv.org/abs/2504.07491. [Google Scholar]
  • Guo Z, Xu R, Yao Y, et al. LLaVA-UHD: An LMM perceiving any aspect ratio and high-resolution images. In: Computer Vision—ECCV 2024. ECCV 2024. Lecture Notes in Computer Science. Cham: Springer, 2024. [Google Scholar]
  • Dehghani M, Mustafa B, Djolonga J, et al. Patch n’Pack: Navit, a vision transformer for any aspect ratio and resolution. In: Proceedings of the Annual Conference on Neural Information Processing Systems. New Orleans, LA, 2023. [Google Scholar]
  • Zhai X, Mustafa B, Kolesnikov A, et al. Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, 2023. [Google Scholar]
  • Li B, Zhang Y, Guo D, et al. Llava-onevision: Easy visual task transfer. arXiv: https://arxiv.org/abs/2408.03326. [Google Scholar]
  • Xiong B, Chen B, Wang C, et al. BlueLM-2.5-3B Technical Report. arXiv: https://arxiv.org/abs/2507.05934. [Google Scholar]
  • Vasu PKA, Faghri F, Li C-L, et al. Fastvlm: Efficient vision encoding for vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Bai S, Chen K, Liu X, et al. Qwen2.5-VL technical report. arXiv: https://arxiv.org/abs/2502.13923. [Google Scholar]
  • Huang M, Huang R, Shi H, et al. Efficient multi-modal large language models via visual token grouping. arXiv: https://arxiv.org/abs/2411.17773. [Google Scholar]
  • Marafioti A, Zohar O, Farré M, et al. Smolvlm: Redefining small and efficient multimodal models. arXiv: https://arxiv.org/abs/2504.05299. [Google Scholar]
  • Yang G, Yan X, Kou H, et al. TWDP: A vision transformer accelerator with token-weight dual-pruning strategy for edge device deployment. In: Proceedings of the 30th Asia and South Pacific Design Automation Conference. Tokyo, 2025, 177–182. [Google Scholar]
  • Zhao S, Wang Z, Juefei-Xu F, et al. Accelerating multimodal large language models by searching optimal vision token reduction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Tang Y, Han K, Wang Y, et al. Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, 2022. [Google Scholar]
  • Chen L, Zhao H, Liu T, et al. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: Computer Vision—ECCV 2024. ECCV 2024. Lecture Notes in Computer Science. Cham: Springer, 2024. [Google Scholar]
  • Liu H, Li C, Li Y, et al. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024. [Google Scholar]
  • Bai J, Bai S, Yang S, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv: https://arxiv.org/abs/2308.12966. [Google Scholar]
  • Han Y, Liu X, Ding P, et al. Rethinking token reduction in MLLMs: Towards a unified paradigm for training-free acceleration. arXiv: https://arxiv.org/abs/2411.17686. [Google Scholar]
  • Sun Y, Xin Y, Li H, et al. LVPruning: An effective yet simple language-guided vision token pruning approach for multi-modal large language models. In: Findings of the Association for Computational Linguistics: NAACL 2025. Albuquerque: Association for Computational Linguistics, 2025, 4299–4308. [Google Scholar]
  • Zhao Z, Li Y, Li Y. Learning free token reduction for multi-modal large language models. arXiv: https://arxiv.org/abs/2501.17391. [Google Scholar]
  • Lin B, Ye Y, Zhu B, et al. Video-LLaVA: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024. [Google Scholar]
  • Liu T, Shi L, Hong R, et al. Multi-stage vision token dropping: Towards efficient multimodal large language model. arXiv: https://arxiv.org/abs/2411.10803. [Google Scholar]
  • Liu H, Li C, Li Y, et al. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. 2024. https://llava-vl.github.io/blog/2024-01-30-llava-next/. [Google Scholar]
  • Singh A, Natarajan V, Shah M, et al. Towards VQA models that can read. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, 2019. [Google Scholar]
  • Gholami M, Akbari M, Cannons K, et al. CASP: Compression of large multimodal models based on attention sparsity. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Chen F, He Y, Lin L, et al. Zipr1: Reinforcing token sparsity in MLLMs. arXiv: https://arxiv.org/abs/2504.18579. [Google Scholar]
  • Mitra C, Huang B, Chai T, et al. Enhancing few-shot vision-language classification with large multimodal model features. In: Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV). Honolulu, HI, 2025. [Google Scholar]
  • Chu Y F, Xu J, Yang Q, et al. Qwen2-audio technical report. arXiv: https://arxiv.org/abs/2407.10759. [Google Scholar]
  • Yang T, Ma F, Li X, et al. DTATrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture. IEEE Trans Comput-Aided Des Integr Circuits Syst 2022; 42: 509–520.[Article] [Google Scholar]
  • Jia D, Guo J, Han K, et al. GeminiFusion: Efficient pixel-wise multimodal fusion for vision transformer. In: Proceedings of the International Conference on Machine Learning. Vienna, 2024. [Google Scholar]
  • Li B, Li Y, Li Z, et al. Megrez-omni technical report. arXiv: https://arxiv.org/abs/2502.15803. [Google Scholar]
  • Chu X, Qiao L, Zhang X, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv: https://arxiv.org/abs/2402.03766. [Google Scholar]
  • Zhang S, Fang Q, Yang Z, et al. Llava-mini: Efficient image and video large multimodal models with one vision token. In: Proceedings of the Thirteenth International Conference on Learning Representations, ICLR 2025. Singapore, 2025. [Google Scholar]
  • Zhu D, Chen J, Shen X, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In: Proceedings of the Twelfth International Conference on Learning Representations. Vienna, 2024. [Google Scholar]
  • Li W, Zhou H, Yu J, et al. Coupled mamba: Enhanced multimodal fusion with coupled state space model. In: Proceedings of the Annual Conference on Neural Information Processing Systems. Vancouver, 2024. [Google Scholar]
  • Sun X, Yang Z, Xie R, et al. LightVLP: A lightweight vision-language pre-training via gated interactive masked autoencoders. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, 2024, 10499–10510. [Google Scholar]
  • Hu Y, Fan Z, Wang X, et al. TinyAlign: Boosting lightweight vision-language models by mitigating modal alignment bottlenecks. arXiv: https://arxiv.org/abs/2505.12884. [Google Scholar]
  • Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 2017. [Google Scholar]
  • Cai Z, Cao M, Chen H, et al. InternLM2 technical report. arXiv: https://arxiv.org/abs/2403.17297. [Google Scholar]
  • Abdin M I, Jacobs S A, Awan A A, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv: https://arxiv.org/abs/2404.14219. [Google Scholar]
  • Bellagente M, Tow J, Mahan D, et al. Stable LM 2.1.6B technical report. arXiv: https://arxiv.org/abs/2402.17834. [Google Scholar]
  • Bai J, Bai S, Chu Y, et al. Qwen technical report. arXiv: https://arxiv.org/abs/2309.16609. [Google Scholar]
  • Biderman S, Schoelkopf H, Anthony QG, et al. Pythia: A suite for analyzing large language models across training and scaling. In: Proceedings of the International Conference on Machine Learning. Honolulu 2023. [Google Scholar]
  • Liu Y, Li Z, Huang M, et al. OCRBench: On the hidden mystery of OCR in large multimodal models. Sci China Inf Sci 2024; 67: 220102.[Article] [Google Scholar]
  • Mathew M, Karatzas D, Jawahar CV. DocVQA: A dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Online, 2021. [Google Scholar]
  • Yu L, Poirson P, Yang S, et al. Modeling context in referring expressions. In: Computer Vision—ECCV 2016. ECCV 2016. Lecture Notes in Computer Science. Cham: Springer, 2016. [Google Scholar]
  • Zhang B, Sennrich R. Root mean square layer normalization. In: Proceedings of the Annual Conference on Neural Information Processing Systems. Vancouver, 2019. [Google Scholar]
  • Jiang A Q, Sablayrolles A, Roux A, et al. Mixtral of experts. arXiv: https://arxiv.org/abs/2401.04088. [Google Scholar]
  • Lin B, Tang Z, Ye Y, et al. MoE-LLaVA: Mixture of experts for large vision-language models. IEEE T Multimedia, 2026: 1–14. [Google Scholar]
  • Gu A, Dao T. MAMBA: Linear-time sequence modeling with selective state spaces. In: Proceedings of the First conference on language modeling. Philadelphia, 2024. [Google Scholar]
  • Zhao H, Zhang M, Zhao W, et al. Cobra: Extending mamba to multi-modal large language model for efficient inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia, 2025, 118–126. [Google Scholar]
  • Huang W, Pan J, Tang J, et al. Ml-MAMBA: Efficient multi-modal large language model utilizing MAMBA-2. arXiv: https://arxiv.org/abs/2407.19832. [Google Scholar]
  • Dao T, Gu A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In: Proceedings of the 41st International Conference on Machine Learning (ICML). Vienna, 2024. [Google Scholar]
  • Thomas A, Massaroli S, Poli M. Liquid Edge Team. Convolutional multi-hybrids for edge devices. 2025. https://www.liquid.ai/research/convolutional-multi-hybrids-for-edge-devices. [Google Scholar]
  • Hou H, Zeng P, Ma F, et al. VisualRWKV: Exploring recurrent neural networks for visual language models. In: Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, 2025, 10423–10434. [Google Scholar]
  • Cai Y, Zhang J, He H, et al. LLaVA-KD: A framework of distilling multimodal large language models. In: Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV). Honolulu, 2025. [Google Scholar]
  • Yang A, Li A, Yang B, et al. Qwen3 technical report. arXiv: https://arxiv.org/abs/2505.09388. [Google Scholar]
  • Xu S, Li X, Yuan H, et al. LLaVADI: What matters for multimodal large language models distillation. arXiv: https://arxiv.org/abs/2407.19409. [Google Scholar]
  • Feng Q, Li W, Lin T, et al. Align-KD: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Cai K, Duan Z, Liu G, et al. Self-adapting large visual-language models to edge devices across visual modalities. In: Computer Vision—ECCV 2024. Lecture Notes in Computer Science. Cham: Springer, 2024. [Google Scholar]
  • Liao B, Tao H, Zhang Q, et al. Multimodal mamba: Decoder-only multimodal state space model via quadratic to linear distillation. arXiv: https://arxiv.org/abs/2502.13145. [Google Scholar]
  • Gerganov G. GGML. 2024. https://github.com/ggerganov/ggml. [Google Scholar]
  • Koska B, Horváth M. Towards multi-modal mastery: A 4.5B parameter truly multi-modal small language model. In: Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM). Dubai, 2024. [Google Scholar]
  • Chen Z, Wang W, Cao Y, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv: https://arxiv.org/abs/2412.05271. [Google Scholar]
  • Lin H, Bai H, Liu Z, et al. Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric. In: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024. [Google Scholar]
  • Wu Z, Chen J, Wang Y. Unified knowledge maintenance pruning and progressive recovery with weight recalling for large vision-language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia, 2025. [Google Scholar]
  • Liang Y, Wang Z, Xu X, et al. Efficientllava: Generalizable auto-pruning for large vision-language models. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Zhang Z, Pan X, Wei H, et al. LOP: Learning optimal pruning for efficient on-demand MLLMs scaling. arXiv: https://arxiv.org/abs/2506.12826. [Google Scholar]
  • Huang Y, Thede L, Mancini M, et al. Investigating structural pruning and recovery techniques for compressing multimodal large language models: An empirical study. Pattern Recognition. In: DAGM GCPR 2025. Lecture Notes in Computer Science. Cham: Springer, 2026. [Google Scholar]
  • Xiao G, Tian Y, Chen B, et al. Efficient streaming language models with attention sinks. In: Proceedings of the Twelfth International Conference on Learning Representations. Vienna, 2024. [Google Scholar]
  • Han I, Zhang Z, Wang Z, et al. CalibQuant: 1-Bit KV cache quantization for multimodal LLMs. In: Proceedings of the ICML 2025 Workshop on Long-Context Foundation Models. Vancouver, 2025. [Google Scholar]
  • Tillet P, Kung H-T, Cox D. Triton: An intermediate language and compiler for tiled neural network computations. In: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. Phoenix, AZ, 2019, 10–19. [Google Scholar]
  • Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding. In: Proceedings of the International Conference on Machine Learning. Honolulu, 2023. [Google Scholar]
  • Miao X, Oliaro G, Zhang Z, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. La Jolla, CA, 2024, 932–949. [Google Scholar]
  • Lin L, Lin Z, Zeng Z, et al. Speculative decoding reimagined for multimodal large language models. arXiv: https://arxiv.org/abs/2505.14260. [Google Scholar]
  • Lu X, Chen Y, Chen C, et al. BlueLM-V-3B: Algorithm and system co-design for multimodal large language models on mobile devices. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Dong X, Liu T, Zeng Y, et al. HydraInfer: Hybrid disaggregated scheduling for multimodal large language model serving. arXiv: https://arxiv.org/abs/2505.12658. [Google Scholar]
  • Elhoushi M, Shrivastava A, Liskovich D, et al. Layerskip: Enabling early exit inference and self-speculative decoding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, 2024, 923–938. [Google Scholar]
  • Devvrit F, Kudugunta S, Kusupati A, et al. Matformer: Nested transformer for elastic inference. In: Proceedings of the Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023), New Orleans, 2023. [Google Scholar]
  • Yue Y, Wang Y, Kang B, et al. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, 2024. [Google Scholar]
  • Kwon W, Li Z, Zhuang S, et al. vLLM. 2023. https://github.com/vllm-project/vllm. [Google Scholar]
  • Zheng L, Yin L, Xie Z, et al. Sglang: Efficient execution of structured language model programs. In: Proceedings of the Annual Conference on Neural Information Processing Systems. Vancouver, 2024. [Google Scholar]
  • Gerganov G. llama.cpp. 2023. https://github.com/ggerganov/llama.cpp. [Google Scholar]
  • NVIDIA. TensorRT-LLM. 2025. https://nvda.org.cn/TensorRT-LLM/. [Google Scholar]
  • MLC Team. MLC LLM. 2025. https://github.com/mlc-ai/mlc-llm. [Google Scholar]
  • Lv C, Niu C, Gu R, et al. Walle: An end-to-end, general-purpose, and large-scale production system for device-cloud collaborative machine learning. In: Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA, 2022. [Google Scholar]
  • Shao Z, Yu Z, Yu J, et al. Imp: Highly capable large multimodal models for mobile devices. IEEE Trans Multimedia 2025; 27: 2961–2974.[Article] [Google Scholar]
  • Abid A, Abdalla A, Abid A, et al. Gradio: Hassle-free sharing and testing of ML models in the wild. arXiv: https://arxiv.org/abs/1906.02569. [Google Scholar]
  • Huang M, Shen A, Li K, et al. EdgeLLM: A highly efficient CPU-FPGA heterogeneous edge accelerator for large language models. IEEE Trans Circuits Syst I 2025; 72: 3352–3365.[Article] [Google Scholar]
  • Kim H, Ye G, Wang N, et al. Exploiting intel advanced matrix extensions (AMX) for Large Language Model Inference. IEEE Comput Arch Lett 2024; 23: 117–120.[Article] [Google Scholar]
  • Zhu Y, Lu H. Edge-side NPU inference optimization: Adaptation research of multimodal large models on qualcomm platforms. Intell Data Anal 2025; 30: 544–568. [Google Scholar]
  • Bai K, Ye L, Huang R, et al. EdgeMM: Multi-core CPU with heterogeneous AI-extension and activation-aware weight pruning for multimodal LLMs at edge. arXiv: https://arxiv.org/abs/2505.10782. [Google Scholar]
  • Khronos Group. OpenCL—The Open Standard for Parallel Programming of Heterogeneous Systems. 2025. https://www.khronos.org/opencl/. [Google Scholar]
  • Fu Z, Ren J, Liu Y, et al. Hyperion: A generic and distributed mobile offloading framework on OpenCL. In: Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems. Boston Massachusetts, 2022, 607–621. [Google Scholar]
  • Jia F, Zhang D, Cao T, et al. CoDL: Efficient CPU-GPU co-execution for deep learning inference on mobile devices. In: Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (MobiSys’22). Portland, 2022, 209–221. [Google Scholar]
  • Apple Inc. Apple unleashes M1. 2020. https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/. [Google Scholar]
  • Shen Y, Wang ZC, Wang TY, et al. Hetero2Pipe: Pipelining multi-DNN inference on heterogeneous mobile processors under co-execution slowdown. In: Proceedings of the 2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS). Glasgow, 2025. [Google Scholar]
  • Qualcomm Team. Qualcomm Snapdragon 8s Gen 3. 2024. https://www.qualcomm.com/news/releases/2024/03/qualcomm-brings-the-best-of-on-device-ai-to-more-smartphones-wit. [Google Scholar]
  • Lin J, Tang J M, Tang H T, et al. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. In: Proceedings of the Seventh Annual Conference on Machine Learning and Systems. Santa Clara, 2024, 87–100. [Google Scholar]
  • Xiao G, Lin J, Seznec M, et al. Smoothquant: Accurate and efficient post-training quantization for large language models. In: Proceedings of the International Conference on Machine Learning. Honolulu, 2023, 38087–38099. [Google Scholar]
  • Ma X Y, Fang G F, Wang X C. LLM-Pruner: On the structural pruning of large language models. In: Proceedings of the Annual Conference on Neural Information Processing Systems. New Orleans, LA, 2023. [Google Scholar]
  • Gokhale S, Das D, Patwari R, et al. KV Pareto: Systems-level optimization of KV cache and model compression for long context inference. In: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track). Rabat, 2026. [Google Scholar]
  • Rjoub G, Elmekki H, Islam S, et al. A hybrid swarm intelligence approach for optimizing multimodal large language models deployment in edge-cloud-based federated learning environments. Comput Commun 2025; 237: 108152.[Article] [Google Scholar]
  • Li Y, Gumaste D, Turkcan M K, et al. Distributed VLMs: Efficient vision-language processing through cloud-edge collaboration. In: Proceedings of the 2025 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops). Washington DC, 2025. [Google Scholar]
  • Lou S, Ge S, Yu J, et al. TinyVision: Distributed vision-language model with efficiency and privacy for edge deployment. In: Advanced Intelligent Computing Technology and Applications. ICIC 2025. Lecture Notes in Computer Science. Singapore: Springer, 2025. [Google Scholar]
  • Hu Y, Ye D, Kang J, et al. A cloud-edge collaborative architecture for multimodal LLM-based advanced driver assistance systems in IoT networks. IEEE Internet Things J 2025; 12: 13208–13221.[Article] [Google Scholar]
  • Yi B, Hu X, Chen Y, et al. EcoAgent: An efficient edge-cloud collaborative multi-agent framework for mobile automation. arXiv: https://arxiv.org/abs/2505.05440. [Google Scholar]
  • Wang G, Liu J, Li C, et al. Cloud-device collaborative learning for multimodal large language models. In: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, 2024. [Google Scholar]
  • Xu W, Liu Y, He L, et al. Xmodel-vlm: A simple baseline for multimodal vision language model. arXiv: https://arxiv.org/abs/2405.09215. [Google Scholar]
  • Gao Z, Zhang B, Li P, et al. Multi-modal agent tuning: Building a VLM-driven agent for efficient tool usage. In: Proceedings of the Thirteenth International Conference on Learning Representations. Singapore, 2025. [Google Scholar]
  • Lv T, Huang Y, Chen J, et al. Kosmos-2.5: A multimodal literate model. arXiv: https://arxiv.org/abs/2309.11419. [Google Scholar]
  • Zhang C, Yang Z, Liu J, et al. AppAgent: Multimodal agents as smartphone users. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Yokohama, 2025, 1–20. [Google Scholar]
  • Li Y, Zhang C, Yang W, et al. Appagent v2: Advanced agent for flexible mobile interactions. arXiv: https://arxiv.org/abs/2408.11824. [Google Scholar]
  • Wang J, Xu H, Ye J, et al. Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception. In: Proceedings of the ICLR 2024 Workshop on Large Language Model (LLM) Agents. Vienna, 2024. [Google Scholar]
  • Song Z, Li Y, Fang M, et al. Mmac-copilot: Multi-modal agent collaboration operating system copilot. arXiv: https://arxiv.org/abs/2404.18074. [Google Scholar]
  • Yan Y, Jiang S, Cao T, et al. AVA: Towards agentic video analytics with vision language models. arXiv: https://arxiv.org/abs/2505.00254. [Google Scholar]
  • Li X, Ma Y, Chen Y, et al. Priority optimization for autonomous driving systems to meet end-to-end latency constraints. In: Proceedings of the 2024 IEEE Real-Time Systems Symposium (RTSS). York, 2024. [Google Scholar]
  • Gopalkrishnan A, Greer R, Trivedi M. Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving. arXiv: https://arxiv.org/abs/2403.19838. [Google Scholar]
  • Xu Z, Zhang Y, Xie E, et al. DriveGPT4: Interpretable end-to-end autonomous driving via large language model. IEEE Robot Autom Lett 2024; 9: 8186–8193.[Article] [Google Scholar]
  • Xu Z, Bai Y, Zhang Y, et al. DriveGPT4-V2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Zheng Y, Xing Z, Zhang Q, et al. Planagent: A multi-modal large language agent for closed-loop vehicle motion planning. arXiv: https://arxiv.org/abs/2406.01587. [Google Scholar]
  • Tong X, Ding P, Fan Y, et al. Quart-Online: Latency-free multimodal large language model for quadruped robot learning. In: Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA). Atlanta, GA, 2025. [Google Scholar]
  • Yan F, Liu F, Huang Y, et al. RoboTron-Mani: All-in-one multimodal large model for robotic manipulation. In: Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV). Honolulu, HI, 2025. [Google Scholar]
  • Liu J, Li C, Wang G, et al. Self-corrected multimodal large language model for end-to-end robot manipulation. arXiv: https://arxiv.org/abs/2405.17418. [Google Scholar]
  • Yang J, Tan R, Wu Q, et al. Magma: A foundation model for multimodal AI agents. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, 2025. [Google Scholar]
  • Chen J, Liang H, Du L, et al. OWMM-Agent: Open world mobile manipulation with multi-modal agentic data synthesis. In: Proceedings of the RSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities & Contemporary Challenges. Los Angeles, California, 2025. [Google Scholar]
  • Luo G, Yang G, Gong Z, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv: https://arxiv.org/abs/2506.00123. [Google Scholar]
  • Lin Z, Qu G, Chen Q, et al. Pushing large language models to the 6G edge: Vision, challenges, and opportunities. IEEE Commun Mag 2025; 63: 52–59.[Article] [Google Scholar]
  • Xu M, Niyato D, Kang J, et al. When large language model agents meet 6G networks: Perception, grounding, and alignment. IEEE Wireless Commun 2024; 31: 63–71.[Article] [Google Scholar]
  • Msuya H, Maiseli BJ. Deep learning model compression techniques: Advances, opportunities, and perspective. Tanzania J Eng Technol 2023; 42: 65–83.[Article] [Google Scholar]
  • Li D, Liu Y, Wu H, et al. Aria: An open multimodal native mixture-of-experts model. arXiv: https://arxiv.org/abs/2410.05993. [Google Scholar]
  • Huang R, Yu M, Tsoi M, et al. MMEdge: Accelerating on-device multimodal inference via pipelined sensing and encoding. arXiv: https://arxiv.org/abs/2510.25327. [Google Scholar]
  • Moradifirouzabadi A, Kang M. End-to-end acceleration of generative models with runtime regularized KV cache management. IEEE J Emerg Sel Top Circuits Syst 2025; 15: 217–230.[Article] [Google Scholar]
  • Lee J, Ha S. Empowering edge devices with processing-in-memory for on-device language inference. IEEE Embedded Syst Lett 2025; 17: 244–247.[Article] [Google Scholar]
  • Oliveira GF, Gomez-Luna J, Ghose S, et al. Accelerating neural network inference with processing-in-DRAM: From the edge to the cloud. IEEE Micro 2022; 42: 25–38.[Article] [Google Scholar]
  • Xu D, Zhang H, Yang L, et al. Fast on-device LLM inference with NPUs. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. Rotterdam, 2025, 445–462. [Google Scholar]
  • Tuli S, Jha N K. EdgeTran: Co-designing transformers for efficient inference on mobile edge platforms. arXiv: https://arxiv.org/abs/2303.13745. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.