A survey on edge multimodal large models: Compression, inference acceleration, and applications

Open Access

Review

Table 2

Comparative summary of MLLM compression techniques^a)

Sec.	Technique	Core idea	Reduction type	Typical gain	Hardware
Modality Enc.	Visual Token Comp.	Resample / merge visual tokens before LLM	Token ↓	4x–16x tokens	GPU
	Token Pruning	Remove redundant tokens via saliency/attention	Token ↓	3x–18x tokens; 1.2x–1.8x latency	GPU
	Sparse Attention	Restrict attention patterns (local/global)	Complexity ↓	Moderate speedup	Custom kernels preferred
	Audio Modality	Downsample or compress temporal frames	Token ↓	2x–10x tokens	GPU
Cross-Modal Conn.	Linear Projection	Linear mapping for modality alignment	Token/Dim	~1x–4x	Universal
	Query-based	Bottleneck queries (e.g., Q-Former)	Token ↓	8x–20x tokens	Universal
	Other Connectors	pooling/routing variants	Token/Dim ↓	2x–10x	Universal
LLM Backbone	Small LMs	Replace backbone with smaller LLM	Params↓	2x–10x params	Edge/GPU
	MoE	Activate subset of experts per token	Compute ↓	Throughput ↑	High bandwidth
	Non-typical	RNN / state-space architectures	Complexity ↓	Long-seq speedup	Kernel support helpful
General Opt.	Distillation	Transfer knowledge to smaller model	Params ↓	Variable	Universal
	Quantization	Low-bit weights (INT8/4, NF4)	Memory ↓	2x–4x memory	HW-dependent
	Pruning	Remove weights / channels	Params ↓	≤2x	Sparse kernels

a) (1) “Reduction type" distinguishes token, parameter, memory, or computational complexity reduction. (2) Sparse attention, MoE, and state-space models are efficiency-oriented rather than strict compression. (3) Reported gains are component-level; end-to-end latency depends on system implementation.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.