Open Access
Review

Table 2

Comparative summary of MLLM compression techniquesa)

Sec. Technique Core idea Reduction type Typical gain Hardware
Modality Enc. Visual Token Comp. Resample / merge visual tokens before LLM Token ↓ 4x–16x tokens GPU
Token Pruning Remove redundant tokens via saliency/attention Token ↓ 3x–18x tokens; 1.2x–1.8x latency GPU
Sparse Attention Restrict attention patterns (local/global) Complexity ↓ Moderate speedup Custom kernels preferred
Audio Modality Downsample or compress temporal frames Token ↓ 2x–10x tokens GPU
Cross-Modal Conn. Linear Projection Linear mapping for modality alignment Token/Dim ~1x–4x Universal
Query-based Bottleneck queries (e.g., Q-Former) Token ↓ 8x–20x tokens Universal
Other Connectors pooling/routing variants Token/Dim ↓ 2x–10x Universal
LLM Backbone Small LMs Replace backbone with smaller LLM Params↓ 2x–10x params Edge/GPU
MoE Activate subset of experts per token Compute ↓ Throughput ↑ High bandwidth
Non-typical RNN / state-space architectures Complexity ↓ Long-seq speedup Kernel support helpful
General Opt. Distillation Transfer knowledge to smaller model Params ↓ Variable Universal
Quantization Low-bit weights (INT8/4, NF4) Memory ↓ 2x–4x memory HW-dependent
Pruning Remove weights / channels Params ↓ ≤2x Sparse kernels

a) (1) “Reduction type" distinguishes token, parameter, memory, or computational complexity reduction. (2) Sparse attention, MoE, and state-space models are efficiency-oriented rather than strict compression. (3) Reported gains are component-level; end-to-end latency depends on system implementation.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.