Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constraint environments. Despite some recent efforts on improving the efficiency of MLLMs, prior solutions yield models with static accuracy and latency footprint, and thus fall short in responding to varying runtime conditions, including changing resource availability (e.g. contention due to other programs on the device). To bridge this gap, we introduce AdaLLaVA---an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We perform extensive experiments across multimodal benchmarks involving question-answering, reasoning and hallucination. Our results show that AdaLLaVA can adhere to input latency budget and achieve varying accuracy and latency trade-offs at runtime.
Our key contributions are three folds.
During inference, model is given:
Overview of AdaLLaVA:
AdaLlava achieves comparable performance with reduced computational requirements. AdaLlava is also compatible to token selection techniques.
Balance between latency and accuracy. AdaLLaVA exhibits strong adaptability to different latency budgets, effectively trading accuracy for speed during inference, particularly in extremely latency-constrained settings.
Latency token shows different behavior given different image. The key-query attention scores of the latency token and the input visual tokens with different text questions are different. Our model dynamically adjust its computational focus based on the query type.
Latency token shows different behavior given different image. The key-query attention scores of the latency token and the input visual tokens with different text questions are different. Our model dynamically adjust its computational focus based on the query type.
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Related Links: [CLIP] [LLaVA] [Instruction Tuning with GPT-4]