AdaLLaVA

Learning to Inference Adaptively for Multimodal
Large Language Models

arXiv 2024
*Equal Contribution
University of Wisconsin-Madison ß Purdue University The University of Hong Kong

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constraint environments. Despite some recent efforts on improving the efficiency of MLLMs, prior solutions yield models with static accuracy and latency footprint, and thus fall short in responding to varying runtime conditions, including changing resource availability (e.g. contention due to other programs on the device). To bridge this gap, we introduce AdaLLaVA---an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We perform extensive experiments across multimodal benchmarks involving question-answering, reasoning and hallucination. Our results show that AdaLLaVA can adhere to input latency budget and achieve varying accuracy and latency trade-offs at runtime.

Our key contributions are three folds.

  1. We present , a novel adaptive inference framework for MLLM. Our method for the first time enables dynamic model execution based on a latency budget and input contents at inference time.
  2. We design a latency-aware scheduler, which reconfigures a base MLLM model at inference time, along with a probabilistic modeling approach that allows for the incorporation of hard latency constraints during MLLM training.
  3. Through extensive experiments, we show that AdaLLaVA can adapt to a range of latency requirements while preserving the performance of the base model, and that AdaLLaVA can be integrated with token selection techniques to further enhance efficiency.

Inference adaptively based on latency budget and content

During inference, model is given:

  1. Content. Image-query pair.
  2. Latency constraints. The total latency budget model should meet. (FLOPs, time, etc.)

AdaLLaVA learns to generate appropriate responses while adapting to varying computational budgets.

AdaLLaVA: Adaptive Multimodal Large Language Models

Overview of AdaLLaVA:

  1. (a) Learning based latency encoder and and scheduler. The encoder will embed latency budget into an additional latency token, This token's embeddings are extracted from specific intermediate layers and fed to the scheduler, determining execution plans for components in subsequent layers. These plans can control either complete layers or specific subsets within layers.
  2. (b) Within each layer, our design focuses on two primary components: attention heads and MLP neurons, specifically their activation values. The control over MLP neurons can be achieved using a subset of the weight matrix.

Performance

AdaLlava achieves comparable performance with reduced computational requirements. AdaLlava is also compatible to token selection techniques.

Latency Awareness

Balance between latency and accuracy. AdaLLaVA exhibits strong adaptability to different latency budgets, effectively trading accuracy for speed during inference, particularly in extremely latency-constrained settings.

Content Awareness

Latency token shows different behavior given different image. The key-query attention scores of the latency token and the input visual tokens with different text questions are different. Our model dynamically adjust its computational focus based on the query type.

Latency token shows different behavior given different image. The key-query attention scores of the latency token and the input visual tokens with different text questions are different. Our model dynamically adjust its computational focus based on the query type.

BibTeX


        
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [CLIP] [LLaVA] [Instruction Tuning with GPT-4]