Enhancing Multi-Image Understanding through Delimiter Token Scaling

Abstract

Large Vision-Language Models achieve strong performance on single-image tasks, but their performance declines with multiple images due to cross-image information leakage. We propose scaling the hidden states of delimiter tokens to reinforce intra-image interaction and limit cross-image interactions. The method improves performance on multi-image benchmarks and requires no additional training or inference cost.

Publication
International Conference on Learning Representations 2026