CLIP Models Generalize Less Than Compositional Benchmarks Suggest

Shuman Peng, Arnas Uselis, Darina Koishigarina, Martin Ester, Seong Joon Oh

Abstract

Compositional benchmarks may conflate generalisation to novel bindings with memorisation of bindings already seen during alignment training. A synthetic study with fully-seen, partially-unseen, and fully-unseen binding splits shows accuracy drops monotonically across nine CLIP backbones. On ARO VG-A, positive captions overlap COCO bindings nearly twice as often as their attribute-swapped negatives (79.8% vs. 41.8%); only 1.2% of samples have no COCO-overlapping bindings. Restricting evaluation to shortcut-free splits reorders leaderboards and flips model rankings on ARO VG-A, with broadly replicating drops on BiVLC and VisMin. Reported gains likely overstate how much CLIP has learned to bind.

Publication

ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents (CompLearn) 2026

Links

PDF