CLIP Models Generalize Less Than Compositional Benchmarks Suggest

Abstract

Compositional benchmarks may conflate generalisation to novel bindings with memorisation of bindings already seen during alignment training. A synthetic study with fully-seen, partially-unseen, and fully-unseen binding splits shows accuracy drops monotonically across nine CLIP backbones. On ARO VG-A, positive captions overlap COCO bindings nearly twice as often as their attribute-swapped negatives (79.8% vs. 41.8%); only 1.2% of samples have no COCO-overlapping bindings. Restricting evaluation to shortcut-free splits reorders leaderboards and flips model rankings on ARO VG-A, with broadly replicating drops on BiVLC and VisMin. Reported gains likely overstate how much CLIP has learned to bind.

Publication
ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents (CompLearn) 2026
Links