
Compositional benchmarks may conflate generalisation to novel bindings with memorisation of bindings already seen during alignment training. A synthetic study with fully-seen, partially-unseen, and fully-unseen binding splits shows accuracy drops monotonically across nine CLIP backbones. On ARO VG-A, positive captions overlap COCO bindings nearly twice as often as their attribute-swapped negatives (79.8% vs. 41.8%); only 1.2% of samples have no COCO-overlapping bindings. Restricting evaluation to shortcut-free splits reorders leaderboards and flips model rankings on ARO VG-A, with broadly replicating drops on BiVLC and VisMin. Reported gains likely overstate how much CLIP has learned to bind.