When we prompt a VLM with an image and text, we donāt know to what extent the response depends on the image vs text. These are clear illustrations of this concept. This is important to consider when looking at benchmarks for these models ā quite often the answer comes simply from the text.
add a skeleton here at some point
about 1 year ago