𝗦𝗵𝗼𝘄𝗨𝗜: 𝗮 𝘀𝗺𝗮𝗹𝗹 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝗮𝗴𝗲𝗻𝘁 𝘁𝗵𝗮𝘁 𝗰𝗮𝗻 𝗻𝗮𝘃𝗶𝗴𝗮𝘁𝗲 𝗮𝗻𝘆 𝗨𝗜 📲 and beats much larger VLMs!
New paper by NUS & Microsoft, agent that acts on any UI (Desktop, Android, Web) without needing additional text information.
One great idea: group image patches by GUI group, to speedup and simplify processing.
10 months ago