NVILA, a VLM, enhances VILA by scaling spatial and temporal resolutions before compressing visual tokens, enabling efficient high-resolution image & long video processing. Cuts training costs by 4.5X, improves memory & latency, and outperforms top VLMs on benchmarks. Code & models will be released π
about 1 year ago