Humans learn from one continuous visual stream, but large video models have to be trained on billions of web videos.
We found that learning from such sequential streams is challenging for video models—and we introduce a family of "orthogonal optimizers" to bridge the gap!
6 months ago