𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8.bsky.social)

High-Resolution Image Synthesis via Next-Token Prediction D-JEPA⋅T2I extends D-JEPA for high-resolution text-to-image generation, using flow matching loss, multimodal transformers, and Visual Rotary Positional Embedding for efficient learning, achieving sota results with next-token prediction.

loading . . .

High-Resolution Image Synthesis via Next-Token Prediction Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), an autoregressive model, has demonstrated outstanding performance in class-conditional image generation. However, the application of ... https://arxiv.org/abs/2411.14808

about 1 year ago