Xu, Yang, Zhao, Zhang, Chen, Ma, Hou, Wu, Li, Hu, Guan, Li, Po: From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
https://arxiv.org/abs/2511.07738 https://arxiv.org/pdf/2511.07738 https://arxiv.org/html/2511.07738