(@aakriti1kumar.bsky.social)

Our paper: Decision-Point Guided Safe Policy Improvement We show that a simple approach to learn safe RL policies can outperform most offline RL methods. (+theoretical guarantees!) How? Just allow the state-actions that have been seen enough times! 🤯 arxiv.org/abs/2410.09361

loading . . .

Decision-Point Guided Safe Policy Improvement Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challeng... https://arxiv.org/abs/2410.09361