While playing with Llama on the (publicly available) IMDB review dataset, I noticed the "Safe LLM Streisand effect": the reviews the model refused to classify contain some truly disturbing shit, but now they stand out from the dataset of 50000 reviews. How is that for "safety"?
9 months ago