Wouldn’t any awareness of the rules or ability to follow the rules inherently mean the model also knows how to break the rules? How is a language model going to avoid saying bigoted things if the model can’t recognize (and thus be capable of recreating) bigoted speech?