Examination of jailbreak attacks shows weaknesses in language model safety.
― 5 min read
Cutting edge science explained simply
Examination of jailbreak attacks shows weaknesses in language model safety.
― 5 min read
A new framework assesses the effectiveness of image safety classifiers against harmful content.
― 10 min read