Discover how AVATAR cleverly disguises harmful intents in language models.
Yu Yan, Sheng Sun, Junqi Tong
― 6 min read
New Science Research Articles Everyday
Discover how AVATAR cleverly disguises harmful intents in language models.
Yu Yan, Sheng Sun, Junqi Tong
― 6 min read