Do-Not-Answer: Evaluating Safeguards in LLMs
Document Type
Conference Proceeding
Publication Title
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024
Abstract
With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation.1 Warning: This paper contains examples that may be offensive, harmful, or biased.
First Page
896
Last Page
911
Publication Date
1-1-2024
Recommended Citation
Y. Wang et al., "Do-Not-Answer: Evaluating Safeguards in LLMs," EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024, pp. 896 - 911, Jan 2024.