Natural Language Processing Faculty Publications

Do-Not-Answer: Evaluating Safeguards in LLMs

Yuxia Wang, LibrAI
Haonan Li, LibrAI
Xudong Han, LibrAI
Preslav Nakov, Mohamed bin Zayed University of Artificial IntelligenceFollow
Timothy Baldwin, LibrAI

Document Type

Conference Proceeding

Publication Title

EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024

Abstract

With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation.1 Warning: This paper contains examples that may be offensive, harmful, or biased.

First Page

896

Last Page

911

Publication Date

1-1-2024

Recommended Citation

Y. Wang et al., "Do-Not-Answer: Evaluating Safeguards in LLMs," EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024, pp. 896 - 911, Jan 2024.

This document is currently not available here.

COinS

Natural Language Processing Faculty Publications

Do-Not-Answer: Evaluating Safeguards in LLMs

Document Type

Publication Title

Abstract

First Page

Last Page

Publication Date

Recommended Citation

Browse

Contribute

Links

Natural Language Processing Faculty Publications

Do-Not-Answer: Evaluating Safeguards in LLMs

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

Publication Date

Recommended Citation

Share

Browse

Contribute

Links