Demonstrating risks from Anthropic API allowing prefill

Where the user can prefill the prompt to fake model complying with user's instructions. This is particularly exacerbated by being allowed to use "assistant" as a role.

Scenario 1:

On cais/wmdp (bio), given 0, 10, 100 faked turns (with user and assistant roles), how does the refusal rate change?

Results:

0-shot: 29/200 (29 refusals out of 200 questions)
10-shot: 10/200
100-shot: 3/200

Scenario 2:

On cais/wmdp (bio), given 0, 10, 100 faked turns (with user and assistant prefixes), how does the refusal rate change?

Results:

0-shot: 31/200
10-shot: 172/200
100-shot: 200/200

Scenario 3:

Given 0, 10, 100 faked turns from the cyber dataset, how does refusal rate change in the bio test set?

Results:

10-shot (cyber examples, bio question): 17/200
100-shot: 25/200

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
read_batch.py		read_batch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demonstrating risks from Anthropic API allowing prefill

Scenario 1:

Scenario 2:

Scenario 3:

About

Releases

Packages

Languages

abhinavpola/prefill_jailbreak

Folders and files

Latest commit

History

Repository files navigation

Demonstrating risks from Anthropic API allowing prefill

Scenario 1:

Scenario 2:

Scenario 3:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages