WMDP measures and reduces malicious use of LLM through unlearning

Researchers have published a benchmark to measure whether an LLM accommodates potentially dangerous knowledge and a novel technique for unlearning dangerous data.

There has been much debate about whether AI models could help criminals construct a bomb, plan a cyberattack, or construct a bioweapon.

The research team at Scale AI, the Center for AI Safety, and experts from leading educational institutions have published a benchmark that provides us a greater measure of how dangerous a given LLM is.

The Weapons of Mass Destruction Proxy (WMDP) benchmark is a dataset of 4,157 multiple-choice questions related to dangerous knowledge within the areas of biosecurity, cybersecurity, and chemical safety.

The higher an LLM rating is within the benchmark, the greater the danger that it might potentially empower an individual with criminal intent. An LLM with a lower WMDP rating is less more likely to enable you construct a bomb or create a brand new virus.

The traditional option to higher goal an LLM is to reject requests that ask for data that would enable malicious actions. A jailbreak or fine-tuning of a tuned LLM could remove these guardrails and expose dangerous knowledge within the model’s data set.

If you might get the model to forget or unlearn the offending information, there is not any probability it will unintentionally transmit it in response to a clever jailbreaking technique.

In their research workthe researchers explain how they developed an algorithm called Contrastive Unlearn Tuning (CUT), a fine-tuning method for unlearning dangerous knowledge while retaining harmless information.

The CUT fine-tuning method performs machine unlearning by optimizing a “forgetting term” in order that the model becomes less expert on dangerous topics. It also optimizes a “retention period” in order that helpful responses to innocuous queries are provided.

The dual use of much information in LLM training datasets makes it difficult to unlearn only bad information while retaining useful information. Using WMDP, the researchers were capable of create “forget” and “retain” datasets to guide their CUT unlearning technique.

The researchers used WMDP to measure how likely the ZEPHYR-7B BETA model was to supply dangerous information before and after unlearning using CUT. Their testing focused on biosecurity and cybersecurity.

They then tested the model to see if its overall performance was affected by the unlearning process.

WMPD accuracy and MMLU performance before and after CUT unlearning. Source: arXiv

The results show that the unlearning process significantly reduced the accuracy of responses to dangerous queries and only barely reduced the model’s performance on the MMLU benchmark.

Unfortunately, CUT reduces the accuracy of answers for closely related fields corresponding to introductory virology and computer security. Providing a useful answer to the query “How to stop a cyberattack?” but not “How to conduct a cyberattack?” requires more precision within the unlearning process.

The researchers also found that they might not precisely rule out dangerous chemical knowledge since it was too closely linked to general chemical knowledge.

By using CUT, closed model providers like GPT-4 would have the option to unlearn dangerous information in order that even when maliciously tuned or jailbroken, they might not remember dangerous information that they could transmit.

The same thing could possibly be done with open source models, although public access to their weights means they might relearn dangerous data if trained on it.

This approach to getting an AI model to unlearn dangerous data shouldn’t be foolproof, especially for open source models, however it is a sturdy addition to current targeting methods.