In this episode, Max Reuter and William Schultz delve into their research on prompt refusal in large language models, with a specific focus on OpenAI's chatGPT. They share their findings on the reasons behind prompt refusal and the potential for predicting it.
Max and William's interest in prompt refusal was sparked during a machine learning class project, which motivated them to delve deeper into the topic. They developed a refusal classifier using the BERT model to determine whether chatGPT would refuse a given prompt or not. Impressively, the classifier achieved an accuracy rate of 96% in identifying prompt refusal.
During their investigation, the researchers also analyzed the features that were most indicative of prompt refusal. They discovered that mentions of demographic groups and controversial figures played a significant role in predicting refusal. However, they acknowledged the challenges of labeling data for prompt refusal and the potential biases that can arise from this process.
Looking ahead, Max and William proposed several future research directions. They suggested expanding the dataset to enhance the classifier's accuracy and exploring alternative approaches to improve its performance. Additionally, they emphasized the importance of transparency in the training process of large language models and advocated for open access to the models' ethical guidelines.
This episode sheds light on the intricate nature of prompt refusal in large language models, providing valuable insights and underscoring the need for further investigation and transparency in this field.
The creators of large language models impose restrictions on some of the types of requests one might make of them. LLMs commonly refuse to give advice on committing crimes, producting adult content, or respond with any details about a variety of sensitive subjects. As with any content filtering system, you have false positives and false negatives.
Today's interview with Max Reuter and William Schulze discusses their paper "I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models". In this work, they explore what types of prompts get refused and build a machine learning classifier adept at predicting if a particular prompt will be refused or not.