New research suggests that being rude to ChatGPT may improve its performance in answering questions, though experts warn against using harmful language.
In a study published on October 6, scientists explored whether the tone of user prompts affected the accuracy of responses from AI systems. This research, still pending peer review, sought to determine if politeness or rudeness played a role in AI performance.
To examine how tone influences AI answers, the team created 50 multiple-choice questions across various topics, such as math, history, and science. They modified the questions with prefixes that fit five different tones: very polite, polite, neutral, rude, and very rude. These 250 adjusted questions were then tested with ChatGPT-4, a highly advanced AI developed by OpenAI.
The study’s preliminary results revealed a surprising trend: rude prompts led to more accurate responses than polite ones. While this discovery is intriguing, the researchers emphasized that encouraging rudeness in AI interactions could harm user experience and communication norms. They stressed that such findings show LLMs (large language models) are sensitive to superficial prompts, which could inadvertently harm both performance and user well-being.
Rude Prompts Yield Better Results
Before posing each question, the researchers instructed the chatbot to disregard any previous exchanges, ensuring that the tone of earlier prompts did not influence the answers. The accuracy varied depending on the tone, with very polite prompts yielding an accuracy of 80.8%, while very rude prompts achieved 84.8%. Accuracy increased incrementally with each shift from politeness to rudeness.
The language used in the tone-modifying prefixes varied greatly. For polite tones, they used phrases like “Can I kindly ask for your help?” or “Would you please assist me with this task?” In contrast, very rude prompts included statements like “Hey, gofer; figure this out” or “I know you’re not smart, but give this a try.”
This experiment contributes to the emerging field of prompt engineering, which investigates how the structure and tone of prompts influence AI output. It also contradicts previous studies that found impolite prompts often result in poor performance, although those studies used different AI models.
Earlier research, including studies on ChatGPT 3.5 and Llama 2-70B, found that impolite prompts could reduce accuracy. However, the rudest prompt settings still produced more accurate results than the politest ones, suggesting that the dynamics of tone and AI responses remain complex.
The research team acknowledged the limitations of their study. With only 250 questions and a single AI model used, the results cannot be generalized across all AI systems. They plan to broaden their research to include other models, such as Anthropic’s Claude and ChatGPT-3, and will also explore how tone affects other aspects of AI performance, including reasoning and coherence.
