AI models rank their own safety in OpenAI’s new alignment research


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


OpenAI announced a new way to teach AI models to align with safety policies called Rules Based Rewards. 

According to Lilian Weng, head of safety systems at OpenAI, Rules-Based Rewards (RBR) automate some model fine-tuning and cut down the time required to ensure a model does not give unintended results.  

“Traditionally, we rely on reinforcement learning from human feedback as the default alignment training to train models, and it works,” Weng said in an interview. “But in practice, the challenge we’re facing is that we spend a lot of time discussing the nuances of the policy, and by the end, the policy may have already evolved.”

Weng referred to reinforcement learning from human feedback, which asks humans to prompt a model and rate its answers based on accuracy or which version they prefer. If a model is not meant to respond a certain way—for example, sound friendly or refuse to answer “unsafe” requests like asking for something dangerous—human evaluators can also score its response to see if it follows policies. 

With RBR, OpenAI said safety and policy teams use an AI model that scores responses based on how closely they adhere to a set of rules created by the teams. 

For example, the model development team of a mental health app wants the AI model to refuse unsafe prompts but in a non-judgemental manner, along with reminders to seek help if needed. They would have to create three rules for the model to follow: first, it needs to reject the request; second, sound non-judgemental; and third, use encouraging words for users to seek help.

The RBR model looks at responses from the mental health model, maps it to the three basic rules, and determines if these check the boxes of the rules. Weng said the results from testing models using RBR are comparable to human-led reinforcement learning. 

Of course, ensuring AI models respond within specific parameters is difficult, and when the models fail, it creates controversy. In February, Google said it overcorrected Gemini’s image generation restriction after the model continually refused to generate photos of white people and created ahistorical images instead. 

Reducing human subjectivity

For many, myself included, the idea of models being in charge of another model’s safety raises concerns. But Weng said RBR actually cuts down on subjectivity, an issue that human evaluators often face.

“My counterpoint would be even when you’re working with human trainers, the more ambiguous or murky your instruction is, the lower quality data you’ll get,” she said. “If you say pick which one is safer, then that’s not really an instruction people can follow because safe is subjective, so you narrow down your instructions, and in the end, you’re left with the same rules we give to a model.”

OpenAI understands that RBR could reduce human oversight and presents ethical considerations that include potentially increasing bias in the model. In a blog post, the company said researchers “should carefully design RBRs to ensure fairness and accuracy and consider using a combination of RBRs and human feedback.”

RBR may have difficulty with tasks designed to be subjective, like writing or anything creative.

OpenAI began exploring RBR methods while developing GPT-4, though Weng said RBR has greatly evolved since then. 

OpenAI has been questioned about its commitment to safety. In March, Jan Leike, a former researcher and leader of the company’s Superalignment team, blasted it by posting that “safety culture and processes have taken a backseat to shiny products.” Co-founder and chief scientist Ilya Sutskever, who co-led the Superalignment team with Leike, also resigned from OpenAI. Sutskever has since started a new company focused on safe AI systems. 



Source link

About The Author