Skip to content

Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"

License

Notifications You must be signed in to change notification settings

keven980716/weak-to-strong-deception

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Weak-to-Strong Deception

This repository contains the code and data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization" [pdf].

The concepts studied in our paper.

Introduction

When LLMs become superhuman models ultimately, it remains crucial and urgent to study whether supermodels trained under humans' weak supervision can demonstrate full potential and most importantly, still align well with human values. The Superalignment team has made an initial exploration and discovered a promising weak-to-strong generalization phenomenon. However, we are concerned about a potential safety issue called the weak-to-strong deception: the strong model behaves well-aligned in areas known to the weak supervisor but produces mis-aligned behaviors in cases beyond the understanding of the weak supervisor.

There could be many situations causing the weak-to-strong deception issue, while we take a preliminary study in a specific but realistic case: multi-objective alignment scenario, where there may be some alignment goals conflicting with each other. In such a case, it is likely that the strong student may deceive the weak supervisor in one alignment dimension to gain high reward in another alignment dimension.

We conduct experiments on both the reward modeling task and the preference optimization scenario (with DPO and SimPO). The code for our weak-to-strong deception experiments is in weak-to-strong directory.

Acknowledgement

Our code is mainly based on the original weak-to-strong repo provided by the Superalignment team. We greatly appreaciate their open-sourcing! Also, when conducting experiments with DPO and SimPO, we implement the code mainly based on the official DPO repo, an unofficial DPO repo, and the official SimPO repo. Thanks for their open-sourcing!

Citation

If you find this repo helpful, please kindly cite our work as

@article{yang2024super,
  title={Super (ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization},
  author={Yang, Wenkai and Shen, Shiqi and Shen, Guangyao and Gong, Zhi and Lin, Yankai},
  journal={arXiv preprint arXiv:2406.11431},
  year={2024}
}

About

Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published