Implementing activation steering

This repository provides code for different ways to implement activation steering to change the behavior of LLMs. See also this blogpost.

It is aimed at people who are new to activation/representation steering/engineering/editing. I use GPT2-XL as an example model for the implementation.

install

Tested with python 3.10. Make a new environment and install the libraries in requirements.txt.

pip install -r requirements.txt

General approach to activation steering

The idea is simple: we just add some vector (for example the "Love" vector) to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting.

In general there are a few steps involved which I simplify in the following:

Decide on a layer $l$ and transformer module $\phi$ to apply the activation steering to. This is often the residual stream of one of the hidden layers.
Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like $v=\phi_l(Love)−\phi_l(Hate)$.
Add the vector to the activation during the forward pass. In the simplest case it's something like $\tilde{\phi}_l=\phi_l+v$.

Implementations

custom_wrapper.ipynb - writing your own wrappers to equip modules with additional functionality
transformer_lens.ipynb - using the TransfomerLens library
baukit.ipynb - using the baukit library
pytorch_hooks.ipynb - using PyTorch hooks directly (TransfomerLens and baukit use PyTorch hooks internally)
bias_editing.ipynb - editing the model bias

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementing activation steering

install

General approach to activation steering

Implementations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md
baukit.ipynb		baukit.ipynb
bias_editing.ipynb		bias_editing.ipynb
custom_wrapper.ipynb		custom_wrapper.ipynb
pytorch_hooks.ipynb		pytorch_hooks.ipynb
requirements.txt		requirements.txt
transformer_lens.ipynb		transformer_lens.ipynb

License

annahdo/implementing_activation_steering

Folders and files

Latest commit

History

Repository files navigation

Implementing activation steering

install

General approach to activation steering

Implementations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages