Skip to content

A collection of different ways to implement accessing and modifying internal model activations for LLMs

License

Notifications You must be signed in to change notification settings

annahdo/implementing_activation_steering

Repository files navigation

Implementing activation steering

This repository provides code for different ways to implement activation steering to change the behavior of LLMs. See also this blogpost.

It is aimed at people who are new to activation/representation steering/engineering/editing. I use GPT2-XL as an example model for the implementation.

install

Tested with python 3.10. Make a new environment and install the libraries in requirements.txt.

pip install -r requirements.txt

General approach to activation steering

The idea is simple: we just add some vector (for example the "Love" vector) to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting.

In general there are a few steps involved which I simplify in the following:

  • Decide on a layer $l$ and transformer module $\phi$ to apply the activation steering to. This is often the residual stream of one of the hidden layers.
  • Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like $v=\phi_l(Love)−\phi_l(Hate)$.
  • Add the vector to the activation during the forward pass. In the simplest case it's something like $\tilde{\phi}_l=\phi_l+v$.

Implementations

About

A collection of different ways to implement accessing and modifying internal model activations for LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published