GitHub - Baidicoot/sparse_coding: Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition.

Sparse Coding

This repo contains code for applying sparse coding to activation vectors in language models. Work done with Logan Riggs and Aidan Ewart, advised by Lee Sharkey.

run.py contains a more set of functions for generating datasets using Pile10k and then running sparse autoencoders activations on the data to try and learn the features that the model is using for its computation. It is set up by default to run hyperparameter sweeps using across dictionary size and l1 coefficient.

python replicate_toy_models.py runs code which allows for the replication of the first half of the post Taking features out of superposition with sparse autoencoders.

The repo also contains utils for running code on vast.ai computers which can speed up these sweeps.

Automatic Interpretation

interpret.py contains tools to interpret learned dictionaries using OpenAI's automatic interpretation protocol. Set --load_interpret_autoencoder to the location of the autoencoder you want to test, and --model_name, --layer and --layer_loc to specify the activations that should be used. --activation_tranform should be set to feature_dict for interpreting a learned dictionary but there are many baselines that can also be run, including pca, ica, nmf, neuron_basis, and random.

If you run interpret.py read_results --kwargs.. and select the --model_name, --layer and --layer_loc, this will produce a series of plots comparing

Training a custom small transformer

One part of replicating Conjecture's sparse coding work was to use a very small transformer for some early tests using sparse autoencoders to find features. There doesn't appear to be an open-source model of this kind, and the original model is proprietary, so below are the instructions I followed to create a similar small transformer.

Make sure you have >200GB disk space. Tested using a vast.ai RTX3090 and pytorch:latest docker image.

git clone https://github.com/karpathy/nanoGPT
cd nanoGPT
python -m venv .env
source .env/bin/activate
apt install -y build-essential
pip install torch numpy transformers datasets tiktoken wandb tqdm

Change config/train_gpt2.py to have:

import time
wandb_project = 'sparsecode'
wandb_run_name = 'supertiny-' + str(time.time())
n_layer = 6 # (same as train_shakespeare and Lee's work)
n_embd = 16 # (same as Lee's)
n_head = 8 # (needs to divide n_embd)
dropout = 0.2 # (used in shakespeare_char)
block_size = 256 # (just to make faster?)
batch_size = 64

To set up the dataset run:

python data/openwebtext/prepare.py

Then if using multiple gpus, run:

torchrun --standalone --nproc_per_node={N_GPU} train.py config/train_gpt2.py

else simply run:

python train.py

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
autoencoders		autoencoders
experiments		experiments
interp_notebooks		interp_notebooks
plotting		plotting
sc_datasets		sc_datasets
test		test
test_datasets		test_datasets
.gitignore		.gitignore
Python-3.11.3.tgz		Python-3.11.3.tgz
README.md		README.md
activation_dataset.py		activation_dataset.py
argparser.py		argparser.py
basic_l1_sweep.py		basic_l1_sweep.py
big_sweep.py		big_sweep.py
big_sweep_experiments.py		big_sweep_experiments.py
case_studies_loop.ipynb		case_studies_loop.ipynb
cluster_runs.py		cluster_runs.py
cmdutil.py		cmdutil.py
do_ioi_multiple_layers.sh		do_ioi_multiple_layers.sh
generate_test_data.py		generate_test_data.py
inter_dict_connections.ipynb		inter_dict_connections.ipynb
interpret.py		interpret.py
minimal_feature_interp.ipynb		minimal_feature_interp.ipynb
mypy.ini		mypy.ini
replicate_toy_models.py		replicate_toy_models.py
requirements.txt		requirements.txt
sphered_comparison.py		sphered_comparison.py
standard_metrics.py		standard_metrics.py
sweep_baselines.py		sweep_baselines.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Coding

Automatic Interpretation

Training a custom small transformer

About

Releases

Packages

Languages

Baidicoot/sparse_coding

Folders and files

Latest commit

History

Repository files navigation

Sparse Coding

Automatic Interpretation

Training a custom small transformer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages