linear_rep_geometry

In our paper, we formalize the linear representation hypothesis of large language models, and define a causal inner product that respects the semantic structure of language.

We confirm our theory with LLaMA-2 representations and this repo provides the code for the experiments.

Data

In word_pairs, each [___ - ___].txt has counterfactual pairs of words for each concept. We use them to estimate the unembedding representations.

In paired_contexts, each __-__.jsonl has context samples from Wikipedia in different languages. We use them for the measurement experiment (3_measurement.ipynb).

Requirement

You need to install Python packages transformers, torch, numpy, seaborn, matplotlib, json, and tqdm to run the codes. Also, you need some GPUs to implement the code efficiently.

Make a directory matrices and run store_matrices.py first before you run other jupyter notebooks.

Experiments

1_subspace.ipynb: We compare the projection of differences between counterfactual pairs (vs random pairs) onto their corresponding concept direction
2_heatmap.ipynb: We compare the orthogonality between the unembedding representations for causally separable concepts based on the causal inner product (vs Euclidean and random inner product)
3_measurement.ipynb: We confirm that the concept direction acts as a linear probe
4_intervention.ipynb: We confirm that the embedding representation changes the target concept, without changing off-target concepts.
5_sanity_check.ipynb: We verify that the causal inner product we find satisfies the uncorrelatedness condition in Assumption 3.3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

linear_rep_geometry

Data

Requirement

Experiments

Files

README.md

Latest commit

History

README.md

File metadata and controls

linear_rep_geometry

Data

Requirement

Experiments