prot-gpt

This an implementation of nano (femto?) GPT model trainable on protein sequences made of amino acids; inspired from the original NanoGPT implementation of Andrej Karpathy. Compared to the original implementations the main changes are:

This model trains on multiple independent sequences. That is, the context only contains the current sequence (and not those appearing before in the training set).
Since the protein sequences have variable lengths, it does padding and masking. The sequences are padded to the block size (for batching), and the weights corresponding to the padded tokens are then masked out inside the transformer model, in order to avoid communication to/from padded tokens.
The training loop relies on PyTorch Lightning, which makes our lives a little easier.

The default parameters in train_proteins.py will build a ~10M parameters models trainable in a few hours on a GPU with 8 GB of RAM (e.g. 2080).

Procedure

Prepare Python

$ pip install -r requirements.txt

Download sequences from PDB:

$ mkdir data && cd data
$ wget https://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
$ gzip -d pdb_seqres.txt.gz && cd ..

Pre-process sequences:

$ python preprocess_pdb_seqres.py

That creates a file data/prot_seqs.txt, which contains mol:protein entries of the PDB file (one entry per distinct name).

Choose hyper-parameters in train_proteins.py and train model:

$ python train_proteins.py

You can launch a Tensorboard instance to watch the model being trained.

At the end (or if CTRL+C'ing) the path to the best model checkpoint should be displayed.

Generate 100 proteins using a checkpointed model:

$ python generate_proteins.py 100 path/to/checkpoint.ckpt

This writes the generated proteins in a file generated_proteins.txt.

Visualise with AlphaFold: Use the AlphaFold Colab with your own sequences!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
data		data
.gitignore		.gitignore
README.md		README.md
analyze_generated.py		analyze_generated.py
dev_notebook.ipynb		dev_notebook.ipynb
generate_proteins.py		generate_proteins.py
nano_transformer.py		nano_transformer.py
preprocess_pdb_seqres.py		preprocess_pdb_seqres.py
requirements.txt		requirements.txt
train_proteins.py		train_proteins.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prot-gpt

Procedure

About

Languages

hrzn/prot-gpt

Folders and files

Latest commit

History

Repository files navigation

prot-gpt

Procedure

About

Topics

Resources

Stars

Watchers

Forks

Languages