New TCR-Epitope Binding Affinity Prediction Task #141

annaweber209 · 2022-01-26T17:06:09Z

fixes #129.

The new TCR-Epitope Binding Affinity Prediction Task can be used via:

from tdc.multi_pred import TCREpitopeBinding data = TCREpitopeBinding(name = 'weber', path = './data')

The 'weber' dataset contains information about TCR full and cdr3 sequence, as well as epitope sequence as amino acids or smiles strings.

kexinhuang12345 · 2022-01-28T01:15:14Z

Thanks, Anna! It works perfectly! Could you also provide some more information about the task and dataset? We would like to highlight it on the website. An example would be https://tdcommons.ai/multi_pred_tasks/peptidemhc/

Particularly, for the task, it would be definition/impact/generalization/product/pipeline. And for the dataset, it would be dataset description/task description/dataset statistics/recommended splits/reference/license.

Any written draft would be great! If you are busy, let us know, we can also come up with something : )

kexinhuang12345 · 2022-02-05T01:27:29Z

Hi @annaweber209 and @jannisborn i drafted the following text on the TDC website for this task. Could you guys double check to see if you want to edit anything? thanks!!

Description: T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

Impact: An accurate model can help design TCR receptor for effective immunotherapy. It can also unlock a patients’ TCR repertoire, which reflects their immune history and could inform about past and current infectious diseases, vaccine effectiveness or autoimmune reactions.

Generalization: The models are expected to be generalized to unseen TCR-epitope pairs and also generalize to epitope that is never-before-seen.

Product: Immunotherapy.

Pipeline: Activity.

Dataset description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses all human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid and SMILES representation for epitope and the amino acid full sequence of CDR3 region sequence for TCR.

Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.

Dataset Split: Random Split, Cold Epitope Split

from tdc.multi_pred import TCREpitopeBinding 
data = TCREpitopeBinding(name = 'weber', path = './data')
split = data.get_split()

References:
[1] Weber, Anna, Jannis Born, and María Rodriguez Martínez. "TITAN: T-cell receptor specificity prediction with bimodal attention networks." Bioinformatics 37.Supplement_1 (2021): i237-i244.

[2] Bagaev, Dmitry V., et al. "VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium." Nucleic Acids Research 48.D1 (2020): D1057-D1062.

[3] Dines, Jennifer N., et al. "The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database." medRxiv (2020).

jannisborn · 2022-02-07T08:17:23Z

Hi @kexinhuang12345, thanks a lot for doing this. I think it's great, here are few comments:

Dataset Split: Cold TCR Split, Cold TCR/Epitope Split

Comment:The Random split will be too easy to solve. In our paper, we use a Cold TCR split and a Cold TCR-Epitope split; where we split by both modalities together. Cold Epitope split alone might also be very useful since the generalization toward new epitopes is the grand challenge.

Generalization: The models are, at very least, expected to generalize to unseen TCRs. But the main challenge of this dataset is to generalize to samples where both epitope and TCR are unseen.

Also I would replace this sentence

The dataset contains amino acid and SMILES representation for epitope and the amino acid full sequence of CDR3 region sequence for TCR.

with

The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included.

Maybe @annaweber209 has some more thoughts?

kexinhuang12345 · 2022-02-07T14:56:27Z

Great! Thanks for the feedback! @annaweber209 let me know if there is any additional thought!

annaweber209 · 2022-02-09T12:14:03Z

Sorry for the late reply @kexinhuang12345 !

I would add a sentence to the dataset description saying: "50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind."

Another small comment is to remove the word "all" from "It uses all human TCR-beta chain sequences." since we excluded some of the TCR sequences during the downsampling process.

kexinhuang12345 · 2022-02-09T22:46:34Z

Awesome! Thanks so much! It is included on the website and will be alive in 0.3.6, scheduled to be released this weekend. let us know if you have any questions before then!

annaweber209 added 3 commits January 26, 2022 17:27

feat: added new TCR epitope interaction task.

5868e4c

wip: change names to match naming conventions

6f16772

wip: remove target list as dataset is single label

6e93dd1

kexinhuang12345 approved these changes Jan 28, 2022

View reviewed changes

kexinhuang12345 merged commit 6d443fc into mims-harvard:main Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New TCR-Epitope Binding Affinity Prediction Task #141

New TCR-Epitope Binding Affinity Prediction Task #141

annaweber209 commented Jan 26, 2022

kexinhuang12345 commented Jan 28, 2022

kexinhuang12345 commented Feb 5, 2022

jannisborn commented Feb 7, 2022

kexinhuang12345 commented Feb 7, 2022

annaweber209 commented Feb 9, 2022

kexinhuang12345 commented Feb 9, 2022 •

edited

Loading

New TCR-Epitope Binding Affinity Prediction Task #141

New TCR-Epitope Binding Affinity Prediction Task #141

Conversation

annaweber209 commented Jan 26, 2022

kexinhuang12345 commented Jan 28, 2022

kexinhuang12345 commented Feb 5, 2022

jannisborn commented Feb 7, 2022

kexinhuang12345 commented Feb 7, 2022

annaweber209 commented Feb 9, 2022

kexinhuang12345 commented Feb 9, 2022 • edited Loading

kexinhuang12345 commented Feb 9, 2022 •

edited

Loading