Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VCF to protein HGVS conversions of insertions near splice sites #68

Merged
merged 3 commits into from
Feb 1, 2023

Conversation

alumi
Copy link
Member

@alumi alumi commented Jan 26, 2023

Summary

This PR fixes a problem in varity.vcf-to-hgvs/vcf-variant->protein-hgvs reported by #67.

Problem

The test case of #67 fails on master branch returning an incorrect value, p.R1335Qfs*954

{:chr "chr1", :pos 26773690, :ref "C", :alt "CGCAGCA"}

As ARID1A is on the :forward strand and there's a repeated sequence of GCA around the variant, 3'-rule realigns the variant to the following one:

{:chr "chr1", :pos 26773714, :ref "A", :alt "AGCAGCA"}

This can be illustrated as follows. The correct output should be p.Q1334_R1335insQQ.

pos  26773...                          POS              POS+6
     666 666 666 667 777 777 777 777 7 7         7 77            8 888 888
     899 999 999 990 000 000 000 111 1 1         1 11            0 000 000
     901 234 567 890 123 456 789 012 3 4         5 67            2 345 678
pref P Q Q Q Q Q Q Q ┌─────Q─────┐ R  intron  ─┐ H D
REF  CCG_CAG_CAG_CAG_CAG_CAG_CAG_CAG_C[A        ]A_CG  gt....ag  A_CAT_GAT
ALT  CCG_CAG_CAG_CAG_CAG_CAG_CAG_CAG_C[AG_CAG_CA]A_CG  gt....ag  A_CAT_GAT
palt P Q Q Q Q Q Q Q Q─┘ Q Q─┘ R          ─┘ H D
Cause

varity.vcf-to-hgvs.protein/->protein-variant splits REF and ALT AA sequences into three parts each:

  1. common prefix
  2. directly affected subsequence
  3. the remaining which may have a frame shift

[_ pref ref-prot-rest] (pstring/split-at
ref-prot-seq
[(dec base-ppos)
(case strand
:forward (protein-position (+ pos (count ref) -1) rg)
:reverse ppos)])
[_ palt alt-prot-rest] (pstring/split-at
alt-prot-seq*
[(min (dec base-ppos) (count alt-prot-seq*))
(min (case strand
:forward (protein-position (+ pos (count alt) -1) rg)
:reverse (protein-position (- pos (- (count alt) (count ref))) rg))
(count alt-prot-seq*))])

At L162, a genomic position (+ pos (count alt) -1) is converted into a transcript coordinate by varity.vcf-to-hgvs.protein/protein-position to calculate the end index of the direct effect (2.) in the ALT AA sequence.
But the position (+ pos (count alt) -1), which evaluates to be 26774720, is inside of an intron and thus be clamped to the start position of the subsequent exon: 26773802

pos (proton/clip pos (:cds-start rg) (:cds-end rg))
pos (->> (:exon-ranges rg)
(some (fn [[s e]]
(cond
(<= s pos e) pos
(< pos s) s))))

resulting in an incorrect splitting and an incorrect HGVS p.R1335Qfs*954

REF: ...QQQ QR HD...
ALT: ...QQQ QQ QRHD...
Fix

This PR fixes the coordinate converting by shifting feature positions in refGene entry by applying ALT.

:alt-rg (-> rg
(assoc :exon-ranges alt-exon-ranges*)
(update :cds-start apply-offset)
(update :cds-end apply-offset))}))

Along with :exon-ranges which is already calculated as alt-exon-ranges*, :cds-(start|end) are converted using the same fn: varity.vcf-to-hgvs.protein/alt-exon-ranges.
With these changes applied, (+ pos (count alt) -1) will no longer come in an intron, allowing the correct conversion of positions.

REF: ...QQQ Q   RHD...
ALT: ...QQQ QQQ RHD...
Known issues

Since the current implementation relies only on positions, it will fail if a realigned variant contains a boundary of exon and intron.

;; p.=
;; Two synonymous snvs on the same strand.
;; The intron in between remains unchanged.
"chr1" 26773714
"AACGGTGAGTAAAGCCTGGTCTCGGTGCTGCTATGGATCAGGCTTCGCCACTGCCCACCCTAATCCTGTGTTTCTTTGCCTCCTATAGACAT"
"AGCGGTGAGTAAAGCCTGGTCTCGGTGCTGCTATGGATCAGGCTTCGCCACTGCCCACCCTAATCCTGTGTTTCTTTGCCTCCTATAGACAC"
;; p.R1335delinsQRH
;; Two insertions (GCA at first, CAT at last) on the same strand.
;; The intron in between remains unchanged.
"chr1" 26773714
"AACGGTGAGTAAAGCCTGGTCTCGGTGCTGCTATGGATCAGGCTTCGCCACTGCCCACCCTAATCCTGTGTTTCTTTGCCTCCTATAGACAT"
"AGCAACGGTGAGTAAAGCCTGGTCTCGGTGCTGCTATGGATCAGGCTTCGCCACTGCCCACCCTAATCCTGTGTTTCTTTGCCTCCTATAGACATCAT"))))

This PR does not address the issue. When such a situation is detected, it just throws an exception instead of returning an incorrect result. 6f322e0

@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

Merging #68 (6f322e0) into master (e401a36) will decrease coverage by 0.05%.
The diff coverage is 37.50%.

@@            Coverage Diff             @@
##           master      #68      +/-   ##
==========================================
- Coverage   45.67%   45.63%   -0.05%     
==========================================
  Files          16       16              
  Lines        1990     2003      +13     
  Branches       63       64       +1     
==========================================
+ Hits          909      914       +5     
- Misses       1018     1025       +7     
- Partials       63       64       +1     
Impacted Files Coverage Δ
src/varity/vcf_to_hgvs/protein.clj 28.21% <37.50%> (+0.38%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link

@ToshimitsuArai ToshimitsuArai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix. LGTM

Copy link
Contributor

@niyarin niyarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alumi alumi requested a review from nokara26 January 31, 2023 13:05
@alumi alumi marked this pull request as ready for review January 31, 2023 13:09
@alumi
Copy link
Member Author

alumi commented Jan 31, 2023

@ToshimitsuArai @niyarin Thank you for your reviews!

@nokara26 I marked this PR as ready for review. Would you take a look at it?
I'm afraid but I'm not certain if you are in charge of this, if not, please assign the appropriate person 🙇

Copy link
Contributor

@nokara26 nokara26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! LGTM👍

@nokara26 nokara26 merged commit c603169 into master Feb 1, 2023
@nokara26 nokara26 deleted the fix/insertion-near-splice-site branch February 1, 2023 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants