Download from pypi:
pip install sortedintersect
To build from source:
git clone https://github.com/kcleal/sortedintersect
cd sortedintersect
pip install .
Check if a point or interval overlaps a reference interval set:
from sortedintersect import IntervalSet
# intervals without data
itv = IntervalSet()
itv.add(0, 10)
itv.search_point(1)
# >>> [(0, 10)]
# Return a boolean only (faster)
itv = IntervalSet(bool_only=True)
itv.add(0, 10)
itv.search_point(1)
# >>> True
# search intervals with additional data
itv = IntervalSet(with_data=True)
itv.add(0, 10, 'interval1')
itv.add(20, 30, {'a': 1})
itv.search_interval(1, 2)
# >>> [(0, 10, 'interval1')]
itv.search_point(20)
# >>> [(20, 30, {'a': 1})]
itv = IntervalSet()
itv.add(-1, 1)
itv.add(2, 4)
itv.add(9, 10)
assert tuple(itv.search_point(10)) == ((9, 10),)
assert tuple(itv.search_point(0)) == ((-1, 1),)
# set custom distance threshold
itv = IntervalSet(distance_threshold=500_000)
An interval tree is often used for this purpose relying on binary search. However, if the reference intervals are sorted ahead of time, and query intervals are roughly sorted, then a simple linear search can be used for a small speedup. Sorted-intersect uses a simple heuristic (a distance measure) to decide when to use binary-search and when to use a plane-sweep. When a query-point is tested, its position is recorded. When a subsequent query point is tested, if the distance to the previous point is below a threshold, then a plane sweep is used on the reference intervals instead of a binary search.
sortedintersect was compared to popular python implementations based on interval trees, see tests/benchmark.py for details.
random1 - randomly generated intervals using a python script random2 - 1million random intervals of 1kb in size over chromosome 1, generated by bedtools reads+genes - reads from a 2mb region converted to bed format, then intersected with ucsc genes from hg19
library | time (s) | intersections | test | shuffle |
---|---|---|---|---|
sortedintersect | 0.0116878 | 2947 | random1 | True |
quicksect | 0.014946 | 2947 | random1 | True |
cgranges | 0.0177999 | 2878 | random1 | True |
ncls | 0.027061 | 2947 | random1 | True |
sortedintersect | 0.740114 | 8001594 | random2 | True |
quicksect | 1.28441 | 8001594 | random2 | True |
cgranges | 1.00505 | 7993517 | random2 | True |
ncls | 1.352 | 8001594 | random2 | True |
sortedintersect | 0.111097 | 320618 | reads+genes | True |
quicksect | 0.125926 | 320618 | reads+genes | True |
cgranges | 0.15797 | 320610 | reads+genes | True |
ncls | 0.235777 | 320618 | reads+genes | True |
sortedintersect | 0.011116 | 2916 | random1 | False |
quicksect | 0.0142658 | 2916 | random1 | False |
cgranges | 0.0173151 | 2847 | random1 | False |
ncls | 0.0269301 | 2916 | random1 | False |
sortedintersect | 0.589765 | 8001594 | random2 | False |
quicksect | 0.644668 | 8001594 | random2 | False |
cgranges | 0.860898 | 7993517 | random2 | False |
ncls | 1.15578 | 8001594 | random2 | False |
sortedintersect | 0.0999217 | 320618 | reads+genes | False |
quicksect | 0.123064 | 320618 | reads+genes | False |
cgranges | 0.14915 | 320610 | reads+genes | False |
ncls | 0.230041 | 320618 | reads+genes | False |
shuffle here means that the query intervals have been shuffled.
- Reference queries must be added in sorted order
- Reference intervals can not be deleted
- Not recommended if you reference set of intervals are a mixture of small and very large intervals. Queries can turn in to linear searches!