-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vCPU Utilization ~33% ( and GIAB Concordance Work ) #435
Comments
Hi John, Thanks for kind words and feedback! @marcelm and I will have a meeting tomorrow and discuss this. Best, |
Great! Add'l DeetsIt might be helpful to know that I'm running on 128 and 192 core (mem > 360G) intel EC2 instances using ubuntu 22.04. GIAB b37 Runtime Data ReadyI should mention that I am not specifically tracking strobealign's runtime, but the combined strobealign->samtools sort->samtools view > {out.bam,out.bam.bai}. Results are in for
My Full Commandnot a fully optimized incantation
OMP_NUM_THREADS=64 OMP_PROC_BIND=close OMP_PLACES=threads OMP_PROC_BIND=TRUE OMP_DYNAMIC=TRUE OMP_MAX_ACTIVE_LEVELS=1 OMP_SCHEDULE=dynamic OMP_WAIT_POLICY=ACTIVE avx2_compiled/strobealign \
-t 128 \
-v \
--rg '@RG\tID:x_$epocsec\tSM:x\tLB:RIH0_ANA0-HG00119_DBC0_0_presumedNoAmpWGS\tPL:presumedILLUMINA\tPU:presumedCombinedLanes\tCN:CenterName\tPG:strobealigner' \
--use-index. \ /fsx/data/genomic_data/organism_references/H_sapiens/b37/human_1kg_v37/human_1kg_v37.fasta \
<(unpigz -c -q -- results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/RIH0_ANA0-HG001-19_DBC0_0.R1.fastq.gz ) \
<(unpigz -c -q -- results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/RIH0_ANA0-HG001-19_DBC0_0.R2.fastq.gz ) \
| samtools sort -l 0 -m 2G -@ 64 -T $tdir -O SAM - \
| samtools view -b -@ 0 -O BAM --write-index -o results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/align/strobe/RIH0_ANA0-HG001-19_DBC0_0.strobe.sort.bam##idx##results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/align/strobe/RIH0_ANA0-HG001-19_DBC0_0.strobe.sort.bam.bai -
note: paired reads, lengh 151bp Variant Calling w/DeepVariant (in progress)One sample is complete, and comparing vs bwa mem2 calls, the concordance is quite favorable. The remaining 6 I'll drop a table here tomorrow. |
quite favorable is an understatement even :-) Given deep variant is certainly biased towards bwa alignments, I'm really exited to see what a bit of optimization and tuning will yield. |
Hi, thanks for the feedback! We’ll be able to say more tomorrow, but just a couple of comments already now (not directly in regard to your main question).
It should, it may just not be that noticable because it doesn’t print that much more info. With
Yes, strobealign prints the metrics only when stderr is connected to a terminal. It would make a lot of sense to always print the rate metrics once after all reads have been mapped. Would that be sufficient or would you like to be able to see the rate metrics while strobealign is running and even when stderr is not a terminal? Maybe we could print the metrics once a minute or so if stderr is a file (without the line overwriting behavior).
Do you really need
Which dataset is that exactly? I’d like to test running this myself. I’m only aware of the datasets at https://github.com/genome-in-a-bottle/giab_data_indexes.
A couple of comments regarding your command:
|
(I will absolutely respond to the open items above, but it might take me a few days to get back you you) In The Mean TimeInitial Results !I'm planning to publish these results in some way, and will share the nitty gritty details when I do. For now however:
Sample DataThe NOVAseq ~30x noamp WGS 2x150bp reads for HG001,2,3,4,5,6,7. Referenceb37 Compute SituationAWS parallel cluster and a snakemake pipeline orchestrator (but any orchestrator will, they all do the same thing). I developed this framework that I use for this kind of work -note, i'm presently tearing it apart and cleaning it up after some time away.. Pipeline StagesAligners
DedupVariant Calling
Concordance
Detailed Summary Results
^1 bwamem2 uses all 192 cores maximally, strobealigner only uses ~33%. This suggests a substantial further speedup is still possible for strobealigner. Numbers do not include the suggested change to samtools sort and samtools write, which could well save time as well. ^2 deepvariant was trained on alignments largely produced by bwa mem, I imagine if retrained with strobealign data, these numbers could improve for strobealign significantly. It might also be informative to try running octopus on the strobealign bams, but using a strobealign random forest model first. ^3 I have not yet looked at the membership of variant calls in bwa mem2 variant calls and strobealign calls. I'm curious to see what the three populations of call sets look like. TO DO
1 Further Suggestion
|
Hi, I ran strobealign on a dataset with 100 million 150bp paired-end reads. The command was
Neither
The total wall-clock time includes a part in the beginning where strobealign reads the reference and indexes it; this is not fully parallelized. The CPU usage at that stage is at most around 600% due to the serial bits. This has significant impact on the CPU usage average because I mapped relatively few reads in this experiment. To compensate, I added the 'mapping-only CPU usage' column. This is an estimate. The more reads the dataset contains, the closer the CPU usage should be to that number. I think the numbers are quite ok up to about 36 cores. The problem with more cores is likely that the gzip decompression is just too slow:
|
And here are some measurements with the following differences:
So my suggestion would be to use |
With the changes from PR #418, the numbers look like this:
And with those changes, it may make sense to use an instance with 128 cores (unless adding I’m quite happy with this TBH, this should make strobealign quite usable in settings with many cores. |
Ahoy!
I am quite excited to explore in more detail how strobealign behaves in b37 & hg38 across all of the GIAB samples ( at least for NOVA reads, others if budget allows ).
I have built the tool from source, and am working with:
This is running well for me, but not optimally. When I monitor the vCPU utiization, if I set -t to $nproc for the machine I'm using, I can only every coerce the tool to occupy ~ 1/3 of the indicated threads/cpus. Increasing the number wildly has little effect, reducing it has a smaller effect than I'd have expected.
Do you have guidance on if this is expected behavior or not (and if not, what might I try to boost cpu utilization).
Even with this inefficiency, I am getting runtimes which are roughly 30% of the paired bwa mem2 runtimes for the same sample. At the moment, I am waiting for deepvariant and concordance #'s to be generated, which I'll happily share.
I have a few other observations / questions:
-v
flag changes verbosity.-Y
which bwa and others provide. I use this flag so that I can save the complete read info in the BAM which allows me to delete the input fastas (b/c I can re-generate the fastas from BAM as needed, or stream them out to tools directly from BAM more commonly). Both unmapped reads and clipped alignments are needed to do this. Being unable to reconstitute fastas from BAM would be a blocker from me in advocating adoption widely as being able to delete the fastas saves a substantial amount of $.That's all. I'm really very impressed with how quickly the docs helped me get rolling, and how well the tool has behaved out of the gate. Thanks again!
John Major
The text was updated successfully, but these errors were encountered: