CPU inefficiency and ulimit / disk error #1236

tracychew · 2021-05-13T23:12:39Z

Hi,

I am using STAR (v2.7.3a) to align >250 FASTQ pairs (human, ~80M read pairs) on a large HPC system. The code below worked (albeit inefficiently) for most of the samples but failed for 8.

For the 8 that didn't work, I get the exit status codes 102 (3 samples) and 1 (5 samples), both related to disk.

The stdout from one of the samples with exit 102 is below. It suggests that it failed during sorting of BAMs, and to check disk and ulimit -n. There is definitely enough disk and iNode available. The ulimit -n for the system I am using is 16384. This thread suggests this should be sufficient with --runThreadN 3 and default --outBamsortingBinsN 50.

Stdout for sample with exit 102

[tc6463@gadi-login-09 Logs]$ cat ./star_align_trimmed/AGRF_CAGRF20104118-1_HYNHMDSXY/nCIT098_2.oe
Mon May 10 13:10:58 AEST 2021 : Mapping with STAR 2.7.3a. Sample:nCIT098 R1:../AGRF_CAGRF20104118-1_HYNHMDSXY_trimmed/nCIT098_HYNHMDSXY_CCTCTACATG-AGGAGGTATC_L002_R1_trimmed.fastq.gz R2: ../AGRF_CAGRF20104118-1_HYNHMDSXY_trimmed/nCIT098_HYNHMDSXY_CCTCTACATG-AGGAGGTATC_L002_R2_trimmed.fastq.gz centre:AGRF platform:ILLUMINA library:1 lane:2 flowcell:HYNHMDSXY logfile:./Logs/star_align_trimmed/AGRF_CAGRF20104118-1_HYNHMDSXY/nCIT098_2.oe NCPUS:3
May 10 13:10:58 ..... started STAR run
May 10 13:10:59 ..... loading genome
May 10 13:12:29 ..... started mapping
May 10 14:14:56 ..... finished mapping
May 10 14:56:47 ..... started sorting BAM

EXITING because of fatal ERROR: could not open temporary bam file: ../AGRF_CAGRF20104118-1_HYNHMDSXY_STAR/nCIT098_2__STARtmp//BAMsort//b13
SOLUTION: check that the disk is not full, increase the max number of open files with Linux command ulimit -n before running STAR
May 10 15:16:38 ...... FATAL ERROR, exiting

Here is an example stdout from one of the samples that failed with exit code 1. STAR isn't able to read one of the temporary files it creates:

Mon May 10 13:10:58 AEST 2021 : Mapping with STAR 2.7.3a. Sample:nCIT092 R1:../AGRF_CAGRF20104118-1_HYNHMDSXY_trimmed/nCIT092_HYNHMDSXY_TGTCGCTGGT-AGTCAGACGA_L002_R1_trimmed.fastq.gz R2: ../AGRF_CAGRF20104118-1_HYNHMDSXY_trimmed/nCIT092_HYNHMDSXY_TGTCGCTGGT-AGTCAGACGA_L002_R2_trimmed.fastq.gz centre:AGRF platform:ILLUMINA library:1 lane:2 flowcell:HYNHMDSXY logfile:./Logs/star_align_trimmed/AGRF_CAGRF20104118-1_HYNHMDSXY/nCIT092_2.oe NCPUS:3
May 10 13:10:58 ..... started STAR run
May 10 13:10:59 ..... loading genome
May 10 13:12:29 ..... started mapping
May 10 14:07:16 ..... finished mapping
May 10 14:56:47 ..... started sorting BAM

EXITING because of FATAL ERROR: failed reading from temporary file: ../AGRF_CAGRF20104118-1_HYNHMDSXY_STAR/nCIT092_2__STARtmp//BAMsort/2/11
May 10 15:16:46 ...... FATAL ERROR, exiting

Here is the code in star_align_paired_unmapped.sh, inputs are read as arguments:

#!/bin/bash

# Align to the reference genome using STAR

filename=`echo $1 | cut -d ',' -f 1`
dataset=`echo $1 | cut -d ',' -f 2`
sampleid=`echo $1 | cut -d ',' -f 3`
fastq1=`echo $1 | cut -d ',' -f 4`
fastq2=`echo $1 | cut -d ',' -f 5`
ref=`echo $1 | cut -d ',' -f 6`
seqcentre=`echo $1 | cut -d ',' -f 7`
platform=`echo $1 | cut -d ',' -f 8`
library=`echo $1 | cut -d ',' -f 9`
lane=`echo $1 | cut -d ',' -f 10`
flowcell=`echo $1 | cut -d ',' -f 11`
outdir=`echo $1 | cut -d ',' -f 12`
logfile=`echo $1 | cut -d ',' -f 13`
NCPUS=`echo $1 | cut -d ',' -f 14`

echo `date` ": Mapping with STAR 2.7.3a. Sample:$sampleid R1:$fastq1 R2: $fastq2 centre:$seqcentre platform:$platform library:$library lane:$lane flowcell:$flowcell logfile:$logfile NCPUS:$NCPUS" > ${logfile} 2>&1

# Mapping
STAR \
        --runThreadN ${NCPUS} \
        --genomeDir ${ref} \
        --quantMode GeneCounts \
        --readFilesCommand zcat \
        --readFilesIn ${fastq1} ${fastq2} \
        --outSAMattrRGline ID:${flowcell}:${lane} PU:${flowcell}.${lane}.${sampleid} SM:${sample} PL:${platform} CN:${seqcentre} LB:${library} \
        --outSAMtype BAM SortedByCoordinate \
        --outReadsUnmapped Fastx \
        --outSAMunmapped Within KeepPairs \
        --outFileNamePrefix ${outdir}/${sampleid}_${lane}_ >> ${logfile} 2>&1

My last question is related to the inefficiency - for the samples that did work, I used --runThreadN 24, but the CPU efficiency was ~0.06. At --runThreadN 6 efficiency was 0.16 and at --runThreadN 3 efficiency was at 0.5. The HPC system I am using has 48 CPU, 190 GB nodes. Could you recommend how I could utilise the hardware more efficiently?

Thanks so much!

The text was updated successfully, but these errors were encountered:

tracychew · 2021-05-15T00:12:40Z

Explicitly setting --outBAMsortingThreadN ${NCPUS} and --outBAMsortingBinsN 100 helped both efficiency and resolved the ulimit issue :)

tracychew closed this as completed May 15, 2021

zhouyiqi91 mentioned this issue Sep 23, 2024

OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata. singleron-RD/CeleScope#293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU inefficiency and ulimit / disk error #1236

CPU inefficiency and ulimit / disk error #1236

tracychew commented May 13, 2021

tracychew commented May 15, 2021

CPU inefficiency and ulimit / disk error #1236

CPU inefficiency and ulimit / disk error #1236

Comments

tracychew commented May 13, 2021

tracychew commented May 15, 2021