Unknown errors with big datasets #229

max-hence · 2024-11-05T13:06:38Z

Hi,

I manage to make snpArcher work on dataset with medium size genomes (400Mb) but I got errors for bigger genomes (2Gb) and when job are taking to much time and ressources. I think I set the slurm/config.yaml properly to ask for big ressources and the cluster I m using is supposed to handle such settings but I got this kind of errors for instance at the bwa_map rule :

Error in rule bwa_map:
    message: SLURM-job '13562883' failed, SLURM status is: 'NODE_FAIL'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 252
    input: results/GCA_902167145.1/data/genome/GCA_902167145.1.fna, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.sa, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.pac, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.bwt, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.ann, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.amb, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.fai
    output: results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam, results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
    log: logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt, /scratch/mbrault/snpcalling/zmays_parviglumis_PRJNA641889/.snakemake/slurm_logs/rule_bwa_map/GCA_902167145.1_SAMN15515513_SRR12460375/13562883.log (check log file(s) for error details)
    conda-env: /scratch/mbrault/snpcalling/zmays_parviglumis_PRJNA641889/.snakemake/conda/8ca636c300f965c6ac864e051945e276_
    shell:
        bwa mem -M -t 8 -R '@RG\tID:6E8\tSM:SAMN15515513\tLB:6E8\tPL:ILLUMINA' results/GCA_902167145.1/data/genome/GCA_902167145.1.fna results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz 2> logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt | samtools sort -o results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam - && samtools index results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 13562883

And in the .snakemake/slurm_logs/rule_bwa_map/GCA_902167145.1_SAMN15515513_SRR12460375/13562883.log :

localrule bwa_map:
    input: results/GCA_902167145.1/data/genome/GCA_902167145.1.fna, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.sa, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.pac, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.bwt, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.ann, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.amb, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.fai
    output: results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam, results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
    log: logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt
    jobid: 0
    benchmark: benchmarks/GCA_902167145.1/bwa_mem/SAMN15515513_SRR12460375.txt
    reason: Forced execution
    wildcards: refGenome=GCA_902167145.1, sample=SAMN15515513, run=SRR12460375
    threads: 32
    resources: mem_mb=100000, mem_mib=95368, disk_mb=43245, disk_mib=41242, tmpdir=/tmp, mem_mb_reduced=90000, slurm_partition=ecobio,genouest, slurm_account=mbrault, runtime=11520, cpus_per_task=32

Activating conda environment: .snakemake/conda/8ca636c300f965c6ac864e051945e276_

[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0000.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0001.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0002.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0003.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0004.bam" : File exists
etc...

But still when I look on the slurm cluster at that particular job I find no errors :

JobID           JobName      State    Elapsed     ReqMem     MaxRSS  MaxVMSize  AllocCPUS 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- 
13562883     e5af4995-+  COMPLETED   08:26:43    100000M                               32 
13562883.ba+      batch  COMPLETED   08:26:43               129620K   5013688K         32 
13562883.ex+     extern  COMPLETED   08:26:43                  912K    144572K         32 
13562883.0   python3.11  COMPLETED   08:26:05             26104628K  33043144K         32

Do you have any clue on what could cause such an error ? I joined the slurm/config.yaml if needed.
config.yaml.txt

Thank you very much,

Max Brault

The text was updated successfully, but these errors were encountered:

tsackton · 2024-11-05T14:57:45Z

It looks like the particular bwa_mem job you posted the log of is failing because there is an existing set of temp files from the samtools sort command, likely from a previous failed run that crashed before cleanup could finish. I would initially try deleting the "SRR12460375.bam.tmp.*.bam" files and rerunning.

This doesn't look like a slurm / resources error, although I'm not entirely sure why the error in the command is not being propagated to slurm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown errors with big datasets #229

Unknown errors with big datasets #229

max-hence commented Nov 5, 2024

tsackton commented Nov 5, 2024

Unknown errors with big datasets #229

Unknown errors with big datasets #229

Comments

max-hence commented Nov 5, 2024

tsackton commented Nov 5, 2024