Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown errors with big datasets #229

Open
max-hence opened this issue Nov 5, 2024 · 1 comment
Open

Unknown errors with big datasets #229

max-hence opened this issue Nov 5, 2024 · 1 comment

Comments

@max-hence
Copy link

Hi,

I manage to make snpArcher work on dataset with medium size genomes (400Mb) but I got errors for bigger genomes (2Gb) and when job are taking to much time and ressources. I think I set the slurm/config.yaml properly to ask for big ressources and the cluster I m using is supposed to handle such settings but I got this kind of errors for instance at the bwa_map rule :

Error in rule bwa_map:
    message: SLURM-job '13562883' failed, SLURM status is: 'NODE_FAIL'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 252
    input: results/GCA_902167145.1/data/genome/GCA_902167145.1.fna, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.sa, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.pac, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.bwt, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.ann, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.amb, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.fai
    output: results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam, results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
    log: logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt, /scratch/mbrault/snpcalling/zmays_parviglumis_PRJNA641889/.snakemake/slurm_logs/rule_bwa_map/GCA_902167145.1_SAMN15515513_SRR12460375/13562883.log (check log file(s) for error details)
    conda-env: /scratch/mbrault/snpcalling/zmays_parviglumis_PRJNA641889/.snakemake/conda/8ca636c300f965c6ac864e051945e276_
    shell:
        bwa mem -M -t 8 -R '@RG\tID:6E8\tSM:SAMN15515513\tLB:6E8\tPL:ILLUMINA' results/GCA_902167145.1/data/genome/GCA_902167145.1.fna results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz 2> logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt | samtools sort -o results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam - && samtools index results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 13562883

And in the .snakemake/slurm_logs/rule_bwa_map/GCA_902167145.1_SAMN15515513_SRR12460375/13562883.log :

localrule bwa_map:
    input: results/GCA_902167145.1/data/genome/GCA_902167145.1.fna, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.sa, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.pac, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.bwt, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.ann, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.amb, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.fai
    output: results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam, results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
    log: logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt
    jobid: 0
    benchmark: benchmarks/GCA_902167145.1/bwa_mem/SAMN15515513_SRR12460375.txt
    reason: Forced execution
    wildcards: refGenome=GCA_902167145.1, sample=SAMN15515513, run=SRR12460375
    threads: 32
    resources: mem_mb=100000, mem_mib=95368, disk_mb=43245, disk_mib=41242, tmpdir=/tmp, mem_mb_reduced=90000, slurm_partition=ecobio,genouest, slurm_account=mbrault, runtime=11520, cpus_per_task=32

Activating conda environment: .snakemake/conda/8ca636c300f965c6ac864e051945e276_

[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0000.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0001.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0002.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0003.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0004.bam" : File exists
etc...

But still when I look on the slurm cluster at that particular job I find no errors :

JobID           JobName      State    Elapsed     ReqMem     MaxRSS  MaxVMSize  AllocCPUS 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- 
13562883     e5af4995-+  COMPLETED   08:26:43    100000M                               32 
13562883.ba+      batch  COMPLETED   08:26:43               129620K   5013688K         32 
13562883.ex+     extern  COMPLETED   08:26:43                  912K    144572K         32 
13562883.0   python3.11  COMPLETED   08:26:05             26104628K  33043144K         32 

Do you have any clue on what could cause such an error ? I joined the slurm/config.yaml if needed.
config.yaml.txt

Thank you very much,

Max Brault

@tsackton
Copy link
Contributor

tsackton commented Nov 5, 2024

It looks like the particular bwa_mem job you posted the log of is failing because there is an existing set of temp files from the samtools sort command, likely from a previous failed run that crashed before cleanup could finish. I would initially try deleting the "SRR12460375.bam.tmp.*.bam" files and rerunning.

This doesn't look like a slurm / resources error, although I'm not entirely sure why the error in the command is not being propagated to slurm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants