Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB2vcf / CombineGVCFs error out with MNPs #219

Open
shelbirussell opened this issue Sep 4, 2024 · 0 comments
Open

DB2vcf / CombineGVCFs error out with MNPs #219

shelbirussell opened this issue Sep 4, 2024 · 0 comments

Comments

@shelbirussell
Copy link

I am running into an error that I initially wanted some help in resolving, but now, I just want to notify you about its existence. The error involves an issue in CombineGVCFs that prevents processing Multi Nucleotide Polymorphisms.

I am running a small set of 10 samples on a slurm-managed cluster. Previously, eight of these samples (Drosophila simulans collected in 2023) ran successfully. When we added two more samples (D. simulans from 1984 and 1976), the workflow failed. These two samples have been previously mapped by bwa-mem for coverage calculations, so I had no reason to think the data are bad.... now I know that there might be an issue with that 1976 sample (e.g., two flies in the gDNA).

The workflow runs smoothly until the DB2vcf step, when GenotypeGVCFs is run, then it errors out. The log and error files are attached.

GATK throws exception:
"org.broadinstitute.hellbender.exceptions.GATKException: Error creating iterator over file gendb://results/Dsim_wRi_wMel_genome/genomics_db_import/DB_L0071"

"Caused by: java.io.IOException: GenomicsDB JNI Error: VariantQueryProcessorException : Unhandled overlapping variants at columns 127228851 and 127228852 for row 0"

0071.txt
5019711.log

I did a bit of research on this error and discovered that:
"CombineGVCFs will not work if your input file contains MNPs which would result in this type of error"

I found the reference position corresponding to columns 127228851 and 127228852 by looking at the offset column of the results/Dsim_wRi_wMel_genome/data/genome/Dsim_wRi_wMel_genome.fna.fai file:
NW_025416844.1 38922 127196239 80 81
NW_025416845.1 31969 127235789 80 81

127228851-127196239=> at position 32612

So, the potential MNP is in the back ~1/6 of NW_025416844.1.... I made some subsetted vcfs of the variants on this scaffold. There were only ~100-500 per sample. See the attached files.

Dsimulans_Mer23_wRi_line8A.NW_025416844.1.partial.vcf.txt
Dsimulans_w501.NW_025416844.1.partial.vcf.txt
Dsimulans_Riv84_wRi-3.NW_025416844.1.partial.vcf.txt

I don't see any variants at position 32612 (the max is 28557), but there are MNPs in the w501 sample. Position 9446 is a MNP in itself, with the call: GTTT,GTTTTT,<NON_REF>,G,GTTTTTTT. There are different SNP vs deletion calls between samples at position 9490.

Honestly, sample w501 looks problematic. Scaffold NW_025416844.1 is unplaced, but the rest of the genome looks enriched in "ExcessHet=3.0103" calls. I'm going to get rid of this sample for now - it might consist of two flies and I need to track this down.

I still think this error is of concern to you because someone will come across MNPs sometime, especially in polyploid systems. If I end up sequencing enough w501 flies to recover these MNPs among diploid individuals, it will become a problem for me again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant