Skip to content

Latest commit

 

History

History
28 lines (23 loc) · 1.69 KB

README.md

File metadata and controls

28 lines (23 loc) · 1.69 KB

rdeval-manuscript

Scripts used for the rdeval manuscript.

Compute raw reads summary statistics for the Vertebrate Genomes Project

The following commands show how to access SRA accession numbers and metadata from NCBI, download the raw reads and generate summary statistics with rdeval. The process is heavily parallelized, with approximately 4 cores used by each rdeval process, and 8 experiments processed in parallel (32 cores total). SRA samples can optionally be randomly sampled.

First, we can use NCBI datasets and jq to parse the VGP umbrella BioProject and collect all SRA accessions associated with its genomes (here is jq's manual):

datasets summary genome accession PRJNA489243 > vgp-metadata.json
cat vgp-metadata.json | jq -r '.reports[] | .accession + "," + .assembly_info.assembly_name + "," + (.assembly_info.biosample.sample_ids[] | select(.db=="SRA").value)' > raw_data_metadata.ls
cat raw_data_metadata.ls

Then we can download the data (requires fasterq-dump in $PATH) and compute summary statistics by generating .rd files (requires rdeval in $PATH):

bash rdeval_parallel.sh raw_data_metadata.ls 0.5 // sampling fraction, 1 for full data set

We aggregate metrics from multiple .rd files by either:

rdeval *.rd

Or if they don't fit in memory (i.e. in the hundreds):

for file in *.rd; do rdeval --tab $file | cut -f2 | sed "1q;d"; done | awk '{sum += $1}END{print sum}' # total number of reads
for file in *.rd; do rdeval --tab $file | cut -f2 | sed "2q;d"; done | awk '{sum += $1}END{print sum}' # total number of bases