Skip to content

Commit

Permalink
reorganize table files
Browse files Browse the repository at this point in the history
  • Loading branch information
gwct committed Jan 16, 2025
1 parent d44fad9 commit 580218c
Show file tree
Hide file tree
Showing 33 changed files with 88 additions and 42 deletions.
23 changes: 23 additions & 0 deletions data/glossary-tables/bioinformatics/assemblers.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Program,Author,Year,Contig assembly method,Scaffolding method,Use cases,Link,Paper,Used
ALLPATHS-LG,Gnerre,2011,de Bruijn graph and sequence graphs,Read pairs,Large genomes with both short and long reads,http://software.broadinstitute.org/allpaths-lg/blog/?page_id=12,https://doi.org/10.1073/pnas.1017351108,N
hifiasm,Cheng,2020,Error corrected overlap-layout to preserve haplotypes,NA,PacBio Hifi reads,https://github.com/chhylp123/hifiasm,https://arxiv.org/abs/2008.01237,N
Spades,Bankevich,2012,de Bruijn graph,Read pairs and small gap repeat resolution; can also use long reads or previously inferred contigs,Good for small genomes or targeted sequencing (e.g. exomes),https://github.com/ablab/spades,https://doi.org/10.1089/cmb.2012.0021,Y
Discovar de novo,Weisenfeld,2014,de Bruijn graph and lines,NA,Single 2x250bp library,https://software.broadinstitute.org/software/discovar/blog/,https://dx.doi.org/10.1038%2Fng.3121,N
Supernova,Weisenfeld,2017,de Bruijn graph and lines,Read pairs and linked read barcodes,Large genomes with linked reads for phased genome assemblies,https://github.com/10XGenomics/supernova,https://doi.org/10.1101/gr.235812.118,N
Canu,Koren,2017,Overlap-layout-consensus,NA,Long reads,https://github.com/marbl/canu,https://dx.doi.org/10.1101/gr.215087.116,N
HiCanu,Nurk,2020,Overlap-layout-consensus,NA,PacBio Hifi reads,,https://doi.org/10.1101/gr.263566.120 ,N
Flye,Kolmogorov,2019,Repeat graph,NA,Long reads,https://github.com/fenderglass/Flye/,https://doi.org/10.1038/s41587-019-0072-8,Y
platanus,Kajitani,2014,de Bruijn graph,Read pairs,Short reads for genomes with high heterozygosity,https://dx.doi.org/10.1101%2Fgr.170720.113,http://platanus.bio.titech.ac.jp/,N
opera-lg,Gao,2016,NA,Paired reads and long reads,Scaffolding of repeat-rich genomes,https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/,https://doi.org/10.1186/s13059-016-0951-y,N
agouti,Zhang,2016,NA,RNA-seq reads,Scaffolding of large genomes,https://github.com/svm-zhang/AGOUTI,https://doi.org/10.1186/s13742-016-0136-3,N
Abyss,Jackman,2017,de Bruijn graph; all possible k-mers,"""Mate pairs, linked reads, or long reads""",Short read libraries for genomes up to 100Mb; Transcriptomes with Trans-ABySS,https://github.com/bcgsc/abyss,https://doi.org/10.1101/gr.214346.116 ,N
Velvet,Zerbino and Birney,2008,de Bruijn graph,Read pairs,Short read assembly,https://github.com/dzerbino/velvet,https://doi.org/10.1101/gr.074492.107,N
SOAPdenovo2,Luo,2012,de Bruijn graph,Read pairs,Short read assembly,https://github.com/aquaskyline/SOAPdenovo2,https://doi.org/10.1186/2047-217X-1-18,N
MaSuRCA,Zimin,2013,Overlap-layout-consensus on unique super-reads,"""Mate pairs, linked reads, or long reads""",Mixed short read libraries of large genomes,https://github.com/alekseyzimin/masurca,https://doi.org/10.1093/bioinformatics/btt476,N
CABOG,NA,NA,NA,NA,NA,http://wgs-assembler.sourceforge.net/wiki/index.php?title=Main_Page,NA,N
Falcon,Chin,2016,String graph,NA,PacBio long reads for diploid genome assembly,https://github.com/PacificBiosciences/pb-assembly,https://dx.doi.org/10.1038%2Fnmeth.4035,N
Miniasm,Li,2016,Overlap-layout,NA,Long reads,https://github.com/lh3/miniasm,https://doi.org/10.1093/bioinformatics/btw152,N
HINGE,Kamath,2017,Overlap-layout-consensus with hinging,NA,Long reads,https://hingeassembler.github.io/,https://doi.org/10.1101/gr.216465.116 ,N
Abruijn,Lin,2016,A-Bruijn graph,NA,Long reads,https://github.com/bioreps/ABruijn,https://doi.org/10.1073/pnas.1604560113,N
MEGAHIT,Li,2015,de Bruijn graph,,Short read assembly of metagenomes,https://github.com/voutcn/megahit,https://doi.org/10.1093/bioinformatics/btv033,N
Peregrine,Chin and Khalak,2020,Overlap-layout-consensus with shimmer indexing,NA,Fast long reads assembly,https://github.com/cschin/Peregrine,https://doi.org/10.1101/705616 ,N
9 changes: 9 additions & 0 deletions data/glossary-tables/bioinformatics/formats.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Format,Use,Link,Specs
FASTA,Stores sequence data.,https://en.wikipedia.org/wiki/FASTA_format,NA
FASTQ,Stores sequence data and quality scores.,https://en.wikipedia.org/wiki/FASTQ_format,https://doi.org/10.1093/nar/gkp1137
SAM,Sequence Alignment Map format. Stores information about reads mapped to a reference genome.,https://en.wikipedia.org/wiki/SAM_(file_format),https://samtools.github.io/hts-specs/SAMv1.pdf
BAM,Binary Alignment Map format. The compressed binary version of SAM format.,https://en.wikipedia.org/wiki/SAM_(file_format),https://samtools.github.io/hts-specs/SAMv1.pdf
CRAM,Another compressed format to store read mapping information.,https://en.wikipedia.org/wiki/CRAM_(file_format),https://samtools.github.io/hts-specs/CRAMv3.pdf
VCF,Variant Call Format. Used to store information about variants inferred for a given sample(s).,https://en.wikipedia.org/wiki/Variant_Call_Format,https://samtools.github.io/hts-specs/VCFv4.2.pdf
BCF,Binary variant Call Format. The binary compressed verion of a VCF.,https://en.wikipedia.org/wiki/Variant_Call_Format,https://samtools.github.io/hts-specs/VCFv4.2.pdf
BED,Stores coordinates of regions of interest,https://en.wikipedia.org/wiki/BED_(file_format),https://bedtools.readthedocs.io/en/latest/content/general-usage.html
7 changes: 7 additions & 0 deletions data/glossary-tables/bioinformatics/mappers.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Program,Author,Year,Use cases,Link,Paper,Used
BWA,Li and Durbin,2010,Short read alignment,http://bio-bwa.sourceforge.net/,https://doi.org/10.1093/bioinformatics/btp324,Y
TopHat2,Kim,2013,Mapping RNA-seq reads,https://ccb.jhu.edu/software/tophat/index.shtml,https://doi.org/10.1186/gb-2013-14-4-r36,N
Minimap2,Li,2018,Long read mapping and whole genome alignment,https://github.com/lh3/minimap2,https://doi.org/10.1093/bioinformatics/bty191,Y
bbmap,Bushnell,2014,Short and long read mapper with many extras,https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/,https://www.osti.gov/biblio/1241166,N
Bowtie2,Langmead and Salzberg,2012,Short read alignment,http://bowtie-bio.sourceforge.net/bowtie2/index.shtml,https://dx.doi.org/10.1038%2Fnmeth.1923,N
SOAP2,Li,2009,Short read alignment,https://sourceforge.net/projects/soapdenovo2/,https://doi.org/10.1093/bioinformatics/btp336,N
8 changes: 8 additions & 0 deletions data/glossary-tables/bioinformatics/other.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Program,Author,Year,Use cases,Link,Paper,Used
bedtools,Quinnlan and Hall,2010,Perform operations on sets of genomic coordinates.,https://bedtools.readthedocs.io/en/latest/,https://doi.org/10.1093/bioinformatics/btq033,Y
bcftools,NA,NA,Perform opterions on VCF and BCF formatted files.,http://samtools.github.io/bcftools/,NA,Y
samtools,Li,2009,Perform operations on SAM/BAM/CRAM formatted files.,http://www.htslib.org/download/,https://doi.org/10.1093/bioinformatics/btp352,Y
Picard tools,Broad Institute,2019,Performs many operations on SAM/BAM/CRAM and VCF files.,https://github.com/broadinstitute/picard,http://broadinstitute.github.io/picard/,Y
mosdepth,Pedersen and Quinlan,2018,Calculates read depth from mapped reads.,https://github.com/brentp/mosdepth,https://doi.org/10.1093/bioinformatics/btx699,N
pseudo-it,Sarver,2017,Iterative read mapping for pseudo-reference assembly.,https://github.com/goodest-goodlab/pseudo-it,https://doi.org/10.1093/gbe/evx034,Y
Referee,Thomas and Hahn,2018,Assign per-base quality scores to genome assemblies.,https://gwct.github.io/referee/,https://doi.org/10.1093/gbe/evz088,N
20 changes: 20 additions & 0 deletions data/glossary-tables/bioinformatics/terms.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Term,Definition
Sequence libraries,A sample of DNA that has been processed to be sequenced.
Reads,Fragmented and overlapping pieces of a DNA strand that are sequenced.
Phred quality scores,"A scaled probability that a given inference (usually a called base) is incorrect. The probabilty of error, P(e), scaled by -10 log P(e)."
Short reads,"Reads from first and second generation sequencing such as Sanger, Illumina, IonTorrent, etc. Short reads can range from 30-1000bp long."
Read pair,"Many short read sequencing technologies sequence from both ends of a DNA fragment, resulting in a pair of sequenced reads that come from said fragment."
Adapter,A short piece of DNA that is ligated to the short fragment to be sequenced. The adapter allows the fragment to be affixed to a physical medium (such as a flow cell) to facilitate amplification and sequencing.
Insert size,The size of the DNA fragment between the adapter sequences.
Mate pairs,Long-insert paired end reads prepared by circularizing longer DNA fragments.
Jumping libraries,Junction-fragment libraries. Mate pair libraries.
Long reads,Reads from single-molecule sequencing technology such as PacBio SMRT and Oxford Nanopore. Long reads can range from 1000-100000+bp long.
"""Genome assembly, Assembly, de novo Assembly""","""1. The process by which small overlapping parts of the genome are reconstructed into longer contiguous sequences, 2. A sequence that has undergone the assembly process."""
Contigs,Assembled reads. Contig assembly is usually done with a graph-based representation (i.e. de Bruijn graphs) of overlapping sequence reads.
Scaffolds,Contigs that have been joined together to form longer sequences. Scaffolding is usually done using read pair information or long reads.
Reference genome,An already assembled genome to which you can compare newly sequenced reads or genomes.
Read mapping,The process of aligning reads from a newly sequenced genome to a reference genome
Mapping quality,A usually Phred scaled probability that a given read has mapped incorrectly.
Reference-guided assembly,"""1. The process of using read mapping to reconstruct the genome from a set of reads, 2. A sequence that has undergone the reference-guided assembly process."""
Reference bias,The phenomenon of a set of mapped reads appearing to resembe (through lower divergence) the reference genome more closely than they actually do because reads containing the most variation were not mapped.
Iterative mapping,"The process of mapping reads to a reference genome, generating a reference-guided assembly, and then repeating the process this time mapping to the new reference guided assembly. Done to reduce reference bias."
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
14 changes: 0 additions & 14 deletions data/tables/formats.csv

This file was deleted.

7 changes: 0 additions & 7 deletions data/tables/other.csv

This file was deleted.

42 changes: 21 additions & 21 deletions docs/resources/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,57 +35,57 @@ Please feel free to suggest additions or edits.

\* These terms are used somewhat interchangeably colloquially

{{ read_csv('data/tables/general.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/computing-general.csv') }}

## General programming terms

{{ read_csv('data/tables/programming-general.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-general.csv') }}

### Programming constructs

{{ read_csv('data/tables/programming-constructs.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-constructs.csv') }}

### Data representation

{{ read_csv('data/tables/programming-data.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-data.csv') }}

### Functions

{{ read_csv('data/tables/programming-functions.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-functions.csv') }}

### Operators

{{ read_csv('data/tables/programming-operators.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-operators.csv') }}

### Errors

{{ read_csv('data/tables/programming-errors.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-errors.csv') }}

### Programming tools

{{ read_csv('data/tables/programming-tools.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/programming-tools.csv') }}

## Python terms

*Note that while we give some examples of syntax, the format of these tables does not lend itself to exact typing, so please read further documentation if needed and for more information on Python's syntax.*

{{ read_csv('data/tables/python.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/python.csv') }}

\* Note: While R is primarily a functional programming language and not inherently object-oriented, the subsequent tables use OOP terms and provide R examples because R can emulate OOP behavior.

### Python data types

{{ read_csv('data/tables/python-data-types.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/python-data-types.csv') }}

### Python data structures

{{ read_csv('data/tables/python-data-structures.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/python-data-structures.csv') }}

### Python operators

\* See below the table for examples of update operator usage in Python.

{{ read_csv('data/tables/python-operators.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/python-operators.csv') }}

\* Update operators are shortcuts to re-assign a variable to a new value based on the old one. For example, in Python one could add 3 to a number stored in a variable as follows:

Expand Down Expand Up @@ -113,23 +113,23 @@ This works for the other arithmetic operators as well. See the table for all ari

*Note that while we give some examples of syntax, the format of these tables does not lend itself to exact typing, so please read further documentation if needed and for more information on R's syntax.*

{{ read_csv('data/tables/r.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/r.csv') }}

### R data types

*Note: While these individual data types are not **iterable** in R, vectors made up of any data type inherit that type (*i.e.* a vector of numerics is itself numeric in type) and are iterable (see below)*

{{ read_csv('data/tables/r-data-types.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/r-data-types.csv') }}

### R data structures

{{ read_csv('data/tables/r-data-structures.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/r-data-structures.csv') }}

### R operators

*Note that R does not have **update operators** like Python does (see above).*

{{ read_csv('data/tables/r-operators.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/r-operators.csv') }}

## High performance computing (HPC) terms

Expand All @@ -138,25 +138,25 @@ particularly their page on <a href="https://docs.rc.fas.harvard.edu/kb/running-j

They also provide a <a href="https://docs.rc.fas.harvard.edu/kb/glossary/" target="_blank">more extensive glossary</a> for more term definitions.

{{ read_csv('data/tables/hpc.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/hpc.csv') }}

## Installing software

Installing software is a notoriously troublesome task, especially for beginners and when working on a server on which you don't have accsess to the **root** of the file system.

A couple of strategies have evolved to make this easier:

1. Environments: Portions of the *user's file system* that are adjusted so they can install and run software, giving the user full control.
2. Containers: Executable files that internally emulate the file system of the developer's computer, allowing the software in the container to be run without being explicitly installed on the user's computer.
1. Environments: Portions of the *user's file system* that are adjusted so they can install and run software, giving the user full control.
2. Containers: Executable files that internally emulate the file system of the developer's computer, allowing the software in the container to be run without being explicitly installed on the user's computer.

There are several ways to create environments and containers which are covered below. Additionally, different environment management systems may work with different package repositories and managers, so we go over some of those as well.

{{ read_csv('data/tables/installing-software.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/installing-software.csv') }}

## Git terms

<a href="https://git-scm.com/" target="_blank">Git</a> is a program that stores the history of files in any directory that has been initialized as a git repository. Used in conjunction with web-based platforms this makes for a powerful collaboration tool.
However, there are many terms associated with Git that may be confusing. In essence, many of these terms are simply other words for "a copy" or "copying" a directory, however with slight distinctions.
This table tries to define these terms clearly.

{{ read_csv('data/tables/git.csv') }}
{{ read_csv('data/glossary-tables/computing-programming/git.csv') }}

0 comments on commit 580218c

Please sign in to comment.