Skip to content

mmariani123/nextflow_workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nextflow_workflow

I have written countless bioinformatics pipelines over the years, mostly using Bash scripting to organize program and alanysis flow (using programs/packages/libraries that I, or others, have written in C/C++, Perl,R,Python,Bash itself, or even MATLAB). These pipelines that I would generally organize to submit on various job scheduling software (PBS/Torque, Slurm, etc.) on HPC (High perfomance computing) clusters. I have found, recently, that Nextflow offers a well-organized, easy-to-use approach to crafting pipelines and employing containerized software dependencies (obviating many of the problems associated with versioning issues that often plague the bioinformatics and computational biology worlds). In addition, nextflow automatically scales and parallelizes data processing, making it easier to process many bioinformatics pipelines that often require a high level of computational resources.

Here we have an example where I (as many other computational biologists and bioinformaticians) have organized their numerous scripts and programs over the years according to function or project.

image

Using the Nextflow architecture, we can create and organize workflows based on these scripts to enhance code reusability and portability (e.g. among collaboraters or across systems) Here I present my own end-to-end RNA-Seq analysis Nextflow workflow, using my own scripts that I had created for various previous (bulk) RNA-Seq analyses, and using the publicly available Mus musculus RNA-Seq data (.fastq format) from the now-classic "A Beginner’s Guide to Analysis of RNA Sequencing Data" (PMID: 29624415) - USA, National Institutes of Health (NIH), National Library of Medicine (NLM) Accession PRJNA450151 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA450151)

nextfoww_example_workflow

Here we have my Nextflow workflow (file: nextflow_workflow.nf) version running on the Ubuntu App on Windows 10 with Windows Subsystem for Linux 2 (WSL2) installed and enabled. Deveopment and testing has been performed on an ASUS PC with 64GB RAM and 1TB hard drive. Note development within dedicated Miniconda environment: "nextflow_testing".

image

Addiitonally, one can use Gitpod to develop and test their nextflow project as well as link it to a respective GitHub project page, in this case: https://github.com/mmariani123/nextflow_workflow . Changes to the Nextflow files (.nf), for example can then be pushed to GitHub

gitpod_commit_1

One can also use the file explorer in Gitpod to navigate through files generated by our nextflow pipeline such as the quality control output from FASTQC.

image

It is important to note that the free tier of Github only allows for a maximum of 8GB of RAM and 30GB of storage, quite low for today's bioinformatics needs. Additionally, GuitHub itself has a maximum file size of 100MB; otherwise, GitHub's LTS system will need to be employed. Ideally, this nextflow pipeline would be run on an HPC cluster (either physical or cloud) with adequate computational resources. At the moment, we can see output from Alignment of the raw fastq data from PRJNA450151 to the mm39 Mus musculus reference genome (.fasta) and refGene file (.gtf).

star_alignment_output

The pipeline will conclude with counting up how many fastq reads align to the various genes (contained in the .gtf file) using the featureCounts routine that is part of the Subread software package. These ".counts" files can then used as input to DESeq2 for differential gene analysis between the 0, 2, and 24 hour reperfusion groups (Again, see PMID: 29624415 for the input RNA-seq data used for this workflow). The staistically significant differentially expressed genes output by DESeq2 can then be used as input to the clusterProfiler software package to identify significantly up- and down-regulated gene pathways.

Below we see a gene expression heatmap across reperfusion conditions followed by a PCA analysis of the samples clustered by normalized gene expression.

deseq2_heatmap

deseq2_pca

Next we see the results of the above mentioned pathway analysis (top 10 most significant pathways) from significantly differentially expressed genes up and down identified between the 2hr and 24hr reperfusion groups using DESeq2

pathways

Finally, we have the Nextflow flowchart generated by Nextflow that provides a map of our pipeline:

flowchart

Stay tuned for more updates!

About

Practicing Nextflow - Example RNA-Seq Nextflow workflow.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages