nextflow_workflow

I have written countless bioinformatics pipelines over the years, mostly using Bash scripting to organize program and alanysis flow (using programs/packages/libraries that I, or others, have written in C/C++, Perl,R,Python,Bash itself, or even MATLAB). These pipelines that I would generally organize to submit on various job scheduling software (PBS/Torque, Slurm, etc.) on HPC (High perfomance computing) clusters. I have found, recently, that Nextflow offers a well-organized, easy-to-use approach to crafting pipelines and employing containerized software dependencies (obviating many of the problems associated with versioning issues that often plague the bioinformatics and computational biology worlds). In addition, nextflow automatically scales and parallelizes data processing, making it easier to process many bioinformatics pipelines that often require a high level of computational resources.

Here we have an example where I (as many other computational biologists and bioinformaticians) have organized their numerous scripts and programs over the years according to function or project.

Using the Nextflow architecture, we can create and organize workflows based on these scripts to enhance code reusability and portability (e.g. among collaboraters or across systems) Here I present my own end-to-end RNA-Seq analysis Nextflow workflow, using my own scripts that I had created for various previous (bulk) RNA-Seq analyses, and using the publicly available Mus musculus RNA-Seq data (.fastq format) from the now-classic "A Beginner’s Guide to Analysis of RNA Sequencing Data" (PMID: 29624415) - USA, National Institutes of Health (NIH), National Library of Medicine (NLM) Accession PRJNA450151 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA450151)

Here we have my Nextflow workflow (file: nextflow_workflow.nf) version running on the Ubuntu App on Windows 10 with Windows Subsystem for Linux 2 (WSL2) installed and enabled. Deveopment and testing has been performed on an ASUS PC with 64GB RAM and 1TB hard drive. Note development within dedicated Miniconda environment: "nextflow_testing".

Addiitonally, one can use Gitpod to develop and test their nextflow project as well as link it to a respective GitHub project page, in this case: https://github.com/mmariani123/nextflow_workflow . Changes to the Nextflow files (.nf), for example can then be pushed to GitHub

One can also use the file explorer in Gitpod to navigate through files generated by our nextflow pipeline such as the quality control output from FASTQC.

It is important to note that the free tier of Github only allows for a maximum of 8GB of RAM and 30GB of storage, quite low for today's bioinformatics needs. Additionally, GuitHub itself has a maximum file size of 100MB; otherwise, GitHub's LTS system will need to be employed. Ideally, this nextflow pipeline would be run on an HPC cluster (either physical or cloud) with adequate computational resources. At the moment, we can see output from Alignment of the raw fastq data from PRJNA450151 to the mm39 Mus musculus reference genome (.fasta) and refGene file (.gtf).

The pipeline will conclude with counting up how many fastq reads align to the various genes (contained in the .gtf file) using the featureCounts routine that is part of the Subread software package. These ".counts" files can then used as input to DESeq2 for differential gene analysis between the 0, 2, and 24 hour reperfusion groups (Again, see PMID: 29624415 for the input RNA-seq data used for this workflow). The staistically significant differentially expressed genes output by DESeq2 can then be used as input to the clusterProfiler software package to identify significantly up- and down-regulated gene pathways.

Below we see a gene expression heatmap across reperfusion conditions followed by a PCA analysis of the samples clustered by normalized gene expression.

Next we see the results of the above mentioned pathway analysis (top 10 most significant pathways) from significantly differentially expressed genes up and down identified between the 2hr and 24hr reperfusion groups using DESeq2

Finally, we have the Nextflow flowchart generated by Nextflow that provides a map of our pipeline:

Stay tuned for more updates!

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
bin		bin
old_scripts		old_scripts
output		output
LICENSE		LICENSE
README.md		README.md
nextflow_workflow.nf		nextflow_workflow.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nextflow_workflow

About

Releases

Packages

Languages

License

mmariani123/nextflow_workflow

Folders and files

Latest commit

History

Repository files navigation

nextflow_workflow

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages