Repo for scripts and files for read mapping simulations.
Make sure Anaconda is installed and activate it if it hasn't been already:
source /path/to/anaconda3/bin/activate
Then, all dependencies (except one, see below), can be installed using the pre-set environment, mapping-sims
. Create that environment to install this software:
conda env create --file mapping-sims.yml
And then activate the environment:
conda activate mapping-sims
The only software not installed via the conda environment is the read simulation software, NEAT. NEAT is just a python script, so you can install it just by cloning the github repo, though this must be cloned in the correct spot so the snakemake pipeline can find it. First, while in the root project directory, make the directory pkgs
and navigate to it:
mkdir pkgs
cd pkgs
Then, clone the NEAT repo:
git clone https://github.com/ncsa/NEAT.git
That should be everything!
To make things easier, let's just switch to the scripts
directory.
cd scripts
If you'd rather not switch to scripts
, just be sure to add that to all paths listed below.
--dryrun
will show you the commands that snakemake will run without actually running them. This can let you address errors before you start the whole pipeline.
The commands for the pipeline are compiled in a snakemake workflow, mapping_sims.smk
To run a given set of simulations, you will have to create a config file with the parameters and inputs of that simulation. An example config file for the mouse genome is provided here: mm39-config.yaml
Take a look at that config file and make any changes, including paths to data directories and reference genomes, as well as simulation parameters.
After you have edited the config file, you can run the pipeline directly:
snakemake -j <number of jobs to run in parallel> \
-s mapping_sims.smk \
-p \
--configfile <your config file>
I like to include the -p
option that prints out the exact shell command run for each rule, which I find helpful to understand exactly what's going on.
For example, the exact command for the provided mm39 config file to launch 20 jobs would be:
snakemake -j 20 -s mapping_sims.smk -p --configfile mm30-config.yaml
Snakemake also allows jobs to be submitted to a cluster rather than run directly on the terminal. This is advantageous as it can allow the pipeline to access many more resources than using a single node with the command above. To do this, you must set-up a profile for your cluster. An example profile for the Harvard SLURM cluster is provided in scripts/profiles/slurm_profile/config.yaml
. Once your profile is set-up, you can launch the pipeline as follows:
snakemake -s mapping_sims.smk \
-p \
--configfile <your config file> \
--profile <path to directory containing config.yaml profile>
Similar, to the previous command, but without specifying the number of jobs with -j
, which is now done with the --profile
option. Note that the provided path must be to a directory containing your profile, which is saved as config.yaml
.
So, the command to run the pipeline with the provided config and profile would be:
snakemake -s mapping_sims.smk -p --configfile mm30-config.yaml --profile profiles/slurm_profile/