Primary analysis workflows¶
The primary analysis workflows are generally used for transforming raw data (fastq files) into usable results. For RNA-seq, that’s differentially-expressed genes (along with comprehensive QC and analysis). For ChIP-seq, that’s called peaks or differentially bound chromatin regions.
The primary analysis workflows are:
References
RNA-seq
ChIP-seq
These are each described further in their respective sections.
While the references workflow can be stand-alone, usually it is run as a by-product of running the RNA-seq or ChIP-seq workflows. Here we will focus on RNA-seq and ChIP-seq which share some common properties.
Where possible, we prefer to have rules use the normal command-line syntax for tools (examples include rules calling samtools, deepTools bamCoverage, picard, salmon). However in some cases we use wrapper scripts.
Situtations where we use wrappers:
Ensuring various aligners (HISAT2, Bowtie2, STAR, bwa) behave uniformly. These wrappers call the aligner, followed by samtools sort and view. The end result is that FASTQs go in, and a sorted BAM comes out.
Tools with legacy dependencies like Python 2.7 that must be run in an independent environment (macs2, sicer, rseqc)
R analyses (particularly spp and dupradar, which build up an R script incrementally before calling it).
Tools that need complicated setup, or handling output files hard-coded by the tool (fastqc, fastq_screen).
In all cases, search for the string NOTE: in the Snakefile to read notes on how to configure each rule, and make adjustments as necessary. You may see some comments that say # [TEST SETTINGS]; you can ignore these, and see A note about test settings for more info.
Note
If you have two different RNA-seq experiments, from different species, they
have to be run separately. However, if downstream analyses will use them both
then you would like to keep them in the same project. In this case, you can copy
the workflows/rnaseq
directory to two other directories:
cp -r workflows/rnaseq workflows/genome1-rnaseq
cp -r workflows/rnaseq workflows/genome2-rnaseq
This way, downstream analyses can link to and utilize results from these individual folders, while the whole project remains self-contained.