Overview of workflows

Note

These workflows are intended to be edited and customized by the user.

See Getting started for recommendations on setting up these workflows in your project directory.

Each workflow lives in its own directory:

├── references/
│   ├── Snakefile
│   └── ...
├── rnaseq/
│   ├── Snakefile
│   └── ...
├── chipseq/
│   ├── Snakefile
│   └── ...
├── colocalization/
│   ├── Snakefile
│   └── ...
├── external/
│   ├── Snakefile
│   └── ...
└── figures/
    ├── Snakefile
    └── ...

There are two general classes of workflows, primary analysis and the integrative analysis.

Each workflow is driven by a Snakefile and is configured by plain text YAML and TSV format files (see Configuration for much more on this).

Features common to workflows

In this section, we will take a higher-level look at the features common to the primary analysis workflows.

  • The lib module is imported in each Snakefile, allowing various helper functions to be used.

  • The config file is hard-coded to be config/config.yaml by default, but a custom config can be specified at the command-line, using snakemake --configfile <path to other config file>.

  • The config file is loaded using lib.common.load_config. This function resolves various paths (especially the references config section) and checks to see if the config is well-formatted.

  • The c object: To make it easier to work with the config, a SeqConfig object is created. It needs that parsed config file as well as the patterns file (see Patterns and targets for more on this). The act of creating this object reads the sample table, fills in the patterns with sample names, creates a reference dictionary (see common.references_dict) for easy access to reference files, and for ChIP-seq, also fills in the filenames for the configured peak-calling runs. This object, called c for convenience, can be accessed to get all sort of information – c.sampletable, c.config, c.patterns, c.targets, and c.refdict are frequently used in rules throughout the Snakefiles.

Primary analysis workflows

The primary analysis workflows are generally used for transforming raw data (fastq files) into usable results. For RNA-seq, that’s differentially-expressed genes (along with comprehensive QC and analysis). For ChIP-seq, that’s called peaks or differentially bound chromatin regions.

The primary analysis workflows are:

  • References

  • RNA-seq

  • ChIP-seq

These are each described further in their respective sections.

While the references workflow can be stand-alone, usually it is run as a by-product of running the RNA-seq or ChIP-seq workflows. Here we will focus on RNA-seq and ChIP-seq which share some common properties.

Where possible, we prefer to have rules use the normal command-line syntax for tools (examples include rules calling samtools, deepTools bamCoverage, picard, salmon). However in some cases we use wrapper scripts.

Situtations where we use wrappers:

  • Ensuring various aligners (HISAT2, Bowtie2, STAR, bwa) behave uniformly. These wrappers call the aligner, followed by samtools sort and view. The end result is that FASTQs go in, and a sorted BAM comes out.

  • Tools with legacy dependencies like Python 2.7 that must be run in an independent environment (macs2, sicer, rseqc)

  • R analyses (particularly spp and dupradar, which build up an R script incrementally before calling it).

  • Tools that need complicated setup, or handling output files hard-coded by the tool (fastqc, fastq_screen).

In all cases, search for the string NOTE: in the Snakefile to read notes on how to configure each rule, and make adjustments as necessary. You may see some comments that say # [TEST SETTINGS]; you can ignore these, and see A note about test settings for more info.

Note

If you have two different RNA-seq experiments, from different species, they have to be run separately. However, if downstream analyses will use them both then you would like to keep them in the same project. In this case, you can copy the workflows/rnaseq directory to two other directories:

cp -r workflows/rnaseq workflows/genome1-rnaseq
cp -r workflows/rnaseq workflows/genome2-rnaseq

This way, downstream analyses can link to and utilize results from these individual folders, while the whole project remains self-contained.

Integrative analysis workflows

The integrative analysis workflows take input from the primary workflows and tie them together.

The integrative analysis workflows are described in Integrative workflows:

  • Colocalization

  • “External”

  • Figures

These are each described in more detail in their respective sections.

Next Steps

Next we look at Configuration for details on how to configure specific workflows.