.. _workflows: Overview of workflows ===================== .. note:: These workflows **are intended to be edited and customized by the user**. See :ref:`getting-started` for recommendations on setting up these workflows in your project directory. Each workflow lives in its own directory: :: ├── references/ │   ├── Snakefile │   └── ... ├── rnaseq/ │   ├── Snakefile │   └── ... ├── chipseq/ │   ├── Snakefile │   └── ... ├── colocalization/ │   ├── Snakefile │   └── ... ├── external/ │   ├── Snakefile │   └── ... └── figures/ ├── Snakefile └── ... There are two general classes of workflows, **primary analysis** and the **integrative analysis**. Each workflow is driven by a ``Snakefile`` and is configured by plain text `YAML `_ and `TSV `_ format files (see :ref:`config` for much more on this). Features common to workflows ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this section, we will take a higher-level look at the features common to the primary analysis workflows. - The ``lib`` module is imported in each Snakefile, allowing various helper functions to be used. - The config file is hard-coded to be ``config/config.yaml`` by default, but a custom config can be specified at the command-line, using ``snakemake --configfile ``. - The config file is loaded using ``lib.common.load_config``. This function resolves various paths (especially the references config section) and checks to see if the config is well-formatted. - The ``c`` object: To make it easier to work with the config, a `SeqConfig` object is created. It needs that parsed config file as well as the patterns file (see :ref:`patterns-and-targets` for more on this). The act of creating this object reads the sample table, fills in the patterns with sample names, creates a reference dictionary (see ``common.references_dict``) for easy access to reference files, and for ChIP-seq, also fills in the filenames for the configured peak-calling runs. This object, called ``c`` for convenience, can be accessed to get all sort of information -- ``c.sampletable``, ``c.config``, ``c.patterns``, ``c.targets``, and ``c.refdict`` are frequently used in rules throughout the Snakefiles. Primary analysis workflows ~~~~~~~~~~~~~~~~~~~~~~~~~~ The primary analysis workflows are generally used for transforming raw data (fastq files) into usable results. For RNA-seq, that's differentially-expressed genes (along with comprehensive QC and analysis). For ChIP-seq, that's called peaks or differentially bound chromatin regions. The primary analysis workflows are: - References - RNA-seq - ChIP-seq These are each described further in their respective sections. While the references workflow can be stand-alone, usually it is run as a by-product of running the RNA-seq or ChIP-seq workflows. Here we will focus on RNA-seq and ChIP-seq which share some common properties. Where possible, we prefer to have rules use the normal command-line syntax for tools (examples include rules calling samtools, deepTools bamCoverage, picard, salmon). However in some cases we use wrapper scripts. Situtations where we use wrappers: - Ensuring various aligners (HISAT2, Bowtie2, STAR, bwa) behave uniformly. These wrappers call the aligner, followed by samtools sort and view. The end result is that FASTQs go in, and a sorted BAM comes out. - Tools with legacy dependencies like Python 2.7 that must be run in an independent environment (macs2, sicer, rseqc) - R analyses (particularly spp and dupradar, which build up an R script incrementally before calling it). - Tools that need complicated setup, or handling output files hard-coded by the tool (fastqc, fastq_screen). In all cases, search for the string **NOTE:** in the Snakefile to read notes on how to configure each rule, and make adjustments as necessary. You may see some comments that say `# [TEST SETTINGS]`; you can ignore these, and see :ref:`test-settings` for more info. .. note:: If you have two different RNA-seq experiments, from different species, they have to be run separately. However, if downstream analyses will use them both then you would like to keep them in the same project. In this case, you can copy the ``workflows/rnaseq`` directory to two other directories: .. code-block:: bash cp -r workflows/rnaseq workflows/genome1-rnaseq cp -r workflows/rnaseq workflows/genome2-rnaseq This way, downstream analyses can link to and utilize results from these individual folders, while the whole project remains self-contained. Integrative analysis workflows ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The integrative analysis workflows take input from the primary workflows and tie them together. The integrative analysis workflows are described in :ref:`integrative`: - Colocalization - "External" - Figures These are each described in more detail in their respective sections. Next Steps ~~~~~~~~~~ Next we look at :ref:`config` for details on how to configure specific workflows.