.. _config-yaml: Config YAML =========== This page details the various configuration options and describes how to configure a new workflow. Note that the ``references:`` section is detailed separately, at :ref:`references-config`. Config files are expected to be in a ``config`` directory next to the the Snakefile. For example, the RNA-seq workflow at ``workflows/rnaseq/Snakefile`` expects the config file ``workflows/rnaseq/config/config.yaml``. While it is possible to use Snakemake mechanisms such as ``--config`` to override a particular config value and ``--configfile`` to update the config with a different file, it is easiest to edit the existing ``config/config.yaml`` in place. This has the additional benefit of reproducibity because all of the config information is stored in one place. The following table summarizes the config fields, which ones are use for which workflow, and under what conditions, if any, they are required. Each option links to a section below with more details on how to use it. ================================================================================== =================== ================ ================= ========= Field Used for References Used for RNA-seq Used for ChIP-seq Required ================================================================================== =================== ================ ================= ========= :ref:`references ` and/or :ref:`include_references ` yes yes yes yes :ref:`references_dir ` yes yes yes if `REFERENCES_DIR` env var not set :ref:`sampletable ` . yes yes always :ref:`organism ` . yes yes always :ref:`aligner ` . yes yes always :ref:`stranded ` . yes no usually (see :ref:`stranded `) :ref:`fastq_screen ` . yes yes if using `fastq_screen` :ref:`merged_bigwigs ` . yes yes if you want to merge bigwigs :ref:`gtf ` . yes . always for RNA-seq :ref:`rrna ` . yes . if rRNA screening desired :ref:`salmon ` . yes . if Salmon quantification will be run :ref:`chipseq ` . . yes always for ChIP-seq ================================================================================== =================== ================ ================= ========= Example configs --------------- To provide an overview, here are some example config files. More detail is provided later; this is just to provide some context: RNA-seq ~~~~~~~ The config file for RNA-seq is expected to be in ``workflows/rnaseq/config/config.yaml``: .. code-block:: yaml references_dir: "/data/references" sampletable: "config/sampletable.tsv" organism: 'human' aligner: tag: 'gencode-v25' index: 'hisat2' rrna: tag: 'rRNA' index: 'bowtie2' gtf: tag: 'gencode-v25' fastq_screen: - label: Human organism: human tag: gencode-v25 - label: rRNA organism: human tag: rRNA # Portions have been omitted from "references" section below for # simplicity; see references config section for details. references: human: gencode-v25: genome: url: 'ftp://.../genome.fa.gz' indexes: - 'hisat2' - 'bowtie2' annotation: url: 'ftp://.../annotation.gtf.gz' transcriptome: indexes: - 'salmon' rRNA: genome: url: 'https://...' indexes: - 'bowtie2' ChIP-seq ~~~~~~~~ The config file for ChIP-seq is expected to be in ``workflows/chipseq/config/config.yaml``. The major differences between ChIP-seq and RNA-seq configs are: - ChIP-seq has no ``annotation`` or ``rrna`` fields - ChIP-seq has an addition section ``chipseq: peak_calling:`` .. code-block:: yaml sampletable: 'config/sampletable.tsv' organism: 'dmel' genome: 'dm6' aligner: index: 'bowtie2' tag: 'test' chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1 - label: gaf-embryo-1 algorithm: spp ip: - gaf-embryo-1 control: - input-embryo-1 - label: gaf-wingdisc-pooled algorithm: macs2 ip: - gaf-wingdisc-1 - gaf-wingdisc-2 control: - input-wingdisc-1 - input-wingdisc-2 - label: gaf-wingdisc-pooled algorithm: spp ip: - gaf-wingdisc-1 - gaf-wingdisc-2 control: - input-wingdisc-1 - input-wingdisc-2 - label: gaf-wingdisc-pooled-1 algorithm: epic2 ip: - gaf-wingdisc-1 control: - input-wingdisc-1 extra: '' - label: gaf-wingdisc-pooled-2 algorithm: epic2 ip: - gaf-wingdisc-2 control: - input-wingdisc-2 extra: '' fastq_screen: - label: Human organism: human tag: gencode-v25 merged_bigwigs: input-wingdisc: - input-wingdisc-1 - input-wingdisc-2 gaf-wingdisc: - gaf-wingdisc-1 - gaf-wingdisc-2 gaf-embryo: - gaf-embryo-1 # Portions have been omitted from "references" section below for # simplicity; see references config section for details. references: human: gencode-v25: genome: url: 'ftp://.../genome.fa.gz' indexes: - 'hisat2' - 'bowtie2' annotation: url: 'ftp://.../annotation.gtf.gz' fly: test: genome: url: "https://raw.githubusercontent.com/lcdb/lcdb-test-data/master/data/seq/dm6.small.fa" postprocess: 'lib.common.gzipped' indexes: - 'bowtie2' - 'hisat2' Field descriptions ------------------ Required for references, RNA-seq and ChIP-seq ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _cfg-references: ``references`` `````````````` This section defines labels for references, where to get FASTA and GTF files and (optionally) post-process them, and which indexes to build. Briefly, the example above has a single organism configured ("human"). That organism has two tags ("gencode-v25" and "rRNA"). This is the most complex section and is documented elsewhere (see :ref:`references-config`). .. _cfg-inc-refs: ``include_references`` `````````````````````` This section can be used to supplement the ``references`` section with other reference sections stored elsewhere in files. It's a convenient way of managing a large amount of references without cluttering the config file. See :ref:`references-config` for more. .. _cfg-references-dir: ``references_dir`` `````````````````` Top-level directory in which to create references. If not specified, uses the environment variable ``REFERENCES_DIR``. If specified and ``REFERENCES_DIR`` also exists, ``REFERENCES_DIR`` takes precedence. This is useful when multiple people in a group share the same references to avoid duplicating commonly-used references. Simply point references_dir to an existing references directory to avoid having to rebuild references. Required for RNA-seq and ChIP-seq ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _cfg-sampletable: ``sampletable`` field ````````````````````` Path to sampletable file which, at minimum, list sample names and paths to FASTQ files. The path of this filename is relative to the Snakefile. See :ref:`sampletable` for more info on the expected contents of the file. Example: .. code-block:: yaml sampletable: "config/sampletable.tsv" .. _cfg-organism: ``organism`` field `````````````````` This field selects the top-level section of the ``references`` section that will be used for the analysis. In RNA-seq example above, "human" is the only organism configured. In the ChIP-seq example, there is "human" as well as "fly". Example: .. code-block:: yaml organism: "human" .. _cfg-aligner: ``aligner`` config section `````````````````````````` This field has two sub-fields, and automatically uses the configured ``organism`` to select the top-level entry in the references section. ``tag`` selects the tag from the organism to use, and ``index`` selects which aligner index to use. The relevant option from the example above would be "gencode-v25", which configures both bowtie2 and hisat2 indexes to be built. For RNA-seq we would likely choose "hisat2"; for ChIP-seq "bowtie2". Currently-configured options are ``hisat2``, ``bowtie2``, and ``star``. Example: .. code-block:: yaml aligner: tag: "gencode-v25" index: "hisat2" Required for RNA-seq ~~~~~~~~~~~~~~~~~~~~ .. _cfg-stranded: ``stranded`` field `````````````````` This field specifies the strandedness of the library. This is used by various rule to set the parameters correctly. For example, ``featureCounts`` will use ``-s0``, ``-s1``, or ``-s2`` accordingly; ``kallisto`` will use ``--fr-stranded`` if needed, and so on. This field can take the following options: =================== =========== value description =================== =========== ``unstranded`` The strand that R1 reads align to has no information about the strand of the gene. ``fr-firststrand`` R1 reads from plus-strand genes align to the *minus* strand. Also called reverse stranded, dUTP-based ``fr-secondstrand`` R1 reads from plus-strand genes align to the *plus* strand. Also called forward stranded. =================== =========== Example: .. code-block:: yaml stranded: "fr-firststrand" Rules that require information about strand will check the config file at run time and raise an error if this field doesn't exist. If you don't know the strandedness of the library, run the Snakefile in such a way to only run the ``strand_check`` rule: .. code-block:: bash snakemake -j 2 strand_check Or, when using the Slurm wrapper on cluster, .. code-block:: bash sbatch ../../include/WRAPPER_SLURM strand_check When complete, there will be a MultiQC HTML file in the ``strand_check/`` directory that you can inspect to make your choice. This will align the first 10,000 reads to the specified reference and run RSeQC's ``infer_experiment.py`` on the results and then run MultiQC on just those output files. .. versionadded:: 1.8 Optional fields ~~~~~~~~~~~~~~~ .. _cfg-fastq-screen: ``fastq_screen`` config section ``````````````````````````````` This section configures which Bowtie2 indexes should be used with `fastq_screen`. It takes the form of a list of dictionaries. Each dictionary has the keys: - `label`: how to label the genome in the output - `organism`: a configured organism. In the example above, there is only a single configured organism, "human". - `tag`: a configured tag for that organism. Each entry in the list must have a Bowtie2 index configured to be built. Example: .. code-block:: yaml fastq_screen: - label: Human organism: human tag: gencode-v25 - label: rRNA organism: human tag: rRNA The above example configures two different indexes to use for fastq_screen: the human gencode-v25 reference, and the human rRNA reference. .. _cfg-merged-bigwigs: ``merged_bigwigs`` config section ````````````````````````````````` This section controls optional merging of signal files in bigWig format. Its format differs depending on RNA-seq or ChIP-seq, due to how strands are handled in those workflows. Here is an RNA-seq example: .. code-block:: yaml merged_bigwigs: arbitrary_label_to_use: pos: - 'sample1' - 'sample2' neg: - 'sample1' - 'sample2' This will result in a single bigWig file called `arbitrary_label_to_use.bigwig` in the directory `data/rnaseq_aggregation/merged_bigwigs` (by default; this is configured using ``config/rnaseq_patterns.yaml``). That file merges together both the positive and negative signal strands of two samples, `sample1` and `sample2`. The names "sample1" and "sample2" are sample names defined in the :ref:`sample table `. In other words, if samples 1 and 2 are replicates for a condition, this gets us a single merged (averaged) track for that condition. Here's another RNA-seq example, where we merge the samples again but keep the strands separate. This will result in two output bigwigs. .. code-block:: yaml merged_bigwigs: merged_sense: sense: - 'sample1' - 'sample2' merged_antisense: antisense: - 'sample1' - 'sample Here is a ChIP-seq example: .. code-block:: yaml merged_bigwigs: arbitrary_label_to_use: - 'label1' - 'label2' This will result in a single bigWig file called `arbitrary_label_to_use.bigwig` in the directory `data/chipseq_aggregation/merged_bigwigs` (by default; this is configured using ``config/chipseq_patterns.yaml``) that merges together the "label1" and "label2" bigwigs. See :ref:`sampletable` for more info on the relationship between a *sample* and a *label* when working with ChIP-seq. RNA-seq-only fields ~~~~~~~~~~~~~~~~~~~ .. _cfg-rrna: ``rrna`` field ``````````````` This field selects the reference tag to use for screening rRNA reads. Similar to the ``aligner`` field, it takes both a ``tag`` and ``index`` key. The specified index must have been configured to be built for the specified tag. It uses the already configured ``organism``. Example: .. code-block:: yaml rrna: tag: 'rRNA' index: 'bowtie2' .. _cfg-gtf: ``gtf`` field ````````````` This field selects the reference tag to use for counting reads in features. The tag must have had a ``gtf:`` section specified; see :ref:`references-config` for details. The organism is inherited from the ``organism:`` field. Example: .. code-block:: yaml gtf: tag: "gencode-v25" .. _cfg-salmon: ``salmon`` field ```````````````` This field selects the reference tag to use for the Salmon index (if used). The tag must have had a FASTA configured, and an index for "salmon" must have been configured to be built for the organism selected with the ``organism`` config option. ChIP-seq-only fields ~~~~~~~~~~~~~~~~~~~~ .. _cfg-chipseq: ``chipseq`` config section `````````````````````````` This section configures the peak-calling stage of the ChIP-seq workflow. It currently expects a single key, ``peak_calling``, which is a list of peak-calling runs. A peak-calling run is a dictionary configuring a single execution of a peak-caller which results in a single BED file of called peaks. A peak-calling run is uniquely described by its ``label`` and ``algorithm``. This way, we can use the same label (e.g., `gaf-embryo-1`) across multiple peak-callers to help organize the output. The currently-supported peak-callers are ``macs2``, ``spp``, and ``sicer``. They each have corresponding wrappers in the ``wrappers`` directory. To add other peak-callers, see :ref:`new-peak-caller`. The track hubs will include all of these called peaks which helps with assessing the peak-calling performance. Here is a minimal example of a peak-calling config section. It defines a single peak-calling run using the `macs2` algorithm. Note that the ``ip:`` and ``control:`` keys are lists of **labels** from the ChIP-seq sample table's ``label`` column, **not sample IDs** from the first column. .. code-block:: yaml chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1 The above peak-calling config will result in a file ``data/chipseq_peaks/macs2/gaf-embryo-1/peaks.bed`` (that pattern is defined in ``chipseq_patterns.yaml`` if you need to change it). We can specify additional command-line arguments that are passed verbatim to `macs2` with the ``extra:`` section, for example: .. code-block:: yaml chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1 extra: '--nomodel --extsize 147' `macs2` supports multiple IP and input files, which internally are merged by `macs2`. We can supply multiple IP and input labels for biological replicates to get a set of peaks called on pooled samples. Note that we give it a different label so it doesn't overwrite the other peak-calling run we already have configured. .. code-block:: yaml chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1 extra: '--nomodel --extsize 147' - label: gaf-embryo-pooled algorithm: macs2 ip: - gaf-embryo-1 - gaf-embryo-2 control: - input-embryo-1 - input-embryo-2