Config YAML

This page details the various configuration options and describes how to configure a new workflow.

Note that the references: section is detailed separately, at References config.

Config files are expected to be in a config directory next to the the Snakefile. For example, the RNA-seq workflow at workflows/rnaseq/Snakefile expects the config file workflows/rnaseq/config/config.yaml.

While it is possible to use Snakemake mechanisms such as --config to override a particular config value and --configfile to update the config with a different file, it is easiest to edit the existing config/config.yaml in place. This has the additional benefit of reproducibity because all of the config information is stored in one place.

The following table summarizes the config fields, which ones are use for which workflow, and under what conditions, if any, they are required. Each option links to a section below with more details on how to use it.

Field

Used for References

Used for RNA-seq

Used for ChIP-seq

Required

references and/or include_references

yes

yes

yes

yes

references_dir

yes

yes

yes

if REFERENCES_DIR env var not set

sampletable

.

yes

yes

always

organism

.

yes

yes

always

aligner

.

yes

yes

always

stranded

.

yes

no

usually (see stranded)

fastq_screen

.

yes

yes

if using fastq_screen

merged_bigwigs

.

yes

yes

if you want to merge bigwigs

gtf

.

yes

.

always for RNA-seq

rrna

.

yes

.

if rRNA screening desired

salmon

.

yes

.

if Salmon quantification will be run

chipseq

.

.

yes

always for ChIP-seq

Example configs

To provide an overview, here are some example config files. More detail is provided later; this is just to provide some context:

RNA-seq

The config file for RNA-seq is expected to be in workflows/rnaseq/config/config.yaml:

references_dir: "/data/references"
sampletable: "config/sampletable.tsv"
organism: 'human'
aligner:
  tag: 'gencode-v25'
  index: 'hisat2'
rrna:
  tag: 'rRNA'
  index: 'bowtie2'
gtf:
  tag: 'gencode-v25'

fastq_screen:
  - label: Human
    organism: human
    tag: gencode-v25
  - label: rRNA
    organism: human
    tag: rRNA

# Portions have been omitted from "references" section below for
# simplicity; see references config section for details.

references:
  human:
    gencode-v25:
      genome:
        url: 'ftp://.../genome.fa.gz'
        indexes:
          - 'hisat2'
          - 'bowtie2'
      annotation:
        url: 'ftp://.../annotation.gtf.gz'

      transcriptome:
        indexes:
          - 'salmon'

    rRNA:
      genome:
        url: 'https://...'
        indexes:
            - 'bowtie2'

ChIP-seq

The config file for ChIP-seq is expected to be in workflows/chipseq/config/config.yaml.

The major differences between ChIP-seq and RNA-seq configs are:

  • ChIP-seq has no annotation or rrna fields

  • ChIP-seq has an addition section chipseq: peak_calling:

sampletable: 'config/sampletable.tsv'
organism: 'dmel'
genome: 'dm6'

aligner:
  index: 'bowtie2'
  tag: 'test'

chipseq:
  peak_calling:

    - label: gaf-embryo-1
      algorithm: macs2
      ip:
        - gaf-embryo-1
      control:
        - input-embryo-1

    - label: gaf-embryo-1
      algorithm: spp
      ip:
        - gaf-embryo-1
      control:
        - input-embryo-1

    - label: gaf-wingdisc-pooled
      algorithm: macs2
      ip:
        - gaf-wingdisc-1
        - gaf-wingdisc-2
      control:
        - input-wingdisc-1
        - input-wingdisc-2

    - label: gaf-wingdisc-pooled
      algorithm: spp
      ip:
        - gaf-wingdisc-1
        - gaf-wingdisc-2
      control:
        - input-wingdisc-1
        - input-wingdisc-2

    - label: gaf-wingdisc-pooled-1
      algorithm: epic2
      ip:
        - gaf-wingdisc-1
      control:
        - input-wingdisc-1
      extra: ''

    - label: gaf-wingdisc-pooled-2
      algorithm: epic2
      ip:
        - gaf-wingdisc-2
      control:
        - input-wingdisc-2
      extra: ''

fastq_screen:
  - label: Human
    organism: human
    tag: gencode-v25

merged_bigwigs:
  input-wingdisc:
    - input-wingdisc-1
    - input-wingdisc-2
  gaf-wingdisc:
    - gaf-wingdisc-1
    - gaf-wingdisc-2
  gaf-embryo:
    - gaf-embryo-1


# Portions have been omitted from "references" section below for
# simplicity; see references config section for details.

references:
  human:
    gencode-v25:
      genome:
        url: 'ftp://.../genome.fa.gz'
        indexes:
          - 'hisat2'
          - 'bowtie2'
      annotation:
        url: 'ftp://.../annotation.gtf.gz'

  fly:
    test:
      genome:
        url: "https://raw.githubusercontent.com/lcdb/lcdb-test-data/master/data/seq/dm6.small.fa"
        postprocess: 'lib.common.gzipped'
        indexes:
          - 'bowtie2'
          - 'hisat2'

Field descriptions

Required for references, RNA-seq and ChIP-seq

references

This section defines labels for references, where to get FASTA and GTF files and (optionally) post-process them, and which indexes to build.

Briefly, the example above has a single organism configured (“human”). That organism has two tags (“gencode-v25” and “rRNA”).

This is the most complex section and is documented elsewhere (see References config).

include_references

This section can be used to supplement the references section with other reference sections stored elsewhere in files. It’s a convenient way of managing a large amount of references without cluttering the config file.

See References config for more.

references_dir

Top-level directory in which to create references.

If not specified, uses the environment variable REFERENCES_DIR.

If specified and REFERENCES_DIR also exists, REFERENCES_DIR takes precedence.

This is useful when multiple people in a group share the same references to avoid duplicating commonly-used references. Simply point references_dir to an existing references directory to avoid having to rebuild references.

Required for RNA-seq and ChIP-seq

sampletable field

Path to sampletable file which, at minimum, list sample names and paths to FASTQ files. The path of this filename is relative to the Snakefile. See Sample tables for more info on the expected contents of the file.

Example:

sampletable: "config/sampletable.tsv"

organism field

This field selects the top-level section of the references section that will be used for the analysis. In RNA-seq example above, “human” is the only organism configured. In the ChIP-seq example, there is “human” as well as “fly”.

Example:

organism: "human"

aligner config section

This field has two sub-fields, and automatically uses the configured organism to select the top-level entry in the references section. tag selects the tag from the organism to use, and index selects which aligner index to use. The relevant option from the example above would be “gencode-v25”, which configures both bowtie2 and hisat2 indexes to be built. For RNA-seq we would likely choose “hisat2”; for ChIP-seq “bowtie2”.

Currently-configured options are hisat2, bowtie2, and star.

Example:

aligner:
  tag: "gencode-v25"
  index: "hisat2"

Required for RNA-seq

stranded field

This field specifies the strandedness of the library. This is used by various rule to set the parameters correctly. For example, featureCounts will use -s0, -s1, or -s2 accordingly; kallisto will use --fr-stranded if needed, and so on.

This field can take the following options:

value

description

unstranded

The strand that R1 reads align to has no information about the strand of the gene.

fr-firststrand

R1 reads from plus-strand genes align to the minus strand. Also called reverse stranded, dUTP-based

fr-secondstrand

R1 reads from plus-strand genes align to the plus strand. Also called forward stranded.

Example:

stranded: "fr-firststrand"

Rules that require information about strand will check the config file at run time and raise an error if this field doesn’t exist.

If you don’t know the strandedness of the library, run the Snakefile in such a way to only run the strand_check rule:

snakemake -j 2 strand_check

Or, when using the Slurm wrapper on cluster,

sbatch ../../include/WRAPPER_SLURM strand_check

When complete, there will be a MultiQC HTML file in the strand_check/ directory that you can inspect to make your choice.

This will align the first 10,000 reads to the specified reference and run RSeQC’s infer_experiment.py on the results and then run MultiQC on just those output files.

New in version 1.8.

Optional fields

fastq_screen config section

This section configures which Bowtie2 indexes should be used with fastq_screen. It takes the form of a list of dictionaries. Each dictionary has the keys:

  • label: how to label the genome in the output

  • organism: a configured organism. In the example above, there is only a single configured organism, “human”.

  • tag: a configured tag for that organism.

Each entry in the list must have a Bowtie2 index configured to be built.

Example:

fastq_screen:
  - label: Human
    organism: human
    tag: gencode-v25
  - label: rRNA
    organism: human
    tag: rRNA

The above example configures two different indexes to use for fastq_screen: the human gencode-v25 reference, and the human rRNA reference.

merged_bigwigs config section

This section controls optional merging of signal files in bigWig format. Its format differs depending on RNA-seq or ChIP-seq, due to how strands are handled in those workflows.

Here is an RNA-seq example:

merged_bigwigs:
  arbitrary_label_to_use:
    pos:
      - 'sample1'
      - 'sample2'
    neg:
      - 'sample1'
      - 'sample2'

This will result in a single bigWig file called arbitrary_label_to_use.bigwig in the directory data/rnaseq_aggregation/merged_bigwigs (by default; this is configured using config/rnaseq_patterns.yaml). That file merges together both the positive and negative signal strands of two samples, sample1 and sample2. The names “sample1” and “sample2” are sample names defined in the sample table.

In other words, if samples 1 and 2 are replicates for a condition, this gets us a single merged (averaged) track for that condition.

Here’s another RNA-seq example, where we merge the samples again but keep the strands separate. This will result in two output bigwigs.

merged_bigwigs:
  merged_sense:
    sense:
      - 'sample1'
      - 'sample2'
  merged_antisense:
    antisense:
      - 'sample1'
      - 'sample

Here is a ChIP-seq example:

merged_bigwigs:
  arbitrary_label_to_use:
    - 'label1'
    - 'label2'

This will result in a single bigWig file called arbitrary_label_to_use.bigwig in the directory data/chipseq_aggregation/merged_bigwigs (by default; this is configured using config/chipseq_patterns.yaml) that merges together the “label1” and “label2” bigwigs.

See Sample tables for more info on the relationship between a sample and a label when working with ChIP-seq.

RNA-seq-only fields

rrna field

This field selects the reference tag to use for screening rRNA reads. Similar to the aligner field, it takes both a tag and index key. The specified index must have been configured to be built for the specified tag. It uses the already configured organism.

Example:

rrna:
  tag: 'rRNA'
  index: 'bowtie2'

gtf field

This field selects the reference tag to use for counting reads in features. The tag must have had a gtf: section specified; see References config for details.

The organism is inherited from the organism: field.

Example:

gtf:
  tag: "gencode-v25"

salmon field

This field selects the reference tag to use for the Salmon index (if used). The tag must have had a FASTA configured, and an index for “salmon” must have been configured to be built for the organism selected with the organism config option.

ChIP-seq-only fields

chipseq config section

This section configures the peak-calling stage of the ChIP-seq workflow. It currently expects a single key, peak_calling, which is a list of peak-calling runs.

A peak-calling run is a dictionary configuring a single execution of a peak-caller which results in a single BED file of called peaks. A peak-calling run is uniquely described by its label and algorithm. This way, we can use the same label (e.g., gaf-embryo-1) across multiple peak-callers to help organize the output.

The currently-supported peak-callers are macs2, spp, and sicer. They each have corresponding wrappers in the wrappers directory. To add other peak-callers, see Adding a new peak-caller.

The track hubs will include all of these called peaks which helps with assessing the peak-calling performance.

Here is a minimal example of a peak-calling config section. It defines a single peak-calling run using the macs2 algorithm. Note that the ip: and control: keys are lists of labels from the ChIP-seq sample table’s label column, not sample IDs from the first column.

chipseq:
  peak_calling:

    - label: gaf-embryo-1
      algorithm: macs2
      ip:
        - gaf-embryo-1
      control:
        - input-embryo-1

The above peak-calling config will result in a file data/chipseq_peaks/macs2/gaf-embryo-1/peaks.bed (that pattern is defined in chipseq_patterns.yaml if you need to change it).

We can specify additional command-line arguments that are passed verbatim to macs2 with the extra: section, for example:

chipseq:
  peak_calling:

    - label: gaf-embryo-1
      algorithm: macs2
      ip:
        - gaf-embryo-1
      control:
        - input-embryo-1
      extra: '--nomodel --extsize 147'

macs2 supports multiple IP and input files, which internally are merged by macs2. We can supply multiple IP and input labels for biological replicates to get a set of peaks called on pooled samples. Note that we give it a different label so it doesn’t overwrite the other peak-calling run we already have configured.

chipseq:
  peak_calling:

    - label: gaf-embryo-1
      algorithm: macs2
      ip:
        - gaf-embryo-1
      control:
        - input-embryo-1
      extra: '--nomodel --extsize 147'


    - label: gaf-embryo-pooled
      algorithm: macs2
      ip:
        - gaf-embryo-1
        - gaf-embryo-2
      control:
        - input-embryo-1
        - input-embryo-2