Config YAML¶
This page details the various configuration options and describes how to configure a new workflow.
Note that the references:
section is detailed separately, at
References config.
Config files are expected to be in a config
directory next to the
the Snakefile. For example, the RNA-seq workflow at
workflows/rnaseq/Snakefile
expects the config file
workflows/rnaseq/config/config.yaml
.
While it is possible to use Snakemake mechanisms such as --config
to
override a particular config value and --configfile
to update the config
with a different file, it is easiest to edit the existing
config/config.yaml
in place. This has the additional benefit of reproducibity
because all of the config information is stored in one place.
The following table summarizes the config fields, which ones are use for which workflow, and under what conditions, if any, they are required. Each option links to a section below with more details on how to use it.
Field |
Used for References |
Used for RNA-seq |
Used for ChIP-seq |
Required |
---|---|---|---|---|
references and/or include_references |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
if REFERENCES_DIR env var not set |
|
. |
yes |
yes |
always |
|
. |
yes |
yes |
always |
|
. |
yes |
yes |
always |
|
. |
yes |
no |
usually (see stranded) |
|
. |
yes |
yes |
if using fastq_screen |
|
. |
yes |
yes |
if you want to merge bigwigs |
|
. |
yes |
. |
always for RNA-seq |
|
. |
yes |
. |
if rRNA screening desired |
|
. |
yes |
. |
if Salmon quantification will be run |
|
. |
. |
yes |
always for ChIP-seq |
Example configs¶
To provide an overview, here are some example config files. More detail is provided later; this is just to provide some context:
RNA-seq¶
The config file for RNA-seq is expected to be in
workflows/rnaseq/config/config.yaml
:
references_dir: "/data/references"
sampletable: "config/sampletable.tsv"
organism: 'human'
aligner:
tag: 'gencode-v25'
index: 'hisat2'
rrna:
tag: 'rRNA'
index: 'bowtie2'
gtf:
tag: 'gencode-v25'
fastq_screen:
- label: Human
organism: human
tag: gencode-v25
- label: rRNA
organism: human
tag: rRNA
# Portions have been omitted from "references" section below for
# simplicity; see references config section for details.
references:
human:
gencode-v25:
genome:
url: 'ftp://.../genome.fa.gz'
indexes:
- 'hisat2'
- 'bowtie2'
annotation:
url: 'ftp://.../annotation.gtf.gz'
transcriptome:
indexes:
- 'salmon'
rRNA:
genome:
url: 'https://...'
indexes:
- 'bowtie2'
ChIP-seq¶
The config file for ChIP-seq is expected to be in
workflows/chipseq/config/config.yaml
.
The major differences between ChIP-seq and RNA-seq configs are:
ChIP-seq has no
annotation
orrrna
fieldsChIP-seq has an addition section
chipseq: peak_calling:
sampletable: 'config/sampletable.tsv'
organism: 'dmel'
genome: 'dm6'
aligner:
index: 'bowtie2'
tag: 'test'
chipseq:
peak_calling:
- label: gaf-embryo-1
algorithm: macs2
ip:
- gaf-embryo-1
control:
- input-embryo-1
- label: gaf-embryo-1
algorithm: spp
ip:
- gaf-embryo-1
control:
- input-embryo-1
- label: gaf-wingdisc-pooled
algorithm: macs2
ip:
- gaf-wingdisc-1
- gaf-wingdisc-2
control:
- input-wingdisc-1
- input-wingdisc-2
- label: gaf-wingdisc-pooled
algorithm: spp
ip:
- gaf-wingdisc-1
- gaf-wingdisc-2
control:
- input-wingdisc-1
- input-wingdisc-2
- label: gaf-wingdisc-pooled-1
algorithm: epic2
ip:
- gaf-wingdisc-1
control:
- input-wingdisc-1
extra: ''
- label: gaf-wingdisc-pooled-2
algorithm: epic2
ip:
- gaf-wingdisc-2
control:
- input-wingdisc-2
extra: ''
fastq_screen:
- label: Human
organism: human
tag: gencode-v25
merged_bigwigs:
input-wingdisc:
- input-wingdisc-1
- input-wingdisc-2
gaf-wingdisc:
- gaf-wingdisc-1
- gaf-wingdisc-2
gaf-embryo:
- gaf-embryo-1
# Portions have been omitted from "references" section below for
# simplicity; see references config section for details.
references:
human:
gencode-v25:
genome:
url: 'ftp://.../genome.fa.gz'
indexes:
- 'hisat2'
- 'bowtie2'
annotation:
url: 'ftp://.../annotation.gtf.gz'
fly:
test:
genome:
url: "https://raw.githubusercontent.com/lcdb/lcdb-test-data/master/data/seq/dm6.small.fa"
postprocess: 'lib.common.gzipped'
indexes:
- 'bowtie2'
- 'hisat2'
Field descriptions¶
Required for references, RNA-seq and ChIP-seq¶
references
¶
This section defines labels for references, where to get FASTA and GTF files and (optionally) post-process them, and which indexes to build.
Briefly, the example above has a single organism configured (“human”). That organism has two tags (“gencode-v25” and “rRNA”).
This is the most complex section and is documented elsewhere (see References config).
include_references
¶
This section can be used to supplement the
references
section with other reference sections stored elsewhere in files. It’s a convenient way of managing a large amount of references without cluttering the config file.See References config for more.
references_dir
¶
Top-level directory in which to create references.
If not specified, uses the environment variable
REFERENCES_DIR
.If specified and
REFERENCES_DIR
also exists,REFERENCES_DIR
takes precedence.This is useful when multiple people in a group share the same references to avoid duplicating commonly-used references. Simply point references_dir to an existing references directory to avoid having to rebuild references.
Required for RNA-seq and ChIP-seq¶
sampletable
field¶
Path to sampletable file which, at minimum, list sample names and paths to FASTQ files. The path of this filename is relative to the Snakefile. See Sample tables for more info on the expected contents of the file.
Example:
sampletable: "config/sampletable.tsv"
organism
field¶
This field selects the top-level section of the
references
section that will be used for the analysis. In RNA-seq example above, “human” is the only organism configured. In the ChIP-seq example, there is “human” as well as “fly”.Example:
organism: "human"
aligner
config section¶
This field has two sub-fields, and automatically uses the configured
organism
to select the top-level entry in the references section.tag
selects the tag from the organism to use, andindex
selects which aligner index to use. The relevant option from the example above would be “gencode-v25”, which configures both bowtie2 and hisat2 indexes to be built. For RNA-seq we would likely choose “hisat2”; for ChIP-seq “bowtie2”.Currently-configured options are
hisat2
,bowtie2
, andstar
.Example:
aligner: tag: "gencode-v25" index: "hisat2"
Required for RNA-seq¶
stranded
field¶
This field specifies the strandedness of the library. This is used by various rule to set the parameters correctly. For example,
featureCounts
will use-s0
,-s1
, or-s2
accordingly;kallisto
will use--fr-stranded
if needed, and so on.This field can take the following options:
value
description
unstranded
The strand that R1 reads align to has no information about the strand of the gene.
fr-firststrand
R1 reads from plus-strand genes align to the minus strand. Also called reverse stranded, dUTP-based
fr-secondstrand
R1 reads from plus-strand genes align to the plus strand. Also called forward stranded.
Example:
stranded: "fr-firststrand"Rules that require information about strand will check the config file at run time and raise an error if this field doesn’t exist.
If you don’t know the strandedness of the library, run the Snakefile in such a way to only run the
strand_check
rule:snakemake -j 2 strand_checkOr, when using the Slurm wrapper on cluster,
sbatch ../../include/WRAPPER_SLURM strand_checkWhen complete, there will be a MultiQC HTML file in the
strand_check/
directory that you can inspect to make your choice.This will align the first 10,000 reads to the specified reference and run RSeQC’s
infer_experiment.py
on the results and then run MultiQC on just those output files.Added in version 1.8.
Optional fields¶
fastq_screen
config section¶
This section configures which Bowtie2 indexes should be used with fastq_screen. It takes the form of a list of dictionaries. Each dictionary has the keys:
label: how to label the genome in the output
organism: a configured organism. In the example above, there is only a single configured organism, “human”.
tag: a configured tag for that organism.
Each entry in the list must have a Bowtie2 index configured to be built.
Example:
fastq_screen: - label: Human organism: human tag: gencode-v25 - label: rRNA organism: human tag: rRNAThe above example configures two different indexes to use for fastq_screen: the human gencode-v25 reference, and the human rRNA reference.
merged_bigwigs
config section¶
This section controls optional merging of signal files in bigWig format. Its format differs depending on RNA-seq or ChIP-seq, due to how strands are handled in those workflows.
Here is an RNA-seq example:
merged_bigwigs: arbitrary_label_to_use: pos: - 'sample1' - 'sample2' neg: - 'sample1' - 'sample2'This will result in a single bigWig file called arbitrary_label_to_use.bigwig in the directory data/rnaseq_aggregation/merged_bigwigs (by default; this is configured using
config/rnaseq_patterns.yaml
). That file merges together both the positive and negative signal strands of two samples, sample1 and sample2. The names “sample1” and “sample2” are sample names defined in the sample table.In other words, if samples 1 and 2 are replicates for a condition, this gets us a single merged (averaged) track for that condition.
Here’s another RNA-seq example, where we merge the samples again but keep the strands separate. This will result in two output bigwigs.
merged_bigwigs: merged_sense: sense: - 'sample1' - 'sample2' merged_antisense: antisense: - 'sample1' - 'sampleHere is a ChIP-seq example:
merged_bigwigs: arbitrary_label_to_use: - 'label1' - 'label2'This will result in a single bigWig file called arbitrary_label_to_use.bigwig in the directory data/chipseq_aggregation/merged_bigwigs (by default; this is configured using
config/chipseq_patterns.yaml
) that merges together the “label1” and “label2” bigwigs.See Sample tables for more info on the relationship between a sample and a label when working with ChIP-seq.
RNA-seq-only fields¶
rrna
field¶
This field selects the reference tag to use for screening rRNA reads. Similar to the
aligner
field, it takes both atag
andindex
key. The specified index must have been configured to be built for the specified tag. It uses the already configuredorganism
.Example:
rrna: tag: 'rRNA' index: 'bowtie2'
gtf
field¶
This field selects the reference tag to use for counting reads in features. The tag must have had a
gtf:
section specified; see References config for details.The organism is inherited from the
organism:
field.Example:
gtf: tag: "gencode-v25"
salmon
field¶
This field selects the reference tag to use for the Salmon index (if used). The tag must have had a FASTA configured, and an index for “salmon” must have been configured to be built for the organism selected with the
organism
config option.
ChIP-seq-only fields¶
chipseq
config section¶
This section configures the peak-calling stage of the ChIP-seq workflow. It currently expects a single key,
peak_calling
, which is a list of peak-calling runs.A peak-calling run is a dictionary configuring a single execution of a peak-caller which results in a single BED file of called peaks. A peak-calling run is uniquely described by its
label
andalgorithm
. This way, we can use the same label (e.g., gaf-embryo-1) across multiple peak-callers to help organize the output.The currently-supported peak-callers are
macs2
,spp
, andsicer
. They each have corresponding wrappers in thewrappers
directory. To add other peak-callers, see Adding a new peak-caller.The track hubs will include all of these called peaks which helps with assessing the peak-calling performance.
Here is a minimal example of a peak-calling config section. It defines a single peak-calling run using the macs2 algorithm. Note that the
ip:
andcontrol:
keys are lists of labels from the ChIP-seq sample table’slabel
column, not sample IDs from the first column.chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1The above peak-calling config will result in a file
data/chipseq_peaks/macs2/gaf-embryo-1/peaks.bed
(that pattern is defined inchipseq_patterns.yaml
if you need to change it).We can specify additional command-line arguments that are passed verbatim to macs2 with the
extra:
section, for example:chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1 extra: '--nomodel --extsize 147'macs2 supports multiple IP and input files, which internally are merged by macs2. We can supply multiple IP and input labels for biological replicates to get a set of peaks called on pooled samples. Note that we give it a different label so it doesn’t overwrite the other peak-calling run we already have configured.
chipseq: peak_calling: - label: gaf-embryo-1 algorithm: macs2 ip: - gaf-embryo-1 control: - input-embryo-1 extra: '--nomodel --extsize 147' - label: gaf-embryo-pooled algorithm: macs2 ip: - gaf-embryo-1 - gaf-embryo-2 control: - input-embryo-1 - input-embryo-2