FAQs¶
This page serves as a catch-all for details of various topics.
Can I run multiple workflows at once?¶
Sometimes. While Snakemake creates a lockfile to prevent multiple instances running in the same directory, the fact that each workflow (rnaseq, chipseq, etc) are in their own subdirectory means that they will each have their own separate lockfiles and can be run.
Be careful though, because both ChIP-seq and RNA-seq workflows include the references workflow. This means that if you have not yet already set up the references, the RNA-seq and ChIP-seq workflows may both attempt to write the references, potentially corrupting it.
How do I handle multiple experiments in the same project?¶
It’s pretty common to have both RNA-seq and ChIP-seq experiments that need to be analyzed together. For example, we might have RNA-seq in two different mouse cell types, RNA-seq in a human cell type, ChIP-seq for different antibodies in those cell types, and all of this needs to be compared with publicly available data (say, other RNA-seq experiments from GEO). We need to make figures for the manuscript, so the figure-making code needs to be re-run if there are any changes to the primary analysis (new samples, parameter changes, etc).
lcdb-wf is designed to handle all of this. There are a couple of limitations to lcdb-wf that will determine how best to split up workflows, and the subsections below will help you decide if you should consider an experiment as part of a different workflow.
If an experiment needs to be considered as part of a different workflow, then
make copies of the relevant workflow directory after deploying. Taking the
above project as an example, immediately after deploying (using --flavor
full
so we get all supported workflows including RNA-seq and ChIP-seq), we
have:
workflows/
chipseq/
colocalization/
external/
figures/
rnaseq/
then we might rename the directory called “external” to match the GEO accession (to make it easier to remember), make copies of the RNA-seq directory for the mouse and human experiments, and clean up a little:
rm -r workflows/colocalization
mv workflows/external workflows/GSE00112233
cp -r workflows/rnaseq workflows/mouse-rnaseq
cp -r workflows/rnaseq workflows/human-rnaseq
rm -r workflows/rnaseq
then we would have:
workflows/
chipseq/
figures/
GSE00112233/
human-rnaseq/
mouse-rnaseq/
See below for advice on when to split experiments into separate workflows.
Samples need the same library layout¶
A single workflow must be either all single-end or all paired-end.
If you have the same RNA-seq samples that were sequenced once SE and again PE,
you’ll need to have two different copies of the workflows/rnaseq
directory
(say, workflows/rnaseq-se
and workflows/rnaseq-pe
). If they are to be
combined in the same differential expression analysis (e.g., by modeling layout
as a batch effect), then the adjust your downstream analysis in R to read in
both counts tables (see global_options for more).
Samples need to use the same parameters¶
There is no mechanism for specifying sample-specific parameters. For example, to use cutadapt to trim some 5’ bases from some samples but leave other samples alone. Samples that need to treated differently should be split off into a separate workflow, and the respective Snakefiles should be edited accordingly.
Note
A partial exception to this is that the peak-calling for ChIP-seq supports specifying custom parameters for each peak-calling run. For example, when running macs2 you can specify “–nomodel” for a single peak-calling run, or any other parameter supported by the peak-caller.
However, the BAM files used in peak-calling still need to have used uniform parameters across samples, so if alignment or trimming options differ, you should split each set of parameters into a different workflow.
Samples must use the same assembly¶
There is no mechanism for specifying per-sample assemblies. Samples from different species need to go in different workflows. If you want to compare, say, hg19 and hg38, then you would make a copy of the workflow for each assembly and adjust the reference config accordingly for each workflow separately.
How are low counts handled during differential expression analysis? Should we use a read-count threshold to filter genes?¶
Low count genes are handled during the normalization and analysis steps of DESeq2
with sophisticated statistical models. Genes with low counts across the board are flagged
as low count outliers, and the p-values are set to NA. Also genes with low counts
are penalized by shrinking the log2FoldChange
estimate. For example, a fold change of
4 that came from 4 reads in the treatment group vs 1 read in the control, will be shrunken,
as opposed to if the treatment had 2000 reads vs 500 in the control. As a result of this
low-count correction, the log2FoldChange
of genes clearing a false-discovery criterion
can be used as a reliable metric for prioritizing candidate genes for follow-up experiments.
In contrast, using an arbitrary fold-change cutoff could introduce biases that potentially
violate modeling assumptions and introduce variables that we could not predict or control for.
So, we do not recommend using count thresholds to filter differential expression analysis
results to determine candidate genes for follow up.
How do I troubleshoot failed jobs?¶
Many rules have an explicit log:
directive that defines where the log is
written. These are typically in the same directory as the output files the rule
creates, and this is the first place to check if something goes wrong.
Some rules do not explicitly redirect to log:
or may only redirect either
stdout or stderr. Where this output ends up depends on if you’re running
locally or on a cluster.
When running locally, stdout and stderr will be included in the output from Snakemake, so check there.
If running on a cluster, the default behavior is to send the main Snakemake
output to Snakefile.log
. The per-rule output depends on how it was sent to
the cluster. As described in the above section, by default stdout and stderr
are sent to the logs
directory, named after rule and job ID.
If a job fails on a cluster:
Open
Snakefile.log
and search forError
Recent versions of Snakemake report the
log:
file (if any) and thecluster_jobid:
. Keep track of these.If
log:
was defined for the rule, check there firstIf not, or if more information is needed, check
logs/<rulename>.{e,o}.<jobid>
(which is how stderr and stdout are configure when running with theinclude/WRAPPER_SLURM
wrapper).
For example, if we find the following error in Snakefile.log
:
[Tue Feb 6 20:06:30 2018] Error in rule rnaseq_rmarkdown:
[Tue Feb 6 20:06:30 2018] jobid: 156
[Tue Feb 6 20:06:30 2018] output: downstream/rnaseq.html
[Tue Feb 6 20:06:30 2018] cluster_jobid: 60894387
Then we would check logs/rnaseq_markdown.e.60894387
and
logs/rnaseq_markdown.o.60894387
for more information.
How do I update my deployment?¶
If there are additional fixes or features in the main lcdb-wf repo that you want to propagate to your existing projects, the best way to do this is to clone a recent version and do the manual diffs between the new version and what you have on disk.
To help narrow down the changes that have happened in the main lcdb-wf repo
since you deplyed to a project, Use the .lcdb-wf-deployment.json
file that
is created when deploying to a project to find the commit hash that the
deployment used.