Testing the installation¶
This section describes how to set up and run the example data. It is useful for verifying everything is working correctly. This reproduces the steps that are performed during the automated tests on Circle CI. You can see the latest test results here.
The example run takes up about 360 MB of space and runs in about 15 mins on 2 cores.
Note
The deploy.py
script specifically excludes the various test files,
so the commands below must be run in a full clone of the repo, not in
a directory in which lcdb-wf has been deployed.
Create conda envs¶
This assumes you have set up the bioconda channel properly.
mamba env create -p ./env --file env.yml
mamba env create -p ./env-r --file env-r.yml
We highly recommend using conda for isolating projects and for analysis reproducibility. If you are unfamiliar with conda, we provide a more detailed look at conda and conda envs in lcdb-wf.
Activate the main env¶
Depending on how you have set up conda, either
conda activate ./env
or
source activate ./env
Download example data¶
This will download the example data from our test data repository into the directories
workflows/{references,rnaseq,chipseq}/data
:
python ci/get-data.py
A note about test settings¶
Warning
The default configuration assumes a machine with large amounts of RAM.
Running the workflows as-is on a single machine with limited RAM may cause
all RAM to be consumed! Use run_test.sh
as described below to avoid
this.
A major benefit of lcdb-wf
is that the code undergoes automated testing on
CircleCI. However this test environment only
has 2 cores and 2GB RAM. To accommodate this, we developed a small
representative test dataset from
real-world data. This allows the workflows to run in their entirety in a reasonable time frame.
We also needed to adjust specific settings to the workflows, e.g.
we set the Java VM memory to only 2GB for Java tools like Picard and FastQC.
We had to make a design decision about the “default” state of the workflows: should the workflows reflect production-ready (high-RAM) settings, or reflect test-ready (low RAM) settings? We chose to have the default to be real-world, production-ready settings, because we want to minimize the edits required (and therefore possibility of introducing errors!) for running on real data.
What this all means is that if we want to run tests, we need use the run_test.sh
script in each workflows directory to make adjustments. This script runs a
preprocessor, ci/preprocessor.py
, which looks for specially-formatted
comments in the workflows. It swaps out production settings for test settings,
and writes the results to a new Snakefile.test
file that
is then run. In production, especially when running on a cluster, there’s no
need to do this.
See the docstring in the ci/preprocessor.py
for details on how this works.
The run_test.sh
simply passes all arguments on to Snakemake. Take a look at
the script to see what it’s doing, and see the examples below for usage.
Run the RNA-seq workflow with example data¶
With the lcdb-wf environment activated, change to the RNA-seq workflows directory:
cd workflows/rnaseq
First, run in dry-run mode which will print out the jobs to be run. The arguments will be described later, this is just to get things running:
./run_test.sh -n --use-conda
If all goes well, you will get lots of output ending with a summary of the
number of jobs that will be run. Then, use the same command but remove the
-n
, and optionally include the -j
argument to specify the number of
cores to use, for example -j 8
if you have 8 cores on your machine (this
example just uses 2 cores):
./run_test.sh -j 2 --use-conda
This will take ~15 minutes to run.
Then activate the R environment (this assumes you’re still in the
workflows/rnaseq
subdirectory):
conda activate env-r # or source activate env-r
and run:
./run_downstream_test.sh
After the workflow runs, here are some useful points of interest in the output:
data/rnaseq_samples/*
: sample-specific output. For example, individual BAMs and bigWig files can be found here
data/aggregation/multiqc.html
: MultiQC report.
downstream/rnaseq.html
: Differential expression results generated from running thedownstream/rnaseq.Rmd
RMarkdown file.
See RNA-seq workflow and Configuration for more details.
Run the ChIP-seq workflow with example data¶
To run the ChIP-Seq workflow, follow the same steps as above but
with the workflow directory updated to workflows/chipseq
.
The most notable difference here is that the downstream analysis
in R (e.g. the rmarkdown::render
step) is not run.
Points of interest after running the ChIP-seq workflow:
data/chipseq_samples/*
: sample-specific output. Individual BAM files for a sample can be found here.
data/chipseq_merged/*
: technical replicates merged and re-deduped, or if only one tech rep, symlinked to the BAM in the samples directory
data/chipseq_peaks/*
: peak-caller output, including BED files of called peaks and bedGraph files of signal as output by each algorithm
data/chipseq_aggregation/multiqc.html
: MultiQC report
See ChIP-seq workflow for more details.
Exhaustive tests¶
The file .circleci/config.yml
configures all of the tests that are run on
CircleCI. There’s a lot of configuration happening there, but look for the
entries that have ./run_test.sh
in them to see the commands that are run.