Testing the installation

This section describes how to set up and run the example data. It is useful for verifying everything is working correctly. This reproduces the steps that are performed during the automated tests on Circle CI. You can see the latest test results here.

The example run takes up about 360 MB of space and runs in about 15 mins on 2 cores.

Note

The deploy.py script specifically excludes the various test files, so the commands below must be run in a full clone of the repo, not in a directory in which lcdb-wf has been deployed.

Create conda envs

This assumes you have set up the bioconda channel properly.

mamba env create -p ./env --file env.yml
mamba env create -p ./env-r --file env-r.yml

We highly recommend using conda for isolating projects and for analysis reproducibility. If you are unfamiliar with conda, we provide a more detailed look at conda and conda envs in lcdb-wf.

Activate the main env

Depending on how you have set up conda, either

conda activate ./env

or

source activate ./env

Download example data

This will download the example data from our test data repository into the directories workflows/{references,rnaseq,chipseq}/data:

python ci/get-data.py

A note about test settings

Warning

The default configuration assumes a machine with large amounts of RAM. Running the workflows as-is on a single machine with limited RAM may cause all RAM to be consumed! Use run_test.sh as described below to avoid this.

A major benefit of lcdb-wf is that the code undergoes automated testing on CircleCI. However this test environment only has 2 cores and 2GB RAM. To accommodate this, we developed a small representative test dataset from real-world data. This allows the workflows to run in their entirety in a reasonable time frame. We also needed to adjust specific settings to the workflows, e.g. we set the Java VM memory to only 2GB for Java tools like Picard and FastQC.

We had to make a design decision about the “default” state of the workflows: should the workflows reflect production-ready (high-RAM) settings, or reflect test-ready (low RAM) settings? We chose to have the default to be real-world, production-ready settings, because we want to minimize the edits required (and therefore possibility of introducing errors!) for running on real data.

What this all means is that if we want to run tests, we need use the run_test.sh script in each workflows directory to make adjustments. This script runs a preprocessor, ci/preprocessor.py, which looks for specially-formatted comments in the workflows. It swaps out production settings for test settings, and writes the results to a new Snakefile.test file that is then run. In production, especially when running on a cluster, there’s no need to do this.

See the docstring in the ci/preprocessor.py for details on how this works.

The run_test.sh simply passes all arguments on to Snakemake. Take a look at the script to see what it’s doing, and see the examples below for usage.

Run the RNA-seq workflow with example data

With the lcdb-wf environment activated, change to the RNA-seq workflows directory:

cd workflows/rnaseq

First, run in dry-run mode which will print out the jobs to be run. The arguments will be described later, this is just to get things running:

./run_test.sh -n --use-conda

If all goes well, you will get lots of output ending with a summary of the number of jobs that will be run. Then, use the same command but remove the -n, and optionally include the -j argument to specify the number of cores to use, for example -j 8 if you have 8 cores on your machine (this example just uses 2 cores):

./run_test.sh -j 2 --use-conda

This will take ~15 minutes to run.

Then activate the R environment (this assumes you’re still in the workflows/rnaseq subdirectory):

conda activate env-r   # or source activate env-r

and run:

./run_downstream_test.sh

After the workflow runs, here are some useful points of interest in the output:

  • data/rnaseq_samples/*: sample-specific output. For example, individual BAMs and bigWig files can be found here

  • data/aggregation/multiqc.html: MultiQC report.

  • downstream/rnaseq.html: Differential expression results generated from running the downstream/rnaseq.Rmd RMarkdown file.

See RNA-seq workflow and Configuration for more details.

Run the ChIP-seq workflow with example data

To run the ChIP-Seq workflow, follow the same steps as above but with the workflow directory updated to workflows/chipseq. The most notable difference here is that the downstream analysis in R (e.g. the rmarkdown::render step) is not run.

Points of interest after running the ChIP-seq workflow:

  • data/chipseq_samples/*: sample-specific output. Individual BAM files for a sample can be found here.

  • data/chipseq_merged/*: technical replicates merged and re-deduped, or if only one tech rep, symlinked to the BAM in the samples directory

  • data/chipseq_peaks/*: peak-caller output, including BED files of called peaks and bedGraph files of signal as output by each algorithm

  • data/chipseq_aggregation/multiqc.html: MultiQC report

See ChIP-seq workflow for more details.

Exhaustive tests

The file .circleci/config.yml configures all of the tests that are run on CircleCI. There’s a lot of configuration happening there, but look for the entries that have ./run_test.sh in them to see the commands that are run.