conda and conda envs in lcdb-wf

Conda basics

If you’re not familiar with conda, it is a way of keeping software isolated on a computer in an “environment” (basically a directory with the executables for all the software you want to use). When you “activate” the environment, it places that location at the beginning of your $PATH variable, so that any executables there are found first. It does not affect any existing installation of any software on your machine and does not need root privileges.

If you don’t already have conda installed and the Bioconda channel set up, see the Bioconda docs for details.

You’ll also probably want mamba. Mamba is a drop-in replacement for conda that is faster and more robust. In fact, it is now the default conda front-end for Snakemake. If you don’t already have mamba, you can install it into your base conda environment with:

conda install -n base -c conda-forge mamba

It’s recommended that you install mamba into the base env (just like conda itself is) so that it behaves like conda. It does not need to be installed into each individual environment.

Building the environments

It is recommended that you create a separate environment directory for each project, rather than a single environment for all projects. That way you can update packages in each project independently of any others, and yet the environment will always be close at hand. This is an especially good practice in shared space as others can easily find and activate the environment specific to the project.

Note

We recommend using mamba rather than conda for the speed increase and ability to more correctly solve environments. See the snakemake docs for more info.

If you use the --build-envs argument when deploying lcdb-wf to a project directory (see Setting up a project), two conda environments will be built in the directories: env, which has all of the non-R requirements, and env-r which has the R packages used in particular for downstream RNA-seq analysis. These environments will use the fully-pinned environments in env.yml and env-r.yml. If you’ve already deployed but didn’t use the --build-envs argument, then then the equivalent command to run in the deployed directory is:

mamba env create -p ./env --file env.yml
mamba env create -p ./env-r --file env-r.yml

Troubleshooting environments

Sometimes there is a problem with creating an environment. For example, the exact package specified in the env yaml might not be available for some reason (this should not happen, but in practice sometimes it does in corner cases).

If this happens, you can try a couple things.

First, some terminology with how packages are specified in the environment yamls. Here’s an example for libpng version 1.6.37:

libpng=1.6.37=hed695b0_2
|____| |____| |________|
 |       |       |
name     |       |
       version   |
               build string

The package name (libpng) and version (1.6.37) are pretty standard and self-explanatory. The build string refers to different built versions of the conda package, but for the same version (1.6.37 in this case) of the package. For example, if a conda package was built for version 1.1 of a tool, but that package itself had an error unrelated to the tool, then a fixed build would be made. The package version would remain the same (1.1) but the build string would change.

In this example, the build string contains a hash hed695b0 which is a hash of all the pinned dependencies for this package at packaging time. The conda-forge pinning docs give more detail on what this pinning is about, but basically if that pinning changes then this hash will change. The _2 on the end of the build string hash indicates that this is the third built package (build numbers start at zero) for this version of libpng using the same pinning. In other words, there also likely exists libpng=1.6.37=hed695b0_1 and libpng=1.6.37=hed695b0_0. At the time of this writing, there is also libpng-1.6.37-h21135ba_2 (notice the different hash) which is the same libpng version but uses different pinnings.

What does this mean for troubleshooting?

For any package that seems to be problematic, try editing the respective environment yaml (e.g., env.yml) to remove the build string (so in the example above, you would try changing it to just libpng=1.6.37) and try building the environment again. If that doesn’t work, try removing the version as well (so just libpng).

Alternatively for very problematic cases or cases where there are multiple problematic packages, you can try creating an environment with the “loose” pinning in include/requirements.txt which effectively does not require any particular versions with the exception of a few corner cases. Keep in mind that using that file may cause the environment to take a while to build as conda (or mamba) solves the dependencies of all the specified packages.

Conda envs in lcdb-wf

Given all of the software used across all of lcdb-wf, the environments can take a lot of time to build because the solver needs to figure out the entire dependency tree and come up with a solution that works to satisfy the entire set of specified requirements.

We chose to split the conda environments in two: the main environment and the R environment (see Design decisions). These environments are described by both “strict” and “loose” files. By default we use the “strict” version, which pins all versions of all packages exactly. This is preferred wherever possible. However we also provide a “loose” version that is not specific about versions. The following table describes these files:

strict version

loose version

used for

env.yml

include/requirements.txt

Main Snakefiles

env-r.yaml

include/requirements-r.txt

Downstream RNA-seq analysis in R

When deploying new instances, use the --build-envs argument which will use the strict version. Or use the following commands in a deployed directory:

mamba env create -p ./env --file env.yml
mamba env create -p ./env-r --file env-r.yml

When getting ready to release a new lcdb-wf version, create a new environment using the loose version to prepare the env and then when tests pass, export it to yaml. That is:

# use loose version when preparing a new version of lcdb-wf
mamba create -p ./env --file include/requirements.txt
mamba create -p ./env-r --file include/requirements-r.txt

# then do testing....

# when tests pass, export the envs
conda env export -p ./env > env.yml
conda env export -p ./env-r > env-r.yaml

# commit, push, finalize release

Design decisions

We made the design decision to split the conda envs into two different environments – one for R, one for non-R. We found that by by removing the entire sub-DAG of R packages from the main environment we can dramatically reduce the creation time.

We also made the decision to use large top-level environments rather than smaller environments created for each rule using the conda: directive. There are two reasons for this choice. First, it allows us to activate a single environment to give us access to all the tools used. This streamlines troubleshooting because we don’t have to dig through the .snakemake/conda directory to figure out which hash corresponds to which file, but comes with the up-front cost of creating the environment initially. Second, it simplifies running the tests on CircleCI, allowing us to cache the env directories as a whole to be re-used for multiple tests rather than caching the individual .snakemake directories for each tested workflow.

Given that the conda and snakemake ecosystem are in flux, this may change in the future to using small conda environments for each rule separately if it turns out to be more beneficial to do so.

Note

Prior to v1.7, we used requirements.txt files with loose pinning. Moving to yaml files allows us the option of also installing pip packages if needed. It also allows us to specify channels directly in the yaml file for streamlined installation.

Using strictly-pinned yaml files that are consistently tested will hopefully result in a more stable experience for users. For example, if you happen to create an environment around the time of a new R/Bioconductor release, the environment may not build correctly using a loose pinning. Other transient issues in the packaging ecosystem can similarly cause issues.