.. _conda-envs: conda and conda envs in `lcdb-wf` ================================= Conda basics ------------ If you're not familiar with ``conda``, it is a way of keeping software isolated on a computer in an "environment" (basically a directory with the executables for all the software you want to use). When you "activate" the environment, it places that location at the beginning of your ``$PATH`` variable, so that any executables there are found first. It does not affect any existing installation of any software on your machine and does not need root privileges. If you don't already have conda installed and the Bioconda channel set up, see the `Bioconda docs `_ for details. You'll also probably want `mamba `_. Mamba is a drop-in replacement for conda that is faster and more robust. In fact, it is now the default conda front-end for Snakemake. If you don't already have mamba, you can install it into your base conda environment with: .. code-block:: bash conda install -n base -c conda-forge mamba It's recommended that you install mamba into the base env (just like conda itself is) so that it behaves like conda. It does *not* need to be installed into each individual environment. Building the environments ------------------------- **It is recommended that you create a separate environment directory for each project**, rather than a single environment for all projects. That way you can update packages in each project independently of any others, and yet the environment will always be close at hand. This is an especially good practice in shared space as others can easily find and activate the environment specific to the project. .. note:: We recommend using mamba rather than conda for the speed increase and ability to more correctly solve environments. See the `snakemake docs `_ for more info. If you use the ``--build-envs`` argument when deploying lcdb-wf to a project directory (see :ref:`setup-proj`), two conda environments will be built in the directories: ``env``, which has all of the non-R requirements, and ``env-r`` which has the R packages used in particular for downstream RNA-seq analysis. These environments will use the fully-pinned environments in ``env.yml`` and ``env-r.yml``. If you've already deployed but didn't use the ``--build-envs`` argument, then then the equivalent command to run in the deployed directory is: .. code-block:: bash mamba env create -p ./env --file env.yml mamba env create -p ./env-r --file env-r.yml .. _conda-troubleshooting: Troubleshooting environments ---------------------------- Sometimes there is a problem with creating an environment. For example, the exact package specified in the env yaml might not be available for some reason (this should not happen, but in practice sometimes it does in corner cases). If this happens, you can try a couple things. First, some terminology with how packages are specified in the environment yamls. Here's an example for ``libpng`` version 1.6.37:: libpng=1.6.37=hed695b0_2 |____| |____| |________| | | | name | | version | build string The package name (libpng) and version (1.6.37) are pretty standard and self-explanatory. The `build` string refers to different built versions of the *conda package*, but for the same version (1.6.37 in this case) of the package. For example, if a conda package was built for version 1.1 of a tool, but that package itself had an error unrelated to the tool, then a fixed build would be made. The package version would remain the same (1.1) but the build string would change. In this example, the build string contains a hash ``hed695b0`` which is a hash of all the pinned dependencies for this package at packaging time. The `conda-forge pinning docs `_ give more detail on what this pinning is about, but basically if that pinning changes then this hash will change. The ``_2`` on the end of the build string hash indicates that this is the third built package (build numbers start at zero) for this version of ``libpng`` using the same pinning. In other words, there also likely exists ``libpng=1.6.37=hed695b0_1`` and ``libpng=1.6.37=hed695b0_0``. At the time of this writing, there is also ``libpng-1.6.37-h21135ba_2`` (notice the different hash) which is the same libpng version but uses different pinnings. What does this mean for troubleshooting? For any package that seems to be problematic, try editing the respective environment yaml (e.g., ``env.yml``) to remove the build string (so in the example above, you would try changing it to just ``libpng=1.6.37``) and try building the environment again. If that doesn't work, try removing the version as well (so just ``libpng``). Alternatively for very problematic cases or cases where there are multiple problematic packages, you can try creating an environment with the "loose" pinning in ``include/requirements.txt`` which effectively does not require any particular versions with the exception of a few corner cases. Keep in mind that using that file may cause the environment to take a while to build as conda (or mamba) solves the dependencies of all the specified packages. Conda envs in lcdb-wf --------------------- Given all of the software used across all of `lcdb-wf`, the environments can take a lot of time to build because the solver needs to figure out the entire dependency tree and come up with a solution that works to satisfy the entire set of specified requirements. We chose to split the conda environments in two: the **main** environment and the **R** environment (see :ref:`conda-design-decisions`). These environments are described by both "strict" and "loose" files. By default we use the "strict" version, which pins all versions of all packages exactly. This is preferred wherever possible. However we also provide a "loose" version that is not specific about versions. The following table describes these files: +----------------+--------------------------------+----------------------------------+ | strict version | loose version | used for | +================+================================+==================================+ | ``env.yml`` | ``include/requirements.txt`` | Main Snakefiles | +----------------+--------------------------------+----------------------------------+ | ``env-r.yaml`` | ``include/requirements-r.txt`` | Downstream RNA-seq analysis in R | +----------------+--------------------------------+----------------------------------+ When deploying new instances, use the ``--build-envs`` argument which will use the strict version. Or use the following commands in a deployed directory: .. code-block:: bash mamba env create -p ./env --file env.yml mamba env create -p ./env-r --file env-r.yml When getting ready to release a new lcdb-wf version, create a new environment using the loose version to prepare the env and then when tests pass, export it to yaml. That is: .. code-block:: bash # use loose version when preparing a new version of lcdb-wf mamba create -p ./env --file include/requirements.txt mamba create -p ./env-r --file include/requirements-r.txt # then do testing.... # when tests pass, export the envs conda env export -p ./env > env.yml conda env export -p ./env-r > env-r.yaml # commit, push, finalize release .. _conda-design-decisions: Design decisions ---------------- We made the design decision to split the conda envs into two different environments -- one for R, one for non-R. We found that by by removing the entire sub-DAG of R packages from the main environment we can dramatically reduce the creation time. We also made the decision to use large top-level environments rather than smaller environments created for each rule using the ``conda:`` directive. There are two reasons for this choice. First, it allows us to activate a single environment to give us access to all the tools used. This streamlines troubleshooting because we don't have to dig through the ``.snakemake/conda`` directory to figure out which hash corresponds to which file, but comes with the up-front cost of creating the environment initially. Second, it simplifies running the tests on CircleCI, allowing us to cache the env directories as a whole to be re-used for multiple tests rather than caching the individual .snakemake directories for each tested workflow. Given that the conda and snakemake ecosystem are in flux, this may change in the future to using small conda environments for each rule separately if it turns out to be more beneficial to do so. .. note:: Prior to v1.7, we used requirements.txt files with loose pinning. Moving to yaml files allows us the option of also installing pip packages if needed. It also allows us to specify channels directly in the yaml file for streamlined installation. Using strictly-pinned yaml files that are consistently tested will hopefully result in a more stable experience for users. For example, if you happen to create an environment around the time of a new R/Bioconductor release, the environment may not build correctly using a loose pinning. Other transient issues in the packaging ecosystem can similarly cause issues.