References workflow

This workflow is not normally run on its own. Rather, it is intended to be include:-ed into other workflows that depend on reference fastas, indexes, and annotations. That way, rules in this references workflow will only be run for those files asked for in the parent workflow.

It is still possible to run this workflow on its own. In that case, it will build all of the references and indexes specified in the config. This can be helpful when setting up the workflows for the first time on a new machine.

In all cases, it depends on the references section being in config/config.yaml. See References config for details on configuring.

_images/references.png

A dictionary of references

When run on its own, the references workflow in workflows/references/Snakefile builds all references specified in the config file. This is typically done only when initially setting up a system that will run workflows on many different references.

Most of the time, this workflow is included into the other workflows with the include: directive. This way, any reference files that are needed by, say, the RNA-seq workflow will be created automatically.

The format of the config YAML is designed to be convenient to edit and maintain. It can be awkward to use within a Snakefile though, so for convenience it is converted into an easier-to-access dictionary in the c config object in each workflow, accessible as c.refdict.

If we have the following references section defined in our config file (see Configuration for more):

references:
  dm6:
    r6-11:
      fasta:
        url: "https://url/to/dm6.fasta"
        indexes:
          - bowtie2
          - hisat2
      gtf:
        url: "https://url/to/gm6.gtf"
        conversions:
          - refflat
    r6-11_transcriptome:
      fasta:
        url: "https://url/to/transcriptome.fa"
        indexes:
          - salmon

then it will be converted to this simplified version where values are filenames:

{
  'dm6': {
     'r6-11': {
         'fasta': '/data/dm6/r6-11/fasta/dm6_r6-11.fasta',
         'refflat': '/data/dm6/r6-11/gtf/dm6_r6-11.refflat',
         'gtf': '/data/dm6/r6-11/gtf/dm6_r6-11.gtf',
         'chromsizes': '/data/dm6/r6-11/fasta/dm6_r6-11.chromsizes',
         'bowtie2': '/data/dm6/r6-11/bowtie2/dm6_r6-11.1.bt2',
         'hisat2': '/data/dm6/r6-11/hisat2/dm6_r6-11.1.ht2',
         },
     'r6-11_transcriptome': {
         'fasta': '/data/dm6/r6-11_transcriptome/fasta/dm6_r6-11_transcriptome.fasta',
         'chromsizes': '/data/dm6/r6-11_transcriptome/fasta/dm6_r6-11_transcriptome.chromsizes',
         'salmon': '/data/dm6/r6-11_transcriptome/salmon/dm6_r6-11_transcriptome/hash.bin,
         },
    },
}