References config

The references section defines which genomes, transcriptomes, and annotations to use. It supports arbitrarily many species and assemblies, and supports customizing references for a particular project. For example, there are tools and examples for adding ERCC spike-in controls to references, or adjusting chromosome nomenclature, or removing problematic entries from GTF files, and so on.

Another advantage is that this makes it easy to add multiple genomes to the screening step (which uses fastq_screen).

Specifying a references directory

See references_dir for more information on where the references are built (as well as how and why to adjust this).

Including existing reference configs

We provide a number of pre-configured reference configs for common model organisms. If you just want to use some references configs that work, then put this in your config file:

include_references:
  - '../../include/references_configs'

This will populate the config with the contents of all the files contained in the include/references_configs directory. Any paths provided under the include_references key are relative to the Snakefile using the config. Note that you will still need to inspect the contents of those files to decide which organsim and tag you want to use for your particular experiment (see organism field and aligner config section for more on these fields). For example, if you are working with human RNA-seq data, and you use the above include_references, you may want this in your config:

organism: 'human'
aligner:
  tag: 'gencode-v25'
  index: 'hisat2'
salmon: 'gencode-v25'

The reason for using gencode-v25 is because that tag is configured for the human key in ../../include/references_configs/Homo_sapiens.yaml.

You can provide entire directories of reference configs, a single file, or use the references section below. The prioritization works like this:

  • an organism can show up in multiple configs; if a tag exists for an organism in more than one config, higher-priority configs will overwrite the contents of the tag.

  • directories have lowest priority; when multiple directories are specified the last one has priority

  • files have priority over directories; when multiple files are specified the last one has priority

  • the references:` section always has priority over anything in include_references:.

The remainder of this section of the documentation explains how to customize the references, to add your own or modify the existing examples.

Overview

The references workflow is based on the idea that while each genome’s source files may differ, they can usually be modified to a uniform format. For example, reference files (FASTA, GTF) may come from different providers (Ensembl, FlyBase, UCSC, etc) and have slightly different formatting (strange headers, one big file or a tarball of individual chromosomes, etc), once they are well-formatted they can be used to create a hisat2 index, a bowtie2 index, a list of genes, intergenic regions, and so on without any further customization.

The challenging part is the “well-formatted” part. To solve this, the config file and references building system allows a very flexible specification of how to modify references via a plugin architecture. It works something like this:

  • Each key in the references section refers to an organism.

  • An organism has one or more tags.

  • Each tag has a FASTA file and/or a GTF file associated with it.

  • Each FASTA or GTF specifies one or more URIs from which to download the raw file(s). These can be ftp://, http://, https://, or file:// URIs.

  • An optional postprocess key specifies the import path to a Python module. This is the primary hook for customization, and is described in more detail below.

  • For FASTA files one or more indexes are requested

  • For GTF files, zero or more conversions are requested.

Note

If using a file:// URI, it needs to be gzipped.

It’s probably easiest to show an example config and then describe what’s happening.

Example references config

The following example configures the workflow to:

  • download a fasta file from the GENCODE project for the human genome and build a hisat2 and bowtie2 index

  • download the corresponding GTF file from GENCODE, strip off the dotted version numbers from Ensembl gene and transcript IDs, and create a refFlat format file from it

  • download the SILVA rRNA database and keep only the ribosomal RNA sequence corresponding to Homo sapiens

This example contains sufficient real-world complexity to illustrate the flexibility afforded by the references workflow. It is heavily commented for illustration.

# EXAMPLE REFERENCES CONFIG SECTION

# This configures the directory in which the prepared references will be
# saved (see below for directory structure). If you already have reference
# files saved in the lcdb-wf structure, point this to that directory to
# avoid rebuilding a fresh set of references:

references_dir: 'data/references'

# One of the organisms configured below. We are only configuring a single one
# so "human" is our only option here:

organism: 'human'

# Here we specify which tag under "human" to use for aligning, as well as
# which index we'll be using. This example is RNA-seq, so we'll use HISAT2:

aligner:
  tag: 'gencode-v25'
  index: 'hisat2'

# Top-level section for references:

references:

  # Label for this organism or species:

  human:

    # "gencode-v25" is our tag to describe this particular FASTA and GTF
    # we're preparing:

    gencode-v25:

      # This block will define how to get and postprocess a FASTA file:

      fasta:

        # URL to download:

        url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_organism.genome.fa.gz'

        # We can optionally build indexes for various aligners:

        indexes:
          - 'hisat2'
          - 'bowtie2'

      # This next block will define how to get and postprocess a GTF file.
      # The coordinates of the GTF file correspond to the
      # coordinates in the fasta defined above, so we're putting it under
      # the same tag. This is not required; we could also put it under
      # separate tag (perhaps called "gencode-v25-annotations")

      gtf:
        url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz'

        # The GENCODE annotations include the dotted Ensembl versions in
        # the gene IDs. The following function, strip_ensembl_version, is
        # defined in lib/postprocess/hg38.py. It strips off those dotted
        # versions so that our resulting GTF file used by the workflows
        # will not contain them:

        postprocess: 'lib.postprocess.hg38.strip_ensembl_version'

        # Once well-formatted by the postprocessing function, we can now
        # perform standard conversions on the GTF. These conversions are
        # defined as rules in the references Snakefile, and will be run
        # if the conversion is specified here. Here we ask to get a refFlat
        # file, which can be provided to Picard's collectRnaSeqMetrics tool:

        conversions:
          - 'refflat'


    # Here is another tag, to create a FASTA file for ribosomal RNA. It can
    # then be used for fastq_screen, or for the rRNA screening portion of the
    # RNA-seq workflow:

    rRNA:
      fasta:

        # The SILVA database has separate files for large and small subunit
        # sequences. We'd like them all; by providing multiple URLs they will
        # be concatenated:

        url:
          - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_LSURef_tax_silva_trunc.fasta.gz'
          - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz'

        # However, the downloaded files contain many species. Here we only
        # care about human. We already have a function, "filter_fastas()", in
        # lib/common.py that accepts a FASTA and only keeps the records that
        # contain the provided first argument.

        # We specify that first argument here, and it will be passed to that
        # function, resulting in a final FASTA file that only contains the
        # rRNA sequence for Homo sapiens:

        postprocess:
            function: 'lib.common.filter_fastas'
            args: 'Homo sapiens'

        # We only need a bowtie2 index out of it.
        indexes:
            - 'bowtie2'

Without all those comments, it looks like this:

references_dir: 'data/references'
organism: 'human'
aligner:
  tag: 'gencode-v25'
  index: 'hisat2'
references:
  human:
    gencode-v25:
      fasta:
        url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_organism.genome.fa.gz'
        indexes:
          - 'hisat2'
          - 'bowtie2'
      gtf:
        url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz'
        postprocess: 'lib.postprocess.hg38.strip_ensembl_version'
        conversions:
          - 'refflat'
    rRNA:
      fasta:
        url:
          - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_LSURef_tax_silva_trunc.fasta.gz'
          - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz'
        postprocess:
            function: 'lib.common.filter_fastas'
            args: 'Homo sapiens'
        indexes:
            - 'bowtie2'

The above file will result in the following directory structure:

data/references/human/gencode-v25/fasta
data/references/human/gencode-v25/bowtie2
data/references/human/gencode-v25/hisat2
data/references/human/gencode-v25/gtf
data/references/human/gencode-v25-transcriptome/fasta
data/references/human/gencode-v25-transcriptome/salmon
data/references/human/rRNA/fasta
data/references/human/rRNA/bowtie2

Each block in the YAML file describes either a fasta or gtf file. Each block has at least the organism, type, and a URL. A block can optionally have a postprocess, which is an arbitrary function (described below) that converts the downloaded URL to something that conforms to the standards of the workflow (also described below). By supplying a tag, we can differentiate between different versions (e.g., FlyBase r6.04 vs r6.11; hg19 vs hg38) or different kinds of postprocessing (e.g, “chr” preprended to chrom names or not; comprehensive annotation vs only coding genes).

fasta blocks can have an optional indexes entry which will build the specified indexes. gtf blocks can have an optional conversions entry which will perform the specified conversion. Available indexes and conversions are described below.

Post processing

All files created by a block are required to be gzipped.

This means that if a URL points to an uncompressed GTF file, a post-processing function must gzip it. It also means that any post-processing functions must write gzipped output files.

Other than that, it’s up to the user to decide what transformations (if any) are required. Examples might include:

  • exluding particular contigs

  • removing or editing problematic genes that have transcripts on both strands – mod(mdg4) I’m looking at you

  • renaming chromosomes (e.g., prepend “chr”)

  • remove unnecessary annotations (e.g., keep only cds/exon/transcript/gene features)

In the example config above, the yeast genome is available as a tarball of separate fasta files, but we’d like to get it into a single fasta file for downstream tools to work with.

The configuration block can define an optional postprocess string which contains a dotted name referring to Python function that is importable by the reference.snakefile workflow. By default, the workflow will find modules in in lib.postprocess directory, so it’s most convenient and organized to put your functions within modules in that directory.

For example, above we used the postprocess function lib.postprocess.sacCer3.fasta_lib.postprocess, and you can view this function in lib/postprocess/sacCer3.py.

Please see lib.common.download_and_postprocess() for more details, and the files in the lib/postproces directory for inspiration.

These two arguments are automatically provided by the references workflow – you don’t have to know or care exactly what the filenames are, just what has to be done to their contents.

See the files in lib/postprocess for inspiration if you need to write your own post-processing functions.

The job of a postprocessing function is to ensure that the fastq/gtf/transcriptome fasta meets the requirements described above and is ready for any intended downstream tasks. For example if we download the fasta file from FlyBase for dm6 but want “chr” prepended to chromosome names, we can create a function in the file dm6.py called add_chr that does this:

# This is dm6.py

from snakemake.shell import shell  # a very convenient function

def add_chr(origfn, newfn):
    shell(
        'zcat {origfn} '       # input is always gzipped
        '| sed "s/>/>chr/g" '  # add chr to names
        '| gzip -c > {newfn} ' # re-zip
        '&& rm {origfn}'       # clean up
    )

We specify this function to be called in the fasta config block like this (note that the module doesn’t have to be the same name as the organism, but it is here for clarity):

dm6:
  fasta:
    url: ...
    postprocess: "dm6.add_chr"

This expects a file dm6.py in the same directory as the references.snakefile workflow, and expects a function add_chr to be defined in that module.

Any downstream rules that operate on the genome FASTA file (like hisat2 index, bowtie2 index, etc) will now use this fixed version with “chr” prepended to chromosome names. In this way, we can apply arbitrary code to modify references to get them into a uniform format.

More advanced postprocessing

If a post-processing function has a keyword argument with starts and ends with a double underscore (__), the config system will assume this is a string that should be interpreted as a dotted function name and the actual function will be resolved and passed to the post-processing function.

This is useful for example when attaching ERCC spike-ins to a reference file that in turn needs to be modified. For example, the S. pombe reference annotations are available as a GFF file, but this needs to be converted to a GTF file. After that, the ERCC spike-in GTF annotations need to be added to the newly-created GTF.

The functions in lib/postprocess/ercc.py support such a use-case. The config looks like this:

genome:
  url:
    # S. pombe fasta
    - 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/schizosaccharomyces_pombe/dna/Schizosaccharomyces_pombe.ASM294v2.dna_sm.toplevel.fa.gz'
    # ERCC fasta
    - 'https://www-s.nist.gov/srmors/certificates/documents/SRM2374_Sequence_v1.FASTA'
  postprocess:
    function: "lib.postprocess.ercc.add_fasta_to_genome"

annotation:
  url:
    # S. pombe GFF, which needs to be converted to GTF
    - 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/gff3/schizosaccharomyces_pombe/Schizosaccharomyces_pombe.ASM294v2.41.gff3.gz'

    # ERCC GTF is not available; conversion function needed to convert
    # fasta to GTF
    - 'https://www-s.nist.gov/srmors/certificates/documents/SRM2374_Sequence_v1.FASTA'

  postprocess:
    function: "lib.postprocess.ercc.add_gtf_to_genome"
    kwargs:
      # As per the docs for add_gtf_to_genome, this function will be
      # applied to all but the last input file. It is specified as a string
      # here, but the config-processing system will resolve this to the
      # actual function and pass that along to add_gtf_to_genome
      __preprocess__: "lib.common.gff2gtf"

Added in version 1.7: Ability to use special __-prefixed variables that are interpreted as dotted-path functions to import.

Locations of downloaded-and-post-processed FASTA and GTF files

Generally speaking, the fasta and gtf files will be in:

{references_dir}/{organism}/{tag}/fasta/{organism}_{tag}.fasta
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.gtf

If a config file looks like this (simplified here for clarity):

references_dir: refs
references:
  human:
    hg38:
      fasta: ...
      gtf: ...

Then the following files will be created:

refs/human/hg38/fasta/human_hg38.fasta
refs/human/hg38/gtf/human_hg38.gtf

If you are running the references workflow directly, or it is included in another workflow that requests a chromsizes file, the following will also be created:

refs/human/hg38/fasta/human_hg38.chromsizes

Note

URLs are expected to be gzipped and any postprocessing functions are expected to output gzipped files. This is because it is most common for providers to offer gzipped reference files, and therefore minimizes the effort required to prepare fasta and gtf files. However, not all downstream tools handle gzipped input. The references workflow therefore stores only the uncompressed versions. We consider the resulting configuration simplicity to be worth the additional space and time cost.

Available indexes and conversions

The following indexes can be currently be specified for fasta files:

hisat2

indexes:
  - hisat2

Output files:

{references_dir}/{organism}/{tag}/hisat2/{organism}_{tag}.*.ht2

bowtie2

indexes:
  - bowtie2

Output files:

{references_dir}/{organism}/{tag}/bowtie2/{organism}_{tag}.*.bt2

salmon

indexes:
  - salmon

Output files:

{references_dir}/{organism}/{tag}/salmon/{organism}_{tag}/*

The following conversions can be specified for GTF files:

refflat

conversions:
  - refflat

Converts GTF to refFlat format. See the conversion_refflat rule in workflows/references/Snakefile.

Output file:

{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.refflat

bed12

conversions:
   - bed12

Converts GTF to BED12 format. See the conversion_bed12 rule in workflows/references/Snakefile.

Output file:

{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.refflat

gffutils

Converts GTF to gffutils database (typically used for downstream work). You can specify arbitrary kwargs to gffutils.create_db by including them as keys. For example, if the GTF file already contains features for genes and transcripts:

conversions:
  - gffutils:
      disable_infer_genes: True
      disable_infer_transcripts: True

Output file:

{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.gtf.db

genelist

Reads the postprocessed GTF file, and extracts the set of gene IDs found, one ID per line. The GTF attribute to use is configured by the gene_id: key, for example, if the file contains gene IDs in the Name attribute of each line, use the following:

conversions:
  - genelist:
      gene_id: 'Name'

Output file:

{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.genelist

mappings

Reads the postprocesses GTF file, and outputs mappings between attributes as a gzipped TSV.

You can include/exclude featuretypes from being checked. For example, if your GTF has genes and transcripts in addition to exons, the gene and transcript lines probably contain all of the attributes you are interested in (like gene_id, symbol, name, etc) and the exon (and any other lines) can be ignored, speeding up the process. In this case you could use include_featuretypes: [gene, transcript].

A __featuretype__ column is always included in the mapping. This is the GTF featuretype of each line, with extra __ to avoid overwriting an attribute that may happen to be called featuretype.

conversions:
  - mappings
conversions:
  - mappings:
      include_featuretypes: [gene, transcript]

Output file:

{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.mapping.tsv.gz