.. _references-config:

References config
=================

The references section defines which genomes, transcriptomes, and annotations
to use. It supports arbitrarily many species and assemblies, and supports
customizing references for a particular project. For example, there are tools
and examples for adding ERCC spike-in controls to references, or adjusting
chromosome nomenclature, or removing problematic entries from GTF files, and so
on.

Another advantage is that this makes it easy to add multiple genomes to the
screening step (which uses fastq_screen).

Specifying a references directory
---------------------------------

See :ref:`cfg-references-dir` for more information on where the references are
built (as well as how and why to adjust this).

Including existing reference configs
------------------------------------
We provide a number of pre-configured reference configs for common model
organisms. If you just want to use some references configs that work, then put
this in your config file:

.. code-block:: yaml

    include_references:
      - '../../include/references_configs'

This will populate the config with the contents of all the files contained in
the ``include/references_configs`` directory. Any paths provided under the
``include_references`` key are relative to the Snakefile using the config.
**Note that you will still need to inspect the contents** of those files to
decide which organsim and tag you want to use for your particular experiment
(see :ref:`cfg-organism` and :ref:`cfg-aligner` for more on these fields). For
example, if you are working with human RNA-seq data, and you use the above
``include_references``, you may want this in your config:

.. code-block:: yaml

    organism: 'human'
    aligner:
      tag: 'gencode-v25'
      index: 'hisat2'
    salmon: 'gencode-v25'

The reason for using ``gencode-v25`` is because that tag is configured for the
``human`` key in ``../../include/references_configs/Homo_sapiens.yaml``.

You can provide entire directories of reference configs, a single file, or use
the references section below. The prioritization works like this:

- an organism can show up in multiple configs; if a tag exists for an organism
  in more than one config, higher-priority configs will overwrite the contents
  of the tag.
- directories have lowest priority; when multiple directories are specified the
  last one has priority
- files have priority over directories; when multiple files are specified the
  last one has priority
- the ``references:``` section always has priority over anything in
  ``include_references:``.

The remainder of this section of the documentation explains how to customize
the references, to add your own or modify the existing examples.

Overview
--------
The references workflow is based on the idea that while each genome's source
files may differ, they can usually be modified to a uniform format. For
example, reference files (FASTA, GTF) may come from different providers
(Ensembl, FlyBase, UCSC, etc) and have slightly different formatting (strange
headers, one big file or a tarball of individual chromosomes, etc), once they
are well-formatted they can be used to create a hisat2 index, a bowtie2 index,
a list of genes, intergenic regions, and so on without any further
customization.

The challenging part is the "well-formatted" part. To solve this, the config
file and references building system allows a very flexible specification of how
to modify references via a plugin architecture. It works something like this:

- Each key in the references section refers to an **organism**.
- An organism has one or more **tags**.
- Each **tag** has a FASTA file and/or a GTF file associated with it.
- Each FASTA or GTF specifies one or more URIs from which to download the raw
  file(s). These can be `ftp://`, `http://`, `https://`, or `file://` URIs.
- An optional **postprocess** key specifies the import path to a Python module.
  This is the primary hook for customization, and is described in more detail
  below.
- For FASTA files one or more **indexes** are requested
- For GTF files, zero or more **conversions** are requested.

.. note::

    If using a ``file://`` URI, it needs to be gzipped.

It's probably easiest to show an example config and then describe what's
happening.

Example references config
-------------------------

The following example configures the workflow to:

- download a fasta file from the GENCODE project for the human genome and build
  a hisat2 and bowtie2 index
- download the corresponding GTF file from GENCODE, strip off the dotted
  version numbers from Ensembl gene and transcript IDs, and create a refFlat
  format file from it
- download the SILVA rRNA database and keep only the ribosomal RNA sequence
  corresponding to *Homo sapiens*

This example contains sufficient real-world complexity to illustrate the
flexibility afforded by the references workflow. It is heavily commented for
illustration.

.. code-block:: yaml

    # EXAMPLE REFERENCES CONFIG SECTION

    # This configures the directory in which the prepared references will be
    # saved (see below for directory structure). If you already have reference
    # files saved in the lcdb-wf structure, point this to that directory to
    # avoid rebuilding a fresh set of references:

    references_dir: 'data/references'

    # One of the organisms configured below. We are only configuring a single one
    # so "human" is our only option here:

    organism: 'human'

    # Here we specify which tag under "human" to use for aligning, as well as
    # which index we'll be using. This example is RNA-seq, so we'll use HISAT2:

    aligner:
      tag: 'gencode-v25'
      index: 'hisat2'

    # Top-level section for references:

    references:

      # Label for this organism or species:

      human:

        # "gencode-v25" is our tag to describe this particular FASTA and GTF
        # we're preparing:

        gencode-v25:

          # This block will define how to get and postprocess a FASTA file:

          fasta:

            # URL to download:

            url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_organism.genome.fa.gz'

            # We can optionally build indexes for various aligners:

            indexes:
              - 'hisat2'
              - 'bowtie2'

          # This next block will define how to get and postprocess a GTF file.
          # The coordinates of the GTF file correspond to the
          # coordinates in the fasta defined above, so we're putting it under
          # the same tag. This is not required; we could also put it under
          # separate tag (perhaps called "gencode-v25-annotations")

          gtf:
            url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz'

            # The GENCODE annotations include the dotted Ensembl versions in
            # the gene IDs. The following function, strip_ensembl_version, is
            # defined in lib/postprocess/hg38.py. It strips off those dotted
            # versions so that our resulting GTF file used by the workflows
            # will not contain them:

            postprocess: 'lib.postprocess.hg38.strip_ensembl_version'

            # Once well-formatted by the postprocessing function, we can now
            # perform standard conversions on the GTF. These conversions are
            # defined as rules in the references Snakefile, and will be run
            # if the conversion is specified here. Here we ask to get a refFlat
            # file, which can be provided to Picard's collectRnaSeqMetrics tool:

            conversions:
              - 'refflat'


        # Here is another tag, to create a FASTA file for ribosomal RNA. It can
        # then be used for fastq_screen, or for the rRNA screening portion of the
        # RNA-seq workflow:

        rRNA:
          fasta:

            # The SILVA database has separate files for large and small subunit
            # sequences. We'd like them all; by providing multiple URLs they will
            # be concatenated:

            url:
              - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_LSURef_tax_silva_trunc.fasta.gz'
              - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz'

            # However, the downloaded files contain many species. Here we only
            # care about human. We already have a function, "filter_fastas()", in
            # lib/common.py that accepts a FASTA and only keeps the records that
            # contain the provided first argument.

            # We specify that first argument here, and it will be passed to that
            # function, resulting in a final FASTA file that only contains the
            # rRNA sequence for Homo sapiens:

            postprocess:
                function: 'lib.common.filter_fastas'
                args: 'Homo sapiens'

            # We only need a bowtie2 index out of it.
            indexes:
                - 'bowtie2'

Without all those comments, it looks like this:

.. code-block:: yaml

    references_dir: 'data/references'
    organism: 'human'
    aligner:
      tag: 'gencode-v25'
      index: 'hisat2'
    references:
      human:
        gencode-v25:
          fasta:
            url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_organism.genome.fa.gz'
            indexes:
              - 'hisat2'
              - 'bowtie2'
          gtf:
            url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz'
            postprocess: 'lib.postprocess.hg38.strip_ensembl_version'
            conversions:
              - 'refflat'
        rRNA:
          fasta:
            url:
              - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_LSURef_tax_silva_trunc.fasta.gz'
              - 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz'
            postprocess:
                function: 'lib.common.filter_fastas'
                args: 'Homo sapiens'
            indexes:
                - 'bowtie2'


The above file will result in the following directory structure::

    data/references/human/gencode-v25/fasta
    data/references/human/gencode-v25/bowtie2
    data/references/human/gencode-v25/hisat2
    data/references/human/gencode-v25/gtf
    data/references/human/gencode-v25-transcriptome/fasta
    data/references/human/gencode-v25-transcriptome/salmon
    data/references/human/rRNA/fasta
    data/references/human/rRNA/bowtie2

Each block in the YAML file describes either a `fasta` or `gtf` file. Each
block has at least the organism, type, and a URL.  A block can optionally have
a `postprocess`, which is an arbitrary function (described below) that converts
the downloaded URL to something that conforms to the standards of the workflow
(also described below). By supplying a tag, we can differentiate between
different versions (e.g., FlyBase r6.04 vs r6.11; hg19 vs hg38) or different
kinds of postprocessing (e.g, "chr" preprended to chrom names or not;
comprehensive annotation vs only coding genes).

`fasta` blocks can have an optional  `indexes` entry which will build the
specified indexes. `gtf` blocks can have an optional `conversions` entry which
will perform the specified conversion. Available indexes and conversions are
described below.


Post processing
---------------

**All files created by a block are required to be gzipped.**

This means that if a URL points to an uncompressed GTF file, a post-processing
function must gzip it. It also means that any post-processing functions must
write gzipped output files.

Other than that, it's up to the user to decide what transformations (if any)
are required. Examples might include:

- exluding particular contigs
- removing or editing problematic genes that have transcripts on both strands
  -- mod(mdg4) I'm looking at you
- renaming chromosomes (e.g., prepend "chr")
- remove unnecessary annotations (e.g., keep only cds/exon/transcript/gene features)

In the example config above, the yeast genome is available as a tarball of
separate fasta files, but we'd like to get it into a single fasta file for
downstream tools to work with.

The configuration block can define an optional `postprocess` string which
contains a dotted name referring to Python function that is importable by the
`reference.snakefile` workflow. By default, the workflow will find modules in
in ``lib.postprocess`` directory, so it's most convenient and organized to put
your functions within modules in that directory.

For example, above we used the postprocess function
``lib.postprocess.sacCer3.fasta_lib.postprocess``, and you can view this
function in ``lib/postprocess/sacCer3.py``.

Please see :func:`lib.common.download_and_postprocess` for more details, and
the files in the ``lib/postproces`` directory for inspiration.

These two arguments are automatically provided by the references workflow --
you don't have to know or care exactly what the filenames are, just what has to
be done to their contents.

See the files in ``lib/postprocess`` for inspiration if you need to write your
own post-processing functions.

The job of a postprocessing function is to ensure that the
fastq/gtf/transcriptome fasta meets the requirements described above and is
ready for any intended downstream tasks. For example if we download the fasta
file from FlyBase for dm6 but want "chr" prepended to chromosome names, we can
create a function in the file ``dm6.py`` called ``add_chr`` that does
this:

.. code-block:: python

    # This is dm6.py

    from snakemake.shell import shell  # a very convenient function

    def add_chr(origfn, newfn):
        shell(
            'zcat {origfn} '       # input is always gzipped
            '| sed "s/>/>chr/g" '  # add chr to names
            '| gzip -c > {newfn} ' # re-zip
            '&& rm {origfn}'       # clean up
        )

We specify this function to be called in the fasta config block like this (note
that the module doesn't have to be the same name as the organism, but it is
here for clarity):

.. code-block:: yaml

    dm6:
      fasta:
        url: ...
        postprocess: "dm6.add_chr"

This expects a file ``dm6.py`` in the same directory as the
`references.snakefile` workflow, and expects a function ``add_chr`` to
be defined in that module.

Any downstream rules that operate on the genome FASTA file (like hisat2 index,
bowtie2 index, etc) will now use this fixed version with "chr" prepended to
chromosome names.  In this way, we can apply arbitrary code to modify
references to get them into a uniform format.


.. _advanced-postprocessing:

More advanced postprocessing
----------------------------

If a post-processing function has a keyword argument with starts and ends with
a double underscore (``__``), the config system will assume this is a string
that should be interpreted as a dotted function name and the actual function
will be resolved and passed to the post-processing function.

This is useful for example when attaching ERCC spike-ins to a reference file
that in turn needs to be modified. For example, the `S. pombe` reference
annotations are available as a GFF file, but this needs to be converted to
a GTF file. After that, the ERCC spike-in GTF annotations need to be added to
the newly-created GTF.

The functions in ``lib/postprocess/ercc.py`` support such a use-case. The
config looks like this:

.. code-block:: yaml

    genome:
      url:
        # S. pombe fasta
        - 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/schizosaccharomyces_pombe/dna/Schizosaccharomyces_pombe.ASM294v2.dna_sm.toplevel.fa.gz'
        # ERCC fasta
        - 'https://www-s.nist.gov/srmors/certificates/documents/SRM2374_Sequence_v1.FASTA'
      postprocess:
        function: "lib.postprocess.ercc.add_fasta_to_genome"

    annotation:
      url:
        # S. pombe GFF, which needs to be converted to GTF
        - 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/gff3/schizosaccharomyces_pombe/Schizosaccharomyces_pombe.ASM294v2.41.gff3.gz'

        # ERCC GTF is not available; conversion function needed to convert
        # fasta to GTF
        - 'https://www-s.nist.gov/srmors/certificates/documents/SRM2374_Sequence_v1.FASTA'

      postprocess:
        function: "lib.postprocess.ercc.add_gtf_to_genome"
        kwargs:
          # As per the docs for add_gtf_to_genome, this function will be
          # applied to all but the last input file. It is specified as a string
          # here, but the config-processing system will resolve this to the
          # actual function and pass that along to add_gtf_to_genome
          __preprocess__: "lib.common.gff2gtf"

.. versionadded:: 1.7
    Ability to use special ``__``-prefixed variables that are interpreted as
    dotted-path functions to import.

Locations of downloaded-and-post-processed FASTA and GTF files
--------------------------------------------------------------
Generally speaking, the fasta and gtf files will be in::

    {references_dir}/{organism}/{tag}/fasta/{organism}_{tag}.fasta
    {references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.gtf

If a config file looks like this (simplified here for clarity):

.. code-block:: yaml

  references_dir: refs
  references:
    human:
      hg38:
        fasta: ...
        gtf: ...

Then the following files will be created::

    refs/human/hg38/fasta/human_hg38.fasta
    refs/human/hg38/gtf/human_hg38.gtf


If you are running the references workflow directly, or it is included in
another workflow that requests a chromsizes file, the following will also be
created::

    refs/human/hg38/fasta/human_hg38.chromsizes

.. note::

  URLs are expected to be gzipped and any postprocessing functions are
  expected to output gzipped files. This is because it is most common for
  providers to offer gzipped reference files, and therefore minimizes the
  effort required to prepare fasta and gtf files.  However, not all downstream
  tools handle gzipped input. The references workflow therefore stores only the
  uncompressed versions. We consider the resulting configuration simplicity to
  be worth the additional space and time cost.


Available indexes and conversions
---------------------------------
The following indexes can be currently be specified for fasta files:

hisat2
^^^^^^

    .. code-block:: yaml

        indexes:
          - hisat2

    Output files::

      {references_dir}/{organism}/{tag}/hisat2/{organism}_{tag}.*.ht2

bowtie2
^^^^^^^

    .. code-block:: yaml

        indexes:
          - bowtie2

    Output files::

      {references_dir}/{organism}/{tag}/bowtie2/{organism}_{tag}.*.bt2

salmon
^^^^^^

    .. code-block:: yaml

        indexes:
          - salmon

    Output files::

      {references_dir}/{organism}/{tag}/salmon/{organism}_{tag}/*

The following conversions can be specified for GTF files:

refflat
^^^^^^^

    .. code-block:: yaml

        conversions:
          - refflat

    Converts GTF to refFlat format. See the ``conversion_refflat`` rule in
    ``workflows/references/Snakefile``.

    Output file::

      {references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.refflat

bed12
^^^^^

    .. code-block:: yaml

        conversions:
           - bed12

   Converts GTF to BED12 format. See the ``conversion_bed12`` rule in
   ``workflows/references/Snakefile``.

   Output file::

      {references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.refflat

gffutils
^^^^^^^^
    Converts GTF to gffutils database (typically used for downstream work). You
    can specify arbitrary kwargs to ``gffutils.create_db`` by including them as
    keys. For example, if the GTF file already contains features for genes and
    transcripts:

    .. code-block:: yaml

        conversions:
          - gffutils:
              disable_infer_genes: True
              disable_infer_transcripts: True


    Output file::

        {references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.gtf.db

genelist
^^^^^^^^
    Reads the postprocessed GTF file, and extracts the set of gene IDs found,
    one ID per line. The GTF attribute to use is configured by the
    ``gene_id:`` key, for example, if the file contains gene IDs in the
    ``Name`` attribute of each line, use the following:

    .. code-block:: yaml

        conversions:
          - genelist:
              gene_id: 'Name'

    Output file::

      {references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.genelist

mappings
^^^^^^^^
    Reads the postprocesses GTF file, and outputs mappings between attributes
    as a gzipped TSV.

    You can include/exclude featuretypes from being checked.  For example, if
    your GTF has genes and transcripts in addition to exons, the gene and
    transcript lines probably contain all of the attributes you are interested
    in (like gene_id, symbol, name, etc) and the exon (and any other lines) can
    be ignored, speeding up the process. In this case you could use
    ``include_featuretypes: [gene, transcript]``.

    A ``__featuretype__`` column is always included in the mapping.  This is
    the GTF featuretype of each line, with extra ``__`` to avoid overwriting an
    attribute that may happen to be called ``featuretype``.

    .. code-block:: yaml

        conversions:
          - mappings

    .. code-block:: yaml

        conversions:
          - mappings:
              include_featuretypes: [gene, transcript]

    Output file::

      {references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.mapping.tsv.gz