References config¶
The references section defines which genomes, transcriptomes, and annotations to use. It supports arbitrarily many species and assemblies, and supports customizing references for a particular project. For example, there are tools and examples for adding ERCC spike-in controls to references, or adjusting chromosome nomenclature, or removing problematic entries from GTF files, and so on.
Another advantage is that this makes it easy to add multiple genomes to the screening step (which uses fastq_screen).
Specifying a references directory¶
See references_dir for more information on where the references are built (as well as how and why to adjust this).
Including existing reference configs¶
We provide a number of pre-configured reference configs for common model organisms. If you just want to use some references configs that work, then put this in your config file:
include_references:
- '../../include/references_configs'
This will populate the config with the contents of all the files contained in
the include/references_configs
directory. Any paths provided under the
include_references
key are relative to the Snakefile using the config.
Note that you will still need to inspect the contents of those files to
decide which organsim and tag you want to use for your particular experiment
(see organism field and aligner config section for more on these fields). For
example, if you are working with human RNA-seq data, and you use the above
include_references
, you may want this in your config:
organism: 'human'
aligner:
tag: 'gencode-v25'
index: 'hisat2'
salmon: 'gencode-v25'
The reason for using gencode-v25
is because that tag is configured for the
human
key in ../../include/references_configs/Homo_sapiens.yaml
.
You can provide entire directories of reference configs, a single file, or use the references section below. The prioritization works like this:
an organism can show up in multiple configs; if a tag exists for an organism in more than one config, higher-priority configs will overwrite the contents of the tag.
directories have lowest priority; when multiple directories are specified the last one has priority
files have priority over directories; when multiple files are specified the last one has priority
the
references:`
section always has priority over anything ininclude_references:
.
The remainder of this section of the documentation explains how to customize the references, to add your own or modify the existing examples.
Overview¶
The references workflow is based on the idea that while each genome’s source files may differ, they can usually be modified to a uniform format. For example, reference files (FASTA, GTF) may come from different providers (Ensembl, FlyBase, UCSC, etc) and have slightly different formatting (strange headers, one big file or a tarball of individual chromosomes, etc), once they are well-formatted they can be used to create a hisat2 index, a bowtie2 index, a list of genes, intergenic regions, and so on without any further customization.
The challenging part is the “well-formatted” part. To solve this, the config file and references building system allows a very flexible specification of how to modify references via a plugin architecture. It works something like this:
Each key in the references section refers to an organism.
An organism has one or more tags.
Each tag has a FASTA file and/or a GTF file associated with it.
Each FASTA or GTF specifies one or more URIs from which to download the raw file(s). These can be ftp://, http://, https://, or file:// URIs.
An optional postprocess key specifies the import path to a Python module. This is the primary hook for customization, and is described in more detail below.
For FASTA files one or more indexes are requested
For GTF files, zero or more conversions are requested.
Note
If using a file://
URI, it needs to be gzipped.
It’s probably easiest to show an example config and then describe what’s happening.
Example references config¶
The following example configures the workflow to:
download a fasta file from the GENCODE project for the human genome and build a hisat2 and bowtie2 index
download the corresponding GTF file from GENCODE, strip off the dotted version numbers from Ensembl gene and transcript IDs, and create a refFlat format file from it
download the SILVA rRNA database and keep only the ribosomal RNA sequence corresponding to Homo sapiens
This example contains sufficient real-world complexity to illustrate the flexibility afforded by the references workflow. It is heavily commented for illustration.
# EXAMPLE REFERENCES CONFIG SECTION
# This configures the directory in which the prepared references will be
# saved (see below for directory structure). If you already have reference
# files saved in the lcdb-wf structure, point this to that directory to
# avoid rebuilding a fresh set of references:
references_dir: 'data/references'
# One of the organisms configured below. We are only configuring a single one
# so "human" is our only option here:
organism: 'human'
# Here we specify which tag under "human" to use for aligning, as well as
# which index we'll be using. This example is RNA-seq, so we'll use HISAT2:
aligner:
tag: 'gencode-v25'
index: 'hisat2'
# Top-level section for references:
references:
# Label for this organism or species:
human:
# "gencode-v25" is our tag to describe this particular FASTA and GTF
# we're preparing:
gencode-v25:
# This block will define how to get and postprocess a FASTA file:
fasta:
# URL to download:
url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_organism.genome.fa.gz'
# We can optionally build indexes for various aligners:
indexes:
- 'hisat2'
- 'bowtie2'
# This next block will define how to get and postprocess a GTF file.
# The coordinates of the GTF file correspond to the
# coordinates in the fasta defined above, so we're putting it under
# the same tag. This is not required; we could also put it under
# separate tag (perhaps called "gencode-v25-annotations")
gtf:
url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz'
# The GENCODE annotations include the dotted Ensembl versions in
# the gene IDs. The following function, strip_ensembl_version, is
# defined in lib/postprocess/hg38.py. It strips off those dotted
# versions so that our resulting GTF file used by the workflows
# will not contain them:
postprocess: 'lib.postprocess.hg38.strip_ensembl_version'
# Once well-formatted by the postprocessing function, we can now
# perform standard conversions on the GTF. These conversions are
# defined as rules in the references Snakefile, and will be run
# if the conversion is specified here. Here we ask to get a refFlat
# file, which can be provided to Picard's collectRnaSeqMetrics tool:
conversions:
- 'refflat'
# Here is another tag, to create a FASTA file for ribosomal RNA. It can
# then be used for fastq_screen, or for the rRNA screening portion of the
# RNA-seq workflow:
rRNA:
fasta:
# The SILVA database has separate files for large and small subunit
# sequences. We'd like them all; by providing multiple URLs they will
# be concatenated:
url:
- 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_LSURef_tax_silva_trunc.fasta.gz'
- 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz'
# However, the downloaded files contain many species. Here we only
# care about human. We already have a function, "filter_fastas()", in
# lib/common.py that accepts a FASTA and only keeps the records that
# contain the provided first argument.
# We specify that first argument here, and it will be passed to that
# function, resulting in a final FASTA file that only contains the
# rRNA sequence for Homo sapiens:
postprocess:
function: 'lib.common.filter_fastas'
args: 'Homo sapiens'
# We only need a bowtie2 index out of it.
indexes:
- 'bowtie2'
Without all those comments, it looks like this:
references_dir: 'data/references'
organism: 'human'
aligner:
tag: 'gencode-v25'
index: 'hisat2'
references:
human:
gencode-v25:
fasta:
url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_organism.genome.fa.gz'
indexes:
- 'hisat2'
- 'bowtie2'
gtf:
url: 'ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz'
postprocess: 'lib.postprocess.hg38.strip_ensembl_version'
conversions:
- 'refflat'
rRNA:
fasta:
url:
- 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_LSURef_tax_silva_trunc.fasta.gz'
- 'https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz'
postprocess:
function: 'lib.common.filter_fastas'
args: 'Homo sapiens'
indexes:
- 'bowtie2'
The above file will result in the following directory structure:
data/references/human/gencode-v25/fasta
data/references/human/gencode-v25/bowtie2
data/references/human/gencode-v25/hisat2
data/references/human/gencode-v25/gtf
data/references/human/gencode-v25-transcriptome/fasta
data/references/human/gencode-v25-transcriptome/salmon
data/references/human/rRNA/fasta
data/references/human/rRNA/bowtie2
Each block in the YAML file describes either a fasta or gtf file. Each block has at least the organism, type, and a URL. A block can optionally have a postprocess, which is an arbitrary function (described below) that converts the downloaded URL to something that conforms to the standards of the workflow (also described below). By supplying a tag, we can differentiate between different versions (e.g., FlyBase r6.04 vs r6.11; hg19 vs hg38) or different kinds of postprocessing (e.g, “chr” preprended to chrom names or not; comprehensive annotation vs only coding genes).
fasta blocks can have an optional indexes entry which will build the specified indexes. gtf blocks can have an optional conversions entry which will perform the specified conversion. Available indexes and conversions are described below.
Post processing¶
All files created by a block are required to be gzipped.
This means that if a URL points to an uncompressed GTF file, a post-processing function must gzip it. It also means that any post-processing functions must write gzipped output files.
Other than that, it’s up to the user to decide what transformations (if any) are required. Examples might include:
exluding particular contigs
removing or editing problematic genes that have transcripts on both strands – mod(mdg4) I’m looking at you
renaming chromosomes (e.g., prepend “chr”)
remove unnecessary annotations (e.g., keep only cds/exon/transcript/gene features)
In the example config above, the yeast genome is available as a tarball of separate fasta files, but we’d like to get it into a single fasta file for downstream tools to work with.
The configuration block can define an optional postprocess string which
contains a dotted name referring to Python function that is importable by the
reference.snakefile workflow. By default, the workflow will find modules in
in lib.postprocess
directory, so it’s most convenient and organized to put
your functions within modules in that directory.
For example, above we used the postprocess function
lib.postprocess.sacCer3.fasta_lib.postprocess
, and you can view this
function in lib/postprocess/sacCer3.py
.
Please see lib.common.download_and_postprocess()
for more details, and
the files in the lib/postproces
directory for inspiration.
These two arguments are automatically provided by the references workflow – you don’t have to know or care exactly what the filenames are, just what has to be done to their contents.
See the files in lib/postprocess
for inspiration if you need to write your
own post-processing functions.
The job of a postprocessing function is to ensure that the
fastq/gtf/transcriptome fasta meets the requirements described above and is
ready for any intended downstream tasks. For example if we download the fasta
file from FlyBase for dm6 but want “chr” prepended to chromosome names, we can
create a function in the file dm6.py
called add_chr
that does
this:
# This is dm6.py
from snakemake.shell import shell # a very convenient function
def add_chr(origfn, newfn):
shell(
'zcat {origfn} ' # input is always gzipped
'| sed "s/>/>chr/g" ' # add chr to names
'| gzip -c > {newfn} ' # re-zip
'&& rm {origfn}' # clean up
)
We specify this function to be called in the fasta config block like this (note that the module doesn’t have to be the same name as the organism, but it is here for clarity):
dm6:
fasta:
url: ...
postprocess: "dm6.add_chr"
This expects a file dm6.py
in the same directory as the
references.snakefile workflow, and expects a function add_chr
to
be defined in that module.
Any downstream rules that operate on the genome FASTA file (like hisat2 index, bowtie2 index, etc) will now use this fixed version with “chr” prepended to chromosome names. In this way, we can apply arbitrary code to modify references to get them into a uniform format.
More advanced postprocessing¶
If a post-processing function has a keyword argument with starts and ends with
a double underscore (__
), the config system will assume this is a string
that should be interpreted as a dotted function name and the actual function
will be resolved and passed to the post-processing function.
This is useful for example when attaching ERCC spike-ins to a reference file that in turn needs to be modified. For example, the S. pombe reference annotations are available as a GFF file, but this needs to be converted to a GTF file. After that, the ERCC spike-in GTF annotations need to be added to the newly-created GTF.
The functions in lib/postprocess/ercc.py
support such a use-case. The
config looks like this:
genome:
url:
# S. pombe fasta
- 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/schizosaccharomyces_pombe/dna/Schizosaccharomyces_pombe.ASM294v2.dna_sm.toplevel.fa.gz'
# ERCC fasta
- 'https://www-s.nist.gov/srmors/certificates/documents/SRM2374_Sequence_v1.FASTA'
postprocess:
function: "lib.postprocess.ercc.add_fasta_to_genome"
annotation:
url:
# S. pombe GFF, which needs to be converted to GTF
- 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/gff3/schizosaccharomyces_pombe/Schizosaccharomyces_pombe.ASM294v2.41.gff3.gz'
# ERCC GTF is not available; conversion function needed to convert
# fasta to GTF
- 'https://www-s.nist.gov/srmors/certificates/documents/SRM2374_Sequence_v1.FASTA'
postprocess:
function: "lib.postprocess.ercc.add_gtf_to_genome"
kwargs:
# As per the docs for add_gtf_to_genome, this function will be
# applied to all but the last input file. It is specified as a string
# here, but the config-processing system will resolve this to the
# actual function and pass that along to add_gtf_to_genome
__preprocess__: "lib.common.gff2gtf"
Added in version 1.7: Ability to use special __
-prefixed variables that are interpreted as
dotted-path functions to import.
Locations of downloaded-and-post-processed FASTA and GTF files¶
Generally speaking, the fasta and gtf files will be in:
{references_dir}/{organism}/{tag}/fasta/{organism}_{tag}.fasta
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.gtf
If a config file looks like this (simplified here for clarity):
references_dir: refs
references:
human:
hg38:
fasta: ...
gtf: ...
Then the following files will be created:
refs/human/hg38/fasta/human_hg38.fasta
refs/human/hg38/gtf/human_hg38.gtf
If you are running the references workflow directly, or it is included in another workflow that requests a chromsizes file, the following will also be created:
refs/human/hg38/fasta/human_hg38.chromsizes
Note
URLs are expected to be gzipped and any postprocessing functions are expected to output gzipped files. This is because it is most common for providers to offer gzipped reference files, and therefore minimizes the effort required to prepare fasta and gtf files. However, not all downstream tools handle gzipped input. The references workflow therefore stores only the uncompressed versions. We consider the resulting configuration simplicity to be worth the additional space and time cost.
Available indexes and conversions¶
The following indexes can be currently be specified for fasta files:
hisat2¶
indexes: - hisat2Output files:
{references_dir}/{organism}/{tag}/hisat2/{organism}_{tag}.*.ht2
bowtie2¶
indexes: - bowtie2Output files:
{references_dir}/{organism}/{tag}/bowtie2/{organism}_{tag}.*.bt2
salmon¶
indexes: - salmonOutput files:
{references_dir}/{organism}/{tag}/salmon/{organism}_{tag}/*
The following conversions can be specified for GTF files:
refflat¶
conversions: - refflatConverts GTF to refFlat format. See the
conversion_refflat
rule inworkflows/references/Snakefile
.Output file:
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.refflat
bed12¶
conversions: - bed12Converts GTF to BED12 format. See the
conversion_bed12
rule inworkflows/references/Snakefile
.Output file:
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.refflat
gffutils¶
Converts GTF to gffutils database (typically used for downstream work). You can specify arbitrary kwargs to
gffutils.create_db
by including them as keys. For example, if the GTF file already contains features for genes and transcripts:conversions: - gffutils: disable_infer_genes: True disable_infer_transcripts: TrueOutput file:
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.gtf.db
genelist¶
Reads the postprocessed GTF file, and extracts the set of gene IDs found, one ID per line. The GTF attribute to use is configured by the
gene_id:
key, for example, if the file contains gene IDs in theName
attribute of each line, use the following:conversions: - genelist: gene_id: 'Name'Output file:
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.genelist
mappings¶
Reads the postprocesses GTF file, and outputs mappings between attributes as a gzipped TSV.
You can include/exclude featuretypes from being checked. For example, if your GTF has genes and transcripts in addition to exons, the gene and transcript lines probably contain all of the attributes you are interested in (like gene_id, symbol, name, etc) and the exon (and any other lines) can be ignored, speeding up the process. In this case you could use
include_featuretypes: [gene, transcript]
.A
__featuretype__
column is always included in the mapping. This is the GTF featuretype of each line, with extra__
to avoid overwriting an attribute that may happen to be calledfeaturetype
.conversions: - mappingsconversions: - mappings: include_featuretypes: [gene, transcript]Output file:
{references_dir}/{organism}/{tag}/gtf/{organism}_{tag}.mapping.tsv.gz