Module lib.common

This module contains various helper functions used by the workflows. It has two main types of functions, those for handling configuration information those for handling references.

Functions for handling configuration

resolve_config(config[, workdir])

Finds the config file.

references_dict(config)

Transforms the references section of the config file.

get_references_dir(config)

Identify the references directory based on config and env vars.

get_sampletable(config)

Return samples and pandas.DataFrame of parsed sampletable.

Functions for handling references

gzipped(tmpfiles, outfile)

Cat-and-gzip a list of uncompressed files into a compressed output file.

cat(tmpfiles, outfile)

Simple concatenation of files.

filter_fastas(tmpfiles, outfile, pattern)

Extract records from fasta file(s) given a search pattern.

twobit_to_fasta(tmpfiles, outfile)

Converts .2bit files to fasta.

download_and_postprocess(outfile, config, ...)

Given an output file, figure out what to do based on the config.

Details

lib.common.cat(tmpfiles, outfile)[source]

Simple concatenation of files.

Note that gzipped files can be concatenated as-is without un- and re- compressing.

lib.common.check_all_urls_found(verbose=True)[source]

Recursively loads all references that can be included and checks them. Reports out if there are any failures.

lib.common.check_url(url, verbose=False)[source]

Try to open – and then immediately close – a URL.

Any exceptions can be handled upstream.

lib.common.check_urls(config, verbose=False)[source]

Given a config filename or existing object, extract the URLs and check them.

Parameters:
  • config (str or dict) – Config object to inspect

  • verbose (bool) – Print which URL is being checked

  • wait (int) – Number of seconds to wait in between checking URLs, to avoid too-many-connection issues

lib.common.deprecation_handler(config)[source]

Checks the config to see if anything has been deprecated.

Also makes any fixes that can be done automatically.

lib.common.download_and_postprocess(outfile, config, organism, tag, type_)[source]

Given an output file, figure out what to do based on the config.

See notes below for details.

Parameters:
  • outfile (str) –

  • config (dict) –

  • organism (str) – Which organism to use. Must be a key in the “references” section of the config.

  • tag (str) – Which tag for the organism to use. Must be a tag for the organism in the config

  • type (str) – A supported references type (gtf, fasta) to use.

Notes

This function:

  • uses organism, tag, type_ as a key into the config dict to figure out:

    • what postprocessing function (if any) was specified along with its optional args

    • the URL[s] to download

  • resolves the name of the postprocessing function (if provided) and imports it

  • downloads the URL[s] to tempfile[s]

  • calls the imported postprocessing function using the tempfile[s] and outfile plus any additional specified arguments.

The postprocessing function must have one of the following signatures, where infiles contains the list of temporary files downloaded from the URL or URLs specified, and outfile is a gzipped file expected to be created by the function:

def func(infiles, outfile):
    pass

or:

def func(infiles, outfile, *args):
    pass

or:

def func(infiles, outfile, *args, **kwargs):
    pass

The function is specified as a string that resolves to an importable function, e.g., postprocess: lib.postprocess.dm6.fix will call a function called fix in the file lib/postprocess/dm6.py.

If the contents of postprocess: is a dict, it must have at least the key function, and optionally args and/or kwargs keys. The function key indicates the importable path to the function. args can be a string or list of arguments that will be provided as additional args to a function with the second kind of signature above. If kwargs is provided, it is a dict that is passed to the function with the third kind of signature above. For example:

postprocess:
    function: lib.postprocess.dm6.fix
    args:
        - True
        - 3

or:

postprocess:
    function: lib.postprocess.dm6.fix
    args:
        - True
        - 3
    kwargs:
        skip: exon
lib.common.fill_r1_r2(sampletable, pattern, r1_only=False)[source]

Returns a function intended to be used as a rule’s input function.

The returned function, when provided with wildcards, will return one or two rendered versions of a pattern depending on SE or PE respectively. Specifically, given a pattern (which is expected to contain a placeholder for “{sample}” and “{n}”), look up in the sampletable whether or not it is paired-end.

Parameters:
  • sampletable (pandas.DataFrame) – Contains a “layout” column with either “SE” or “PE”, or “LibraryLayout” column with “SINGLE” or “PAIRED”. If column does not exist, assume SE.

  • pattern (str) – Must contain at least a “{sample}” placeholder.

  • r1_only (bool) – If True, then only return the file for R1 even if PE is configured.

lib.common.filter_fastas(tmpfiles, outfile, pattern)[source]

Extract records from fasta file(s) given a search pattern.

Given input gzipped FASTAs, create a new gzipped fasta containing only records whose description matches pattern.

Parameters:
  • tmpfiles (list) – gzipped fasta files to look through

  • outfile (str) – gzipped output fastq file

  • pattern (str) – Look for this string in each record’s description

lib.common.get_references_dir(config)[source]

Identify the references directory based on config and env vars.

Returns the references dir, preferring the value of an existing environment variable REFERENCES_DIR over the config entry “references_dir”. Raise an error if either can’t be found.

Parameters:

config (dict) –

lib.common.get_sampletable(config)[source]

Return samples and pandas.DataFrame of parsed sampletable.

Returns the sample IDs and the parsed sampletable from the file specified in the config.

The sample IDs are assumed to be the first column of the sampletable.

Parameters:

config (dict) –

lib.common.get_techreps(sampletable, label)[source]

Return all sample IDs for which the “label” column is label.

lib.common.gff2gtf(gff, gtf)[source]

Converts a gff file to a gtf format using the gffread function from Cufflinks

lib.common.gzipped(tmpfiles, outfile)[source]

Cat-and-gzip a list of uncompressed files into a compressed output file.

lib.common.is_paired_end(sampletable, sample)[source]

Inspects the sampletable to see if the sample is paired-end or not

Parameters:
  • sampletable (pandas.DataFrame) – Contains a “layout” or “LibraryLayout” column (but not both). If the lowercase value is “pe” or “paired”, consider the sample paired-end. Otherwise consider single-end.

  • sample (str) – Assumed to be found in the first column of sampletable

lib.common.load_config(config, missing_references_ok=False)[source]

Loads the config.

Resolves any included references directories/files and runs the deprecation handler.

lib.common.openfile(tmp, mode)[source]

Returns an open file handle; auto-detects gzipped files.

lib.common.pluck(obj, kv)[source]

For a given dict or list that somewhere contains keys kv, return the values of those keys.

Named after the dplyr::pluck, and implemented based on https://stackoverflow.com/a/1987195

lib.common.references_dict(config)[source]

Transforms the references section of the config file.

The references section of the config file is designed to be human-editable, and to only need the URL(s). User-specified indexes, conversions, and post-processing functions can also be added.

For example, the config might say:

human:
  gencode:
    fasta: <url to fasta>
        indexes:
          - hisat2

In this function, we need to convert that “indexes: [hisat2]” into the full path of the hisat2 index that can be used as input for a Snakemake rule. In this example, in the dictionary returned below we can then get that path with d[‘human’][‘gencode’][‘hisat2’], or more generally, d[organism][tag][type].

Parameters:

config (dict) –

Notes

The config file is designed to be easy to edit and use from the user’s standpoint. But it’s not so great for practical usage. Here we convert the config file which has the format:

... references_dir: "/data"
... references:
...   dm6:
...     r6-11:
...       metadata:
...         reference_genome_build: 'dm6'
...         reference_effective_genome_count: 1.2e7
...         reference_effective_genome_proportion: 0.97
...       genome:
...         url: ""
...         indexes:
...           - bowtie2
...           - hisat2
...       annotation:
...         url: ""
...         conversions:
...           - refflat
...       transcriptome:
...           indexes:
...             - salmon

To this format:

... 'dm6': {
...    'r6-11': {
...        'annotation':    '/data/dm6/r6-11/annotation/dm6_r6-11.gtf',
...        'bowtie2':       '/data/dm6/r6-11/genome/bowtie2/dm6_r6-11.1.bt2',
...        'bowtie2_fasta': '/data/dm6/r6-11/genome/bowtie2/dm6_r6-11.fasta',
...        'chromsizes':    '/data/dm6/r6-11/genome/dm6_r6-11.chromsizes',
...        'genome':        '/data/dm6/r6-11/genome/dm6_r6-11.fasta',
...        'hisat2':        '/data/dm6/r6-11/genome/hisat2/dm6_r6-11.1.ht2',
...        'hisat2_fasta':  '/data/dm6/r6-11/genome/hisat2/dm6_r6-11.fasta',
...        'refflat':       '/data/dm6/r6-11/annotation/dm6_r6-11.refflat',
...        'salmon':        '/data/dm6/r6-11/transcriptome/salmon/dm6_r6-11/versionInfo.json',
...        'salmon_fasta':  '/data/dm6/r6-11/transcriptome/salmon/dm6_r6-11.fasta',
...        'transcriptome': '/data/dm6/r6-11/transcriptome/dm6_r6-11.fasta',
...        },
... }
lib.common.resolve_config(config, workdir=None)[source]

Finds the config file.

Parameters:
  • config (str, dict) – If str, assume it’s a YAML file and parse it; otherwise pass through

  • workdir (str) – Optional location to specify relative location of all paths in config

lib.common.twobit_to_fasta(tmpfiles, outfile)[source]

Converts .2bit files to fasta.

Parameters:
  • tmpfiles (list) – 2bit files to convert

  • outfile (str) – gzipped output fastq file