Module lib.common
¶
This module contains various helper functions used by the workflows. It has two main types of functions, those for handling configuration information those for handling references.
Functions for handling configuration¶
|
Finds the config file. |
|
Transforms the references section of the config file. |
|
Identify the references directory based on config and env vars. |
|
Return samples and pandas.DataFrame of parsed sampletable. |
Functions for handling references¶
|
Cat-and-gzip a list of uncompressed files into a compressed output file. |
|
Simple concatenation of files. |
|
Extract records from fasta file(s) given a search pattern. |
|
Converts .2bit files to fasta. |
|
Given an output file, figure out what to do based on the config. |
Details¶
- lib.common.cat(tmpfiles, outfile)[source]¶
Simple concatenation of files.
Note that gzipped files can be concatenated as-is without un- and re- compressing.
- lib.common.check_all_urls_found(verbose=True)[source]¶
Recursively loads all references that can be included and checks them. Reports out if there are any failures.
- lib.common.check_url(url, verbose=False)[source]¶
Try to open – and then immediately close – a URL.
Any exceptions can be handled upstream.
- lib.common.check_urls(config, verbose=False)[source]¶
Given a config filename or existing object, extract the URLs and check them.
- Parameters:
config (str or dict) – Config object to inspect
verbose (bool) – Print which URL is being checked
wait (int) – Number of seconds to wait in between checking URLs, to avoid too-many-connection issues
- lib.common.deprecation_handler(config)[source]¶
Checks the config to see if anything has been deprecated.
Also makes any fixes that can be done automatically.
- lib.common.download_and_postprocess(outfile, config, organism, tag, type_)[source]¶
Given an output file, figure out what to do based on the config.
See notes below for details.
- Parameters:
outfile (str)
config (dict)
organism (str) – Which organism to use. Must be a key in the “references” section of the config.
tag (str) – Which tag for the organism to use. Must be a tag for the organism in the config
type (str) – A supported references type (gtf, fasta) to use.
Notes
This function:
uses organism, tag, type_ as a key into the config dict to figure out:
what postprocessing function (if any) was specified along with its optional args
the URL[s] to download
resolves the name of the postprocessing function (if provided) and imports it
downloads the URL[s] to tempfile[s]
calls the imported postprocessing function using the tempfile[s] and outfile plus any additional specified arguments.
The postprocessing function must have one of the following signatures, where infiles contains the list of temporary files downloaded from the URL or URLs specified, and outfile is a gzipped file expected to be created by the function:
def func(infiles, outfile): pass
or:
def func(infiles, outfile, *args): pass
or:
def func(infiles, outfile, *args, **kwargs): pass
The function is specified as a string that resolves to an importable function, e.g., postprocess: lib.postprocess.dm6.fix will call a function called fix in the file lib/postprocess/dm6.py.
If the contents of postprocess: is a dict, it must have at least the key function, and optionally args and/or kwargs keys. The function key indicates the importable path to the function. args can be a string or list of arguments that will be provided as additional args to a function with the second kind of signature above. If kwargs is provided, it is a dict that is passed to the function with the third kind of signature above. For example:
postprocess: function: lib.postprocess.dm6.fix args: - True - 3
or:
postprocess: function: lib.postprocess.dm6.fix args: - True - 3 kwargs: skip: exon
- lib.common.fill_r1_r2(sampletable, pattern, r1_only=False)[source]¶
Returns a function intended to be used as a rule’s input function.
The returned function, when provided with wildcards, will return one or two rendered versions of a pattern depending on SE or PE respectively. Specifically, given a pattern (which is expected to contain a placeholder for “{sample}” and “{n}”), look up in the sampletable whether or not it is paired-end.
- Parameters:
sampletable (pandas.DataFrame) – Contains a “layout” column with either “SE” or “PE”, or “LibraryLayout” column with “SINGLE” or “PAIRED”. If column does not exist, assume SE.
pattern (str) – Must contain at least a “{sample}” placeholder.
r1_only (bool) – If True, then only return the file for R1 even if PE is configured.
- lib.common.filter_fastas(tmpfiles, outfile, pattern)[source]¶
Extract records from fasta file(s) given a search pattern.
Given input gzipped FASTAs, create a new gzipped fasta containing only records whose description matches pattern.
- Parameters:
tmpfiles (list) – gzipped fasta files to look through
outfile (str) – gzipped output fastq file
pattern (str) – Look for this string in each record’s description
- lib.common.get_references_dir(config)[source]¶
Identify the references directory based on config and env vars.
Returns the references dir, preferring the value of an existing environment variable REFERENCES_DIR over the config entry “references_dir”. Raise an error if either can’t be found.
- Parameters:
config (dict)
- lib.common.get_sampletable(config)[source]¶
Return samples and pandas.DataFrame of parsed sampletable.
Returns the sample IDs and the parsed sampletable from the file specified in the config.
The sample IDs are assumed to be the first column of the sampletable.
- Parameters:
config (dict)
- lib.common.get_techreps(sampletable, label)[source]¶
Return all sample IDs for which the “label” column is label.
- lib.common.gff2gtf(gff, gtf)[source]¶
Converts a gff file to a gtf format using the gffread function from Cufflinks
- lib.common.gzipped(tmpfiles, outfile)[source]¶
Cat-and-gzip a list of uncompressed files into a compressed output file.
- lib.common.is_paired_end(sampletable, sample)[source]¶
Inspects the sampletable to see if the sample is paired-end or not
- Parameters:
sampletable (pandas.DataFrame) – Contains a “layout” or “LibraryLayout” column (but not both). If the lowercase value is “pe” or “paired”, consider the sample paired-end. Otherwise consider single-end.
sample (str) – Assumed to be found in the first column of sampletable
- lib.common.load_config(config, missing_references_ok=False)[source]¶
Loads the config.
Resolves any included references directories/files and runs the deprecation handler.
- lib.common.pluck(obj, kv)[source]¶
For a given dict or list that somewhere contains keys kv, return the values of those keys.
Named after the dplyr::pluck, and implemented based on https://stackoverflow.com/a/1987195
- lib.common.references_dict(config)[source]¶
Transforms the references section of the config file.
The references section of the config file is designed to be human-editable, and to only need the URL(s). User-specified indexes, conversions, and post-processing functions can also be added.
For example, the config might say:
human: gencode: fasta: <url to fasta> indexes: - hisat2
In this function, we need to convert that “indexes: [hisat2]” into the full path of the hisat2 index that can be used as input for a Snakemake rule. In this example, in the dictionary returned below we can then get that path with d[‘human’][‘gencode’][‘hisat2’], or more generally, d[organism][tag][type].
- Parameters:
config (dict)
Notes
The config file is designed to be easy to edit and use from the user’s standpoint. But it’s not so great for practical usage. Here we convert the config file which has the format:
... references_dir: "/data" ... references: ... dm6: ... r6-11: ... metadata: ... reference_genome_build: 'dm6' ... reference_effective_genome_count: 1.2e7 ... reference_effective_genome_proportion: 0.97 ... genome: ... url: "" ... indexes: ... - bowtie2 ... - hisat2 ... annotation: ... url: "" ... conversions: ... - refflat ... transcriptome: ... indexes: ... - salmon
To this format:
... 'dm6': { ... 'r6-11': { ... 'annotation': '/data/dm6/r6-11/annotation/dm6_r6-11.gtf', ... 'bowtie2': '/data/dm6/r6-11/genome/bowtie2/dm6_r6-11.1.bt2', ... 'bowtie2_fasta': '/data/dm6/r6-11/genome/bowtie2/dm6_r6-11.fasta', ... 'chromsizes': '/data/dm6/r6-11/genome/dm6_r6-11.chromsizes', ... 'genome': '/data/dm6/r6-11/genome/dm6_r6-11.fasta', ... 'hisat2': '/data/dm6/r6-11/genome/hisat2/dm6_r6-11.1.ht2', ... 'hisat2_fasta': '/data/dm6/r6-11/genome/hisat2/dm6_r6-11.fasta', ... 'refflat': '/data/dm6/r6-11/annotation/dm6_r6-11.refflat', ... 'salmon': '/data/dm6/r6-11/transcriptome/salmon/dm6_r6-11/versionInfo.json', ... 'salmon_fasta': '/data/dm6/r6-11/transcriptome/salmon/dm6_r6-11.fasta', ... 'transcriptome': '/data/dm6/r6-11/transcriptome/dm6_r6-11.fasta', ... }, ... }