lcdblib.snakemake package

Submodules

lcdblib.snakemake.aligners module

Helper functions for working with aligners within Snakefiles

lcdblib.snakemake.aligners.bowtie2_index_from_prefix(prefix)[source]

Given a prefix, return a list of the corresponding bowtie2 index files.

lcdblib.snakemake.aligners.hisat2_index_from_prefix(prefix)[source]

Given a prefix, return a list of the corresponding hisat2 index files.

lcdblib.snakemake.aligners.prefix_from_bowtie2_index(index_files)[source]

Given a list of index files for bowtie2, return the corresponding prefix.

lcdblib.snakemake.aligners.prefix_from_hisat2_index(index_files)[source]

Given a list of index files for hisat2, return the corresponding prefix.

lcdblib.snakemake.helpers module

lcdblib.snakemake.helpers.extract_wildcards(pattern, target)[source]

Return a dictionary of wildcards and values identified from target.

Returns None if the regex match failed.

Parameters:
  • pattern (str) – Snakemake-style filename pattern, e.g. {output}/{sample}.bam.
  • target (str) – Filename from which to extract wildcards, e.g., data/a.bam.

Examples

>>> pattern = '{output}/{sample}.bam'
>>> target = 'data/a.bam'
>>> expected = {'output': 'data', 'sample': 'a'}
>>> assert extract_wildcards(pattern, target) == expected
>>> assert extract_wildcards(pattern, 'asdf') is None
lcdblib.snakemake.helpers.fill_patterns(patterns, fill, combination=<class 'itertools.product'>)[source]

Fills in a dictionary of patterns with the dictionary or DataFrame fill.

>>> patterns = dict(a='{sample}_R{N}.fastq')
>>> fill = dict(sample=['one', 'two'], N=[1, 2])
>>> sorted(fill_patterns(patterns, fill)['a'])
['one_R1.fastq', 'one_R2.fastq', 'two_R1.fastq', 'two_R2.fastq']
>>> patterns = dict(a='{sample}_R{N}.fastq')
>>> fill = dict(sample=['one', 'two'], N=[1, 2])
>>> sorted(fill_patterns(patterns, fill, zip)['a'])
['one_R1.fastq', 'two_R2.fastq']
>>> patterns = dict(a='{sample}_R{N}.fastq')
>>> fill = pd.DataFrame({'sample': ['one', 'two'], 'N': [1, 2]})
>>> sorted(fill_patterns(patterns, fill)['a'])
['one_R1.fastq', 'two_R2.fastq']
lcdblib.snakemake.helpers.rscript(string, scriptname, log=None)[source]

Saves the string as scriptname and then runs it

Parameters:
  • string (str) – Filled-in template to be written as R script
  • scriptname (str) – File to save script to
  • log (str) – File to redirect stdout and stderr to. If None, no redirection occurs.

lcdblib.snakemake.interface module

class lcdblib.snakemake.interface.SampleHandler(config)[source]

Bases: object

Basic interface to help handle filenames in snakemake

build_targets(patterns)[source]

Build target file names based on pattern naming scheme.

Given a list of string formatted patterns will use config information to fill in the patterns and generate a list of file targets.

Parameters:patterns (list) – List of files with string formating marks that can be filled in from the config or the sampleTable.
Returns:Filled in list of file names.
Return type:list
find_level(prefix)[source]

Figure out which regex the prefix matches.

Scans each level and tries to identify which level the prefix matches.

Parameters:prefix (str) – string that you want to match to the prefix, this string would have sample information filled in.
Returns:[0] is a string indicating which level [1] is a dict with sample information
Return type:tuple
Raises:ValueError – Too lazy to make a custom exception, raises ValueError if the prefix does not match any of the patterns.

Example

>>> SH = SampleHandler(test_config)
>>> prefix = 'pasilla/treated1/treated1_treated_1_0001_R1'
>>> level, attrs = SH.find_level(prefix)
>>> level
'rawLevel'
>>> assert attrs == [
... {'replicate': '1', 'sampleID': 'treated1', 'treatment': 'treated'}
... ]

This prefix doesn’t resolve to an actual sample defined in the sampletable (note mismatch between “1” in everything but the last “_2_”. Should this raise ValueError?

>>> prefix = 'pasilla_sample/treated1/treated1_treated_2_R1'
>>> level, attrs = SH.find_level(prefix)
>>> level
'runLevel'
>>> assert attrs == []

Run-level prefix:

>>> prefix = 'pasilla_sample/treated2/treated2_treated_2_R1'
>>> level, attrs = SH.find_level(prefix)
>>> level
'runLevel'
>>> assert attrs == [
... {'replicate': '2', 'sampleID': 'treated2', 'treatment': 'treated'}
... ]

Agg-level prefix (note the multiple sets of attributes returned):

>>> prefix = 'pasilla_agg/treated'
>>> level, attrs = SH.find_level(prefix)
>>> level
'aggLevel'
>>> assert attrs == [
... {'treatment': 'treated', 'replicate': '1', 'sampleID': 'treated1'},
... {'treatment': 'treated', 'replicate': '2', 'sampleID': 'treated2'}
... ]
find_sample(pattern, prefix)[source]

Find which sample(s) to use.

Parameters:
  • pattern (str) – A regex expression for the level
  • prefix (str) – The prefix that is being matched to the pattern
  • Returns
  • --------
  • dict – Sample attributes corresponding to the sample(s) that the prefix contains.

Example

>>> SH = SampleHandler(test_config)
>>> pattern = (
... 'pasilla_sample\/(?P<sampleID>treated1|treated2|untreated1|untreated2)'
... '\/(?P=sampleID)_(?P<treatment>treated|untreated)_(?P<replicate>1|2)_R1'
... )
>>> prefix = 'pasilla_sample/treated2/treated2_treated_2_R1'
>>> assert SH.find_sample(pattern, prefix) == [
... {'treatment': 'treated', 'sampleID': 'treated2', 'replicate': '2'}
... ]
make_input(prefix='prefix', midfix='', suffix='', agg=False)[source]

Generates Input Function based on wildcards.

Notes

example: ‘{first_element}{second_element}{third_element}’

Parameters:
  • prefix (str) – This can either be the entire prefix, or the name of the string formatting group that contains the prefix. This would be the ‘first_element’ in the example.
  • midfix (str) – This can either be the entire midfix, or the name of the string formating group that contains the midfix. This would be the ‘second_element’ in the example.
  • suffix (str) – This can either be the entire suffix, or the name of the string formating group that contains the suffix. This would be the ‘third_element’ in the example.
  • agg (bool) – True if you want file names from the level above the current prefix.
Returns:

Retruns a snakemake input function that generates a list of files.

Return type:

function

Examples

>>> SH = SampleHandler(test_config)
>>> func = SH.make_input('prefix', '', '.fastq')
>>> func({'prefix': 'pasilla'})
['pasilla.fastq']

Instead of directly specifying the midfix, we pull it from the wildcards dict entry for “asdf”:

>>> func = SH.make_input('prefix', 'asdf', '.fastq')
>>> func({'prefix': 'pasilla', 'asdf': '.cutadapt'})
['pasilla.cutadapt.fastq']

We want to aggregate into “pasilla_agg/treated.fastq”, but the following doesn’t work because we need the directory slash:

>>> func = SH.make_input('pasilla_agg', 'treated', '.fastq', agg=True)
>>> assertRaises(ValueError, func, {})

Also doesn’t work:

>>> func = SH.make_input('pasilla_agg/', midfix='treated', suffix='.fastq', agg=True)
>>> assertRaises(ValueError, func, {})

However if we provide most of the path in the prefix it works:

>>> func = SH.make_input('pasilla_agg/treated', suffix='.fastq', agg=True)
>>> assert sorted(func({})) == [
... 'pasilla_sample/treated1/treated1_treated_1_R1.fastq',
... 'pasilla_sample/treated1/treated1_treated_2_R1.fastq',
... 'pasilla_sample/treated2/treated2_treated_1_R1.fastq',
... 'pasilla_sample/treated2/treated2_treated_2_R1.fastq'
... ]

Module contents