Helper functions for working with aligners within Snakefiles
lcdblib.snakemake.aligners.bowtie2_index_from_prefix(prefix)[source]¶Given a prefix, return a list of the corresponding bowtie2 index files.
lcdblib.snakemake.aligners.hisat2_index_from_prefix(prefix)[source]¶Given a prefix, return a list of the corresponding hisat2 index files.
lcdblib.snakemake.helpers.extract_wildcards(pattern, target)[source]¶Return a dictionary of wildcards and values identified from target.
Returns None if the regex match failed.
| Parameters: |
|
|---|
Examples
>>> pattern = '{output}/{sample}.bam'
>>> target = 'data/a.bam'
>>> expected = {'output': 'data', 'sample': 'a'}
>>> assert extract_wildcards(pattern, target) == expected
>>> assert extract_wildcards(pattern, 'asdf') is None
lcdblib.snakemake.helpers.fill_patterns(patterns, fill, combination=<class 'itertools.product'>)[source]¶Fills in a dictionary of patterns with the dictionary or DataFrame fill.
>>> patterns = dict(a='{sample}_R{N}.fastq')
>>> fill = dict(sample=['one', 'two'], N=[1, 2])
>>> sorted(fill_patterns(patterns, fill)['a'])
['one_R1.fastq', 'one_R2.fastq', 'two_R1.fastq', 'two_R2.fastq']
>>> patterns = dict(a='{sample}_R{N}.fastq')
>>> fill = dict(sample=['one', 'two'], N=[1, 2])
>>> sorted(fill_patterns(patterns, fill, zip)['a'])
['one_R1.fastq', 'two_R2.fastq']
>>> patterns = dict(a='{sample}_R{N}.fastq')
>>> fill = pd.DataFrame({'sample': ['one', 'two'], 'N': [1, 2]})
>>> sorted(fill_patterns(patterns, fill)['a'])
['one_R1.fastq', 'two_R2.fastq']
lcdblib.snakemake.helpers.rscript(string, scriptname, log=None)[source]¶Saves the string as scriptname and then runs it
| Parameters: |
|
|---|
lcdblib.snakemake.interface.SampleHandler(config)[source]¶Bases: object
Basic interface to help handle filenames in snakemake
build_targets(patterns)[source]¶Build target file names based on pattern naming scheme.
Given a list of string formatted patterns will use config information to fill in the patterns and generate a list of file targets.
| Parameters: | patterns (list) – List of files with string formating marks that can be filled in from the config or the sampleTable. |
|---|---|
| Returns: | Filled in list of file names. |
| Return type: | list |
find_level(prefix)[source]¶Figure out which regex the prefix matches.
Scans each level and tries to identify which level the prefix matches.
| Parameters: | prefix (str) – string that you want to match to the prefix, this string would have sample information filled in. |
|---|---|
| Returns: | [0] is a string indicating which level [1] is a dict with sample information |
| Return type: | tuple |
| Raises: | ValueError – Too lazy to make a custom exception, raises ValueError if the
prefix does not match any of the patterns. |
Example
>>> SH = SampleHandler(test_config)
>>> prefix = 'pasilla/treated1/treated1_treated_1_0001_R1'
>>> level, attrs = SH.find_level(prefix)
>>> level
'rawLevel'
>>> assert attrs == [
... {'replicate': '1', 'sampleID': 'treated1', 'treatment': 'treated'}
... ]
This prefix doesn’t resolve to an actual sample defined in the sampletable (note mismatch between “1” in everything but the last “_2_”. Should this raise ValueError?
>>> prefix = 'pasilla_sample/treated1/treated1_treated_2_R1'
>>> level, attrs = SH.find_level(prefix)
>>> level
'runLevel'
>>> assert attrs == []
Run-level prefix:
>>> prefix = 'pasilla_sample/treated2/treated2_treated_2_R1'
>>> level, attrs = SH.find_level(prefix)
>>> level
'runLevel'
>>> assert attrs == [
... {'replicate': '2', 'sampleID': 'treated2', 'treatment': 'treated'}
... ]
Agg-level prefix (note the multiple sets of attributes returned):
>>> prefix = 'pasilla_agg/treated'
>>> level, attrs = SH.find_level(prefix)
>>> level
'aggLevel'
>>> assert attrs == [
... {'treatment': 'treated', 'replicate': '1', 'sampleID': 'treated1'},
... {'treatment': 'treated', 'replicate': '2', 'sampleID': 'treated2'}
... ]
find_sample(pattern, prefix)[source]¶Find which sample(s) to use.
| Parameters: |
|
|---|
Example
>>> SH = SampleHandler(test_config)
>>> pattern = (
... 'pasilla_sample\/(?P<sampleID>treated1|treated2|untreated1|untreated2)'
... '\/(?P=sampleID)_(?P<treatment>treated|untreated)_(?P<replicate>1|2)_R1'
... )
>>> prefix = 'pasilla_sample/treated2/treated2_treated_2_R1'
>>> assert SH.find_sample(pattern, prefix) == [
... {'treatment': 'treated', 'sampleID': 'treated2', 'replicate': '2'}
... ]
make_input(prefix='prefix', midfix='', suffix='', agg=False)[source]¶Generates Input Function based on wildcards.
Notes
example: ‘{first_element}{second_element}{third_element}’
| Parameters: |
|
|---|---|
| Returns: | Retruns a snakemake input function that generates a list of files. |
| Return type: | function |
Examples
>>> SH = SampleHandler(test_config)
>>> func = SH.make_input('prefix', '', '.fastq')
>>> func({'prefix': 'pasilla'})
['pasilla.fastq']
Instead of directly specifying the midfix, we pull it from the wildcards dict entry for “asdf”:
>>> func = SH.make_input('prefix', 'asdf', '.fastq')
>>> func({'prefix': 'pasilla', 'asdf': '.cutadapt'})
['pasilla.cutadapt.fastq']
We want to aggregate into “pasilla_agg/treated.fastq”, but the following doesn’t work because we need the directory slash:
>>> func = SH.make_input('pasilla_agg', 'treated', '.fastq', agg=True)
>>> assertRaises(ValueError, func, {})
Also doesn’t work:
>>> func = SH.make_input('pasilla_agg/', midfix='treated', suffix='.fastq', agg=True)
>>> assertRaises(ValueError, func, {})
However if we provide most of the path in the prefix it works:
>>> func = SH.make_input('pasilla_agg/treated', suffix='.fastq', agg=True)
>>> assert sorted(func({})) == [
... 'pasilla_sample/treated1/treated1_treated_1_R1.fastq',
... 'pasilla_sample/treated1/treated1_treated_2_R1.fastq',
... 'pasilla_sample/treated2/treated2_treated_1_R1.fastq',
... 'pasilla_sample/treated2/treated2_treated_2_R1.fastq'
... ]