Module lib.chipseq

Handling ChIP-seq peak-calling configuration correctly is complex. The functions in this module help manipulate the config information so we can use it more easily in the ChIP-seq workflow without cluttering the Snakefile.

peak_calling_dict(config[, algorithm])

Returns a dictionary of peak-calling runs from the config.

block_for_run(config, label, algorithm)

Returns the block for the (label, algorithm) run.

samples_for_run(config, label, algorithm, ...)

Returns the sample names configured for a particular peak-calling run

merged_input_for_ip(sampletable, merged_ip)

Returns the merged input label for a merged IP label.

detect_peak_format(fn)

Figure out if a BED file is narrowPeak or broadPeak.

Details

Helpers for ChIP-seq.

lib.chipseq.block_for_run(config, label, algorithm)[source]

Returns the block for the (label, algorithm) run.

Parameters:
  • config (dict)

  • label (str)

  • algorithm (str)

lib.chipseq.detect_peak_format(fn)[source]

Figure out if a BED file is narrowPeak or broadPeak.

Returns None if undetermined.

This is useful for figuring out which autoSql file we should use or which bigBed 6, 6+4, or 6+3 format to use.

lib.chipseq.merged_input_for_ip(sampletable, merged_ip)[source]

Returns the merged input label for a merged IP label.

This is primarily used for the fingerprint rule, where we collect all the available input BAMs together.

Parameters:
  • sampletable (pandas.DataFrame)

  • merged_ip (str) – Label of IP to use, must be present in the label column of the sampletable.

Examples

This should make more sense if we have an example to work with…..

Samples ip1 and ip2 are technical replicates. They are from a different experiment than ip3 and input3, hence their different biological_material.

The way we know that input1 should be paired with ip1 and ip2 is because it shares the same biological material.

Compare input1 and input9. They are not technical replicates (since they do not share the same label) but they are biological replicates because they share the same biological material.

>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_csv(StringIO('''
... samplename  antibody   biological_material  label
... ip1         gaf        s2cell-1             s2cell-gaf-1
... ip2         gaf        s2cell-1             s2cell-gaf-1
... ip3         ctcf       s2cell-2             s2cell-ctcf-1
... input1      input      s2cell-1             s2cell-input-1
... input3      input      s2cell-2             s2cell-input-3
... input9      input      s2cell-1             s2cell-input-1'''),
... sep='\s+')
>>> merged_input_for_ip(df, 's2cell-gaf-1')
['s2cell-input-1']
>>> merged_input_for_ip(df, 's2cell-ctcf-1')
['s2cell-input-3']
lib.chipseq.peak_calling_dict(config, algorithm=None)[source]

Returns a dictionary of peak-calling runs from the config.

Parameters:
  • config (dict)

  • algorithm (None) – If algorithm is None, dictionary is keyed by (label, algorithm). Otherwise, only the runs for algorithm are returned, keyed by label.

lib.chipseq.samples_for_run(config, label, algorithm, treatment)[source]

Returns the sample names configured for a particular peak-calling run

Parameters:
  • config (dict)

  • label – Used as keys into peak_calling_dict()

  • algorithm – Used as keys into peak_calling_dict()

  • treatment (ip | input)