Integrative workflows

Here we look at integrative workflows which can be used to combine multiple standard or non-standard workflows.

Colocalization workflow

The output of this workflow is a set of heatmaps showing metrics of colocalization between pairs of regions. This can be used to answer questions like “what else does my protein of interest bind with?”.

Several colocalization methods are run, and they all give slightly different results.

  • bedtools fisher

  • bedtools jaccard

  • GAT (Genome Association Test) log2 fold change

  • GAT (Genome Association Test) nucleotide overlap

  • IntervalStats

“External” workflow

Often we want to compare new data with existing published data. We have found that in practice, having a separate workflow to handle downloading and reformatting and various conversion tasks helps with organization.

The test workflow is a working example that:

  • downloads ChIP-seq data from modENCODE in an older fly genome organism (dm3)

  • downloads the chainfile for liftover

  • fixes the formatting of the downloaded files so they can be lifted over

  • lifts over the files to the newer dm6 assembly.

The file is intended to be heavily edited for the particular experiment; it is here mostly as a placeholder and to be used as a template for integrative downstream work. It can then be incorporated into the figures workflow (see “Figures” workflow) to integrate the analysis with other output.

_images/external.png

“Figures” workflow

This workflow is a working example of how you would tie together output from RNA-seq, ChIP-seq, and “external” workflows to automate figure creation and formally link together the dependencies of each figure all the way back to the original fastq files. If any changes are made upstream, they will trigger downstream rules to re-run as needed.

For example, if a new GTF annotation file comes out, you would change the URL in the config, and then re-run the figures workflow. All RNA-seq jobs that depended on some way on the GTF file will be re-run. This will include the feature counts and downstream RNA-seq analysis, but will not re-run any of the jobs like trimming or mapping that do not depend on the GTF. In addition, any figures that depended in some way on that GTF file will also be re-run.

To provide a sufficiently complex example that can be used in real-world applications, this workflow currently:

  • counts the number of peaks in all configured peak-calling runs and stores the output in a TSV report (work is performed in the script scripts/peak_count.py)

  • identifies peaks at promoters of genes, reports a summary of how many peaks from each run were found in a promoter, and creates BED files of these subsets (scripts/peaks_at_promoters.py)

  • builds the DAG image for the ChIP-seq and RNA-seq workflows

  • symlinks over the ChIP-seq peaks and RNA-seq differential expression results

  • extracts the text from the README.txt files created by the scripts (usually from their docstrings), and compiles them into a summary report

  • the figures directory can then be zipped up and distributed to collaborators