lcdblib.plotting package

Submodules

lcdblib.plotting.basic module

Set of basic plotting functions.

class lcdblib.plotting.basic.PairGrid(data, hue=None, hue_order=None, palette=None, hue_kws=None, vars=None, x_vars=None, y_vars=None, diag_sharey=True, size=2.5, aspect=1, despine=True, dropna=True, subplots_kws={})[source]

Bases: seaborn.axisgrid.PairGrid

lcdblib.plotting.basic.corrfunc(x, y, loc=(0.1, 0.5), template='r = {:0.4f}', type='spearman', **kwargs)[source]

Adds text to the current axes with either the sparman or pearson r.

Parameters:
  • x (array-like) – Vector of values.
  • y (array-like) – Vector of values.
  • type (str) – Which type of correlation coefficient to use. Either “spearman” or “pearson”. Uses scipy.stats module.
lcdblib.plotting.basic.lowerTriangle(df, func, func_kw={}, pairgrid_kw={}, **kwargs)[source]

Create a PairGrid lower triangle panel.

Parameters:
  • df (pandas.DataFrame) – DataFrame containing the columns you want to plot in a PairGrid.
  • func (function or list of functions) – The function you want to plot in a PairGrid
Returns:

Return type:

seaborn.PairGrid

lcdblib.plotting.basic.maPlot(x, y, data=None, title=None, log=False, **kwargs)[source]

Creates a MA plot.

Parameters:
  • y (x,) – Input variables. If strings, these should correspond with column names in data. When pandas objects are used, axes will be labeled with the series name.
  • data (DataFrame) – Tidy (“long-form”) dataframe where each column is a variable and each row is an observation.
  • title (str) – Title to add to the plot
  • log (bool) – If true the log2 of the Geometric mean and ratio will be used instead of the mean and difference.
  • **kwargs (dict) – Values to pass to seaborn.regplot
Returns:

matplotlib.axes.Axes

Return type:

A matplotlib axes.

Example

Given that ‘x’ and ‘y’ are column names in df:
>>> maPlot('x', 'y', data=df, log=True, fig_reg=True,
... line_kws={'color': 'red'})
Given that ‘x’ and ‘y’ are pandas.Series
>>> maPlot(x, y, log=True, fig_reg=True, line_kws={'color': 'red'})

lcdblib.plotting.colormap_adjust module

Module to handle custom colormaps.

cmap_powerlaw_adjust, cmap_center_adjust, and cmap_center_adjust are from https://sites.google.com/site/theodoregoetz/notes/matplotlib_colormapadjust

lcdblib.plotting.colormap_adjust.cmap_center_adjust(cmap, center_ratio)[source]

Returns a new colormap based on the one given but adjusted so that the old center point higher (>0.5) or lower (<0.5)

Parameters:
  • cmap – colormap instance (e.g., cm.jet)
  • center_ratio
lcdblib.plotting.colormap_adjust.cmap_center_point_adjust(cmap, range, center)[source]

Converts center to a ratio between 0 and 1 of the range given and calls cmap_center_adjust(). returns a new adjusted colormap accordingly

Parameters:
  • cmap – colormap instance
  • range – Tuple of (min, max)
  • center – New cmap center
lcdblib.plotting.colormap_adjust.cmap_powerlaw_adjust(cmap, a)[source]

Returns a new colormap based on the one given but adjusted via power-law, newcmap = oldcmap**a.

Parameters:
  • cmap – colormap instance (e.g., cm.jet)
  • a – power
lcdblib.plotting.colormap_adjust.color_test(color)[source]

Figure filled in with color; useful for troubleshooting or experimenting with colors

lcdblib.plotting.colormap_adjust.smart_colormap(vmin, vmax, color_high='#b11902', hue_low=0.6)[source]

Creates a “smart” colormap that is centered on zero, and accounts for asymmetrical vmin and vmax by matching saturation/value of high and low colors.

It works by first creating a colormap from white to color_high. Setting this color to the max(abs([vmin, vmax])), it then determines what the color of min(abs([vmin, vmax])) should be on that scale. Then it shifts the color to the new hue hue_low, and finally creates a new colormap with the new hue-shifted as the low, color_high as the max, and centered on zero.

Parameters:
  • color_high – a matplotlib color – try “#b11902” for a nice red
  • hue_low – float in [0, 1] – try 0.6 for a nice blue
  • vmin – lowest value in data you’ll be plotting
  • vmax – highest value in data you’ll be plotting

lcdblib.plotting.compare_rnaseq_and_chipseq module

lcdblib.plotting.compare_rnaseq_and_chipseq.plot(de_results, regions=None, peaks=None, selected=None, x='baseMean', y='log2FoldChange', disable_logx=False, logy=False, pval_col='padj', alpha=0.1, lfc_cutoff=0, plot_filename=None, disable_raster_points=False, genes_to_label=None, label_column=None, report=None, gene_lists=None)[source]

M-A plot showing up- and downregulated genes with optional labeling and Fishers exact tests.

If –plot-filename is not specified, then the plot will be displayed and points can be clicked for interactive exploration.

If –peaks and –regions are specified, then results from Fishers exact tests will be printed to stdout, or to –report if specified.

Parameters:
  • de_results (str or pandas.DataFrame) – If str, it’s the filename of a TSV of differential expression results, with first column as gene ID. It will be parsed into a dataframe where the index is gene ID. When called as a library, an already-created pandas.DataFrame can optionally be provided instead.
  • regions (str or pybedtools.BedTool) – Gene regions in which to look for intersections with peaks. BED file where the 4th column contains gene IDs that are also present in first column of de_results. Typically this would be a BED file of promoters or gene bodies. When called as a library, a pybedtools.BedTool object can optionally be provided instead.
  • peaks (str or pybedtools.BedTool) – BED file to be intersected with regions. When called as a library, a pybedtools.BedTool object can optionally be provided instead.
  • selected (str or list-like) – Replaces regions peaks arguments; useful for when you already know which genes you want to select (e.g., upregulated from a different experiment). If a string, assume it’s a filename and use the first column which will be used as an index into the de_results dataframe. When called as a library, if selected is not a string it will be used as an index into the dataframe.
  • x (str) – Column to use for x-axis. Default of “baseMean” expects DESeq2 results
  • y (str) – Column to use for y-axis. Default of “log2FoldChange” expects DESeq2 results
  • disable_logx (bool) – Disable default behavior of transforming x values using log10
  • logy (bool) – Transform y values using log2
  • pval-col (str) – Column to use for statistical significance. Default “padj” expectes DESeq2 results.
  • alpha (float) – Threshold for calling significance. Applied to pval_col
  • lfc_cutoff (float) – Log2fold change cutoff to be applied to y values. Threshold is applied post-transformation, if any specified (e.g., logy argument).
  • plot_filename (str) – File to save plot. Format auto-detected by extension. Output directory will be created if needed.
  • disable_raster_points (bool) – Disable the default behavior of rasterizing points in a PDF. Use sparingly, since drawing 30k+ individual points in a PDF may slow down your machine.
  • genes_to_label (str or list-like) – Optional file containing genes to label with text. First column must be a subset of the first column of de_results. Lines starting with ‘#’ and subsequent tab-separated columns will be ignored. When called as a library, a list-like object of gene IDs can be provided.
  • label_column (str) – Optional column from which to take gene labels found in genes_to_label (e.g., “symbol”). If the value in this column is missing, fall back to the index. Use this if your gene IDs are long Ensembl IDs but you want the gene symbols to show up on the plot.
  • report (str) – Where to write out Fisher’s exact test results. Default is stdout
  • gene_lists (str) – Prefix to gene lists. If specified, gene lists corresponding to the cells of the 2x2 Fishers exact test will be written to {prefix}.up.tsv and {prefix}.dn.tsv. These are subsets of de_results where genes are up and have a peak in region (or are selected), or downregulated and have a peak in region (or are selected), respectively.

lcdblib.plotting.results_table module

class lcdblib.plotting.results_table.DESeq2Results(data, db=None, header_check=True, **kwargs)[source]

Bases: lcdblib.plotting.results_table.DESeqResults

Class for working with results from DESeq2.

Just like a DifferentialExpressionResults object, but sets the pval_column, lfc_column, and mean_column to the names used in edgeR’s output.

The underlying pandas.DataFrame is always available with the data attribute.

Any attributes not explicitly in this class will be looked for in the underlying pandas.DataFrame.

Parameters:
  • data (string or pandas.DataFrame) – If string, assumes it’s a filename and calls pandas.read_table(data, **import_kwargs).
  • db (string or gffutils.FeatureDB) – Optional database that can be used to generate features
  • import_kwargs (dict) – These arguments will be passed to pandas.read_table() if data is a filename.
lfc_column = 'log2FoldChange'
mean_column = 'baseMean'
pval_column = 'padj'
class lcdblib.plotting.results_table.DESeqResults(data, db=None, header_check=True, **kwargs)[source]

Bases: lcdblib.plotting.results_table.DifferentialExpressionResults

Class for working with results from DESeq.

Just like a DifferentialExpressionResults object, but sets the pval_column, lfc_column, and mean_column to the names used in DESeq (v1) output.

The underlying pandas.DataFrame is always available with the data attribute.

Any attributes not explicitly in this class will be looked for in the underlying pandas.DataFrame.

Parameters:
  • data (string or pandas.DataFrame) – If string, assumes it’s a filename and calls pandas.read_table(data, **import_kwargs).
  • db (string or gffutils.FeatureDB) – Optional database that can be used to generate features
  • import_kwargs (dict) – These arguments will be passed to pandas.read_table() if data is a filename.
autosql_file()[source]

Generate the autosql for DESeq results (to create bigBed)

Returns a temp filename containing the autosql defining the extra fields.

This for creating bigBed files from BED files created by colormapped_bed. When a user clicks on a feature, the DESeq results will be reported.

colormapped_bedfile(genome, cmap=None)[source]

Create a BED file with padj encoded as color

Features will be colored according to adjusted pval (phred transformed). Downregulated features have the sign flipped.

Parameters:cmap (matplotlib colormap) – Default is matplotlib.cm.RdBu_r

Notes

Requires a FeatureDB to be attached.

class lcdblib.plotting.results_table.DifferentialExpressionResults(data, db=None, header_check=True, **kwargs)[source]

Bases: lcdblib.plotting.results_table.ResultsTable

A ResultsTable subclass for working with differential expression results.

Adds methods for up/down regulation, ma_plot, and sets class variables for which columns should be considered for pval, log fold change, and mean values. This class acts as a parent for subclasses like DESeqResults, EdgeRResults, and others/

The underlying pandas.DataFrame is always available with the data attribute.

Any attributes not explicitly in this class will be looked for in the underlying pandas.DataFrame.

Parameters:
  • data (string or pandas.DataFrame) – If string, assumes it’s a filename and calls pandas.read_table(data, **import_kwargs).
  • db (string or gffutils.FeatureDB) – Optional database that can be used to generate features
  • import_kwargs (dict) – These arguments will be passed to pandas.read_table() if data is a filename.
changed(alpha=0.1, lfc=0, idx=True)[source]

Changed features.

Helper function to get where the pval is <= alpha and the absolute value log2foldchange is >= lfc.

Parameters:
  • alpha (float) –
  • lfc (float) –
  • idx (bool) – If True, a boolean index will be returned. If False, a new object will be returned that has been subsetted.
downregulated(alpha=0.1, lfc=0, idx=True)[source]

Downregulated features.

Helper function to get where the pval is <= alpha and the log2foldchange is <= lfc.

Parameters:
  • alpha (float) –
  • lfc (float) –
  • idx (bool) – If True, a boolean index will be returned. If False, a new object will be returned that has been subsetted.
lfc_column = 'log2FoldChange'
ma_plot(alpha, up_kwargs=None, dn_kwargs=None, zero_line=None, **kwargs)[source]

MA plot.

Plots the average read count across treatments (x-axis) vs the log2 fold change (y-axis).

Additional kwargs are passed to self.scatter (useful ones might include genes_to_highlight)

Parameters:
  • alpha (float) – Features with values <= alpha will be highlighted in the plot.
  • dn_kwargs (up_kwargs,) – Kwargs passed to matplotlib’s scatter(), used for styling up/down regulated features (defined by alpha and col)
  • zero_line (None or dict) – Kwargs passed to matplotlib.axhline(0).
mean_column = 'baseMean'
pval_column = 'padj'
unchanged(alpha=0.1, lfc=0, idx=True)[source]

Unchanged features.

Helper function to get where the pval is > alpha and the absolute value of the log2foldchange is < lfc.

Parameters:
  • alpha (float) –
  • lfc (float) –
  • idx (bool) – If True, a boolean index will be returned. If False, a new object will be returned that has been subsetted.
upregulated(alpha=0.1, lfc=0, idx=True)[source]

Upregulated features.

Helper function to get where the pval is <= alpha and the log2foldchange is >= lfc.

Parameters:
  • alpha (float) –
  • lfc (float) –
  • idx (bool) – If True, a boolean index will be returned. If False, a new object will be returned that has been subsetted.
class lcdblib.plotting.results_table.EdgeRResults(data, db=None, header_check=True, **kwargs)[source]

Bases: lcdblib.plotting.results_table.DifferentialExpressionResults

Class for working with results from edgeR.

Just like a DifferentialExpressionResults object, but sets the pval_column, lfc_column, and mean_column to the names used in edgeR’s output.

The underlying pandas.DataFrame is always available with the data attribute.

Any attributes not explicitly in this class will be looked for in the underlying pandas.DataFrame.

Parameters:
  • data (string or pandas.DataFrame) – If string, assumes it’s a filename and calls pandas.read_table(data, **import_kwargs).
  • db (string or gffutils.FeatureDB) – Optional database that can be used to generate features
  • import_kwargs (dict) – These arguments will be passed to pandas.read_table() if data is a filename.
lfc_column = 'logFC'
mean_column = 'logCPM'
pval_column = 'FDR'
class lcdblib.plotting.results_table.LazyDict(fn_dict, index_file=None, index_from=None, extra=None, cls=<class 'lcdblib.plotting.results_table.DESeqResults'>)[source]

Bases: object

items()[source]
keys()[source]
values()[source]
class lcdblib.plotting.results_table.MarginalHistScatter(ax, hist_size=0.6, pad=0.05)[source]

Bases: object

add_legends(xhists=True, yhists=False, scatter=True, **kwargs)[source]

Add legends to axes.

append(x, y, scatter_kwargs, hist_kwargs=None, xhist_kwargs=None, yhist_kwargs=None, num_ticks=3, labels=None, hist_share=False, marginal_histograms=True)[source]

Adds a new scatter to self.scatter_ax as well as marginal histograms for the same data, borrowing addtional room from the axes.

Parameters:
  • y (x,) – Data to be plotted
  • scatter_kwargs (dict) – Keyword arguments that are passed directly to scatter().
  • hist_kwargs (dict) – Keyword arguments that are passed directly to hist(), for both the top and side histograms.
  • yhist_kwargs (xhist_kwargs,) – Additional, margin-specific kwargs for the x or y histograms respectively. These are used to update hist_kwargs
  • num_ticks (int) – How many tick marks to use in each histogram’s y-axis
  • labels (array-like) – Optional NumPy array of labels that will be set on the collection so that they can be accessed by a callback function.
  • hist_share (bool) – If True, then all histograms will share the same frequency axes. Useful for showing relative heights if you don’t want to use the hist_kwarg normed=True
  • marginal_histograms (bool) – Set to False in order to disable marginal histograms and just use as a normal scatterplot.
limits
xmax
xmin
ymax
ymin
class lcdblib.plotting.results_table.ResultsTable(data, db=None, import_kwargs=None)[source]

Bases: object

Wrapper around a pandas.DataFrame that adds additional functionality.

The underlying pandas.DataFrame is always available with the data attribute.

Any attributes not explicitly in this class will be looked for in the underlying pandas.DataFrame.

Parameters:
  • data (string or pandas.DataFrame) – If string, assumes it’s a filename and calls pandas.read_table(data, **import_kwargs).
  • db (string or gffutils.FeatureDB) – Optional database that can be used to generate features
  • import_kwargs (dict) – These arguments will be passed to pandas.read_table() if data is a filename.
align_with(other)[source]

Align the dataframe’s index with another.

attach_db(db)[source]

Attach a gffutils.FeatureDB for access to features.

Useful if you want to attach a db after this instance has already been created.

Parameters:db (gffutils.FeatureDB) –
copy()[source]
features(ignore_unknown=False)[source]

Generator of features.

If a gffutils.FeatureDB is attached, returns a pybedtools.Interval for every feature in the dataframe’s index.

Parameters:ignore_unknown (bool) – If True, silently ignores features that are not found in the db.
radviz(column_names, transforms={}, **kwargs)[source]

Radviz plot.

Useful for exploratory visualization, a radviz plot can show multivariate data in 2D. Conceptually, the variables (here, specified in column_names) are distributed evenly around the unit circle. Then each point (here, each row in the dataframe) is attached to each variable by a spring, where the stiffness of the spring is proportional to the value of corresponding variable. The final position of a point represents the equilibrium position with all springs pulling on it.

In practice, each variable is normalized to 0-1 (by subtracting the mean and dividing by the range).

This is a very exploratory plot. The order of column_names will affect the results, so it’s best to try a couple different orderings. For other caveats, see [1].

Additional kwargs are passed to self.scatter, so subsetting, callbacks, and other configuration can be performed using options for that method (e.g., genes_to_highlight is particularly useful).

Parameters:
  • column_names (list) – Which columns of the dataframe to consider. The columns provided should only include numeric data, and they should not contain any NaN, inf, or -inf values.
  • transforms (dict) – Dictionary mapping column names to transformations that will be applied just for the radviz plot. For example, np.log1p is a useful function. If a column name is not in this dictionary, it will be used as-is.
  • ax (matplotlib.Axes) – If not None, then plot the radviz on this axes. If None, then a new figure will be created.
  • kwargs (dict) – Additional arguments are passed to self.scatter. Note that not all possible kwargs for self.scatter are necessarily useful for a radviz plot (for example, margninal histograms would not be meaningful).

Notes

This method adds two new variables to self.data: “radviz_x” and “radviz_y”. It then calls the self.scatter method, using these new variables.

The data transformation was adapted from the pandas.tools.plotting.radviz function.

References

  1. Hoffman,P.E. et al. (1997) DNA visual and analytic data mining. In the Proceedings of the IEEE Visualization. Phoenix, AZ, pp. 437-441.
  2. http://www.agocg.ac.uk/reports/visual/casestud/brunsdon/radviz.htm
  3. http://pandas.pydata.org/pandas-docs/stable/visualization.html#radviz
reindex_to(x, attribute=None)[source]

Returns a copy that only has rows corresponding to feature names in x.

Parameters:
  • x (str or pybedtools.BedTool) – If str, then assume it’s a filename. BED, GFF, GTF, or VCF where the “Name” field (that is, the value returned by feature[‘Name’]) or any arbitrary attribute
  • attribute (str or int or None) – If x is GFF or GTF format, and attribute is str, then attribute containing the name of the feature to use. If x format is BED and attribute is str, then use getattr on the interval (e.g., ‘name’ or ‘score’). If attribute is int, then use that column. If None, then use the “name” attribute of the Interval, which falls back to one of “gene_id”, “Name”, “transcript_id” for GFF/GTF.
scatter(x, y, xfunc=None, yfunc=None, xscale=None, yscale=None, xlab=None, ylab=None, genes_to_highlight=None, marginal_histograms=False, general_kwargs={'alpha': 0.2, 'color': 'k', 'picker': True}, general_hist_kwargs=None, offset_kwargs={}, label_kwargs=None, ax=None, one_to_one=None, callback=None, hist_size=0.3, hist_pad=0.0, nan_offset=0.015, pos_offset=0.99, linelength=0.01, neg_offset=0.005)[source]

Do-it-all method for making annotated scatterplots.

Includes rugplots for NaN/Inf/-Inf, a default callback that prints the entries of the underlying dataframe when a point is clicked, point labeling options, marginal histograms for multiple subsets, arbitrary styling of points for arbitrary subsets, and more.

Parameters:
  • y (x,) – Variables to plot. Must be names in self.data’s DataFrame.
  • yfunc (xfunc,) – Functions to apply to xvar and yvar respectively. If xlab or ylab is not set separately, the function name will be used along with the column name to label the corresponding axis. This lets you play around with transformation functions (e.g., np.log, np.log1p) without having to add the corresponding column to the underlying dataframe.
  • ylab (xlab,) – Labels for x and y axes; default is to use function names for xfunc and yfunc and variable names xvar and yvar, e.g., “log2(baseMeanA)”
  • ax (None or Axes object) – If not None then plot on the provided Axes, otherwise create a new figure and axes.
  • general_kwargs (dict) – Kwargs for matplotlib.scatter; specifies how all points look. Note that if you override this, you should include at least picker=True so that the callback function will work.
  • genes_to_highlight (list of 2-tuples or 3-tuples) –

    Provides lots of control to colors. It is a list of (ind, kwargs) tuples, where each ind specifies genes to plot with kwargs. ind is anything that can be used with DataFrame.ix.

    For example:

    [
        (
            x.log2FoldChange < 0,
            dict(color='b', label='downregulated')
        ),
        (
            x.log2FoldChange > 0,
            dict(color='r', label='upregulated')
        ),
    ]
    

    Each dictionary updates a copy of general_kwargs. If genes_to_highlight has a “name” kwarg, this must be a list that’t the same length as ind. It will be used to label the genes in ind using label_kwargs.

    For example:

    [
        (
            ['ENSG001', 'ENSG002'],
            dict(color='r', name=['geneA', 'geneB'])
        ),
    ]
    

    The tuples can also be 3-tuples of (ind, scatter_kwargs, hist_kwargs). The first two items act as above, and the third can be used to control histogram kwargs if marginal_hists is True.

    Note that, unless overridden, the color and alpha of the histograms will be inherited from the scatter kwargs, which in turn are inherited from general_kwargs. So the following will modifiy marginal histograms to have 100 bins:

    [
        (
            x.log2FoldChange < 0,
            dict(color='b', label='down'),
            dict(bins=100)
        )
    ]
    
  • callback (callable) – Function to call upon clicking a point. Must accept a single argument which is the gene ID. The function can do whatever it wants with it, but probably will want to access the underlying dataframe. Default is to print the corresponding row from the underlying dataframe.
  • one_to_one (None or dict) – If not None, a dictionary of matplotlib.plot kwargs that will be used to plot a 1:1 line, e.g., dict(color=’r’, linestyle=’:’).
  • label_kwargs (dict) – Kwargs for labeled genes, e.g., dict=(style=’italic’). Will only be used if an entry in genes_to_highlight has a name key.
  • offset_kwargs (dict) – Kwargs to be passed to matplotlib.transforms.offset_copy, used for adjusting the positioning of gene labels in relation to the actual point.
  • ylab_prefix (xlab_prefix,) – Optional label prefix that will be added to the beginning of xlab and/or ylab.
  • marginal_histograms (bool) – If True, for each subset in genes_to_highlight, add marginal histograms along x and y axes subject to the various histogram controls below.
  • hist_size (float) – Size of marginal histograms
  • hist_pad (float) – Spacing between marginal histograms
  • pos_offset, neg_offset (nan_offset,) – Offset, in units of “fraction of axes” for the NaN, +inf, and -inf “rug plots”
  • linelength (float) – Line length for the rug plots
update(dataframe)[source]

Updates the current data with a new dataframe.

This extra step is required to get around the fancy pandas.DataFrame indexing (like .ix, .iloc, etc).

Module contents