Sample tables

Sample tables map sample names to files on disk and provide additional metadata. It is expected to have a header and be tab-delimited. Empty lines and lines that start with a comment (#) are skipped.

For running new experiments, you will need to write your own sample table. For running experiments uploaded to SRA (Sequence Read Archive), you can use the SRA sample table as-is, with the addition of a new column indicating what you would like to name each sample. This makes it almost trivial to run arbitrary SRA RNA-seq data sets! For ChIP-seq data from SRA, the additional columns antibody, label, and biological_material as described below will need to be added, but often that information is already in the SRA sampletable so the columns just need to be renamed.

RNA-seq sample table

Here is an example minimal sample table for RNA-seq. It only contains sample IDs for four samples:

# Minimal RNA-seq sample table
sample
c1
c2
t1
t2

In this minimal example, the original FASTQ files are expected to be at the locations data/rnaseq_samples/{sample}/{sample}_R1.fastq.gz. That pattern is configured in the config/rnaseq_patterns.yaml file if you would like to change it (see Patterns and targets). Specifically, the workflow will expect the following files to already exist (paths relative to the Snakefile):

# The above sample table expects these files to exist:
data/rnaseq_samples/c1/c1_R1.fastq.gz
data/rnaseq_samples/c2/c2_R1.fastq.gz
data/rnaseq_samples/t1/t1_R1.fastq.gz
data/rnaseq_samples/t2/t2_R1.fastq.gz

Symlinking FASTQs

To avoid having to copy or symlink files over into the expected directory structure, we can instead list the original filenames in a column called orig_filename and they will be automatically symlinked into data/rnaseq_samples/{sample}/{sample}_R1.fastq.gz. That is, the following sampletable:

# Example RNA-seq sample table with original filenames are specified
sample   orig_filename
c1       /data/c1.fastq.gz
c2       /data/c2.fastq.gz
t1       /data/other/t1.fq.gz
t2       ../raw-data/t2.fq.gz

Will result in the following symlinks:

data/rnaseq_samples/c1/c1_R1.fastq.gz -> /data/c1.fastq.gz
data/rnaseq_samples/c2/c2_R1.fastq.gz -> /data/c2.fastq.gz
data/rnaseq_samples/t1/t1_R1.fastq.gz -> /data/other/t1.fq.gz
data/rnaseq_samples/t2/t2_R1.fastq.gz -> ../raw-data/t2.fq.gz

Note that orig_filename paths in the sampletable are considered relative to the Snakefile.

Additional metadata

For RNA-seq, only the first column and optionally the orig_filename column are used directly by the RNA-seq workflow.

However, the sampletable is imported into the downstream/rnaseq.Rmd file (see RNA-Seq downstream analysis for more info). It’s often useful to include any metadata in the sampletable so it’s all in one place, and you’ll get all that information imported into R.

For example, with this sample table we would be easily able to use a DESeq model of ~condition since the condition column will be imported into R.

# Example RNA-seq sampletable with "condition" metadata included
sample   orig_filename          condition
c1       /data/c1.fastq.gz      control
c2       /data/c2.fastq.gz      control
t1       /data/other/t1.fq.gz   treatment
t2       /data/other/t2.fq.gz   treatment

Paired-end data

Paired-end and single-end data may be mixed in the same sampletable. A sample is specified as paired-end using a separate column in the sampletable. That column can either be named layout (easiest if you’re writing your own sample table) or LibraryLayout (if you’re using an SRA sampletable, in which case you can leave it as-is). An error will be raised if both columns are provided.

If one of these columns exists, the values of the column are converted to lowercase. For each sample, if the value is either pe or paired, the sample will be considered paired-end. In all other cases the sample will be considered single-end.

For paired-end samples that will be symlinked, both orig_filename and orig_filename_R2 must be specified as paths relative to the Snakefile (see Symlinking FASTQs above). If there is a mix of SE and PE samples, the SE sample must have an empty entry for orig_filename_R2 (in the context of the tab-delimited sampletable, this means two tab characters next to each other with nothing in between).

Note

If the sample table contains both single- and paired-end samples, the fastq_dump and cutadapt rules will create empty R2 files.

Once the BAM files are created (after alignment in a single- or paired-end fashion as appropriate for the sample), we operate mostly on the BAM.

After the alignment stage, remaining rules do not differentiate between single- and paired-end reads. In particular, featureCounts and bamCoverage may need different parameters depending on the library layout.

# Example RNA-seq sample table with original filenames are specified,
# and c1 is a paired-end sample
sample   orig_filename         orig_filename_R2      layout
c1       /data/c1_R1.fastq.gz  /data/c1_R2.fastq.gz  PE
c2       /data/c2.fastq.gz                           SE
t1       /data/other/t1.fq.gz                        SE
t2       /data/other/t2.fq.gz                        SE

ChIP-seq sample table

Three additional columns are required for ChIP-seq: antibody, biological_material and label.

antibody

Used for differentiating between input and IP samples. Input samples should be listed with an antibody of exactly input.

biological_material

Ties together which samples came from the same chromatin. This is how we know a particular input sample is the matched control for a particular IP sample. This is primarily used in the fingerprint rule, where we collect all the input BAMs together for performing QC. See the lib.chipseq.merged_input_for_ip function for the technical details of how this is handled.

label

Used to tie together technical replicates, and used to configure the ChIP-seq peak-calling runs (see chipseq config section).

Technical replicates share the same label. If you don’t have technical replicates, then this column can be a copy of the first column containing sample names. Technical replicates will have their BAMs merged together and duplicates removed from the merged BAM.

The reason that the ChIP-seq sample table is more complicated than RNA-seq is because RNA-seq is often analyzed in R, and complicated sample handling (like summing technical replicates) can be performed very flexibly in R. In contrast, ChIP-seq peak-callers are command-line tools and frequently only take a single biological replicate, and so are run as Snakemake rules. As a result, more complex configuration is required to ensure complex experimental designs are handled correctly.

Minimal ChIP-seq sample table, no replicates

A minimal ChIP-seq sample table, with no biological replicates, looks like this:

# Example minimal ChIP-seq sample table
sampleid    antibody   biological_material  label          orig_filename
ip1         gaf        s2cell-1             s2cell-gaf-1   /data/ip1.fastq.gz
input1      input      s2cell-1             s2cell-input-1 /data/input.fastq.gz
  • The input sample is required to have the antibody as “input”

  • For an IP, its matched input is the sample with antibody == input that also has the same biological material as the IP. Here, we know input1 goes with ip1 because they both have the same biological material.

ChIP-seq sample table, biological replicates

Here is another example, this time with biological replicates:

# Example ChIP-seq sampletable with biological replicates
sampleid    antibody   biological_material  label          orig_filename
ip1         gaf        s2cell-1             s2cell-gaf-1   /data/ip1.fastq.gz
ip2         gaf        s2cell-2             s2cell-gaf-2   /data/run2/ip3.fastq.gz
input1      input      s2cell-1             s2cell-input-1 /data/input.fastq.gz
input2      input      s2cell-2             s2cell-input-2 /data/run2/input2.fastq.gz
  • As before, ip1 and input1 share the same biological material, indicating that input1 is the matched input for ip1.

  • The matched input for ip2 is input2 because they share the same biological material (s2cell-2) and input2 has antibody == input.

  • Each sample has a unique label because there are no technical replicates here.

ChIP-seq sample table, biological and technical replicates

Another example, this time with biological and technical replicates:

# Example ChIP-seq sampletable with bio and tech reps
sampleid    antibody   biological_material  label          orig_filename
ip1         gaf        s2cell-1             s2cell-gaf-1   /data/ip1.fastq.gz
ip1a        gaf        s2cell-1             s2cell-gaf-1   /data/ip2.fastq.gz
ip2         gaf        s2cell-2             s2cell-gaf-2   /data/run2/ip3.fastq.gz
input1      input      s2cell-1             s2cell-input-1 /data/input.fastq.gz
input2      input      s2cell-2             s2cell-input-2 /data/run2/input2.fastq.gz
  • ip1 and ip1a are technical replicates because they share the label s2cell-gaf-1. This is often the case when we need to sequence the same sample again for higher depth.

  • ip1 and ip1a will be merged into one BAM file named after their common label, s2cell-gaf-1 (described further below). The remaining ip2, input1, and input2 do not have to be merged with anything, so they will be symlinked.

Merging technical replicates for ChIP-seq

In contrast to technical replicates in RNA-seq, where counts can be summed in R, ChIP-seq is a bit more complicated. The ChIP-seq workflow uses samtools merge to merge together the unique, duplicates-removed BAM files from technical replicates into a single BAM, and then removes the duplicates again from that merged file.

There is a “merged_techreps” key in config/chipseq_patterns.yaml which defines the filenames to which technical replicates will be merged. By default this pattern is data/chipseq_merged/{label}/{label}.cutadapt.unique.nodups.merged.bam. After trimming, aligning, removing multimappers, and removing duplicates, tech reps are merged together. Specifically, these files:

data/chipseq_samples/ip1/ip1.cutadapt.unique.nodups.bam
data/chipseq_samples/ip1a/ip1a.cutadapt.unique.nodups.bam

get merged and then duplicates removed again from that merged file, resulting in this file:

data/chipseq_merged/s2cell-gaf-1/s2cell-gaf-1.cutadapt.unique.nodups.merged.bam

For samples with no technical replicates, only symlinks are performed, so for example this file:

data/chipseq_samples/ip2/ip2.cutadapt.unique.nodups.bam

will get symlinked to this file:

data/chipseq_merged/s2cell-gaf-2/s2cell-gaf-2.cutadapt.unique.nodups.merged.bam

For peak-calling (see chipseq config section) and any other downstream analysis, the files to use are these merged (or symlinked) BAM files.