Description of the benchmarking data sets

We provide a large number of simulated and real benchmarking data sets for RNA-seq differential expression methods. All simulated data sets were generated by the generateSyntheticData() function of the compcodeR package. Each data set can be identified by its unique ID, which consists of 10 randomly selected alphanumeric characters. The structure of the data files is described in Section 6.1 of the compcodeR vignette. This page briefly describes the different benchmarking data sets. All simulation parameters that were supplied to generateSyntheticData() can be retrieved from the respective data objects.

For each simulation setting, we provide data for four or five different group sizes, between 2 and 10 (or 100) number of biological replicates per group. For each of these group sizes, we have generated 10 replicated data sets, with the same simulation parameters.

Simulation procedure

This section describes the simulation procedure. For more detailed information, we refer the reader to the description given in Soneson and Delorenzi (2013).

The basic statistical distribution underlying the simulations is the Negative Binomial (NB). To obtain realistic values of the mean and dispersion parameter of the NB distribution, we estimated these values from real data sets (from Pickrell et al and Cheung et al, accessed via the ReCount website). The estimation procedure is more exhaustively described in the supplementary material of Soneson and Delorenzi (2013).

For all data sets, we simulated 12,500 genes. The 'base' library size was set to 75 million reads per sample. To generate data sets with varying library sizes, we multiplied this number by a factor between 0.7 and 1.4 for each individual sample. For each of the genes, we sampled a (mean, dispersion) pair from the data given above. The sampled values for all 12,500 genes were used to determine the mean and dispersion parameters for the NB distributions from which we sampled the read counts.

For some of the datasets, we also introduced outlier counts, which are extremely high or low counts that are not generated by the same NB distribution as the other read counts for the same gene. There are two types of outliers, that we call 'random' and 'single'. The positions of the random outliers are selected randomly in the count matrix, and for each of these positions we multiply (for high outliers) or divide (for low outliers) the simulated count value with a randomly selected number between 5 and 10. Hence, this could introduce multiple outliers for a single gene. For the single outliers, we first select the affected genes randomly. For each of the affected genes, we multiply or divide the read count of a single sample by a randomly selected number between 5 and 10. Hence, in this case, for each gene there can be at most one outlier count. After simulating the count matrices, we filtered out all genes with zero counts in all samples. Hence, the final number of genes may be slightly lower than 12,500.

The table below lists, for each simulation setting, the fraction of genes that are up- and downregulated in condition 2 compared to condition 1, respectively, the fraction of genes for which the dispersion was set to 0 (giving a Poisson distribution), the fraction of single and random outliers, and any particular characteristic of the data sets from the simulation setting. Recall that all simulation parameters are stored in the data objects, see Section 5.1 of the compcodeR vignette for how to extract them.

Simulation setting Upreg. Downreg. Poisson Single outlier Random outlier Comment  Link
NB_0_0 0 0 0 0  download
NB_625_625 5% 5% 0 0 0  download
NB_1250_0 10% 0 0 0 0 download
NB_625_625_diffdisp 5% 5% 0 0 0 Different dispersions in the two conditions  download
NB_2000_2000 16% 16% 0 0 0  download
NB_4000_0 32% 0 0 0 0 download
P_625_625 5% 5% 50% 0 0   download
R_625_625 5% 5% 0 0 1% high, 1% low download
S_625_625  5% 5% 0 2.5% high, 2.5% low 0 download

We also provide real data sets that were downloaded from the ReCount website and formatted to fit in the compcodeR workflow. The table below lists the available real data sets.

Name Nbr. replicates per condition Comment Download
Bottomly (PubMed) 10/11 downlaod
BottomlySingleStrain (PubMed) 5 One strain from the Bottomly data set, arbitrarily divided into two groups. Repeated 10 times to give 10 'replicated' data sets. download
Gilad (PubMed) 3 download
Hammer (PubMed) 4 download