Author: jomaldon
Date: 2019-08-16
The analyses reported in this document are part of the At.all project. The aim is to find features that are differentially expressed between control and treatment. The statistical analysis process includes data normalization, graphical exploration of raw and normalized data, test for differential expression for each feature between the conditions, raw p-value adjustment and export of lists of features having a significant differential expression between the conditions.
The analysis is performed using the R software [R Core Team, 2014], Bioconductor [Gentleman, 2004] packages including DESeq2 [Anders, 2010 and Love, 2014] and the SARTools package developed at PF2 - Institut Pasteur. Normalization and differential analysis are carried out according to the DESeq2 model and package. This report comes with additional tab-delimited text files that contain lists of differentially expressed features.
For more details about the DESeq2 methodology, please refer to its related publications [Anders, 2010 and Love, 2014].
The count data files and associated biological conditions are listed in the following table.
label | file | group | cond |
---|---|---|---|
SRR1811524 | At.all.separated.quants.txt | 1 | control |
SRR1811525 | At.all.separated.quants.txt | 1 | control |
SRR1811526 | At.all.separated.quants.txt | 1 | control |
SRR1811527 | At.all.separated.quants.txt | 1 | treatment |
SRR1811528 | At.all.separated.quants.txt | 1 | treatment |
SRR1811529 | At.all.separated.quants.txt | 1 | treatment |
After loading the data we first have a look at the raw data table itself. The data table contains one row per annotated feature and one column per sequenced sample. Row names of this table are feature IDs (unique identifiers). The table contains raw count values representing the number of reads that map onto the features. For this project, there are 35386 features in the count data table.
SRR1811524 | SRR1811525 | SRR1811526 | SRR1811527 | SRR1811528 | SRR1811529 | |
---|---|---|---|---|---|---|
AT1G01010.1 | 253 | 279 | 312 | 388 | 350 | 287 |
AT1G01020.1 | 121 | 99 | 78 | 108 | 114 | 78 |
AT1G01020.2 | 0 | 28 | 29 | 13 | 12 | 12 |
AT1G01030.1 | 10 | 7 | 10 | 12 | 7 | 11 |
AT1G01040.1 | 267 | 604 | 452 | 340 | 499 | 386 |
AT1G01040.2 | 233 | 3 | 1 | 155 | 0 | 0 |
Looking at the summary of the count table provides a basic description of these raw counts (min and max values, median, etc).
SRR1811524 | SRR1811525 | SRR1811526 | SRR1811527 | SRR1811528 | SRR1811529 | |
---|---|---|---|---|---|---|
Min. | 0 | 0 | 0 | 0 | 0 | 0 |
1st Qu. | 1 | 1 | 1 | 1 | 1 | 1 |
Median | 44 | 48 | 47 | 52 | 46 | 41 |
Mean | 196 | 214 | 228 | 235 | 213 | 200 |
3rd Qu. | 177 | 196 | 195 | 210 | 186 | 170 |
Max. | 55997 | 69306 | 92362 | 68056 | 72974 | 98208 |
Figure 1 shows the total number of mapped reads for each sample. Reads that map on multiple locations on the transcriptome are counted more than once, as far as they are mapped on less than 50 different loci. We expect total read counts to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates. This may happen for several reasons, including:
Figure 2 shows the proportion of features with no read count in each sample. We expect this proportion to be similar within conditions. Features with null read counts in the 6 samples are left in the data but are not taken into account for the analysis with DESeq2. Here, 4724 features (13.35%) are in this situation (dashed line). Results for those features (fold-change and p-values) are set to NA in the results files.
Figure 3 shows the distribution of read counts for each sample. For sake of readability, \(\text{log}_2(\text{counts}+1)\) are used instead of raw counts. Again we expect replicates to have similar distributions. In addition, this figure shows if read counts are preferably low, medium or high. This depends on the organisms as well as the biological conditions under consideration.
It may happen that one or a few features capture a high proportion of reads (up to 20% or more). This phenomenon should not influence the normalization process. The DESeq2 normalization has proved to be robust to this situation [Dillies, 2012]. Anyway, we expect these high count features to be the same across replicates. They are not necessarily the same across conditions. Figure 4 illustrate the possible presence of such high count features in the data set.
We may wish to assess the similarity between samples across conditions. A pairwise scatter plot is produced (figure 5) to show how replicates and samples from different biological conditions are similar or different (\(\text{log}_2(\text{counts}+1)\) are used instead of raw count values). Moreover, as the Pearson correlation has been shown not to be relevant to measure the similarity between replicates, the SERE statistic has been proposed as a similarity index between RNA-Seq samples [Schulze, 2012]. It measures whether the variability between samples is random Poisson variability or higher. Pairwise SERE values are printed in the lower triangle of the pairwise scatter plot. The value of the SERE statistic is:
The main variability within the experiment is expected to come from biological differences between the samples. This can be checked in two ways. The first one is to perform a hierarchical clustering of the whole sample set. This is performed after a transformation of the count data which can be either a Variance Stabilizing Transformation (VST) or a regularized log transformation (rlog) [Anders, 2010 and Love, 2014].
A VST is a transformation of the data that makes them homoscedastic, meaning that the variance is then independent of the mean. It is performed in two steps: (i) a mean-variance relationship is estimated from the data with the same function that is used to normalize count data and (ii) from this relationship, a transformation of the data is performed in order to get a dataset in which the variance is independent of the mean. The homoscedasticity is a prerequisite for the use of some data analysis methods, such as hierarchical clustering or Principal Component Analysis (PCA). The regularized log transformation is based on a GLM (Generalized Linear Model) on the counts and has the same goal as a VST but is more robust in the case when the size factors vary widely.
Figure 6 shows the dendrogram obtained from VST-transformed data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.
Another way of visualizing the experiment variability is to look at the first principal components of the PCA, as shown on the figure 7. On this figure, the first principal component (PC1) is expected to separate samples from the different biological conditions, meaning that the biological variability is the main source of variance in the data.
Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by DESeq2 relies on the hypothesis that most features are not differentially expressed. It computes a scaling factor for each sample. Normalized read counts are obtained by dividing raw read counts by the scaling factor associated with the sample they belong to. Scaling factors around 1 mean (almost) no normalization is performed. Scaling factors lower than 1 will produce normalized counts higher than raw ones, and the other way around. Two options are available to compute scaling factors: locfunc=“median” (default) or locfunc=“shorth”. Here, the normalization was performed with locfunc=“median”.
SRR1811524 | SRR1811525 | SRR1811526 | SRR1811527 | SRR1811528 | SRR1811529 | |
---|---|---|---|---|---|---|
Size factor | 0.97 | 1.05 | 1.03 | 1.12 | 0.98 | 0.89 |
The histograms (figure 8) can help to validate the choice of the normalization parameter (“median” or “shorth”). Under the hypothesis that most features are not differentially expressed, each size factor represented by a red line is expected to be close to the mode of the distribution of the counts divided by their geometric means across samples.
The figure 9 shows that the scaling factors of DESeq2 and the total count normalization factors may not perform similarly.
Boxplots are often used as a qualitative measure of the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 10 shows boxplots of raw (left) and normalized (right) data respectively.
DESeq2 aims at fitting one linear model per feature. For this project, the design used is counts ~ cond and the goal is to estimate the models' coefficients which can be interpreted as \(\log_2(\texttt{FC})\). These coefficients will then be tested to get p-values and adjusted p-values.
Model outliers are features for which at least one sample seems unrelated to the experimental or study design. For every feature and for every sample, the Cook's distance [Cook, 1977] reflects how the sample matches the model. A large value of the Cook's distance indicates an outlier count and p-values are not computed for the corresponding feature.
The DESeq2 model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data. Its purpose is to determine the shape of the mean-variance relationship. The default is to apply a GLM (Generalized Linear Model) based method (fitType=“parametric”), which can handle complex designs but may not converge in some cases. The alternative is to use fitType=“local” as described in the original paper [Anders, 2010]. The parameter used for this project is fitType=“”. Then, DESeq2 imposes a Cox Reid-adjusted profile likelihood maximization [Cox, 1987 and McCarthy, 2012] and uses the maximum a posteriori (MAP) of the dispersion [Wu, 2013].
The left panel on figure 11 shows the result of the dispersion estimation step. The x- and y-axes represent the mean count value and the estimated dispersion respectively. Black dots represent empirical dispersion estimates for each feature (from the observed counts). The red dots show the mean-variance relationship function (fitted dispersion value) as estimated by the model. The blue dots are the final estimates from the maximum a posteriori and are used to perform the statistical test. Blue circles (if any) point out dispersion outliers. These are features with a very high empirical variance (computed from observed counts). These high dispersion values fall far from the model estimation. For these features, the statistical test is based on the empirical variance in order to be more conservative than with the MAP dispersion. These features will have low chance to be declared significant. The figure on the right panel allows to check the hypothesis of log-normality of the dispersions.
Once the dispersion estimation and the model fitting have been done, DESeq2 can perform the statistical testing. Figure 12 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on \([0,1]\) and a peak around 0 corresponding to the differentially expressed features.
DESeq2 can perform an independent filtering to increase the detection power of differentially expressed features at the same experiment-wide type I error. Since features with very low counts are not likely to see significant differences typically due to high dispersion, it defines a threshold on the mean of the normalized counts irrespective of the biological condition. This procedure is independent because the information about the variables in the design formula is not used [Love, 2014].
Table 6 reports the thresholds used for each comparison and the number of features discarded by the independent filtering. Adjusted p-values of discarded features are then set to NA.
Test vs Ref | Threshold | # discarded |
---|---|---|
treatment vs control | 2.74 | 8852 |
A p-value adjustment is performed to take into account multiple testing and control the false positive rate to a chosen level \(\alpha\). For this analysis, a BH p-value adjustment was performed [Benjamini, 1995 and 2001] and the level of controlled false positive rate was set to 0.05.
Test vs Ref | # down | # up | # total |
---|---|---|---|
treatment vs control | 2123 | 2362 | 4485 |
Figure 13 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too low/high \(\log_2(\text{FC})\) to be displayed on the plot.
Figure 14 shows the volcano plots for the comparisons performed and differentially expressed features are still highlighted in red. A volcano plot represents the log of the adjusted P value as a function of the log ratio of differential expression.
Full results as well as lists of differentially expressed features are provided in the following text files which can be easily read in a spreadsheet. For each comparison:
These files contain the following columns:
The versions of the R software and Bioconductor packages used for this analysis are listed below. It is important to save them if one wants to re-perform the analysis in the same conditions.
Parameter values used for this analysis are: