Exercise solution for Chapter 7
Exercise 7.4 from Modern Statistics for Modern Biology
Let’s revisit the Hiiragi data and compare the weighted and unweighted approaches. 7.4a Make a correlation circle for the unweighted Hiiragi data
xwt
. Which genes have the best projections on the first principal plane (best approximation)? 7.4b Make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.
Read in and clean data
We start by loading the libraries:
library("Hiiragi2013")
library("ade4")
library("factoextra")
library("pander")
library("knitr")
library("tidyverse")
library("ggrepel")
You can install the libraries using install.packages
for the ade4
and factoextra
packages and such, and the following line for Hiiragi2013
:
BiocManager::install("Hiiragi2013")
We will analyze data from the Hiiragi2013
package containing a gene expression microarray dataset describing the transcriptomes of about 100 cells from mouse embryos at different times during early development.
I copied the code from the textbook to clean the data. A tidyverse version of this code is available on Brooke Andersons’s github. In either case, we select the wildtype (WT) samples. Then we select the 100 features with the largest variance.
data("x", package = "Hiiragi2013")
xwt <- x[, x$genotype == "WT"]
sel <- order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
xwt <- xwt[sel, ]
The resulting data is of the ExpressionSet
class, a class designed to combine many types of data such as microarray data, metadata, and protocol information. The data is shown below. I transposed the data so the rows correspond to samples and the first 101 columns correspond to different genes. I only showed the first the columns and then skipped to the end for brevity. The last columns give additional data about each sample, such as the total number of cells and the scan date.
kable(head(as.data.frame(xwt)[,c(1:3,102:108)]))
X1434584_a_at | X1437325_x_at | X1420085_at | Embryonic.day | Total.number.of.cells | lineage | genotype | ScanDate | sampleGroup | sampleColour | |
---|---|---|---|---|---|---|---|---|---|---|
1 E3.25 | 13.24888 | 2.332223 | 3.027715 | E3.25 | 32 | WT | 2011-03-16 | E3.25 | #CAB2D6 | |
2 E3.25 | 11.98757 | 2.475742 | 9.293017 | E3.25 | 32 | WT | 2011-03-16 | E3.25 | #CAB2D6 | |
3 E3.25 | 12.72695 | 1.955642 | 2.940142 | E3.25 | 32 | WT | 2011-03-16 | E3.25 | #CAB2D6 | |
4 E3.25 | 12.51926 | 2.061255 | 9.715243 | E3.25 | 32 | WT | 2011-03-16 | E3.25 | #CAB2D6 | |
5 E3.25 | 12.01299 | 2.308667 | 8.924228 | E3.25 | 32 | WT | 2011-03-16 | E3.25 | #CAB2D6 | |
6 E3.25 | 10.50750 | 2.202948 | 11.325952 | E3.25 | 32 | WT | 2011-03-16 | E3.25 | #CAB2D6 |
a) Make a correlation circle
Next we plot a correlation circle. This allows us to plot the original genes projected onto the first two principal axes. We can interpret the angles between the vectors as a measure of the correlation between the two corresponding genes. The length of the arrows indicates the correlation of a gene with the first principal axes. This allows us to visualize the correlation between genes, as well as see which genes are best described by our first two principal components. The first principal component is the linear combination of the genes with maximum variance. The second principal component is the linear combination of the genes with maximum variance while being orthogonal to the first principal component.
First I used dudi.pca
to perform principal component analysis and get a pca and dudi class object. In the text book, they performed a weighted PCA because the groups had very different sample sizes ranging from 4 to 36. A weighted analysis would give the groups equal weight, while an unweighted would give each observation equal weight (and thus give less weight to groups with less members). We are interested in the difference in the genes at the various developmental phases, so in general it would make sense to do a weighted analysis in order to let each developmental phase group have an equal say. To compare the weighted with an unweighted PCA I didn’t use the row.w
option as the textbook did in order to fit an unweighted PCA.
In the following code t
takes the transpose so the rows correspond to samples. The exprs
function is used to access the expression and error measurements of the data. I chose to center and scale the data, meaning make each column have a mean of 0 and a variance of 1. The scannf
option prevents the function from plotting a scree plot and nf
tells it to keep two axes (i.e. 2 principal components).
pcaMouse <- dudi.pca(as.data.frame(t(Biobase::exprs(xwt))),
center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
Then I used the fviz_pca_var
function to plot the correlation circle. I used geom= "arrow"
to remove the labels of each arrow as they were all overlapping. The default for this argument is to print both arrows and text labels for each line.
fviz_pca_var(pcaMouse, col.circle = "black",geom= "arrow") + ggtitle("")
We can compare that plot to the one below showing the correlation circle from a weighted analysis. We can see in the weighted analysis the first principal plane does a better job of explaining the variation as indicated by the percent of variation explained by each dimension as well as the length of the arrows.
tab = table(xwt$sampleGroup)
xwt$weight = 1 / as.numeric(tab[xwt$sampleGroup])
pcaMouseWeighted <- dudi.pca(as.data.frame(t(Biobase::exprs(xwt))),
row.w = xwt$weight,
center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
fviz_pca_var(pcaMouseWeighted, col.circle = "black",geom= "arrow") + ggtitle("")
The genes that have the best projections on the first principal plane (meaning those with the strongest correlation with the principal axes) are those with the longest arrows on the plots above. I extracted the genes with arrow lengths greater than 0.8 from the unweighted PCA plot with the following code:
corrCircle <- fviz_pca_var(pcaMouse, col.circle = "black")$data
arrowLengths <- sqrt(corrCircle$x^2+corrCircle$y^2)
cutoff <- 0.8
kpInd <- order(arrowLengths, decreasing=TRUE)[1:sum(arrowLengths>cutoff)]
genes <- corrCircle[kpInd,"name"]
genes
1456270_s_at, 1450624_at, 1449134_s_at, 1418153_at, 1420085_at, 1420086_x_at, 1434584_a_at, 1450843_a_at, 1429483_at, 1437308_s_at, 1456598_at, 1460605_at, 1429388_at, 1426990_at, 1436392_s_at and 1452270_s_at
To summarize, the code above saves output from the fviz_pca_var
function to an object called corrCircle
. I used the output (which included x and y coordinates) to calculate the arrow length for each gene. I then created a vector, kpInd
, with the indice number for genes that had a length greater then the cutoff of 0.8. Finally, I output the names of those genes with length great than 0.8. Below is a plot showing just these genes with the ``best projection," meaning they are most correlated with the plane of maximum variation.
corrCircle %>%
mutate(length = sqrt(x^2 + y^2)) %>%
filter(length >= 0.8) %>%
ggplot(aes(x = 0, xend = x, y = 0, yend = y)) +
geom_segment(arrow = arrow(length = unit(0.1, "inches"))) +
geom_label_repel(aes(x = x, y = y, label = name),
size = 2, alpha = 0.7) +
coord_fixed()
b) Make a biplot
Next we make a biplot to visualize samples and the genes in one plot. I used the fviz_pca_biplot
function. The col.var
and col.ind
options allow you to color the different genes and sample points, respectively, by particular groups. I used label=""
to remove the labels on the vectors and points. The above8
variable is simply a vector of either “Less than 0.8” or “Greater than 0.8” to indicate if each gene’s vector length was less than or greater than our cutoff value on the correlation circle.
above8 <- rep("Less than 0.8", dim(xwt)[1])
above8[1:100 %in% kpInd] <- "Greater than 0.8"
fviz_pca_biplot(pcaMouse, col.var=above8, col.ind=xwt$sampleGroup, label="") +
ggtitle("")
The colors/shapes of the point indicate the sample group of each sample point. The colors of the arrows indicate whether or not the corresponding gene had length greater than or less than 0.8 on the correlation circle.
We can see the EPI and PE groups (the groups with the fewest samples) appear more extreme in the first principle plane when using the unweighted PCA. This was not the case in the weighted PCA (shown below) because then groups were weighted equally. Here, the PCA mostly depends on the larger groups (to best understand the most observations), so the smaller groups appear more extreme on the principal plane.
above8 <- rep("Less than 0.8", dim(xwt)[1])
above8[1:100 %in% kpInd] <- "Greater than 0.8"
fviz_pca_biplot(pcaMouseWeighted, col.var=above8, col.ind=xwt$sampleGroup, label="") +
ggtitle("")