Chapter 6, Exercise 6.4 Make a less extreme example of correlated test statistics than the data duplication at the end of Section 6.5. Simulate data with true null hypotheses only, and let the data morph from having completely independent replicates (columns) to highly correlated as a function of some continuous-valued control parameter. Check type-I error control (e.g., with the p-value histogram) as a function of this control parameter.
To begin, we want to load (or download) the packages we will need for this exercise using library:
library(tidyverse) # BiocManager::install("EBImage") library(EBImage) # install.packages("spatstat") library("spatstat") Extending the analysis in section 11.17 for all cell types This exercise asks us to analyze an image of a lymph node and evaluate the spatial dependence for all of the cell types in the lymph node. In this case, the null hypothesis is that each of the cell types are evenly distributed (via a homogenous Poisson process) throughout the lymph node.
Exercise 8.1 Do the analyses of Section 8.5 with the edgeR package and compare the results: make a scatterplot of the log 10 p-values, pick some genes where there are large differences, and visualize the raw data to see what is going on. Based on this can you explain the differences?
Most of the following code is taken straight from the book in section 8.5 for data cleaning/wrangling and the DESeq2 analysis.
Exercise 9.2 “Correspondence Analysis on color association tables: Here is an example of data collected by looking at the number of Google hits resulting from queries of pairs of words. The numbers in Table 9.4 [not reproduced] are to be multiplied by 1000. For instance, the combination of the words “quiet” and “blue” returned 2,150,000 hits. Perform a correspondence analysis of these data. What do you notice when you look at the two-dimensional biplot?
Exercise 7.4 from Modern Statistics for Modern Biology Let’s revisit the Hiiragi data and compare the weighted and unweighted approaches. 7.4a Make a correlation circle for the unweighted Hiiragi data xwt. Which genes have the best projections on the first principal plane (best approximation)? 7.4b Make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.
This exercise asks us to interpret and validate the consistency within our clusters of data. To do this, we will employ the silhouette index, which gives us a silhouette value measuring how similar an object is to its own cluster compared to other clusters.
The silhouette index is as follows:
\[\displaystyle S(i) = \frac{B(i) - A(i)}{max_i(A(i), B(i))} \]
The book explains the equation by first defining that the average dissimilarity of a point \(x_i\) to a cluster \(C_k\) is the average of the distances from \(x_i\) to all of the points in \(C_k\).
Exercise 2.6 The first part of the exercise asks you to:
Choose your own prior for the parameters of the beta distribution. You can do this by sketching it here: https://jhubiostatistics.shinyapps.io/drawyourprior.
After sketching a plot, I chose the parameters to set up a prior: \(\alpha\) = 2.47 and \(\beta\) = 8.5.
Using this prior Next, the exercise asks you:
Once you have set up a prior, re-analyse the data from Section 2.
As always, load libraries first.
library(ggplot2) library(tidyverse) library(dplyr) Exercise 2.3 from Modern Statistics for Modern Biologists A sequence of three nucleotides codes for one amino acid. There are 4 nucleotides, thus \(4^3\) would allow for 64 different amino acids, however there are only 20 amino acids requiring only 20 combinations + 1 for an “end” signal. (The “start” signal is the codon, ATG, which also codes for the amino acid methionine, so the start signal does not have a separate codon.
Each of you will be responsible once or twice over the semester to create a blog post that provides a clean, clearly-presented solution to the in-class exercise for the week. This blog post provides the technical instructions for writing and submitting that exercise.
Your exercise solution should be posted before the next class meeting. Since it will need to be reviewed by the faculty before it can be officially posted, please plan to submit it by the Tuesday after the class for your exercise.