Exercise 8.1 Do the analyses of Section 8.5 with the edgeR package and compare the results: make a scatterplot of the log 10 p-values, pick some genes where there are large differences, and visualize the raw data to see what is going on. Based on this can you explain the differences?
Most of the following code is taken straight from the book in section 8.5 for data cleaning/wrangling and the DESeq2 analysis.
Chapter 8 covers high-throughput count data, like data generated through RNA-seq. It introduces a number of tools that are useful for analyzing this type of data. The vocabulary terms for Chapter 8 are:
RNA-Seq sequencing of RNA molecules found in a population of cells or in a tissue ChIP-Seq sequencing of DNA regions that are bound to particular DNA-binding proteins (selected by immunoprecipitation) RIP-Seq sequencing of RNA molecules, or regions of them, bound to a particular RNA-binding protein DNA-Seq sequencing of genomic DNA HiC high-throughput chromatin conformation capture; a technique that aims to map the 3D spatial arrangement of DNA cDNA complementary DNA made from RNA templates and reverse transcriptase; used in RNA-Seq genetic screens a technique looking at the proliferation or survival of cells upon gene knockdown, knockout, or modification read the sequence obtained from a fragment sequencing library the collection of DNA molecules used as input for the sequencing machine fragments molecules being sequenced during a sequencing analysis count table a matrix with the tallies of the number of occurrences of subpopulations from a larger population/sample dynamic range a ratio between the maximum and minimum values heteroskedasticity a phenomenon where the variance and distribution shape of the data in different parts of the dynamic range are very different normalization a technique that adjusts for the nature and magnitude of systematic sampling biases rare events occurrences in the tail(s) of a distribution; observations that are extraordinarily high or low dispersion a measure of the spread of the data; a common measure is the standard deviation or variance gamma-Poisson negative binomial distribution with 2 parameters; 𝛼 and 𝛽 systematic biases systematic distortions that affect the data generation and need to be accounted for in the analysis; one example would be variations in the total number of reads for each sample in a sequencing experiment metadata a set of data that describes or gives information about other data multifactorial design an experimental design with more than one independent variable balanced in the context of study design, these are where there is an equal number of observations of all combinations of factors being tested differential expression analysis a type of analysis that uses the normalized read count data to investigate quantitative changes in expression levels between different experimental groups intercept a coefficient representing the base level of the measurement in the negative control design factors binary indicator variables interaction effect a parameter in a model that accounts for the effects of two experimental factors that combine in a more complicated fashion than a simple summation design matrix a matrix encoding the design of an experiment where the columns correspond to experimental factors and the rows correspond to different experimental conditions residuals a term in a model that reflects the experimental fluctuations (i.