Chapter 8 Vocabulary List
Chapter 8 covers high-throughput count data, like data generated through RNA-seq. It introduces a number of tools that are useful for analyzing this type of data. The vocabulary terms for Chapter 8 are:
RNA-Seq | sequencing of RNA molecules found in a population of cells or in a tissue |
ChIP-Seq | sequencing of DNA regions that are bound to particular DNA-binding proteins (selected by immunoprecipitation) |
RIP-Seq | sequencing of RNA molecules, or regions of them, bound to a particular RNA-binding protein |
DNA-Seq | sequencing of genomic DNA |
HiC | high-throughput chromatin conformation capture; a technique that aims to map the 3D spatial arrangement of DNA |
cDNA | complementary DNA made from RNA templates and reverse transcriptase; used in RNA-Seq |
genetic screens | a technique looking at the proliferation or survival of cells upon gene knockdown, knockout, or modification |
read | the sequence obtained from a fragment |
sequencing library | the collection of DNA molecules used as input for the sequencing machine |
fragments | molecules being sequenced during a sequencing analysis |
count table | a matrix with the tallies of the number of occurrences of subpopulations from a larger population/sample |
dynamic range | a ratio between the maximum and minimum values |
heteroskedasticity | a phenomenon where the variance and distribution shape of the data in different parts of the dynamic range are very different |
normalization | a technique that adjusts for the nature and magnitude of systematic sampling biases |
rare events | occurrences in the tail(s) of a distribution; observations that are extraordinarily high or low |
dispersion | a measure of the spread of the data; a common measure is the standard deviation or variance |
gamma-Poisson | negative binomial distribution with 2 parameters; 𝛼 and 𝛽 |
systematic biases | systematic distortions that affect the data generation and need to be accounted for in the analysis; one example would be variations in the total number of reads for each sample in a sequencing experiment |
metadata | a set of data that describes or gives information about other data |
multifactorial design | an experimental design with more than one independent variable |
balanced | in the context of study design, these are where there is an equal number of observations of all combinations of factors being tested |
differential expression analysis | a type of analysis that uses the normalized read count data to investigate quantitative changes in expression levels between different experimental groups |
intercept | a coefficient representing the base level of the measurement in the negative control |
design factors | binary indicator variables |
interaction effect | a parameter in a model that accounts for the effects of two experimental factors that combine in a more complicated fashion than a simple summation |
design matrix | a matrix encoding the design of an experiment where the columns correspond to experimental factors and the rows correspond to different experimental conditions |
residuals | a term in a model that reflects the experimental fluctuations (i.e. random noise) |
least sum-of-squares fitting | a type of model fitting that minimizes the sum of the squared residuals |
linear model | a model that is a linear function of parameters, i.e. takes the form: y_j = sum_k (x_jk * beta_k + e_j) |
analysis of variance (ANOVA) | an analysis that decomposes patterns in the data into systematic variability and noise |
noise | variability unaccounted for by model parameters |
systematic variability | variability accounted for by model parameters |
breakdown point | a measure of the robustness of an estimator; larger values indicate more robust estimators |
robust | a “sturdy” estimator that is not heavily influenced by outliers |
least absolute deviations | minimization of the sum of the absolute values of the residuals |
least quantile of squares | a type of regression where the difference between the model quantile and empirical quantile is minimized |
least trimmed sum of squares | a type of regression that minimized the sum of squared residuals, where the sum is over only a fraction of the smallest residuals |
logistic regression | a type of generalized linear regression for binary data where the outcome is transformed by the logistic function and bounded between 0 and 1 |
maximum likelihood | a method for parameter estimation that finds the parameter value that maximizes the probablity of the observed data under the model |
likelihood | a function of a model parameter which is equal to the probability of the observed data under the model |
maximum-likelihood estimates | model parameters that are estimated by maximizing the probability of the observed data under the model |
nuisance factor / blocking factor | a factor that has some effect on the response but is of no interest to the experiment |
batch effects | hidden factors that affect the data but are not documented; e.g. running samples at the same time have a degree of similarity from being run in the same batch |
pseudocounts | transformations that take the form y = log2(n + n_0) where n is the count and n_0 is a chosen positive constant |
variance stabilizing transformation | a transformation that has finite values and finite slope, even for counts close to zero |
regularized logarithm (rlog) transformation | a technique that transforms the original count data to a log2-like scale by fitting a “trivial” model with a separate term for each sample and a prior distribution on the coefficients which is estimated from the data |
Cook’s distance | a measure of how much a single sample is influencing the coefficients in a model; large values indicate an outlier count |
sampling without replacement | a random sample in which no observation occurs more than one time in the sample |
null hypothesis | often, a hypothesis of “no association” that is used as a counterpart to a more interesting alternative hypothesis in hypothesis testing. |
variability | in statistics, the amount by which a set of observations deviate from their mean |
outlier | a data point that does not follow the pattern of the rest of the data; often this data point will have a large residual |
M-estimation | a type of regression analysis that is more robust than OLS to outliers or data that does not follow a normal distribution; it minimizes the sum of the penalization function applied to the residuals |
conservative | an approach that prioritizes reducing false positives |
splicing | a process in eukaryotic organisms where mRNA is cut down from the full-length gene to just the exons before being translated |
exons | segments of a gene that actually get used during translation or encode for a protein |
isoforms | different forms of the same gene that result from splicing events that combine different exons in an mRNA script |
upregulated | a term used to describe the increased expression of a gene |
gene knockdown | a way of inactivating a gene by targeting its mRNA transcript for inactivation or degradation |
gene knockout | deletion of a gene from the genome |
transcriptome | the total of all of the mRNA expressed from genes in an organism |
polymorphism | genetic variation within a population |
Sources consulted or cited
Some of the definitions above are based in part or whole on listed definitions in the following sources.
- Holmes and Huber, 2019. Modern Statistics for Modern Biology. Cambridge University Press, Cambridge, United Kingdom.
- Lexico: https://www.lexico.com
- Statistics How To: https://www.statisticshowto.com
- Lavrakas, 2008. Sampling without replacement. Encyclopedia of Survey Research Methods. https://dx.doi.org/10.4135/9781412963947.n516