Chapter 8 Vocabulary List

Chapter 8 covers high-throughput count data, like data generated through RNA-seq. It introduces a number of tools that are useful for analyzing this type of data. The vocabulary terms for Chapter 8 are:

RNA-Seq sequencing of RNA molecules found in a population of cells or in a tissue
ChIP-Seq sequencing of DNA regions that are bound to particular DNA-binding proteins (selected by immunoprecipitation)
RIP-Seq sequencing of RNA molecules, or regions of them, bound to a particular RNA-binding protein
DNA-Seq sequencing of genomic DNA
HiC high-throughput chromatin conformation capture; a technique that aims to map the 3D spatial arrangement of DNA
cDNA complementary DNA made from RNA templates and reverse transcriptase; used in RNA-Seq
genetic screens a technique looking at the proliferation or survival of cells upon gene knockdown, knockout, or modification
read the sequence obtained from a fragment
sequencing library the collection of DNA molecules used as input for the sequencing machine
fragments molecules being sequenced during a sequencing analysis
count table a matrix with the tallies of the number of occurrences of subpopulations from a larger population/sample
dynamic range a ratio between the maximum and minimum values
heteroskedasticity a phenomenon where the variance and distribution shape of the data in different parts of the dynamic range are very different
normalization a technique that adjusts for the nature and magnitude of systematic sampling biases
rare events occurrences in the tail(s) of a distribution; observations that are extraordinarily high or low
dispersion a measure of the spread of the data; a common measure is the standard deviation or variance
gamma-Poisson negative binomial distribution with 2 parameters; 𝛼 and 𝛽
systematic biases systematic distortions that affect the data generation and need to be accounted for in the analysis; one example would be variations in the total number of reads for each sample in a sequencing experiment
metadata a set of data that describes or gives information about other data
multifactorial design an experimental design with more than one independent variable
balanced in the context of study design, these are where there is an equal number of observations of all combinations of factors being tested
differential expression analysis a type of analysis that uses the normalized read count data to investigate quantitative changes in expression levels between different experimental groups
intercept a coefficient representing the base level of the measurement in the negative control
design factors binary indicator variables
interaction effect a parameter in a model that accounts for the effects of two experimental factors that combine in a more complicated fashion than a simple summation
design matrix a matrix encoding the design of an experiment where the columns correspond to experimental factors and the rows correspond to different experimental conditions
residuals a term in a model that reflects the experimental fluctuations (i.e. random noise)
least sum-of-squares fitting a type of model fitting that minimizes the sum of the squared residuals
linear model a model that is a linear function of parameters, i.e. takes the form: y_j = sum_k (x_jk * beta_k + e_j)
analysis of variance (ANOVA) an analysis that decomposes patterns in the data into systematic variability and noise
noise variability unaccounted for by model parameters
systematic variability variability accounted for by model parameters
breakdown point a measure of the robustness of an estimator; larger values indicate more robust estimators
robust a “sturdy” estimator that is not heavily influenced by outliers
least absolute deviations minimization of the sum of the absolute values of the residuals
least quantile of squares a type of regression where the difference between the model quantile and empirical quantile is minimized
least trimmed sum of squares a type of regression that minimized the sum of squared residuals, where the sum is over only a fraction of the smallest residuals
logistic regression a type of generalized linear regression for binary data where the outcome is transformed by the logistic function and bounded between 0 and 1
maximum likelihood a method for parameter estimation that finds the parameter value that maximizes the probablity of the observed data under the model
likelihood a function of a model parameter which is equal to the probability of the observed data under the model
maximum-likelihood estimates model parameters that are estimated by maximizing the probability of the observed data under the model
nuisance factor / blocking factor a factor that has some effect on the response but is of no interest to the experiment
batch effects hidden factors that affect the data but are not documented; e.g. running samples at the same time have a degree of similarity from being run in the same batch
pseudocounts transformations that take the form y = log2(n + n_0) where n is the count and n_0 is a chosen positive constant
variance stabilizing transformation a transformation that has finite values and finite slope, even for counts close to zero
regularized logarithm (rlog) transformation a technique that transforms the original count data to a log2-like scale by fitting a “trivial” model with a separate term for each sample and a prior distribution on the coefficients which is estimated from the data
Cook’s distance a measure of how much a single sample is influencing the coefficients in a model; large values indicate an outlier count
sampling without replacement a random sample in which no observation occurs more than one time in the sample
null hypothesis often, a hypothesis of “no association” that is used as a counterpart to a more interesting alternative hypothesis in hypothesis testing.
variability in statistics, the amount by which a set of observations deviate from their mean
outlier a data point that does not follow the pattern of the rest of the data; often this data point will have a large residual
M-estimation a type of regression analysis that is more robust than OLS to outliers or data that does not follow a normal distribution; it minimizes the sum of the penalization function applied to the residuals
conservative an approach that prioritizes reducing false positives
splicing a process in eukaryotic organisms where mRNA is cut down from the full-length gene to just the exons before being translated
exons segments of a gene that actually get used during translation or encode for a protein
isoforms different forms of the same gene that result from splicing events that combine different exons in an mRNA script
upregulated a term used to describe the increased expression of a gene
gene knockdown a way of inactivating a gene by targeting its mRNA transcript for inactivation or degradation
gene knockout deletion of a gene from the genome
transcriptome the total of all of the mRNA expressed from genes in an organism
polymorphism genetic variation within a population

Sources consulted or cited

Some of the definitions above are based in part or whole on listed definitions in the following sources.

Practice

Avatar
Mikaela Elder
Undergraduate Student in Biochemistry with Statistics Minor

I'm an undergraduate student interested in learning how to mathematically model biological systems.

Related