Vocabulary for Chapter 1
Chapter 1 covers generative modeling for discrete data. It introduces a number of terms covering probablity and statistical modeling, as well as a few biological terms. The vocabulary words for Chapter 1 are:
probability model | A mathematical description of the possible outcomes of an experiment and the probability of each of those outcomes. |
vector | In programming, a one-dimensional array of data, all with the same data type. |
discrete event | In statistics, an event that can take a finite or countable number of values (e.g., number of deaths in a community by day). |
categorical variable | A variable that can belong to one of a finite set of levels. |
levels | In the context of a categorical variable, the set of values to which the variable can be assigned. |
factor | In the context of statistical programming, a data type that can take one of a limited number of possible values (e.g., sex, nationality). |
exchangeable | A property of a vector of random variables that implies the order in which the variables appear in the vector doesn’t matter. |
sufficient statistic | A (summary) statistic that contains all the information about the model parameters that is in the original, uncondensed form of the data. |
Bernoulli distribution | A probability distribution describing a random variable that can take on two possible outcomes (e.g., win / loss). |
parameter | A numerical value that describes a population. |
complementary | A description of two events who are mutually exclusive and whose probabilities sum to one (i.e., either one event or the other is guaranteed to happen, but not both). |
binomial random variable | A variable whose values occur according to a binomial probability distribution. |
probability mass distribution | A function giving the probability that a discrete random variable is equal to a given value. |
Poisson distribution | A probability distribution for count data that has support on the non-negative integers. This distribution is also used to approximate a binomial distribution when the probability of success is small and the number of trials is large. |
epitope / antigen determinent | Site on a macromolecular antigen to which an antibody binds. This is the part of an antigen that is recognized by the immune system. |
Enzyme-linked immunosorbent assay (ELISA) | An assay that is used to detect specific epitopes at different positions along a protein. |
conditional on | Given |
cumulative distribution function | A function giving the probability that a random variable is less than any specified value. |
extreme value analysis | Analysis focused on the behavior of the very large or the very small outcomes of a random distribution, allowing an exploration of the probability of rare events. |
rare event | Something that occurs with a very low probability. |
rank statistic | A data vector sorted least to greatest. |
Monte Carlo method | A method that uses computer simulation from a generative model to determine probabilities of events. |
probability or generative modeling | A method of modeling where all the parameters are known and the mathematical theory allows us to work by deduction. |
deduction | A top-down method of reasoning, starting from a theory or principle rather than from data. |
statistical modeling | A method of modeling where the distribution of the data is not known. |
fit | In the context of statistical modeling, estimating the parameters of a model based on observed data. |
multinomial | A generalization of the binomial distribution to cases where there are a finite set of possible outcomes (e.g., a roll of a die). |
power / true positive rate | The probability of detecting something if it is there. |
null hypothesis | Often, a hypothesis of “no association” that is used as a counterpart to a more interesting alternative hypothesis in hypothesis testing. |
matrix | In programming, a two-dimensional array of data, all with the same data type. |
expected value | The average (mean) value of a random variable. |
variability / spread / dispersion | In statistics, the amount by which a set of observations deviate from their mean. |
statistic | A numerical characteristic of a sample and known constants (i.e. no unknown parameters). |
null distribution | The probability distribution under the null hypothesis. |
alternative | In the context of a generating process and hypothesis testing, the generating process that is considered in comparison to the generating process under the null hypothesis. |
chi-squared distribution | A distribution on the non-negative real numbers that is often used in assessing goodness-of-fit (e.g. models fit to contingency tables). |
p-value | The probability of seeing the observed data or something more extreme under the generative model associated with the null hypothesis. |
probability density function | A function giving the relative likelihood that a continuous random variable is equal to a given value. When this function is integrated over the sample space, it equals 1. |
default | In the context of arguments to an R function, the value that is used if no custom value is specified. |
C. elegans genome nucleotide frequency | How often adenine, cytosine, guanine, and thymine occur in the DNA of a roundwork often used in scientific research. |
Bioconductor | Open-source software that provides contributed programs for bioinformatic data analysis. |
codon | A three-nucleotide sequence that specifies the amino acid to be created next (or to start or stop synthesis). |
DNA read | An inferred sequence of base pairs for a single DNA fragment, based on sequencing. |
nucleotide | In the context of DNA, one of four compounds (adenine (A); cytosince (C); guanine (G); and tymine (T)) that make up the basic information unit. |
genome | An organism’s complete set of DNA, including all of its genes. |
replication cycle | In biology, the process that begins with the infection of a host cell by a virus and ends with the release of mature progeny virus particles. |
point mutation | A change, addition, or deletion of a single nucleotide in a gene sequence. |
genotype | The genetic make-up of an individual’s cells, including how the individual’s genetic make-up differs from others’. |
diploid | Having genetic material in two complete sets of chromosomes, from two parents. |
protein | A compound made up of amino acids; one of the four types of macromolecules that make up living organisms. |
antibody | A type of protein made by certain white blood cells in response to an antigen. |
antigen | A foreign substance in the body to which the immune system reacts. |
Sources consulted or cited
Some of the definitions above are based in part or whole on listed definitions in the following sources.
- Holmes and Huber, 2019. Modern Statistics for Modern Biology. Cambridge University Press, Cambridge, United Kingdom.
- Everitt and Skrondal, 2010. The Cambridge Dictionary of Statistics (Fourth Edition). Cambridge University Press, Cambridge, United Kingdom.
- Bioconductor: Open Source Software for Bioinformatics. https://www.bioconductor.org/
- Wikipedia: The Free Encyclopedia. https://en.wikipedia.org/wiki/Main_Page
- NIH Genetics Home Reference. https://ghr.nlm.nih.gov/
- NCI Dictionary of Cancer Terms. https://www.cancer.gov/publications/dictionaries/cancer-terms