Vocabulary for Chapter 12

Chapter 12 covers supervised learning and the statistics of predicting categorical variables. Also discussed are the issues of overfitting and generalizability and how to “train” statistical models.

The vocabulary words for Chapter 12 are:

predictors characteristics measured for an observation that may be useful in predicting the target variable
overfitting the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future obervations reliably
generalization refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning
statistical learning framework for machine learning drawing from the fields of statistics and functional analysis. Deals with the problem of finding a predictive function based on data
objective response in the context of supervised learning, a measurable response
kernel methods class of algorithms for pattern analysis, whose best known member is the support vector machine (SVM). These use kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space
regression statistical method that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables)
classification the process of grouping observations in a dataset by their similarities in terms of measured characteristics
linear discriminant analysis (LDA) a common technique used both for supervised learning classification and as a pre-processing dimension reduction step that finds a linear combination of features to help in classification
misclassification rate (MCR) in regards to statistical learning, the fraction of times the prediction is wrong, specifically when relating to classification of models
leave-one out cross-validation k-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set
k-fold cross-validation a technique where observations are repeatedly split into a training set of size around n(k-1)/k and a test set of size of around n/k. Mainly used in prediction when one wants to estimate how accurately a predictive model will perform in practice
curse of dimensionality refers to the fact that high-dimensional spaces are very hard, if not impossible, to sample thoroughly, because data in any particular region becomes very sparse as dimensions increase
confusion table in the field of machine learning and specifically the problem of statistical classification, a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. By counting the number of observations truly within each class versus the number predicted by the model to be in each class
sensitvity true positivity rate or recall, measures the proportion of actual positives that are correctly identified as such
specificity true negative rate, the probability, measures the proportion of actual negatives that are correctly identified as negative
receiver operating characteristic(ROC)/precision recall curve a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied
Jaccard index a statistic used in quantifying the similarities between sample sets, which is formally defined as the size of the intersection between two sets divided by the size of the union of the sets
mean-squared error (MSE) the average squared error
risk function/cost function/objective function the function that you optimize during the training of a predictive model (e.g., the maximum likelihood function for a classic regression model)
bias a measure of how different the average of all the different estimates is from the truth
variance how much an individual estimate might scatter from the average value
penalization a tool to actively control and exploit the variance-bias tradeoff
regularization a method used to to ensure stable estimates by helping to prevent overfitting of the model to the training data
logistic regression a statistical model that in its basic form uses a logistic function to model a binary dependent variable. A binary logistic model has a dependent variable with two possible values (e.g, healthy/sick) which are represented by indicator variables (0,1)
penalty function a term added to the objective function that consists of a penalty parameter multiplied by a measure of violation of the constraints
ridge regression a method of regression in which the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients. Doing this shrinks coefficients and helps reduce model complexity and mutli-collinearity
lasso in the context of statistical regression modeling, Least Absolute Shrinkage and Selection Operator that is used in one type of regression modeling to reduce over-fitting and select useful features of hte data for predicting the outcome
elastic net in the fitting of linear or logistic regression models, a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods
ExperimentalHub in the context of Bioconductor, provides a central location where curated data from experiments, publications or training courses can be accessed
kingdom the second highest taxonomic rank, just below domain
phylum a level of classification of taxonomic rank below kingdom and above class
species the basic unit of classification and a taxonomic rank of an organism, as well as unit of biodiversity
diagnostic plots statistical influence plots that help to visualize how well a model fits the data (e.g. Normal Q-Q, Residuals vs Fitted)
tuning parameters parameters that control the strength of the penalty term in certain types of regression algorithms (e.g., ridge and lasso regression), controlling the amount of shrinkage (where parameter estimates are shrunk towards a central point, like the mean) when fitting the mode
p-value hacking manipulation of the data until finding a statistic that yields a desired result
workflow in the context of a computational analysis, the chaining of software tools together in a series of steps that operate on data
scale invariance a feature of objects or laws that do not change if scales, length, energy, or other variables, are multiplied by a common factor, and thus represent universality

Sources consulted or cited

Some of the definitions above are based in part or whole on listed definitions in the following source:

Holmes and Huber, 2019. Modern Statistics for Modern Biology. Cambridge University Press, Cambridge, United Kingdom.

https://en.wikipedia.org/

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

https://.statisticshowto.com

https://www.cs.cmu.edu/~schneide/tut5/node42.html

https://towardsdatascience.com

https://bioconductor.org

https://pfern.gtihub.io

Practice

Avatar
Sherry WeMott
Graduate Student in Environmental Health, Epidemiology

My researh focuses on environmental and social determinants of disease processes and outcomes. For my thesis project I'll be sampling and analyzing nasal viromes of young adult Colorado e-cigarette users compared to non-users.

Related