Exercise solution for Chapter 9
Exercise 9.2
“Correspondence Analysis on color association tables: Here is an example of data collected by looking at the number of Google hits resulting from queries of pairs of words. The numbers in Table 9.4 [not reproduced] are to be multiplied by 1000. For instance, the combination of the words “quiet” and “blue” returned 2,150,000 hits. Perform a correspondence analysis of these data. What do you notice when you look at the two-dimensional biplot?"
In this exercise, we will essentially be repeating the correspondence analysis process seen in section 9.4.2, this time using associations between color and sentiment terms in search engine queries, rather than hair and eye color. Rather than testing whether certain hair/eye combinations are more likely to co-occur, we will be exploring whether a given color is more or less likely to be associated with certain sentiments (e.g. would we expect “orange” and “happy” to co-occur more frequently than would be expected by chance?). The same general steps can be repeated without many changes.
Step 1: Loading libraries
library(ade4)
library(sva)
library(tidyverse)
library(factoextra)
library(janitor) #optional; function `clean_names()` makes column names easier to work with
library(ggplot2)
library(ggrepel)
Step 2: Two ways to load data:
At least working from the online version of the text, there are two ways to obtain (roughly) equal data for this exercise. One option is to copy table 9.4 directly from the book into Excel, shift the header one cell to the right to align columns with their proper names, export a .csv (ex_9.2_color_table.csv
in this example), and load it into R using a command like:
color_matrix <- read_csv("example_datasets/ex_9_2_color_table.csv", col_names = TRUE) %>%
column_to_rownames(var = 'X1') %>%
as.matrix
Something to note is that this data is rounded to units of thousands (e.g. “19” is ~19,000), while the course data in the downloadable data
file gives more-precise values. I don’t know that this would disrupt any major correlations, but it could cause some minor discrepancies on comparison.
Alternatively, the file is included in the course data as colorsentiment.csv
, although in a different, three-column format:
head(read_csv("example_datasets/colorsentiment.csv"))
## Parsed with column specification:
## cols(
## `<X>` = col_character(),
## `<Y>` = col_character(),
## Results = col_double()
## )
## # A tibble: 6 x 3
## `<X>` `<Y>` Results
## <chr> <chr> <dbl>
## 1 blue happy 8310000
## 2 blue depressed 957000
## 3 blue lively 1250000
## 4 blue clever 1270000
## 5 blue perplexed 71300
## 6 blue virtuous 80200
This can be converted to match our correspondence table format using the pivot_wider
function and other tidyverse
formatting tools:
color_matrix <- read_csv("example_datasets/colorsentiment.csv") %>%
janitor::clean_names() %>% #standardizes column name format
arrange(desc(results)) %>% # Rearranging to match row/col order in table 9.4
pivot_wider(names_from = x, values_from = results) %>% # converts colors from column entries (`x`) to column names
column_to_rownames(var = 'y')
color_matrix
## black white green blue orange purple grey
## happy 19300000 9150000 8730000 8310000 4220000 2610000 1920000
## angry 2970000 1730000 1740000 1530000 1040000 710000 752000
## quiet 2770000 2510000 2140000 2150000 1220000 821000 875000
## lively 1840000 1480000 1350000 1250000 621000 488000 659000
## clever 1650000 1420000 1320000 1270000 693000 416000 495000
## depressed 1480000 1270000 983000 957000 330000 102000 147000
## virtuous 179000 165000 102000 80200 24700 17300 20000
## perplexed 110000 109000 80100 71300 23300 15200 18900
Correspondence Analysis (following section 9.4.2 as a model)
Setting up the correspondence analysis object using the correspondence analysis dudi
function from the ade4
package:
color_matrix_ca <- dudi.coa(color_matrix, n = 2, scannf = FALSE) # scannf = FALSE stops automatic printing of eigenvalues
This creates a special “Duality Diagram” list object (used by the ade4
package for correspondence analysis, but also principal component analysis and other methods) containing a variety of data generated by the analysis; the call ?dudi()
will give a list of what each component contains (axis weights, point coordinates, etc.).Stored features include as the base data table (tab
), a vector of eigenvalues (eig
), and a variety of information on row and column data (e.g. weights, coordinates, and principal components).
Question 9.2 specifies that we use two dimensions, but visualizing the eigenvalues with the factoextra
package confirms that this is a good representation of the system, with almost all variance being explained by the first two dimensions:
fviz_eig(color_matrix_ca, geom = 'bar') # visualizing eigenvalues
Following the book’s example in 9.4.2, we can explicitly calculate a residual matrix, which shows the difference between expected (assuming random distribution) and observed values for given row/column intercepts, in the following steps. This doesn’t feed into visualizations, but it may be helpful to have a quantitative reference for residuals:
rowsums_colors <- as.matrix(apply(color_matrix, 1, sum))
colsums_colors <- as.matrix(apply(color_matrix, 2, sum))
expected_colors <- rowsums_colors %*% t(colsums_colors) / sum(colsums_colors) # using matrix multiplication to see what
# "average" values should look like if row and column sums are distributed evenly across the dataset
#sum((color_matrix - expected_colors)^2 / expected_colors)
# subtracting the "expected" matrix from the observed values to see discrepancies
(residual_table_colors <- color_matrix - expected_colors %>%
t() %>%
round())
## black white green blue orange purple grey
## happy 2604538 7252731 6644019 7090159 3616948 2332753 1890798
## angry -6856953 -19511 -241131 891748 657779 448415 620320
## quiet -6291637 848427 1103422 1745469 859372 639948 797493
## lively -6766161 610622 693006 868322 -1000836 381433 587529
## clever -2852964 868979 700121 -965911 -261613 317732 427122
## depressed -1374026 750108 -1383422 -359058 -550269 8671 111484
## virtuous -2513797 -3678280 -1290876 -1133364 -811323 -31532 -2510
## perplexed -3113357 -2153156 -1204300 -1081265 -414128 -15750 -2339
Here, we can see that, for instance, the combination of “happy” and “black” in searches occurs significantly more often (~26,000,000 additional instances) than we’d expect given no correlations between colors and sentiments, while e.g. “happy” and “grey” is much rarer.
To take these patterns more intuitively, we can use a mosaic plot to visualize the proportional distribution of color/sentiment terms in searches (e.g. “back” and “happy” are both popular terms, so could be expected to occur more frequently than e.g. “perplexed” and “purple” in any case), as well as color coding for residuals:
mosaicplot(t(color_matrix / 2000), las = 2, shade = TRUE, type = 'pearson') # dividing values by 2,000 because the mosaic plot function doesn't seem to auto-scale colors, meaning that the unaltered matrix is all "flattened" to the <4 or >4 category.
#transposing is just aesthetic; seems easier to follow with sentiments on the y axis as far as labels and visual row/column continuity.
From this pattern, it’s easier to see broad patterns and associations among colors and sentiments. For instance, if we focus on colors, we can see black
is quite distinct from most colors, with more than expected with happy
and lower for other sentiments. All other sentiments behave more like each other than black
, but do show a smaller division between white
, blue
, and green
; this is distinguished from orange
, purple
, and grey
by being rarer than expected (under random distribution) with angry
and more common with depressed
, virtuous
, and perplexed
. Based on this, in a 2D projection we might expect to see the largest separation between black
and all other colors, with a smaller but obvious distinction between the two other color clusters. Because quiet
, lively
, and clever
are more common than expected for everything but black, the will likely be about equidistant between these clusters.
To check how close we got with these eyeballed estimates, we can use the factoextra
biplot visualization function fviz_ca_biplot
for correspondence analysis to see our biplot for sentiment and color searches. I thought using the option to represent one as vector arrow, rather than points, also improves legibility (e.g. the relationship between angry
and orange
/purple
become more obvious):
fviz_ca_biplot(color_matrix_ca, arrows = c(FALSE, TRUE), repel = TRUE)
As we can see, sentiment and color groups show similar relationships from what we might expect comparing patterns of positive/negative residuals in the mosaic plot. However, this makes it easier to see some patterns, such as the strong opposition between black
and grey
on dimension 1, or the fact that most of dimension 2 is due to the differences of green-white-blue
and perplexed-depressed-virtuous
from the rest of the data. We can also make out some smaller trends that weren’t obvious (at least to me) from the mosaic visualization, like grey
being more distinct from orange
and purple
than we could make out with the mosaic plot’s effective “resolution”. This separation appears to be driven by higher co-occurrence with lively
, quiet
, and clever
. Looking back to our mosaic plot, the latter three have lighter shades of blue in orange
and purple
, with no obvious difference with angry
.
Plotting directly with ggplot
:
While factoextra
uses custom functions to streamline the process, it’s possible to approximate the same visualization using ggplot
and components of the dudi
object. In the color_matrix_ca
object, the ‘row’ and ‘column’ factor coordinates (emotion and color, respectively) are stored at .$li
and .$co
, This allows direct plotting; you could also look at e.g. normalized scores in .$l1
and .$c1
as well. I was able to get a general sense of how the factoextra
authors approached this using the call View(fviz_ca_biplot)
to pull up the function’s R code; this ultimately pointed to the more fundamental fviz
function.
Using this information, we can render single plots for color:
#Single plots: (roughly equivalent to default output for `fviz_ca_col` and `fviz_ca_row`)
color_nmds <- color_matrix_ca$co %>%
ggplot() +
aes(x = Comp1, y = Comp2) +
geom_point(color = 'blue') +
geom_text(label = rownames(color_matrix_ca$co), nudge_y = 0.01, color = 'blue') +
coord_fixed()
color_nmds
and sentiment (below). We can also use the geom_segment
function in ggplot
to replicate the arrows seen in the factoExtra
biplot:
emotion_nmds <- color_matrix_ca$li %>%
ggplot() +
aes(x = Axis1, y = Axis2) +
geom_point(color = 'red') +
ggrepel::geom_text_repel(label = rownames(color_matrix_ca$li), nudge_y = 0.01, color = 'red') +
geom_segment(aes(x = 0, y = 0, xend = Axis1, yend = Axis2),
arrow = arrow(length = unit(0.3,"cm")),
color = "red") +
coord_fixed()
emotion_nmds
We can combine these elements into a biplot by combining the above dataframes, with one column for each value, and plotting the result:
# Biplot (there are probably more efficient/correct approaches)
#manually joining the two datasets using a common index (generates a partial row of NAs, with 8 sentiments and 7 colors)
color_component <- color_matrix_ca$co %>%
rownames_to_column(var = "color") %>%
rownames_to_column(var = "index")
emotion_component <- color_matrix_ca$li %>%
rownames_to_column(var = "emotion") %>%
rownames_to_column(var = "index")
biplot_composite <- color_component %>%
full_join(emotion_component, by = "index")
(biplot_composite)
## index color Comp1 Comp2 emotion Axis1 Axis2
## 1 1 black 0.17794592 0.04039331 happy 0.11532145 0.01602901
## 2 2 white -0.05317953 -0.10399481 angry -0.10756934 0.09837664
## 3 3 green -0.03347037 -0.03237667 quiet -0.20048082 -0.01564279
## 4 4 blue -0.03552655 -0.04588027 lively -0.20105512 0.01174266
## 5 5 orange -0.09284092 0.07047109 clever -0.17886743 -0.03423825
## 6 6 purple -0.15049601 0.15422498 depressed 0.03935532 -0.24782949
## 7 7 grey -0.36826914 0.10335528 virtuous 0.05572888 -0.24684976
## 8 8 <NA> NA NA perplexed -0.04792928 -0.22173448
biplot_composite_plot <- biplot_composite %>%
ggplot() +
aes(x = Comp1, y = Comp2) +
geom_point(color = 'red') +
geom_text(label = biplot_composite$color, nudge_y = 0.01, color = 'red') +
geom_segment(aes(x = 0, y = 0, xend = Comp1, yend = Comp2),
arrow = arrow(length = unit(0.3,"cm")),
color = "red") +
geom_point(aes(x = Axis1, y = Axis2), color = 'blue') +
geom_text_repel(aes(x = Axis1, y = Axis2), label = biplot_composite$emotion, nudge_y = 0.01, color = 'blue') +
geom_hline(yintercept = 0, lty = 2) +
geom_vline(xintercept = 0, lty = 2) +
labs(x = paste0("Dim1(",round(color_matrix_ca$eig[1]/sum(color_matrix_ca$eig)*100, 1),"%)"),
y = paste0("Dim2(",round(color_matrix_ca$eig[2]/sum(color_matrix_ca$eig)*100, 1),"%)") +
coord_fixed()
)
biplot_composite_plot
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text).
## Warning: Removed 1 rows containing missing values (geom_segment).
Second Approach
Dr. Anderson also worked out a more-efficient approach making use of purrr
package function, unclass
, map_at
and other tools to unite the two coa / dudi
objects with fewer intermediate steps:
color_matrix_ca %>%
# `unclass` to work with this as a regular list
unclass() %>%
# `keep` lets you keep just some elements of a list. We'll keep "co" and "li" elements
keep(names(.) %in% c("co", "li")) %>%
# both of these have important info in their rownames, so move those into a column.
# `map` allows you to do the same thing to every element of the list (now just "co" and "li"
map(rownames_to_column) %>%
# `map_at` lets you do something to *just* some elements of a list. So here, to be able
# to bind the two dataframe elements in the list into one dataframe, you need to
# make sure they have the same column names. Currently, they don't, so we need to
# change the column names for one of them.
map_at("co", ~ rename(.x, Axis1 = Comp1, Axis2 = Comp2)) %>%
# Now that we have a list where each element is a dataframe with the same number
# of columns, and where those columns have the same names and data types, you
# can use `bind_rows` to stick them together into one dataframe (it turns out that
# this is a *super* helpful function). After this step, you have a tidy dataframe. The
# `.id` parameter is adding a column with the original list element name, so you can
# tell which rows originally came from "co" and which from "li"
bind_rows(.id = "id") %>%
ggplot(aes(x = Axis1, y = Axis2, color = id)) +
geom_point() +
# You can add arrows to your segments with the `arrow` function (from the
# `grid` package, which is very old school graphics and what ggplot is built on)
# To have everything come from the center, you set the starting point to 0 for
# both x- and y-axes
geom_segment(aes(x = 0, y = 0, xend = Axis1, yend = Axis2),
arrow = arrow(length = unit(0.1, "inches"))) +
geom_text_repel(aes(label = rowname)) +
# `coord_fixed` ensures that the x- and y-axes are scaled so they have the
# same unit size
coord_fixed() +
# `str_c` lets you stick character strings together (`paste0` would also work here)
labs(x = str_c("Dim 1 (",
round(100 * color_matrix_ca$eig[1] / sum(color_matrix_ca$eig), 1),
"%)"),
y = str_c("Dim 2 (",
round(100 * color_matrix_ca$eig[2] / sum(color_matrix_ca$eig), 1),
"%)")) +
scale_color_manual(values = c("red", "blue")) +
# Get rid of the color legend
theme(legend.position = "none") +
# Add some reference lines for 0 on the x- and y-axes
geom_hline(aes(yintercept = 0), linetype = 3) +
geom_vline(aes(xintercept = 0), linetype = 3)