Exercise solution for Chapter 1

Exercise 1.8 (p. 17) Is the mitochondrial sequence of C. elegans consistent with a model of equally likely nucleotides?

This exercise asks us to explore the frequency of each of the four nucleotides (A, C, G, and T) in the genome of C. elegans, a type of worm used frequently in scientific research.

This solution requires that several R extension packages be loaded in your R session. If you do not have these packages installed to your computer yet, you should follow instructions we’ve posted separately describing the required set-up for this exercise. Once you have installed these packages on your computer, you can load them into your current R session using the library function:

library("BSgenome.Celegans.UCSC.ce2")
library("Biostrings")

library("tidyverse")
library("knitr")

Part A

Explore the nucleotide frequencies of chromosome M by using a dedicated function in the Biostrings package from Bioconductor.

Part A of the question asks us to explore the nucleotide frequency of the C. elegans genome. This genome is available in the Celegans data that comes with the BSgenome.Clegans.UCSC.ce2 package and is stored within a BSgenome class, which is a special object class provided by the Biostrings package.

There is a dedicated function called letterFrequency in the Biostrings package that can be used to count the frequency of letters in a string (like a genome) in an R object like this. In a call to this function, you must also include the possible letters in your “alphabet”—that is, the possible letters that each position in your string could take.

(nuc_freq <- letterFrequency(Celegans$chrM, letters=c("A", "C", "G", "T")))

##    A    C    G    T 
## 4335 1225 2055 6179

To explore and plot this data, I put this summary data into a tibble, so I could more easily use tidyverse tools with the data.

nuc_freq_df <- tibble(nucleotide = names(nuc_freq), 
             n = nuc_freq)
nuc_freq_df

## # A tibble: 4 x 2
##   nucleotide     n
##   <chr>      <int>
## 1 A           4335
## 2 C           1225
## 3 G           2055
## 4 T           6179

In this format, you can use tidyverse tools to explore the data a bit more. For example, you can determine the total number of nucleotides in the genome and, with that calculate the proportion of each nucleotide across the genome. Along with the kable function from the knitr package, I created a formatted table with this information:

nuc_freq_df %>% 
  mutate(prop = n / sum(n)) %>% 
  kable(digits = 2, 
        caption = "Nucleotide frequencies and proportions in *C. elegans*",
        col.names = c("Nucleotide", "Frequency", "Proportion"))

Nucleotide frequencies and proportions in *C. elegans*
Nucleotide	Frequency	Proportion
A	4335	0.31
C	1225	0.09
G	2055	0.15
T	6179	0.45

For some presentations, it might be clearer to present this information in a slightly different table format, using pivot_longer and then pivot_wider to reformat the table for presentation:

nuc_freq_df %>% 
  mutate(prop = n / sum(n),
         n = prettyNum(n, big.mark = ","),
         prop = prettyNum(prop, digits = 2)) %>% 
  pivot_longer(cols = c("n", "prop")) %>% 
  pivot_wider(names_from = "nucleotide") %>% 
  mutate(name = case_when(
    name == "n" ~ "Frequency of nucleotide",
    name == "prop" ~ "Proportion of all nucleotides"
  )) %>%  
  rename(` ` = name) %>% 
  kable(align = c("rcccc"), 
        caption = "Nucleotide frequencies and proportions in *C. elegans*")

Nucleotide frequencies and proportions in *C. elegans*
	A	C	G	T
Frequency of nucleotide	4,335	1,225	2,055	6,179
Proportion of all nucleotides	0.31	0.089	0.15	0.45

Here is a plot of the frequency of each of the four nucleotides for the C. elegans nucleotide:

ggplot(nuc_freq_df, aes(x = nucleotide, y = n)) + 
  geom_col(fill = "lavender", color = "black") + 
  theme_classic() + 
  scale_y_continuous(label = scales::comma) + 
  theme(axis.title = element_blank()) + 
  labs(title = expression(paste(italic("C. elegans"), " neucleotide frequency")),
       caption = expression(paste("Based on data from the ", italic("BSgenome.Celegans.UCSC.ce2"), 
                                  " package.")))

This graph uses a few elements to improve its appearance that you might want to explore if you’re not already familiar with them:

The labs function is used to add both a title and a caption to the plot.
The paste, expression, and italic functions are used together to put “C. elegans” and an R package name in italics in some of the labels on the plot.
The scales package is used inside a scale layer for the ggplot2 code to make the y-axis labels a bit nicer.
theme calls are used to apply a simpler overall theme than the default and to remove the x- and y-axis titles (with element_blank).
The color and fill of the bars are customized in the geom layer (geom_col).

From this plot, it certainly looks like the nucleotides are not uniformly distributed in the C. elegans genome. This question will be investigated more in the next part of the exercise.

Part B

Test whether the C. elegans data are consistent with the uniform model (all nucleotide frequencies the same) using a simulation.

The second part of the exercise asks us to test whether the observed nucleotide data for C. elegans is consistent with the uniform model that all nucleotide frequencies are the same.

First, we can simulate several datasets under this null model and see how a plot of nucleotide frequencies compares to the plot that we obtained with the observed C. elegans data. To make these plots, I first simulated 20 samples under the null model that the distribution is uniform across the four nucleotides, using the rmultinom function with the size argument set to the number of nucleotides in the original C. elegans genome data and the prob argument set to have an equal probability of each nucleotide at each spot on the genome:

(sim_nuc_freq <- rmultinom(n = 20, 
                          size = sum(nuc_freq_df$n), 
                          prob = rep(1 / 4, 4)))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 3373 3458 3406 3492 3532 3444 3414 3482 3469  3324  3456  3433  3496  3478
## [2,] 3414 3390 3440 3465 3439 3502 3498 3412 3496  3490  3367  3521  3448  3453
## [3,] 3501 3466 3469 3398 3356 3377 3455 3378 3387  3459  3586  3496  3372  3390
## [4,] 3506 3480 3479 3439 3467 3471 3427 3522 3442  3521  3385  3344  3478  3473
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]  3497  3364  3421  3389  3379  3506
## [2,]  3413  3444  3447  3465  3452  3489
## [3,]  3422  3485  3397  3467  3517  3422
## [4,]  3462  3501  3529  3473  3446  3377

Next, I moved this into a tibble so I could more easily rearrange and plot the data using facetting in ggplot2:

sim_nuc_freq_df <- as_tibble(sim_nuc_freq) %>% 
  mutate(nucleotide = c("A", "C", "G", "T")) %>% 
  pivot_longer(-nucleotide, names_to = "sample") %>% 
  mutate(sample = sample %>% str_remove("V") %>% as.numeric()) %>% 
  arrange(sample, nucleotide)

sim_nuc_freq_df %>% 
  slice(1:10)

## # A tibble: 10 x 3
##    nucleotide sample value
##    <chr>       <dbl> <int>
##  1 A               1  3373
##  2 C               1  3414
##  3 G               1  3501
##  4 T               1  3506
##  5 A               2  3458
##  6 C               2  3390
##  7 G               2  3466
##  8 T               2  3480
##  9 A               3  3406
## 10 C               3  3440

ggplot(sim_nuc_freq_df, aes(x = nucleotide, y = value)) + 
  geom_col(fill = "lavender", color = "black") + 
  theme_classic() + 
  scale_y_continuous(label = scales::comma) + 
  theme(axis.title = element_blank()) + 
  labs(title = "Simulated neucleotide frequencies under a uniform model") +
  facet_wrap(~ sample) + 
  expand_limits(y = max(nuc_freq_df$n))

The y-axis limits were expanded here to cover the same range as that shown for the observed C. elegans nucleotide frequencies, to help make it easier to compare these plots with the plot of our observed data. These plots of data simulated under the null model do show some variation in frequencies among the nucleotides, but it’s certainly much less than in the observed data for C. elegans.

Next, I repeated this simulation process, but I increased the number of simulations to 1,000:

sim_nuc_freq_df <- rmultinom(n = 1000, 
                          size = sum(nuc_freq_df$n), 
                          prob = rep(1 / 4, 4)) %>% 
  as_tibble() %>% 
  mutate(nucleotide = c("A", "C", "G", "T")) %>% 
  pivot_longer(-nucleotide, names_to = "sample") %>% 
  mutate(sample = sample %>% str_remove("V") %>% as.numeric()) %>% 
  arrange(sample, nucleotide)

sim_nuc_freq_df %>% 
  slice(1:10)

## # A tibble: 10 x 3
##    nucleotide sample value
##    <chr>       <dbl> <int>
##  1 A               1  3470
##  2 C               1  3392
##  3 G               1  3479
##  4 T               1  3453
##  5 A               2  3559
##  6 C               2  3364
##  7 G               2  3502
##  8 T               2  3369
##  9 A               3  3457
## 10 C               3  3471

Using this dataframe of simulations, we can measure the mean, minimum, and maximum frequencies of each nucleotide across all 1,000 simulations:

(sim_summary <- sim_nuc_freq_df %>% 
  group_by(nucleotide) %>% 
  summarize(mean_freq = mean(value),
            min_freq = min(value), 
            max_freq = max(value)))

## # A tibble: 4 x 4
##   nucleotide mean_freq min_freq max_freq
##   <chr>          <dbl>    <int>    <int>
## 1 A              3449.     3297     3586
## 2 C              3448.     3311     3646
## 3 G              3448.     3280     3639
## 4 T              3448.     3288     3599

To help compare this with the observed data, we can create a table with information from both the original data and the simulations under the null model:

nuc_freq_df %>% 
  left_join(sim_summary, by = "nucleotide") %>% 
  mutate_at(c("mean_freq", "min_freq", "max_freq", "n"), 
            prettyNum, big.mark = ",", digits = 0) %>% 
  mutate(simulations = paste0(mean_freq, " (", min_freq, ", ", max_freq, ")")) %>% 
  select(nucleotide, n, simulations) %>% 
  kable(col.names = c("Nucleotide",              
        "Frequency in C. elegans genome",
        "Mean frequency (minimum frequency, maximum frequency) across 1,000 simulations"), 
        align = "c")

Nucleotide	Frequency in C. elegans genome	Mean frequency (minimum frequency, maximum frequency) across 1,000 simulations
A	4,335	3,449 (3,297, 3,586)
C	1,225	3,448 (3,311, 3,646)
G	2,055	3,448 (3,280, 3,639)
T	6,179	3,448 (3,288, 3,599)

This helps clarify how unusual the observed data would be under the null model—the counts of all four nucleotides in the C. elegans genome are completely outside the range of frequencies in the simulated data.

Another way to look at this is with histograms of the distribution of frequencies of each nucleotide under the null model compared to the observed frequencies in the C. elegans nucleotide:

ggplot(sim_nuc_freq_df, aes(x = value)) + 
  geom_histogram(binwidth = 10) + 
  facet_wrap(~ nucleotide) + 
  theme_classic() + 
  scale_x_continuous(name = "Frequency of nucleotide in the simulation under the null model",
                     labels = scales::comma) + 
  scale_y_continuous(name = "# of simulations (out of 1,000)") + 
  geom_vline(data = nuc_freq_df, aes(xintercept = n), color = "red") + 
  labs(title = expression(paste("Nucleotide frequency in ",
                                italic("C. elegans"), 
                                " compared null model simulations")),
       caption = "Red line shows the frequency observed for the nucleotide in C. elegans")

Finally, to help in answering this question, it would be interesting to look at a single measure for each simulation (and for the observed data) rather than comparing each nucleotide one at a time. Chapter 1 gives the equation for a statistic to measure variability in multinomial data by calculating the sum of squares for the differences between the observed and expected count of nucleotides for each of the four nucleotides in a sample (p. 12).

I calculated this statistic for the observed data and then for each of the 1,000 simulations.

(obs_stat <- nuc_freq_df %>% 
  mutate(expected = mean(n),
         stat_input = (n - expected) ^ 2 / expected) %>% 
  summarize(variability_stat = sum(stat_input)))

## # A tibble: 1 x 1
##   variability_stat
##              <dbl>
## 1            4387.

sim_stat <- sim_nuc_freq_df %>% 
  mutate(expected = mean(value), 
         stat_input = (value - expected) ^ 2 / expected) %>% 
  group_by(sample) %>% 
  summarize(variability_stat = sum(stat_input))

sim_stat %>% 
  slice(1:5)

## # A tibble: 5 x 2
##   sample variability_stat
##    <dbl>            <dbl>
## 1      1            1.34 
## 2      2            8.27 
## 3      3            0.762
## 4      4            1.00 
## 5      5            0.811

Here is a plot of the distribution of this statistic across the 1,000 simulations:

ggplot(sim_stat, aes(x = variability_stat)) + 
  geom_rect(data = sim_stat, aes(xmin = quantile(variability_stat, prob = 0.025),
                                 xmax = quantile(variability_stat, prob = 0.975),
                                 ymin = 0, ymax = Inf), 
            fill = "beige", alpha = 0.5) +
  geom_histogram(bins = 30, fill = "white", color = "tan", alpha = 0.5) +
  theme_classic() + 
  labs(title = "Variability from expected values",
       subtitle = "Values from simulations under the null",
       x = "Value of variability statistic", 
       y = "Number of simulations with given value",
       caption = "The shaded yellow area shows the region of the central 95% of\nstatistic values for the 1,000 simulations under the null model.")

The value of this statistic for the observed nucleotide frequencies for C. elegans is 4387, which is much larger (indicating greater variability from expected values under the null model) than the value observed under most of the simulations. It is, in fact, far outside the central 95% range of values observed in simulations.

Exercise solution for Chapter 1

Brooke Anderson

2020-02-11

Part A

Part B