Exercise 1.8 (p. 17) Is the mitochondrial sequence of C. elegans consistent with a model of equally likely nucleotides?
This exercise asks us to explore the frequency of each of the four nucleotides (A, C, G, and T) in the genome of C. elegans, a type of worm used frequently in scientific research.
This solution requires that several R extension packages be loaded in your R session. If you do not have these packages installed to your computer yet, you should follow instructions we’ve posted separately describing the required set-up for this exercise. Once you have installed these packages on your computer, you can load them into your current R session using the library
function:
library("BSgenome.Celegans.UCSC.ce2")
library("Biostrings")
library("tidyverse")
library("knitr")
Explore the nucleotide frequencies of chromosome M by using a dedicated function in the Biostrings package from Bioconductor.
Part A of the question asks us to explore the nucleotide frequency of the C. elegans genome. This genome is available in the Celegans
data that comes with the BSgenome.Clegans.UCSC.ce2
package and is stored within a BSgenome
class, which is a special object class provided by the Biostrings
package.
There is a dedicated function called letterFrequency
in the Biostrings
package that can be used to count the frequency of letters in a string (like a genome) in an R object like this. In a call to this function, you must also include the possible letters in your “alphabet”—that is, the possible letters that each position in your string could take.
(nuc_freq <- letterFrequency(Celegans$chrM, letters=c("A", "C", "G", "T")))
## A C G T
## 4335 1225 2055 6179
To explore and plot this data, I put this summary data into a tibble
, so I could more easily use tidyverse
tools with the data.
nuc_freq_df <- tibble(nucleotide = names(nuc_freq),
n = nuc_freq)
nuc_freq_df
## # A tibble: 4 x 2
## nucleotide n
## <chr> <int>
## 1 A 4335
## 2 C 1225
## 3 G 2055
## 4 T 6179
In this format, you can use tidyverse
tools to explore the data a bit more. For example, you can determine the total number of nucleotides in the genome and, with that calculate the proportion of each nucleotide across the genome. Along with the kable
function from the knitr
package, I created a formatted table with this information:
nuc_freq_df %>%
mutate(prop = n / sum(n)) %>%
kable(digits = 2,
caption = "Nucleotide frequencies and proportions in *C. elegans*",
col.names = c("Nucleotide", "Frequency", "Proportion"))
Nucleotide | Frequency | Proportion |
---|---|---|
A | 4335 | 0.31 |
C | 1225 | 0.09 |
G | 2055 | 0.15 |
T | 6179 | 0.45 |
For some presentations, it might be clearer to present this information in a slightly different table format, using pivot_longer
and then pivot_wider
to reformat the table for presentation:
nuc_freq_df %>%
mutate(prop = n / sum(n),
n = prettyNum(n, big.mark = ","),
prop = prettyNum(prop, digits = 2)) %>%
pivot_longer(cols = c("n", "prop")) %>%
pivot_wider(names_from = "nucleotide") %>%
mutate(name = case_when(
name == "n" ~ "Frequency of nucleotide",
name == "prop" ~ "Proportion of all nucleotides"
)) %>%
rename(` ` = name) %>%
kable(align = c("rcccc"),
caption = "Nucleotide frequencies and proportions in *C. elegans*")
A | C | G | T | |
---|---|---|---|---|
Frequency of nucleotide | 4,335 | 1,225 | 2,055 | 6,179 |
Proportion of all nucleotides | 0.31 | 0.089 | 0.15 | 0.45 |
Here is a plot of the frequency of each of the four nucleotides for the C. elegans nucleotide:
ggplot(nuc_freq_df, aes(x = nucleotide, y = n)) +
geom_col(fill = "lavender", color = "black") +
theme_classic() +
scale_y_continuous(label = scales::comma) +
theme(axis.title = element_blank()) +
labs(title = expression(paste(italic("C. elegans"), " neucleotide frequency")),
caption = expression(paste("Based on data from the ", italic("BSgenome.Celegans.UCSC.ce2"),
" package.")))
This graph uses a few elements to improve its appearance that you might want to explore if you’re not already familiar with them:
labs
function is used to add both a title and a caption to the plot.paste
, expression
, and italic
functions are used together to put “C. elegans” and an R package name in italics in some of the labels on the plot.scales
package is used inside a scale layer for the ggplot2
code to make the y-axis labels a bit nicer.theme
calls are used to apply a simpler overall theme than the default and to remove the x- and y-axis titles (with element_blank
).geom_col
).From this plot, it certainly looks like the nucleotides are not uniformly distributed in the C. elegans genome. This question will be investigated more in the next part of the exercise.
Test whether the C. elegans data are consistent with the uniform model (all nucleotide frequencies the same) using a simulation.
The second part of the exercise asks us to test whether the observed nucleotide data for C. elegans is consistent with the uniform model that all nucleotide frequencies are the same.
First, we can simulate several datasets under this null model and see how a plot of nucleotide frequencies compares to the plot that we obtained with the observed C. elegans data. To make these plots, I first simulated 20 samples under the null model that the distribution is uniform across the four nucleotides, using the rmultinom
function with the size
argument set to the number of nucleotides in the original C. elegans genome data and the prob
argument set to have an equal probability of each nucleotide at each spot on the genome:
(sim_nuc_freq <- rmultinom(n = 20,
size = sum(nuc_freq_df$n),
prob = rep(1 / 4, 4)))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 3373 3458 3406 3492 3532 3444 3414 3482 3469 3324 3456 3433 3496 3478
## [2,] 3414 3390 3440 3465 3439 3502 3498 3412 3496 3490 3367 3521 3448 3453
## [3,] 3501 3466 3469 3398 3356 3377 3455 3378 3387 3459 3586 3496 3372 3390
## [4,] 3506 3480 3479 3439 3467 3471 3427 3522 3442 3521 3385 3344 3478 3473
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 3497 3364 3421 3389 3379 3506
## [2,] 3413 3444 3447 3465 3452 3489
## [3,] 3422 3485 3397 3467 3517 3422
## [4,] 3462 3501 3529 3473 3446 3377
Next, I moved this into a tibble so I could more easily rearrange and plot the data using facetting in ggplot2
:
sim_nuc_freq_df <- as_tibble(sim_nuc_freq) %>%
mutate(nucleotide = c("A", "C", "G", "T")) %>%
pivot_longer(-nucleotide, names_to = "sample") %>%
mutate(sample = sample %>% str_remove("V") %>% as.numeric()) %>%
arrange(sample, nucleotide)
sim_nuc_freq_df %>%
slice(1:10)
## # A tibble: 10 x 3
## nucleotide sample value
## <chr> <dbl> <int>
## 1 A 1 3373
## 2 C 1 3414
## 3 G 1 3501
## 4 T 1 3506
## 5 A 2 3458
## 6 C 2 3390
## 7 G 2 3466
## 8 T 2 3480
## 9 A 3 3406
## 10 C 3 3440
ggplot(sim_nuc_freq_df, aes(x = nucleotide, y = value)) +
geom_col(fill = "lavender", color = "black") +
theme_classic() +
scale_y_continuous(label = scales::comma) +
theme(axis.title = element_blank()) +
labs(title = "Simulated neucleotide frequencies under a uniform model") +
facet_wrap(~ sample) +
expand_limits(y = max(nuc_freq_df$n))
The y-axis limits were expanded here to cover the same range as that shown for the observed C. elegans nucleotide frequencies, to help make it easier to compare these plots with the plot of our observed data. These plots of data simulated under the null model do show some variation in frequencies among the nucleotides, but it’s certainly much less than in the observed data for C. elegans.
Next, I repeated this simulation process, but I increased the number of simulations to 1,000:
sim_nuc_freq_df <- rmultinom(n = 1000,
size = sum(nuc_freq_df$n),
prob = rep(1 / 4, 4)) %>%
as_tibble() %>%
mutate(nucleotide = c("A", "C", "G", "T")) %>%
pivot_longer(-nucleotide, names_to = "sample") %>%
mutate(sample = sample %>% str_remove("V") %>% as.numeric()) %>%
arrange(sample, nucleotide)
sim_nuc_freq_df %>%
slice(1:10)
## # A tibble: 10 x 3
## nucleotide sample value
## <chr> <dbl> <int>
## 1 A 1 3470
## 2 C 1 3392
## 3 G 1 3479
## 4 T 1 3453
## 5 A 2 3559
## 6 C 2 3364
## 7 G 2 3502
## 8 T 2 3369
## 9 A 3 3457
## 10 C 3 3471
Using this dataframe of simulations, we can measure the mean, minimum, and maximum frequencies of each nucleotide across all 1,000 simulations:
(sim_summary <- sim_nuc_freq_df %>%
group_by(nucleotide) %>%
summarize(mean_freq = mean(value),
min_freq = min(value),
max_freq = max(value)))
## # A tibble: 4 x 4
## nucleotide mean_freq min_freq max_freq
## <chr> <dbl> <int> <int>
## 1 A 3449. 3297 3586
## 2 C 3448. 3311 3646
## 3 G 3448. 3280 3639
## 4 T 3448. 3288 3599
To help compare this with the observed data, we can create a table with information from both the original data and the simulations under the null model:
nuc_freq_df %>%
left_join(sim_summary, by = "nucleotide") %>%
mutate_at(c("mean_freq", "min_freq", "max_freq", "n"),
prettyNum, big.mark = ",", digits = 0) %>%
mutate(simulations = paste0(mean_freq, " (", min_freq, ", ", max_freq, ")")) %>%
select(nucleotide, n, simulations) %>%
kable(col.names = c("Nucleotide",
"Frequency in C. elegans genome",
"Mean frequency (minimum frequency, maximum frequency) across 1,000 simulations"),
align = "c")
Nucleotide | Frequency in C. elegans genome | Mean frequency (minimum frequency, maximum frequency) across 1,000 simulations |
---|---|---|
A | 4,335 | 3,449 (3,297, 3,586) |
C | 1,225 | 3,448 (3,311, 3,646) |
G | 2,055 | 3,448 (3,280, 3,639) |
T | 6,179 | 3,448 (3,288, 3,599) |
This helps clarify how unusual the observed data would be under the null model—the counts of all four nucleotides in the C. elegans genome are completely outside the range of frequencies in the simulated data.
Another way to look at this is with histograms of the distribution of frequencies of each nucleotide under the null model compared to the observed frequencies in the C. elegans nucleotide:
ggplot(sim_nuc_freq_df, aes(x = value)) +
geom_histogram(binwidth = 10) +
facet_wrap(~ nucleotide) +
theme_classic() +
scale_x_continuous(name = "Frequency of nucleotide in the simulation under the null model",
labels = scales::comma) +
scale_y_continuous(name = "# of simulations (out of 1,000)") +
geom_vline(data = nuc_freq_df, aes(xintercept = n), color = "red") +
labs(title = expression(paste("Nucleotide frequency in ",
italic("C. elegans"),
" compared null model simulations")),
caption = "Red line shows the frequency observed for the nucleotide in C. elegans")
Finally, to help in answering this question, it would be interesting to look at a single measure for each simulation (and for the observed data) rather than comparing each nucleotide one at a time. Chapter 1 gives the equation for a statistic to measure variability in multinomial data by calculating the sum of squares for the differences between the observed and expected count of nucleotides for each of the four nucleotides in a sample (p. 12).
I calculated this statistic for the observed data and then for each of the 1,000 simulations.
(obs_stat <- nuc_freq_df %>%
mutate(expected = mean(n),
stat_input = (n - expected) ^ 2 / expected) %>%
summarize(variability_stat = sum(stat_input)))
## # A tibble: 1 x 1
## variability_stat
## <dbl>
## 1 4387.
sim_stat <- sim_nuc_freq_df %>%
mutate(expected = mean(value),
stat_input = (value - expected) ^ 2 / expected) %>%
group_by(sample) %>%
summarize(variability_stat = sum(stat_input))
sim_stat %>%
slice(1:5)
## # A tibble: 5 x 2
## sample variability_stat
## <dbl> <dbl>
## 1 1 1.34
## 2 2 8.27
## 3 3 0.762
## 4 4 1.00
## 5 5 0.811
Here is a plot of the distribution of this statistic across the 1,000 simulations:
ggplot(sim_stat, aes(x = variability_stat)) +
geom_rect(data = sim_stat, aes(xmin = quantile(variability_stat, prob = 0.025),
xmax = quantile(variability_stat, prob = 0.975),
ymin = 0, ymax = Inf),
fill = "beige", alpha = 0.5) +
geom_histogram(bins = 30, fill = "white", color = "tan", alpha = 0.5) +
theme_classic() +
labs(title = "Variability from expected values",
subtitle = "Values from simulations under the null",
x = "Value of variability statistic",
y = "Number of simulations with given value",
caption = "The shaded yellow area shows the region of the central 95% of\nstatistic values for the 1,000 simulations under the null model.")
The value of this statistic for the observed nucleotide frequencies for C. elegans is 4387, which is much larger (indicating greater variability from expected values under the null model) than the value observed under most of the simulations. It is, in fact, far outside the central 95% range of values observed in simulations.