-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathowl-monkey-figs.Rmd
706 lines (554 loc) · 36.2 KB
/
owl-monkey-figs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
---
bibliography: refs/owl-monkey-md.xml
csl: refs/nature.csl
output:
html_document:
df_print: paged
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(knitr)
library(reshape2)
library(gridExtra)
library(cowplot)
library(kableExtra)
library(akima)
```
![](img/banner-img.png)
</br>
### Introduction to this document
This document is a brief walkthrough of the methods and code used to generate the figures in our owl monkey pedigree sequencing paper. In this paper, we sequenced 30 owl monkey (*Aotus nancymaae*) individuals within 6 multi-generation pedigrees (14 trios) and calculated the mutation rate per generation ($\mu_g$) for this species. We found that there is a similar paternal age effect in owl monkeys as in humans, and that the underlying mutational machinery doesn't seem to have changed much between owl monkeys, chimpanzees, and humans. Finally, we utilized a simple model of reproductive longevity to show that any observed variation in $\mu_g$ between these species can be explained by the varying age of puberty and average age at reproduction.
For full descriptions see the paper:
Thomas GWC, Wang RJ, Puri A, Harris RA, Raveendran M, Hughes DST, Murali SC, Williams LE, Doddapaneni H, Muzny DM, Gibbs RA, Abee CR, Galinski MR, Worley KC, Rogers J, Radivojac P, Hahn MW. 2018. Reproductive longevity predicts mutation rates in primates. *Current Biology*. 28(19):3193-3197. [10.1016/j.cub.2018.08.050](https://doi.org/10.1016/j.cub.2018.08.050)
To download this document, all accompanying code, and the data visit the following link: [Owl monkey github repository](https://github.com/gwct/owl-monkey)
---
### Figures 1 and S2
#### Owl monkey mutation rate
After filtering putative mutations (see Methods in paper) in the 14 owl monkey trios, we can calculate mutation rates ($\mu_g$) by taking into account false negative rates ($\alpha$) and the length of the callable genome ($C$) for each trio. Estimating a false negative rate allows us to correct the mutation rate for any true mutations that may have been removed by our filtering process. By using the distribution of allelic balance at all heterozygous SNP sites as an empirical CDF, we estimate $\alpha = 0.44$ while removing mutations with allelic balance less than 0.4 or or greater than 0.6 (henceforth noted as [0.4,0.6]; see FigS1 and Methods in paper). We estimate $C$ for each trio by counting the number of sites that pass the MV filters in each trio. Then, with the detected number of de novo mutations after filtering ($m_g$), the mutation rate for each trio is calculated as follows:
**Equation 1:** $$ \mu_g = \frac{m_g}{(1 - \alpha)*(2 * C)} $$
Mutation rates for all 14 trios were calculated in this fashion, with allelic balance cut-offs of [0.4,0.6] and the calculated value for $\alpha$. This information is available in **Table S3** of our paper.
```{r table1}
in_data = read.csv("data/owl-monkey-filters.csv", header=TRUE)
cur_filter = subset(in_data, Min.DP==20 & Max.DP==60 & Min.AB==0.4 & Max.AB==0.6)
# Read data and define current filter
callable_sites = cur_filter$Sites - cur_filter$Uncallable.SNP.sites
mu_calc = cur_filter$MVs / ((1 - cur_filter$Min.FN) * (2 * callable_sites))
cur_filter$CpG.rate = cur_filter$CpG.MVs / (2 * callable_sites)
mu_table = data.frame(cur_filter$Trio, cur_filter$MVs, cur_filter$CpG.MVs, callable_sites, mu_calc)
names(mu_table) = c("Trio", "Mutations", "CpG Mutations", "Callable sites", "Mutation rate")
# Re-calculate the mutation rate for posterity and subset the data for the table.
kable(mu_table, "html", caption="Table 1: Mutation rates for 14 owl monkey trios", digits=11, align=rep('c', 5)) %>%
kable_styling(bootstrap_options="striped", full_width=F)
```
We want to know if these mutation rates are correlated with father's age, similar to the paternal-age effect seen in humans. To answer this question, we perform a simple linear regression on mutation rates with father's age at birth of the offspring ($A_M$; purple dots and line). Additionally, we perform a similar analysis on non-replicative mutations and CpG sites (blue dots and line).
```{r regression, fig.width=6, fig.height=5, fig.align="center"}
om_obs_reg = lm(mu_calc ~ cur_filter$Paternal.GT)
fig1 = ggplot(cur_filter, aes(x=Paternal.GT, y=Mutation.rate.corrected.for.FN)) +
geom_smooth(method='glm', color='#b66dff', fill="#f2e6ff", fullrange=T) +
geom_point(color='#b66dff', size=3) +
geom_smooth(aes(x=Paternal.GT, y=CpG.rate), method='glm', color="#6db6ff", fill=NA, fullrange=T) +
geom_point(aes(x=Paternal.GT, y=CpG.rate), color="#6db6ff", size=3) +
scale_x_continuous(breaks = seq(1, 15, by=2)) +
scale_y_continuous(breaks = seq(0, 1.8e-8, by=0.4e-8)) +
scale_color_manual(breaks=c("All mutations (observed)", "CpG mutations (observed)"), values=c("All mutations (observed)"='#b66dff', "CpG mutations (observed)"="#6db6ff")) +
geom_vline(xintercept=1, linetype=3, color="grey", size=0.75) +
labs(x="Age of father at conception (y)",y="Mutation rate per site\nper generation") +
theme_classic() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=20),
axis.title.y=element_text(margin=margin(t=0,r=20,b=0,l=0),color="black"),
axis.title.x=element_text(margin=margin(t=20,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm")
#legend.position=c(0.1,0.87),
#legend.text=element_text(size=18)
)
print(fig1)
```
This results in the following formula for a best fit line (purple):
**Equation 2:** $$ \mu_g = 3.74\text{e-}9 + (A_M * 6.62\text{e-}10) $$
This means that with an average genome size of $2.21$ billion base pairs that $`r signif((coef(om_obs_reg)[1] + (1 * coef(om_obs_reg)[2])) * (2 * mean(callable_sites)), digits=4)`$ mutations are accounted for by puberty at the age of 1, with an additional $`r signif(coef(om_obs_reg)[2] * (2 * mean(callable_sites)), digits=3)`$ mutations per year of the father's life after puberty. As expected, non-replicative mutations are independent of father's age (blue line).
#### Modeling mutation rates
Large-scale pedigree sequencing projects in humans [@kong12; @michaelson12; @besenbacher15; @francioli15; @besenbacher16; @girard16; @goldmann16; @rahbari16; @wong16; @jonsson17] have shown the importance of different life-stages in the determination of mutation rates. Models for predicting mutation rates generally account for the three important life stages in the mammalian germline [@segurel14; @thomas14]. These life stages are (1) female (F), (2) male before puberty (M0), (3) and male after puberty (M1). The relative contribution of each of these stages must be accounted for when estimating mutation rates per generation [@thomas14] or per year [@segurel14; @thomas14; @amster16]. Here, we re-frame this model in terms of reproductive longevity. Reproductive longevity depends on both the age of puberty in males ($P_M$) and the age of the male at conception of his offspring ($A_M$) and we find that it is the main determinant of mutation rate variation in primates. We define the value of reproductive longevity ($RL$) as:
**Equation 3:** $$ RL = (A_M - P_M) $$
$RL$ therefore measures the amount of time mutations have accumulated post-puberty in a male, which only occurs during stage M1.
To see how reproductive longevity affects the per-generation mutation rate, $\mu_g$, we must model the combined contribution from all life stages. In any given period of time, $t$, the mutation rate due to errors in DNA replication, $\mu_t$, is simply a product of the mutation rate per cell division, $\mu_c$, and the number of cell divisions that occur, $d_t$.
**Equation 4:** $$\mu_g = \mu_c * d_t $$
Since females (stage F) and pre-puberty males (stage M0) have a fixed number of cell divisions, their contribution to the mutation rate per-generation is constant and requires only the substitution of appropriate terms into equation 4.
**Equation 5:** Female contribution to $\mu_g$ $$ \mu_{gF} = \mu_c * d_F $$
**Equation 6:** Pre-puberty male contribution to $\mu_g$ $$ \mu_{gM0} = \mu_c * d_{M0} $$
However, in males after puberty (stage M1) the number of cell divisions is a linear function of time, and the mutation rate per-generation in this life stage therefore depends on the yearly rate of cell division ($d_{yM1}$) and reproductive longevity ($RL$).
**Equation 7:** Post-puberty male contriubtion to $\mu_g$ $$ \mu_{gM1} = \mu_c * d_{yM1} * RL $$
Finally, since an autosome will spend roughly half of its time in females and half in males, the mutation rate per generation ($\mu_g$) for a given species is the average of the male and female contributions.
**Equation 8:** $$ \mu_g = \frac{\mu_{gF} + (\mu_{gM0} + \mu_{gM1})}{2} $$
#### Estimating mutational parameters from humans
Empirical observations from developmental studies and large-scale pedigree data from humans inform us about some of the underlying mutational parameters of our model (equations 5, 6, and 7). For example, we use $31$ and $34$ as estimates for the number of cell divisions in human females ($d_F$) and males before puberty ($d_{M0}$) [@drost95]. We use $16$ days as the length of a single spermatogenic cycle ($t_{sc}$) [@heller63], which means we expect $d_{yM1} = 23$ spermatogenic cycles to occur in a year if all spermatagonial cells are constantly dividing (but see below).
The remaining parameter of the model, $\mu_c$, can be estimated from human pedigrees. We confirm the estimate of $\mu_c$ made by Amster and Sella [@amster16] by using the $\mu_{gF}$ observed in Kong et al. [@kong12] of $14.2$, the number of female germline divisions, and rearranging equation 5:
**Equation 9:** $$ \mu_c = \frac{14.2}{31} $$
```{r mu_c}
human_dm0 = 34; human_df = 31; human_num_f = 14.2; human_haploid = 2630000000;
# Observed human mutational parameters
human_mu_c = (human_num_f / human_df)
print(paste("mu_c estimated from Kong params: ", human_mu_c))
human_mu_c = human_mu_c / human_haploid
print(paste("mu_c per site estimated from Kong params: ", human_mu_c))
```
We assume this rate is the same between females, males before puberty, and males after puberty (though see Methods of paper for discussion on this point). Therefore, to accomodate the observed $2.01$ mutations per year of the father ($\mu_{yM1}$) [@kong12], we must use the calculated $\mu_c$ to make adjustments to the expected number of spermatogenic cycles. This stems from the observation that not all spermatagonial cells may be constantly dividing [@plant10; @scally16b] (see Methods of paper for full explanation).
**Equation 10:** $$ dy_{M1}^{Human} = \frac{\mu_{yM1}}{\mu_c} \ = \frac{2.01}{0.458} $$
```{r human_spermatogenesis}
human_sc = 16
human_dy1 = ((2.01 / human_haploid) / human_mu_c)
human_sp = human_dy1 / (365/16)
print(paste("Human expected # of spermatogenic cell divisions/year: ", human_dy1))
print(paste("Human proportion of cells actively dividing given a cycle length of 16 days: ", human_sp))
```
It has been observed the New World monkeys have a lower rate of spermatogenesis than humans [@derooij86], with $t_{sc}^{Owl monkey} = 10.2$ days, approximately. We use the expected proportion of cells actively dividing in humans ($p_{sc}^{Human} = 0.19$) to estimate the expected number of spermatogenic cycles per year in owl monkeys.
**Equation 11:** $$ dy_{M1}^{Owl monkey} = (\frac{365}{t_{sc}^{Owl monkey}}) * p_{sc}^{Human} $$
```{r om_spermatogenesis}
om_sc = 10.2
om_dy1 = (365/om_sc) * human_sp
print(paste("Owl monkey expected # of spermatogenic cell divisions/year: ", om_dy1))
```
#### The mutational model fits observed owl monkey data
We can use the mutational parameters estimated from human data above, along with $t_{sc}^{Owl monkey}$ and an age of puberty of $1$ year [@dixson80] to predict mutation rates for a range of paternal ages (dashed purple line). What follows is an exact replication of **Figure 1** in our paper.
```{r fig1, fig.width=6, fig.height=5, fig.align="center"}
om_puberty = 1
age_range = seq(om_puberty, 15, by=0.1)
rep_long = age_range - om_puberty
# The owl monkey age of puberty set at 1 year and a range of paternal ages
mu_gf = human_mu_c * human_df
# Equation 5
mu_gm0 = human_mu_c * human_dm0
# Equation 6
mu_gm1 = human_mu_c * om_dy1 * rep_long
# Equation 7
mu_g = (mu_gf + (mu_gm0 + mu_gm1)) / 2
# Equation 8
om_pred = data.frame(mu_g, age_range)
names(om_pred) = c("Mutation.rate", "Parental.age")
om_pred_reg = lm(mu_g ~ age_range)
fig1_final = fig1 + geom_line(data=om_pred, aes(x=Parental.age, y=Mutation.rate), linetype=2, size=0.75, color="#b66dff")
print(fig1_final)
```
The fit of our model predicts $`r signif(coef(om_pred_reg)[2] * (2 * 2630000000), digits=3)`$ mutations per year of the father after puberty, compared to the observed $2.92$. An F-test on the residuals of the observed fit (solid purple line) and predicted fit (dashed purple line) shows that they are not significantly different.
```{r fig1-ftest}
cur_mu = mu_calc
cur_gt = cur_filter$Paternal.GT
#plot(cur_gt,cur_mu,ylim=c(min(cur_mu),max(cur_mu)))
#abline(om_obs_reg)
#segments(x0=cur_gt,y0=cur_mu,x1=cur_gt,y1=cur_mu-resid(om_obs_reg))
pred_actual = coef(om_pred_reg)[1] + cur_gt * coef(om_pred_reg)[2]
pred_resids = cur_mu - pred_actual
#abline(om_pred_reg, lty=2)
#segments(x0=cur_gt,y0=cur_mu,x1=cur_gt,y1=cur_mu-pred_resids, lty=2)
# This is the F-test for differences between the two lines in Fig.1
print(var.test(resid(om_obs_reg), pred_resids, alternative="two.sided"))
```
#### Changing the allelic balance filter has little effect on the fit of the model
By using a less stringent allelic balance filter of [0.3,0.7], we do reduce the false negative rate to $\alpha = 0.29$, however we find that this has little effect on our fit. Though we observe $2.63$ mutations per year of the father, a slightly lower estimate than with the more stringent filter. What follows is an exact replication of **Figure S2**.
```{r figS2, fig.width=6, fig.height=5, fig.align="center"}
cur_filter_s2 = subset(in_data, Min.DP==20 & Max.DP==60 & Min.AB==0.3 & Max.AB==0.7)
# Read data and define current filter
callable_sites_s2 = cur_filter_s2$Sites - cur_filter_s2$Uncallable.SNP.sites
mu_calc_s2 = cur_filter$MVs / ((1 - cur_filter_s2$Min.FN) * (2 * callable_sites_s2))
cur_filter_s2$CpG.rate = cur_filter_s2$CpG.MVs / (2 * callable_sites_s2)
figS2 = ggplot(cur_filter_s2, aes(x=Paternal.GT, y=Mutation.rate.corrected.for.FN)) +
geom_smooth(method='glm', color='#b66dff', fill="#f2e6ff", fullrange=T) +
geom_point(color='#b66dff', size=3) +
geom_smooth(aes(x=Paternal.GT, y=CpG.rate), method='glm', color="#6db6ff", fill=NA, fullrange=T) +
geom_point(aes(x=Paternal.GT, y=CpG.rate), color="#6db6ff", size=3) +
geom_line(data=om_pred, aes(x=Parental.age, y=Mutation.rate), linetype=2, size=0.75, color="#b66dff") +
scale_x_continuous(breaks = seq(1, 15, by=2)) +
scale_y_continuous(breaks = seq(0, 1.8e-8, by=0.4e-8)) +
scale_color_manual(breaks=c("All mutations (observed)", "CpG mutations (observed)"), values=c("All mutations (observed)"='#b66dff', "CpG mutations (observed)"="#6db6ff")) +
geom_vline(xintercept=1, linetype=3, color="grey", size=0.75) +
labs(x="Age of father at conception (y)",y="Mutation rate per site\nper generation") +
theme_classic() +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=20),
axis.title.y=element_text(margin=margin(t=0,r=20,b=0,l=0),color="black"),
axis.title.x=element_text(margin=margin(t=20,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm")
#legend.position=c(0.1,0.87),
#legend.text=element_text(size=18)
)
print(figS2)
om_obs_reg = lm(mu_calc_s2 ~ cur_filter_s2$Paternal.GT)
cur_mu = mu_calc_s2
cur_gt = cur_filter_s2$Paternal.GT
#plot(cur_gt,cur_mu,ylim=c(min(cur_mu),max(cur_mu)))
#abline(om_obs_reg)
#segments(x0=cur_gt,y0=cur_mu,x1=cur_gt,y1=cur_mu-resid(om_obs_reg))
pred_actual = coef(om_pred_reg)[1] + cur_gt * coef(om_pred_reg)[2]
pred_resids = cur_mu - pred_actual
#abline(om_pred_reg, lty=2)
#segments(x0=cur_gt,y0=cur_mu,x1=cur_gt,y1=cur_mu-pred_resids, lty=2)
# This is the F-test for differences between the two lines in Fig.1
print(var.test(resid(om_obs_reg), pred_resids, alternative="two.sided"))
```
---
### Figure 2
#### Primate mutational spectra
Based on the current study in owl monkeys, four studies in humans [@kong12; @goldmann16; @besenbacher16; @rahbari16], and one study in chimpanzees [@venn14], we can compare the mutational spectra across primates. What follows is **Figure 2** from the paper.
```{r fig2, fig.width=8, fig.height=6, fig.align="center", message=FALSE}
spectra_data = read.csv("data/mutational-spectra.csv", header=TRUE)
om_spec = data.frame(spectra_data$Mutation, spectra_data$Owl.monkey.CpG.proportion, spectra_data$Owl.monkey.proportion)
names(om_spec) = c("Type", "CpG", "Non-CpG")
om_spec_m = melt(om_spec)
fig2_om = ggplot(om_spec_m, aes(Type, value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_manual("legend", values = c("Non-CpG"="#006ddb","CpG"="#db6d00")) +
ggtitle("Owl monkey") +
labs(x="",y="Proportion of mutations") +
theme_classic() +
theme(axis.text=element_text(size=14),
axis.title.y=element_text(size=14),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm"),
legend.title=element_blank(),
legend.position=c(0.1,0.87)
)
human_spec = data.frame(spectra_data$Mutation, spectra_data$Human.average.proportion)
names(human_spec) = c("Type", "Human average")
human_spec_m = melt(human_spec)
fig2_human = ggplot(human_spec_m, aes(Type, value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_manual("legend", values = c("Human average" = "#920000")) +
ggtitle("Human") +
labs(x="",y="Proportion of mutations") +
theme_classic() +
theme(axis.text=element_text(size=14),
axis.title.y=element_text(size=14),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm"),
legend.position="none"
)
ch_spec = data.frame(spectra_data$Mutation, spectra_data$Chimp.proportion)
names(ch_spec) = c("Type", "Proportion")
ch_spec_m = melt(ch_spec)
fig2_chimp = ggplot(ch_spec_m, aes(Type, value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_manual("legend", values = c("Proportion" = "#920000")) +
ggtitle("Chimpanzee") +
labs(x="",y="") +
theme_classic() +
theme(axis.text=element_text(size=14),
axis.title.y=element_text(size=14),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm"),
legend.position="none"
)
g1 = arrangeGrob(fig2_human,fig2_chimp,nrow=1)
fig2 = plot_grid(fig2_om,g1, nrow=2, rel_heights=c(2, 1))
print(fig2)
```
We observe strikingly similar mutational spectra between primates. The only notable difference is a small, but significant increase in the number of A>T mutations in owl monkeys compared to humans. However, once the A>T class is removed, there is no significant difference between these species. This means that we see no evidence that the mutational machinery has changed drastically between these species.
```{r chisq}
print("--Chi-squared test using all mutation categories--")
all_chi = chisq.test(spectra_data$Owl.monkey+spectra_data$Owl.monkey.CpG, p=spectra_data$Human.average.proportion)
print(all_chi)
print("--Chi-squared test without the A>T category--")
human_noAT = spectra_data$Human.average.proportion[spectra_data$Mutation!="A>T"]
human_noAT = human_noAT / sum(human_noAT)
om_noAT = spectra_data$Owl.monkey[spectra_data$Mutation!="A>T"] + spectra_data$Owl.monkey.CpG[spectra_data$Mutation!="A>T"]
noAT_chi = chisq.test(om_noAT, p=human_noAT)
print(noAT_chi)
```
---
### Figures 3 and S3
#### Modeling mutation functions in primates using human parameters
We then wanted to know how well the model fits other species with minimal changes to the underlying parameters. We used all parameters estimated from human data above ($d_F$, $d_{M0}$, $d_{yM1}$ $\mu_c$) and varied only the age of puberty ($P_M$) to predict linear mutation functions for owl monkey and chimpanzee. For owl monkey we use $P_M = 1$ year [@dixson80], for chimp we used $P_M = 7.5$ years [@marson91], and for humans we set $P_M = 13.4$ years [@nielsen86]. We then predicted mutation rates based on varying the duration of reproductive longevity by changing the male age of reproduction (lines). We compared these predictions to direct observations of mutation rates from studies done in humans [@kong12; @michaelson12; @besenbacher15; @rahbari16; @wong16; @girard16; @besenbacher16; @jonsson17], chimpanzees [@venn14; @tatsumoto17], and owl monkeys (points). What follows is an exact replication of **Figure 3** from the paper.
```{r fig3, fig.width=10, fig.height=6, fig.align="center", message=FALSE, warning=FALSE}
human_puberty = 13.4
human_range = seq(human_puberty, 50.4, by=0.1)
human_rl = human_range - human_puberty
mu_gf = human_mu_c * human_df
# Equation 5
mu_gm0 = human_mu_c * human_dm0
# Equation 6
mu_gm1 = human_mu_c * human_dy1 * human_rl
# Equation 7
human_mu_g = (mu_gf + (mu_gm0 + mu_gm1)) / 2
# Equation 8
human_pred = data.frame(human_mu_g, human_range)
names(human_pred) = c("Mutation.rate", "Parental.age")
# The human predictions
#####
om_puberty = 1
om_range = seq(om_puberty, 16, by=0.1)
om_rl = om_range - om_puberty
mu_gf = human_mu_c * human_df
# Equation 5
mu_gm0 = human_mu_c * human_dm0
# Equation 6
mu_gm1 = human_mu_c * human_dy1 * om_rl
# Equation 7
om_mu_g = (mu_gf + (mu_gm0 + mu_gm1)) / 2
# Equation 8
om_pred = data.frame(om_mu_g, om_range)
names(om_pred) = c("Mutation.rate", "Parental.age")
# The owl monkey predictions
#####
chimp_puberty = 7.5
chimp_range = seq(chimp_puberty, 35.5, by=0.1)
chimp_rl = chimp_range - chimp_puberty
mu_gf = human_mu_c * human_df
# Equation 5
mu_gm0 = human_mu_c * human_dm0
# Equation 6
mu_gm1 = human_mu_c * human_dy1 * chimp_rl
# Equation 7
chimp_mu_g = (mu_gf + (mu_gm0 + mu_gm1)) / 2
# Equation 8
chimp_pred = data.frame(chimp_mu_g, chimp_range)
names(chimp_pred) = c("Mutation.rate", "Parental.age")
# The chimpanzee predictions
#####
study_data = read.csv("data/mutation-studies.csv", header=TRUE)
# Load the study data
fig3 = ggplot(study_data, aes(x=Paternal.age, y=Mutation.rate, color=Species)) +
geom_line(data=human_pred, aes(x=Parental.age, y=Mutation.rate), color='#db6d00', size=1) +
geom_segment(aes(x=human_puberty,y=human_mu_g[1]-human_mu_g[1]/10,xend=human_puberty,yend=human_mu_g[1]+human_mu_g[1]/10), linetype=1, color="#db6d00", size=0.5) +
geom_line(data=om_pred, aes(x=Parental.age, y=Mutation.rate), color='#b66dff', size=1) +
geom_segment(aes(x=om_puberty,y=om_mu_g[1]-om_mu_g[1]/10,xend=om_puberty,yend=om_mu_g[1]+om_mu_g[1]/10), linetype=1, color="#b66dff", size=0.5) +
geom_line(data=chimp_pred, aes(x=Parental.age, y=Mutation.rate), color='#920000', size=1) +
geom_segment(aes(x=chimp_puberty,y=chimp_mu_g[1]-chimp_mu_g[1]/10,xend=chimp_puberty,yend=chimp_mu_g[1]+chimp_mu_g[1]/10), linetype=1, color="#920000", size=0.5) +
geom_point(size=3) +
scale_x_continuous(limits=c(0,50), breaks = seq(1, 50, by=5)) +
labs(x="Paternal age (y)",y="Mutation rate per site\nper generation") +
theme_classic() +
guides(colour = guide_legend(override.aes = list(size=3))) +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=20),
axis.title.y=element_text(margin=margin(t=0,r=20,b=0,l=0),color="black"),
axis.title.x=element_text(margin=margin(t=20,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm"),
legend.title=element_blank(),
legend.position=c(0.13,0.87),
legend.text=element_text(size=18)
) +
scale_color_manual(breaks=c("Owl monkey", "Chimpanzee", "Human"), values=c("Human"='#db6d00',"Owl monkey"='#b66dff',"Chimpanzee"='#920000'))
print(fig3)
```
Astoundingly, our predictions with this simple model of reproductive longevity (lines) fit the observed data extremely well (points). In fact, for the human data, the prediction is not significantly different than the best fit line to the study data.
```{r fig3-ftest}
# This is the F-test for differences between the predicted and observed human lines in Fig.3
human_data = data.frame(study_data$Paternal.age[study_data$Species=="Human"], study_data$Mutation.rate[study_data$Species=="Human"])
names(human_data) = c("Age","Mu")
human_data = na.omit(human_data)
human_study_reg = lm(human_data$Mu ~ human_data$Age)
human_pred_reg = lm(human_pred$Mutation.rate ~ human_pred$Parental.age)
pred_actual = coef(human_pred_reg)[1] + human_data$Age * coef(human_pred_reg)[2]
pred_resids = human_data$Mu - pred_actual
#plot(human_pred$Parental.age,human_pred$Mutation.rate, type='l',ylim=c(0.8e-8,1.8e-8), xlim=c(13.4,50))
#points(human_data$Age, human_data$Mu)
#segments(x0=human_data$Age,y0=human_data$Mu,x1=human_data$Age,y1=human_data$Mu-pred_resids)
#abline(human_study_reg, lty=2)
#segments(x0=human_data$Age,y0=human_data$Mu,x1=human_data$Age,y1=human_data$Mu-resid(human_study_reg),lty=2)
print(var.test(resid(human_study_reg), pred_resids, alternative="two.sided"))
```
#### Using species specific rates of spermatogenesis does not affect the fit of our model
Like we did in Figure 1, we can adjust the slope of the line ($d_{yM1}$) by using species specific rates of spermatogenesis ($t_{sc}$). We've previously calculated this for humans and owl monkeys. For chimpanzees, we use set $t_{sc} = 14$ days [@smithwick96].
```{r chimp_sc}
chimp_sc = 14
chimp_dy1 = (365/chimp_sc) * human_sp
print(paste("Chimpanzee expected # of spermatogenic cell divisions/year: ", chimp_dy1))
```
And adjusting these slopes results in no significant change from the fit of our model. What follows is an exact replication of **Figure S3** in our paper.
```{r figS3, fig.width=10, fig.height=6, fig.align="center", message=FALSE, warning=FALSE}
#####
om_puberty = 1
om_range = seq(om_puberty, 16, by=0.1)
om_rl = om_range - om_puberty
mu_gf = human_mu_c * human_df
# Equation 5
mu_gm0 = human_mu_c * human_dm0
# Equation 6
mu_gm1 = human_mu_c * om_dy1 * om_rl
# Equation 7
om_mu_g = (mu_gf + (mu_gm0 + mu_gm1)) / 2
# Equation 8
om_pred = data.frame(om_mu_g, om_range)
names(om_pred) = c("Mutation.rate", "Parental.age")
# The owl monkey predictions
#####
chimp_puberty = 7.5
chimp_range = seq(chimp_puberty, 35.5, by=0.1)
chimp_rl = chimp_range - chimp_puberty
mu_gf = human_mu_c * human_df
# Equation 5
mu_gm0 = human_mu_c * human_dm0
# Equation 6
mu_gm1 = human_mu_c * chimp_dy1 * chimp_rl
# Equation 7
chimp_mu_g = (mu_gf + (mu_gm0 + mu_gm1)) / 2
# Equation 8
chimp_pred = data.frame(chimp_mu_g, chimp_range)
names(chimp_pred) = c("Mutation.rate", "Parental.age")
# The chimpanzee predictions
#####
study_data = read.csv("data/mutation-studies.csv", header=TRUE)
# Load the study data
figS3 = ggplot(study_data, aes(x=Paternal.age, y=Mutation.rate, color=Species)) +
geom_line(data=human_pred, aes(x=Parental.age, y=Mutation.rate), color='#db6d00', size=1) +
geom_segment(aes(x=human_puberty,y=human_mu_g[1]-human_mu_g[1]/10,xend=human_puberty,yend=human_mu_g[1]+human_mu_g[1]/10), linetype=1, color="#db6d00", size=0.5) +
geom_line(data=om_pred, aes(x=Parental.age, y=Mutation.rate), color='#b66dff', size=1) +
geom_segment(aes(x=om_puberty,y=om_mu_g[1]-om_mu_g[1]/10,xend=om_puberty,yend=om_mu_g[1]+om_mu_g[1]/10), linetype=1, color="#b66dff", size=0.5) +
geom_line(data=chimp_pred, aes(x=Parental.age, y=Mutation.rate), color='#920000', size=1) +
geom_segment(aes(x=chimp_puberty,y=chimp_mu_g[1]-chimp_mu_g[1]/10,xend=chimp_puberty,yend=chimp_mu_g[1]+chimp_mu_g[1]/10), linetype=1, color="#920000", size=0.5) +
geom_point(size=3) +
scale_x_continuous(limits=c(0,50), breaks = seq(1, 50, by=5)) +
labs(x="Paternal age (y)",y="Mutation rate per site\nper generation") +
theme_classic() +
guides(colour = guide_legend(override.aes = list(size=3))) +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=20),
axis.title.y=element_text(margin=margin(t=0,r=20,b=0,l=0),color="black"),
axis.title.x=element_text(margin=margin(t=20,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size = 1),
axis.ticks.length=unit(0.2,"cm"),
legend.title=element_blank(),
legend.position=c(0.13,0.87),
legend.text=element_text(size=18)
) +
scale_color_manual(breaks=c("Owl monkey", "Chimpanzee", "Human"), values=c("Human"='#db6d00',"Owl monkey"='#b66dff',"Chimpanzee"='#920000'))
print(figS3)
# This is the F-test for differences between the predicted and observed human lines in Fig.3
human_data = data.frame(study_data$Paternal.age[study_data$Species=="Human"], study_data$Mutation.rate[study_data$Species=="Human"])
names(human_data) = c("Age","Mu")
human_data = na.omit(human_data)
human_study_reg = lm(human_data$Mu ~ human_data$Age)
human_pred_reg = lm(human_pred$Mutation.rate ~ human_pred$Parental.age)
pred_actual = coef(human_pred_reg)[1] + human_data$Age * coef(human_pred_reg)[2]
pred_resids = human_data$Mu - pred_actual
#plot(human_pred$Parental.age,human_pred$Mutation.rate, type='l',ylim=c(0.8e-8,1.8e-8), xlim=c(13.4,50))
#points(human_data$Age, human_data$Mu)
#segments(x0=human_data$Age,y0=human_data$Mu,x1=human_data$Age,y1=human_data$Mu-pred_resids)
#abline(human_study_reg, lty=2)
#segments(x0=human_data$Age,y0=human_data$Mu,x1=human_data$Age,y1=human_data$Mu-resid(human_study_reg),lty=2)
print(var.test(resid(human_study_reg), pred_resids, alternative="two.sided"))
```
---
### Figure S4
#### Modeling mutation rates per year
To compare rates between species, mutation rates should be converted to an absolute time-scale: the mutation rate per year ($\mu_y$). This is useful when comparing long-term substitution rates ($k$) between species since the neutral mutation rate and the neutral substitution rate are theoretically equivalent ($\mu_g = k$). To calculate $\mu_y$ we again average the mutational contributions from the three germline stages, but weight the average by the duration of time spent in each period: the duration of time spent in females ($F$), the duration of time spent in males before puberty ($M_0$), and the duration of time spent in males after puberty, previously defined as reproductive longevity ($RL$).
**Equation 12:** $$ \mu_y = \frac{\mu_{gF} + \mu_{gM0} + \mu_{gM1}}{F + M_0 + RL} $$
Using a range of female and male ages at reproduction and male ages of puberty, we can predict a parameter space for $\mu_y$. What follows is **Figure S4** from the paper.
```{r figS4, fig.width=7, fig.height=6, fig.align="center", message=FALSE, warning=FALSE}
colfunc <- colorRampPalette(c("#009292", "#920000"))
rt = c();nrt = c();
for(i in seq(5, 51, 1))
{
for(j in seq(5, 51, 1))
{
rt = c(rt, i)
nrt = c(nrt, j)
}
}
mu_y = (((human_df * human_mu_c) + (human_dm0 * human_mu_c) + (human_dy1 * rt * human_mu_c)) / (rt + nrt));
# Calculate mu_y for a range of life history values
par(mar=c(5,5,4,2))
pred_data = interp(rt,nrt,mu_y)
filled.contour(pred_data, xlab='RL (years)', ylab=expression('F + M'[0]*' (years)'), cex.lab=2, col=colfunc(26),
key.title = {par(cex.main=2);title(main=expression(mu[y]))},
plot.axes = {
contour(pred_data, drawlabels=T, axes=F, frame.plot=F, add=T, col="#999999", lwd=1.5, lty=3, labcex=1);
axis(1);
axis(2);
points(16.3,43.1,pch=21,bg='#db6d00',col="white",cex=2)
# Kong human point
points(5.64,7.53,pch=21,bg='#b66dff',col="white",cex=2)
# Our owl monkey point
points(16.8,33.8,pch=21,bg='#920000',col="white",cex=2)
# Venn chimp point
})
human_mu_y = (((human_df * human_mu_c) + (human_dm0 * human_mu_c) + (human_dy1 * 16.3 * human_mu_c)) / (29.7 + 13.4 + 16.3));
# Human mutation rate per year (Equation 12)
om_mu_y = (((human_df * human_mu_c) + (human_dm0 * human_mu_c) + (human_dy1 * 5.64 * human_mu_c)) / (6.53 + 1 + 5.64));
# Owl monkey mutation rate per year (Equation 12)
chimp_mu_y = (((human_df * human_mu_c) + (human_dm0 * human_mu_c) + (human_dy1 * 16.8 * human_mu_c)) / (26.3 + 7.5 + 16.8));
# Chimp mutation rate per year (Equation 12)
```
Again, using the underlying mutational parameters predicted from humans, we can predict $\mu_y$ for humans (orange dot), owl monkeys (purple dot), and chimpanzees (red dot) by varying the ages of reproduction and puberty in these species. What jumps out is that $\mu_y$ is not solely dependent on $RL$, but also on the duration of the other life periods. Increasing the age of concesption for females ($F$) tends to decrease $\mu_y$. Increasing the age of puberty in males increases the duration of $M_0$ and also tends to decrease $\mu_y$, though not as drastically. However, increasing $RL$ by either decreasing age of puberty or increasing age of reproduction in males has more complicated effects. At low $F + M_0$, increasing $RL$ tends to decrease $\mu_y$, while at high $F + M_0$ increaseing $RL$ tends to increase $\mu_y$.
---
### Appendix: Table of parameter definitions
```{r table2}
in_data = read.csv("data/owl-monkey-filters.csv", header=TRUE)
cur_filter = subset(in_data, Min.DP==20 & Max.DP==60 & Min.AB==0.4 & Max.AB==0.6)
# Read data and define current filter
symbols = c(
"$\\mu_g$",
"$\\alpha$",
"$C$",
"$m_g$",
"$A_M$",
"$P_M$",
"$RL$",
"$\\mu_{gF}$",
"$d_F$",
"$\\mu_{gM0}$",
"$d_{M0}$",
"$d_{yM1}$",
"$\\mu_c$",
"$t_{sc}$",
"$\\mu_{yM1}$",
"$p_{sc}$",
"$\\mu_y$"
)
defs = c(
"Mutation rate per site per generation",
"False negative rate",
"Callable sites",
"# of mutations per generation",
"Age of male at birth of offspring",
"Age of male at puberty (years)",
"Reproductive longevity",
"Mutation rate per site per generation in females",
"Number of cell divisions in females",
"Mutation rate per site per generation in males before puberty",
"Number of cell divisions in males before puberty",
"Number of cell divisions per year in males after puberty",
"Mutation rate per site per cell division",
"Spermatogenic cycle length (days)",
"Mutation rate per site per year in males after puberty",
"Proportion of spermatagonial cells actively dividing",
"Mutation rate per site per year"
)
ests = c(
paste("Owl monkey: ", signif(mean(mu_calc), digits=3)),
paste("Owl monkey with allelic balance cut-offs [0.4,0.6]: ", signif(cur_filter$Min.FN[1], digits=3)),
paste("Owl monkey average: ", floor(mean(callable_sites))),
paste("Owl monkey average: ", floor(mean(cur_filter$MVs))),
"NA",
"Owl monkey: 1 ^[^ [@dixson80]^]^, Human: 13.4 ^[^ [@nielsen86]^]^, Chimpanzees: 7.5 ^[^ [@marson91]^]^",
"NA",
paste("Human: ", signif(human_num_f/human_haploid, digits=3), "^[^ [@kong12]^]^"),
paste("Human: ", human_df, "^[^ [@drost95]^]^"),
paste("Human: ", signif(human_mu_c * human_dm0, digits=3)),
paste("Human: ", human_dm0, "^[^ [@drost95]^]^"),
paste("Owl monkey: ", signif(om_dy1, digits=3), ", Human: ", signif(human_dy1, digits=3), ", Chimpanzee: ", signif(chimp_dy1, digits=3)),
signif(human_mu_c, digits=3),
paste("Owl monkey: ", om_sc, "^[^ [@derooij86]^]^, Human: ", human_sc, "^[^ [@heller63]^]^, Chimpanzee: ", chimp_sc, "^[^ [@smithwick96]^]^"),
paste("Owl monkey: ", signif(coef(om_pred_reg)[2], digits=3)),
paste("Human: ", signif(human_sp, digits=3)),
paste("Owl monkey: ", signif(om_mu_y, digits=3), ", Human: ", signif(human_mu_y, digits=3), ", Chimpanzee: ", signif(chimp_mu_y, digits=3))
)
param_table = data.frame(symbols, defs, ests)
names(param_table) = c("Symbol", "Definition", "Estimate")
# Re-calculate the mutation rate for posterity and subset the data for the table.
kable(param_table, "html", caption="Table 2: Parameter symbols, definitions, and estimates", digits=11) %>%
kable_styling(bootstrap_options="striped", full_width=F)
```
---
### References