-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path07_adaptation_idx.Rmd
192 lines (162 loc) · 9.13 KB
/
07_adaptation_idx.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
<link href="http://kevinburke.bitbucket.org/markdowncss/markdown.css" rel="stylesheet"></link>
``` {r setup, include=FALSE}
opts_chunk$set(warning = FALSE, message = FALSE)
require(ggplot2)
require(plyr)
require(reshape)
require(xtable)
require(MASS)
require(car)
require(plotrix)
source('functions.R')
load(file=paste(getwd(),"/rdata/05.Rdata", sep=''))
```
# Step 07 - Looking at Codon and tRNA Adaptation Indices
## TAI Calcluation
I'm using [codonR](http://people.cryst.bbk.ac.uk/~fdosr01/tAI/codonR.tar.gz) to calculate tRNA adaptation for my constructs. I'm getting this first set of commands from `codonR/README`.
I first had to run codonM on the coding sequence only, in perl. I also ran the same codonM script on the E. coli genome, for comparison.
```console
$ perl codonR/codonM \
<(perl -pe '/^[ATGCatgc]/ && s/.*([ATGC]{33})/$1/' \
data/203.norestrict.fa) \
data/203.codonR.m
$ perl codonR/codonM codonR/ecolik12.ffn codonR/ecolik12.m
```
```{r 07.00-calc_tAI, cache=T}
source("codonR/tAI.R")
eco.trna <- scan("codonR/ecolik12.trna")
eco.ws <- get.ws(tRNA=eco.trna, sking=1)
codonR.m <- matrix(scan("data/203.codonR.m"), ncol=61, byrow=T)
codonR.m <- codonR.m[,-33]
tai <- get.tai(codonR.m, eco.ws)
#merge the tai scores with the gene names, and finally with NGS
names <- system('perl -ne "/>/ && s/>// && print" data/203.norestrict.fa',
intern=T)
ngs <- merge(ngs, data.frame(Name=names,tAI=as.numeric(tai)), by="Name")
```
OK, so now we can look at the correlation of tAI with expression level:
```{r 07.01-plot_tAI_corr, fig.width=9, fig.height=14}
multiplot(
ggplot(melt(subset(ngs, Promoter=='BBaJ23100'),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(x=tAI, color=RBS, y=value)) +
geom_point(alpha=0.05) + theme_bw() + stat_smooth(method=lm, se=F) +
facet_wrap(RBS~variable, scale='free') +
scale_x_continuous(name="Secondary Structure Free Energy") +
scale_y_log10("Log10 of Dependent Variable") +
opts(title="tRNA adaptation correlations (Strong Promoter)"),
ggplot(melt(subset(ngs, Promoter=='BBaJ23108'),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(x=tAI, color=RBS, y=value)) +
geom_point(alpha=0.05) + theme_bw() + stat_smooth(method=lm, se=F) +
facet_wrap(RBS~variable, scale='free') +
scale_x_continuous(name="Secondary Structure Free Energy") +
scale_y_log10("Log10 of Dependent Variable") +
opts(title="tRNA adaptation correlations (Weak Promoter)"),
cols=1)
```
I checked a few things to see if I did this right, because I expected to see a much larger effect. Maybe I should compare this by gene as well, so I will perform an ANOVA.
```{r}
tai.lm <- lm(log10(Prot) ~ Promoter + RBS + Gene*tAI, data=ngs)
tai.aov <- Anova(tai.lm)
data.frame("Effect"=rownames(tai.aov), "Pct.Explained"=tai.aov$"Sum Sq"/sum(tai.aov$"Sum Sq"))
```
The effect is very very small; tAI seems to account for only about 1% of variance. I remember this being much larger! To shake my concerns about the data being calculated incorrectly (either by codonM, or by me), I will plot the data like I did earlier, in a large grid by gene, cds type, promoter, and RBS.
```{r 07.02-plot_prot_grid, fig.width=12, fig.height=14}
ggplot(ngs, aes(x = reorder(Gene, log10(Prot), mean), y = CDS.type)) +
geom_tile(aes(fill = rescale(log10(Prot), c(-1, 1) ))) +
opts(panel.background = theme_rect(fill = "gray80"),
axis.ticks = theme_blank()) +
scale_fill_gradient2(low="darkred", mid="yellow", high="darkgreen") +
facet_grid(RBS ~ Promoter) +
opts(plot.title = theme_text(size=14, lineheight=.8, face="bold"),
legend.position=FALSE, axis.text.x=theme_text(angle=-90))
```
The effect is there and definitely quite strong. We can check and see if TAI is correctly measured:
```{r 07.03-plot_prot_grid, fig.width=12, fig.height=9, cache=T}
ngs$Gene <- with(ngs, reorder(Gene, log10(Prot), mean))
ggplot(subset(ngs, Promoter=='BBaJ23100' & RBS=='BB0032'),
aes(x = Gene, y = CDS.type)) +
geom_tile(aes(fill = tAI)) +
opts(panel.background = theme_rect(fill = "gray80"),
axis.ticks = theme_blank()) +
scale_fill_gradient2(low="blue", mid="beige", high="red",
midpoint=mean(ngs$tAI), limits=range(ngs$tAI)) +
opts(plot.title = theme_text(size=14, lineheight=.8, face="bold"),
axis.text.x=theme_text(angle=-90))
```
It looks correctly measured and it also looks like quite a strong correlation! In addition to the 'max rare' codon variants all having very low tAI, it looks like the first ~1/4 of the genes with the lowest expression (X axis is sorted by mean Protein level) all have lower tAIs.
So why are the trend lines so flat? Let's look at just the min and max codons:
```{r 07.04-plot_prot_min_max_tAI, fig.width=12, fig.height=14, cache=T}
multiplot(
ggplot(melt(subset(ngs,
Promoter=='BBaJ23100' & grepl('Rare',ngs$CDS.type)),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(x=tAI, group=CDS.type, color=CDS.type, y=value)) +
geom_point(alpha=0.15) + theme_bw() +
facet_wrap(RBS~variable, scale='free') +
scale_x_continuous(name="Secondary Structure Free Energy") +
scale_y_log10("Log10 of Dependent Variable") +
opts(title="tRNA adaptation correlations (Strong Promoter)") +
geom_boxplot(fill=NA, outlier.shape=NA),
ggplot(melt(subset(ngs,
Promoter=='BBaJ23108' & grepl('Rare',ngs$CDS.type)),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(x=tAI, group=CDS.type, color=CDS.type, y=value)) +
geom_point(alpha=0.15) + theme_bw() +
facet_wrap(RBS~variable, scale='free') +
scale_x_continuous(name="Secondary Structure Free Energy") +
scale_y_log10("Log10 of Dependent Variable") +
opts(title="tRNA adaptation correlations (Weak Promoter)") +
geom_boxplot(fill=NA, outlier.shape=NA),
cols=1)
```
Alright, there is an effect here, but **why is it so strong for RNA?**. The max protein measurement might be washing out the effect, perhaps? Or the fitness cost makes the cells divide more slowly, increasing the amount of RNA per cell? Something weird is definitely going on here. I split the tAI into regions and plot it with boxplots:
```{r fig.width=12, fig.height=14, cache=T}
multiplot(
ggplot(melt(subset(ngs, Promoter=='BBaJ23100'),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(y=value, x=cut(tAI, breaks=5), color=RBS)) +
geom_boxplot(alpha=0.15) + theme_bw() +
facet_wrap(RBS~variable, scale='free') +
opts(title="tRNA adaptation correlations (Strong Promoter)") +
scale_y_log10("Log10 of Dependent Variable") +
geom_boxplot(fill=NA, outlier.shape=NA),
ggplot(melt(subset(ngs, Promoter=='BBaJ23108'),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(y=value, x=cut(tAI, breaks=5), color=RBS)) +
geom_boxplot(alpha=0.15) + theme_bw() +
facet_wrap(RBS~variable, scale='free') +
opts(title="tRNA adaptation correlations (Weak Promoter)") +
scale_y_log10("Log10 of Dependent Variable") +
geom_boxplot(fill=NA, outlier.shape=NA),
cols=1)
```
So it looks like max/min is a strong effect but when I include the tAI measures for all sequences, the effect is not very strong (though still significant) when I include sequences not explicitly designed to have a high rare codon usage. What if we split it up into the extremes, the 1% and 99% quantiles:
```{r fig.width=12, fig.height=14, cache=T}
multiplot(
ggplot(melt(subset(ngs, Promoter=='BBaJ23100'),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(y=value, x=cut(tAI,
breaks=c(0,quantile(ngs$tAI,c(.01,.99)),1),
labels=c('5%','mid','95%')),
color=RBS)) +
geom_boxplot(alpha=0.15) + theme_bw() +
facet_wrap(RBS~variable, scale='free') +
opts(title="tRNA adaptation correlations (Strong Promoter)") +
scale_y_log10("Log10 of Dependent Variable") +
geom_boxplot(fill=NA, outlier.shape=NA),
ggplot(melt(subset(ngs, Promoter=='BBaJ23108'),
measure.vars= c('RNA','Count.DNA','Prot','Trans')),
aes(y=value, x=cut(tAI,
breaks=c(0,quantile(ngs$tAI,c(.01,.99)),1),
labels=c('5%','mid','95%')),
color=RBS)) +
geom_boxplot(alpha=0.15) + theme_bw() +
facet_wrap(RBS~variable, scale='free') +
opts(title="tRNA adaptation correlations (Weak Promoter)") +
scale_y_log10("Log10 of Dependent Variable") +
geom_boxplot(fill=NA, outlier.shape=NA),
cols=1)
```
So even plotting the top 1% and the bottom 1% of tAI values is not as strong as the difference between min/max rare codon types. It seems as if the min/max rare constructs have some other property that tAI alone is not catching. Could it be re-use of the same tRNA for consecutive codons? Some other metric of codon usage? It is unclear, but I will have to explore further.