Correlated Occurrence and Bypass of Frame-Shifting Insertion-Deletions (InDels) to Give Functional Proteins
Short insertions and deletions (InDels) comprise an important part of the natural mutational repertoire. InDels are, however, highly deleterious, primarily because two-thirds result in frame-shifts. Bypass through slippage over homonucleotide repeats by transcriptional and/or translational infidelity is known to occur sporadically. However, the overall frequency of bypass and its relation to sequence composition remain unclear. Intriguingly, the occurrence of InDels and the bypass of frame-shifts are mechanistically related - occurring through slippage over repeats by DNA or RNA polymerases, or by the ribosome, respectively. Here, we show that the frequency of frame-shifting InDels, and the frequency by which they are bypassed to give full-length, functional proteins, are indeed highly correlated. Using a laboratory genetic drift, we have exhaustively mapped all InDels that occurred within a single gene. We thus compared the naive InDel repertoire that results from DNA polymerase slippage to the frame-shifting InDels tolerated following selection to maintain protein function. We found that InDels repeatedly occurred, and were bypassed, within homonucleotide repeats of 3–8 bases. The longer the repeat, the higher was the frequency of InDels formation, and the more frequent was their bypass. Besides an expected 8A repeat, other types of repeats, including short ones, and G and C repeats, were bypassed. Although obtained in vitro, our results indicate a direct link between the genetic occurrence of InDels and their phenotypic rescue, thus suggesting a potential role for frame-shifting InDels as bridging evolutionary intermediates.
Published in the journal:
. PLoS Genet 9(10): e32767. doi:10.1371/journal.pgen.1003882
Category:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1003882
Summary
Short insertions and deletions (InDels) comprise an important part of the natural mutational repertoire. InDels are, however, highly deleterious, primarily because two-thirds result in frame-shifts. Bypass through slippage over homonucleotide repeats by transcriptional and/or translational infidelity is known to occur sporadically. However, the overall frequency of bypass and its relation to sequence composition remain unclear. Intriguingly, the occurrence of InDels and the bypass of frame-shifts are mechanistically related - occurring through slippage over repeats by DNA or RNA polymerases, or by the ribosome, respectively. Here, we show that the frequency of frame-shifting InDels, and the frequency by which they are bypassed to give full-length, functional proteins, are indeed highly correlated. Using a laboratory genetic drift, we have exhaustively mapped all InDels that occurred within a single gene. We thus compared the naive InDel repertoire that results from DNA polymerase slippage to the frame-shifting InDels tolerated following selection to maintain protein function. We found that InDels repeatedly occurred, and were bypassed, within homonucleotide repeats of 3–8 bases. The longer the repeat, the higher was the frequency of InDels formation, and the more frequent was their bypass. Besides an expected 8A repeat, other types of repeats, including short ones, and G and C repeats, were bypassed. Although obtained in vitro, our results indicate a direct link between the genetic occurrence of InDels and their phenotypic rescue, thus suggesting a potential role for frame-shifting InDels as bridging evolutionary intermediates.
Introduction
InDels occur in all kingdoms of life, and in some organisms they are as frequent as point mutations [1]. Short sequence repeats, and homonucleotide repeats in particular, are prone to InDels due to misalignment of the DNA strands during replication (polymerase slippage) [2]. In coding regions, at least 2/3 of InDels disrupt the reading frame and are thus considered nonsense mutations leading to loss of function (for comparison, only ∼1/20 of point mutations result in a stop codon). Frame-shifting InDels are therefore thought to be tolerated only when a gene is freed from selection pressure. Indeed, under fluctuating environments, and/or within small populations, frame-shifting InDels in sequence repeats provide a rapid means of switching genes on and off [3]–[7].
Frame-shifting InDels are considered by default as nonsense mutations. There are, however, known precedents for their bypass to give functional proteins due to transcriptional and/or translational infidelity [8]–[15]. Frame-shifts are also recruited as a regulatory mechanism by ribosome programmed −/+1 frame-shifting [9]–[13]. In other cases, alternative proteins are encoded from the same gene via translational frame-shift [16], [17]. Overall, although rare, bypass of frame-shifts has been identified in organisms from all three domains of life and in viruses [9]–[11]. Most recorded events of bypass occur in long homo-adenine repeats, e.g. ≥10A [8], [14], but bypasses within non-repeat stretches have sporadically been reported [9], [10]. Most cases also regard explicitly evolved mechanisms for the transcriptional and/or translational bypass of frame-shifting InDels, rather than accidental slippage. It therefore remains unknown to what degree randomly occurring frame-shifting InDels can persist in coding regions under purifying selection, and whether and how the likelihood of bypass relates to the sequence contexts within which a frame-shift occurs.
Our own interest in frame-shifts followed the directed, laboratory evolution of a DNA methyltransferase M.HaeIII, towards new target DNA specificities [18]. During this process, surviving variants were identified that carried frame-shifting InDels within in their coding regions. Common to all of these variants was the location of the frame shift mutation (an ‘A’ insertion within an 8-homonucleotides repeat, Figure S1). Being unaware of the possibility of bypass, we assumed these are ‘false positives’ even though these variants did exhibit detectable level of methylation of the newly evolved target. Whilst these InDel-carrying variants disappeared in the subsequent rounds of selection, we encountered additional examples of functional variants carrying frame-shifts in the selection of other proteins. We thus became curious as to how frequent the bypass of frame-shifting InDels might be, and whether they may serve as viable evolutionary intermediates.
InDels are of particular interest as they readily create alterations in a protein's length and sequence, and thus go beyond the exchange of single side chain [19]. This is also the case with our model, M.HaeIII, is a DNA methyltransferase isolated form Haemophilus aegyptius that specifically methylates GGCC ds-DNA sites. M.HaeIII belongs to the prokaryotic restriction-methylation system that encompasses hundreds of different methyltransferases, each with a different DNA target specificity. The target recognition domains (TRDs) of DNA methyltransferases exhibit relatively of low structural order and are highly diverse, including extensive changes in length [20]. We suspected that the intense diversification of TRDs might relate to InDels. We used the “Path” algorithm that produces DNA sequence alignments by back-translation of known proteins sequences (Figure S2). In this manner, frame-shifting InDels that might have underlined the divergence of these sequences might be detected [21], [22]. As discussed in detail below, frame-shifting InDels were readily identified in the aligned TRDs. However, since protein evolution is assumed to occur via a series of functional intermediates [23], the evolutionary relevance of frame-shifting InDels depends on their potential to be rescued via bypass of transcriptional or translational errors.
Here, we have systematically mapped the accumulation of InDels in a laboratory-performed genetic drift of a single gene/protein. We were interested in measuring M.HaeIII's tolerance of InDels, given that, a priori, 2/3 are expected to be purged due to frame-shifts, and that the in-frame ones are also far more deleterious than point mutations [20]. To this end, we subjected M.HaeIII to iterative rounds of random mutagenesis in-vitro followed by purifying selection. We analyzed the gene repertoires by high-throughput sequencing; both the repertoire before selection, thus mapping the occurrence of all mutations regardless of their effect on M.HaeIII, and the repertoire after selection, thus mapping the repertoire of accepted mutations. We thereby measured, for all 987 bases along the M.HaeIII gene, the occurrence rates of InDels due to DNA polymerase slippage, and the rates of their bypass due to transcriptional/translational errors. The data indicate that the rate of bypass of frame-shifting InDels is unexpectedly high, including in relatively short homonucleotide repeats, and in repeats of nucleotides other than adenine. Foremost, we found that the propensity for the genetic occurrence of InDels, and the rates of transcriptional-translational bypass, are highly correlated.
Results
The laboratory drift
M.HaeIII was subjected to a laboratory genetic drift, namely to repeated rounds of random mutagenesis and purifying selection that eliminated non-functional variants (negative selection, Figure S3). To this end, M.HaeIII's gene was subjected to random mutagenesis using an error-prone DNA polymerase, at an average of 2.2±1.6 mutations per gene. The ensemble of mutated genes was ligated into an expression vector using restriction sites at the very beginning of M.HaeIII's ORF, around the ATG codon, and just after the stop codon. A plasmid vector was necessary for obtaining a large number of transformants such that large repertoires (≥105 variants) could be explored. However, when driven from high copy plasmids protein, expression levels can be unrealistically high, and thus bias the level of the bypass. To minimize the levels of expression, the mutated M.HaeIII genes were cloned under the control of the tightly regulated tet promoter, with a constitutively expressed tet repressor encoded downstream. The selections throughout the drift were performed at basal expression, i.e., with no inducer added to the growth media. This basal expression level was nonetheless sufficient for complete methylation of the encoding plasmid, as well as of the genome of the E. coli host, by wild-type M.HaeIII [18].
The ligated plasmids were transformed to E. coli, such that each transformed cell incorporated a different plasmid molecule carrying a different gene variant from the library of M.HaeIII mutants. In each bacterium, the transformed plasmid is replicated, and subsequently methylated at GGCC sites, or not, depending on the functionality of the M.HaeIII variant it encoded. The transformed bacteria were grown, and the plasmid pool was subsequently extracted and treated with the cognate restriction enzyme, HaeIII. Plasmids that encoded a functional M.HaeIII variant survived the digestion and thereby could be retransformed to fresh E. coli cells and propagated for the next round of selection [18].
We maintained ≥105 transformants per round of mutagenesis-selection thus avoiding population bottlenecks and the fixation by chance of deleterious mutations. Overall, M.HaeIII's gene underwent 17 rounds of random mutagenesis and purifying selection through which the drifting population accumulated an average of 2.0±1 point mutations per gene per round.
Systematic mapping of InDels by deep sequencing
The M.HaeIII genes encoded by the plasmid pools were subjected to high-throughput sequencing (Illumina). Sequencing was performed following the first round of random mutagenesis, thus mapping the occurrence of mutations irrespective of selection (G0, or the naive repertoire). Additionally, the pool derived after 17 rounds of mutagenesis and purifying selection as sequenced to map the repertoire of accepted mutations (G17).
The short sequencing reads (∼40 nts) were mapped to the sequence of wild-type M.HaeIII. The analyzed sequence stretch included the coding region of M.HaeIII's that was repetitively mutated and re-cloned into the selection plasmid (987 nts), as well as a plasmid region located upstream of the cloning sites that was not subjected to mutagenesis (98 nts). The latter was used to determine the background frequency of sequencing errors due to the Illumina processing, in both repertoires, G0 and G17. This background frequency was subtracted from the InDel or point mutations frequencies observed at the positions subjected to drift. This procedure allowed us to measure the mutational frequencies for all possible point mutations and InDels throughout M.HaeIII's coding region, from nucleotide position 4 (downstream the NcoI cloning site) to the stop codon (position # 993; 17 nucleotides upstream the NotI cloning site).
The frequency of a given mutation, namely a given nucleotide exchange or a given InDel, at a specific position, corresponded to the number of contigs that carried this mutation divided by the total number of sequenced contigs that covered this position. We excluded sequenced positions within contigs obtained with low accuracy score, and/or showing biases with respect to their location (e.g. positions located at the edges of the contigs, see Methods). In this manner, all InDels that occurred at a frequency above the background were identified, in both the naive and the selected repertoires (Table 1).
Mutation types and frequencies in the naive library (G0)
The in-vitro mutagenesis protocol applied here does not reproduce the factors that contribute to InDels formation in natural genomes. Nevertheless, certain patterns observed in the acquisition of mutations in natural genomes were also observed here. For instance, the point mutations to InDels ratio (S/I) in our unselected repertoire (G0) was found to be ∼16. This ratio is within the range observed in natural genomes (e.g., ∼10 in humans, or 16 in S. cerevisiae) [1]. The InDels that relate to polymerase slippage in natural genomes are typically short (≤5 bp in length) with short frame-shifting InDels (i.e., InDels of 1 or 2 nts) comprising over two-thirds [24]. Here, single nucleotide InDels were overwhelmingly represented (∼98% of all detected InDels; Table 1), and were thus all expected to result in frame-shift and loss of function. Finally, as detailed below, the tendency of InDels to occur within repeat regions of natural genomes is also observed in the in-vitro generated naive library.
InDels occurred within repeat sequences in the naive library (G0)
The predominant mechanism generating InDels in natural genomes is polymerase slippage due to infidelity in repeat sequence pairing [2], [25], [26]. Indeed, essentially all InDels observed here occurred within homonucleotide repeats of 3–8 nucleotides (Table 2). Further, as reported for natural genomes [27]–[29], the frequency of occurrence positively correlated with repeat length (R2 = 0.97, Table 2).
Frame-shifting InDels are frequently bypassed under selection (G17)
Sequencing of the selected repertoire, G17, showed that the overall tolerance of InDels was, as expected, low. Accordingly, under selection, the point mutations to InDels ratio (S/I) within M.HaeIII's coding region increased from ∼16 in G0 to ∼190 in G17 (Table 1). Frame-shifting InDels in M.HaeIII were primarily found to comprise nonsense mutations. Indeed, the purging of InDels in coding regions of natural genomes is intense [20], [30], [31].
The purging of mutations, including InDels, in disordered regions including inter-domain linkers and domain termini is far less intense than within ordered domains [20], [32]. Accordingly, the purging of InDels was >30-fold less intense at the last 7 amino acids of M.HaeIII's C-terminus. This stretch, starting from amino acid position 324, or nucleotide position 969, until the stop codon is structurally disordered and has no functional role (Table 1, lower panel, marked as ‘C-terminus, amino acids 324–330’). Thus, premature stop codons, or completely altered amino acid sequences within this region have little effect on M.HaeIII stability and function.
As expected most InDels were purged, yet an expectedly high fraction was found to be tolerated. Overall, out of 337 positions in which InDels were detected in the naive repertoire, 79 positions were found to carry InDels in G17. Out of the latter, 26 positions carried InDels at a frequency ≥0.2×10−3 (≥10-fold higher than background frequency; Figure 1, Figure S4). Thus, InDels that were found at significant frequencies in G17 were consequently considered as potentially tolerated.
As observed for the occurrence of InDels in the naive repertoire (G0), the frequencies of tolerated InDels (G17 frequencies) were highly correlated with length of the repeat in which they occurred (Table 2, Figure 2). Thus, the InDels that are most prone to occur are also the ones that are most likely to be rescued by transcriptional/translational errors. Indeed, out of 337 different positions in which InDels were identified in total (Figure S4), only 9 were observed above background rates and not within homonucleotide repeats (Table S1). These were found either at the end of homonucleotide repeats whereby the inserted or deleted nucleotide differed from the repeat one, or in short repeats such as TCTCT. These non-canonical InDels might be bypassed not at the InDel position itself, but at the adjacent repeat.
Individually tested frame-shifting InDels
To verify that the frame-shifting InDels observed under selection were indeed bypassed, we generated 15 different M.HaeIII mutants each carrying a specific InDel that had been observed in G17 at frequency above 0.2×10−3. We also tested four InDels identified with high frequencies in the unselected, G0 library yet with near-background frequencies in G17 (# 9, 12, 13 and 14; Figure 1, Table 3). The InDel-carrying variants were individually cloned into the same plasmid that was used for the drift and transformed into E. coli. Cultures derived from cells transformed with individual M.HaeIII InDel variants were grown under basal expression, or under over-expression conditions (with inducer). The functionality of individual variants was determined by the standard plasmid protection test [18], [33], i.e., by the ability of the InDel-carrying variants to protect their encoding plasmids from HaeIII digestion (Figure 3). Upon over-expression, out of the 15 frame-shifted variants that corresponded to InDels found in G17, 13 showed detectable level of protection, and hence measurable methyltransferase activity (Figure 3A). Out of the four tested InDels that were found to occur in G0 but were purged under selection, two with the lowest G17 InDel frequencies (#9 and 14) showed no activity as expected. The other two (#12 and 13) exhibited detectable level of protection only when over-expressed. At basal expression level (Figure 3A), the InDels widely differ in their effects. Nonetheless, 9 out of 15 InDels were found to be bypassed at basal levels, and these also exhibited the highest G17 frequencies (Table 3). For example, variants #5–7 corresponding to InDels within the longest homorepeat (8A repeat carrying an ‘A’ deletion, or ‘A’/‘AA’ insertions, Table 3) showed high protection levels at basal expression and also the highest frequency of occurrence in G17. This result is not that surprising as long homo-A repeats are known hotspots for transcriptional/translational slippage (usually ≥10 nucleotides) [8], [14]. However, several variants carrying InDels within shorter repeats were also found to be bypassed at basal expression levels (#1, 15, 17, and 19, that occurred within 5A, 5C, 4A, and 5T homorepeats, respectively). In fact, one of these, with a deletion within a 4A repeat, seems to exhibit the highest protection level at basal expression (#17). Further, the most active InDel-carrying variants (#8 and #17) exhibited physiologically relevant levels of methylation activity as was also indicated by their ability to fully methylate their GGCC sites in the genomes of their host E. coli cells, even at basal expression levels (>12,000 GGCC sites protected from HaeIII digestion versus 19 in the selection plasmid, Figure S5).
Further validation that these frame-shifting InDels are bypassed to yield full length proteins was provided by a Western blot using an M.HaeIII construct that carries an epitope tag at the C-terminus (Figure S6). The observed levels of full-length proteins were well correlated with the plasmid protection levels at basal expression level, and with the G17 frequency of the InDels that these variants carry.
Although showing relatively high G17 frequencies, two variants (#3, 4, insertion and deletion of ‘G’ at 6G repeat in position 204) showed no detectible methyltransferase activity, even when over-expressed. This and the fact that some InDels are only bypassed upon over-expression, does not necessarily mean that their detection of these InDels in G17 is an artifact. Due to the short contigs of Illumina sequencing, the sequence composition of the full length drifted M.HaeIII variants within which these InDels originally occurred is unknown. They may well contain compensatory mutations at the background on which these InDels were tolerated. Indeed, laboratory drifted variants accumulate global suppressor mutations at high rates [34], as was also observed in our laboratory drift of M.HaeIII (unpublished data). In fact, the acceptance of InDels in naturally drifting sequences also seems to be correlated with the acquisition of enabling point mutations [20].
Discussion
The relatively high frequency of tolerated InDels revealed here reinforces the possibility that frame-shifting InDels should not be considered by default as dead-ends. Rather, the sequence context within which InDels occur most frequently may also promote their tolerance. Clearly, this and other conclusions derived from this study need to be considered in view of the in-vitro mutagenesis protocol, the laboratory selection context, and the data that relates to one gene/protein. Nonetheless, key features that are also relevant to the acquisition of InDels in natural genomes were captured – foremost, the tendency of InDels to occur within repeat regions, and the higher frequency of frame-shifting InDels relative to in-frame ones. Our experimental system mimics these two features and thereby enabled us to systematically measure the rate of occurrence of InDels within all positions of the studied gene, and in the absence and in the presence of selection.
The levels of bypass observed in our dataset might be artificially elevated as a result of enhanced gene copy number and/or expression levels. In our experimental setup, M.HaeIII was encoded by a multi-copy plasmid and under an inducible promoter. Nonetheless, a relatively low, basal expression level (i.e., without induction) was maintained due to the tight regulation of the tet promoter with constitutive over-expression of repressor from the same plasmid. The natural restriction-modification system from which M.HaeIII was derived is encoded by a chromosomal gene. However, whereas the restriction enzyme is tightly regulated, the methyltransferase is constitutively expressed [35], [36]. Whilst we have no direct comparison of the protein doses in nature and in our experiment, they are unlikely to differ dramatically (for comparison, a similar plasmid-based experimental setup showed no detectible GFP signal when inducer levels were ≤20 µg/ml [37], whereas in our drift no inducer was added).
Occurrence and bypass of InDels are both correlated with repeat length
The generation of InDels in this laboratory drift was the outcome of polymerase slippage during DNA replication (components such as DNA repair were not included in the in-vitro replication protocol). This assumption is supported by the strict correlation between repeat lengths and frequency of InDels within their positions (Figure 2; G0 line). In non-repeat positions (positions whereby the flanking bases differ from the base in the mutated position), the occurrence frequency of InDels is close to the detection limit. Indeed, extrapolating from the observed linear correlation of repeat length and log[InDel frequency] to repeat length = 1, a frequency of ∼10−4 is obtained. The InDels frequency increases by ∼2.5-fold per nucleotide as the repeat length increases (Figure 2; G0).
Frame-shift bypasses are primarily associated for homo-A repeats [8], [14]. Homo-A and homo-T repeats show higher InDel frequencies than G/C repeats, but there are no clear trends regarding composition, primarily because the different base repeats are represented in M.HaeIII's gene at very different frequencies (e.g. 34 A/T 3-nucleotide repeats versus 5 G/C repeats; Table 2). A strict correlation was also observed between the frequency of bypassed frame-shifting InDels and repeat length. Due to the purging of most InDels by the negative/purifying selection, the slope of the correlation curve is steeper: 3.4-fold higher frequency per nucleotide length. Additionally, the intercept with the Y-axis indicates that bypass at non-repeat positions is below the background level (∼0.3×10−5, Figure 2; G17).
The validation of functional variants carrying individual InDels was consisted of: (i) their well-above background frequencies in the repertoire of M.HaeIII genes that passed the selection, G17, (ii) their persistence of the GGCC methylation functionality, and (iii) the detection of full-length proteins. Overall, these data suggest that the bypass of InDels occurs by slippage of the RNA polymerase and/or the ribosome, thus shifting the reading-frame either upstream or downstream (−1 or +1 shifts, respectively) to the original frame. This mechanism accounts for the observed circularity in the generation and bypass of the frame-shifting InDels. Namely, the propensity of a given sequence stretch, in this case homonucleotide repeats, towards slippage of the DNA polymerase (the genetic InDel formation), of RNA polymerase (transcriptional bypass), or of the ribosome (translational bypass), is similar.
It should, however, be noted that because frame-shifting InDels are only partly bypassed, they impose a cost. Even when bypass produces enough full-length, functional protein (e.g. in those cases where 100% methylation of both the plasmid and the host's chromosome are observed; Figs. 2B, S5, variants #8, 17), truncated versions derived from the original frame are also produced (Figure S6), thus producing aggregated, deleterious debris. Any advantage afforded by a frame-shifting InDel therefore depends on the benefit afforded counteracting the cost associated with such debris [38]. Nevertheless, competitions of cells carrying wild-type M.HaeIII with cells carrying different InDel variants indicated no growth inhibition by frame-shifts (Figure S7). In fact, the InDel-carrying variants unexpectedly became enriched. The toxicity associated with DNA methylation could bias growth in favor of InDel-carrying variants that are less active. However, an E. coli strain was used in which methylation is not toxic [39] and the most active InDel variant (#17) showed the highest growth advantage (Figure S7B). It therefore seems that, at least for the genes tested here, and within our experimental setup, the growth disadvantage imposed by frame-shifts is undetectable, possibly because expression of wild-type M.HaeIII also produces truncated fragments at comparable levels (Figure S6).
Bypassed InDels are located between conserved motifs
The survival of frame-shifted variants was found to be dependent not only on the nucleotide repeat length, but also on the location of the InDel within the encoded protein. In accordance with previous findings [20], non-functional InDels (e.g. #3, #4, #9 and #14) were located within highly conserved regions, whereas the tolerated ones tend to be located between conserved motifs (Figure 1D). For example, InDels #3 and 4 comprise an insertion or deletion of a G nucleotide within a 6G repeat located in the middle of M.HaeIII's catalytic region (motif IV). These InDels were detected at high frequency in the naive repertoire, but with very low frequency in the selected, G17 repertoire (Table 3). In accordance, the variants carrying these InDels show no methylation activity, even under over-expression (Figure 3). In contrast and with agreement with the “Path” analysis for prediction for frame-shifting sites (Figure S2), InDels tolerated at high frequency tend to reside in connecting loops between conserved motifs. InDels #6–8, for example, reside in a flexible loop that connects motif V and VI (Figure S8). This tendency suggests that the bypass of most frame-shifting InDels results, as a minimum, in one point mutation. The conserved structural motifs are intolerant to substitutions and hence to InDels – to in-frame InDels [20], [40], let alone frame-shifting ones (Figure 1D, Table 3).
Implications: ORF predictions
Altogether, our data indicate that frame-shifting InDels can be tolerated at a surprisingly high frequency. We identified at least eight readily bypassed InDels within M.HaeIII, all comprising homonucleotide repeats of 4–8 nucleotides located between the enzyme's conserved motifs. Variants carrying frame-shifting InDels within the longest 8A repeat were as expected highly tolerated, but other permissive InDels were identified in unexpected repeats, e.g. relatively short repeats (4 nucleotides), homo-G or -C repeats, and even next to repeats rather than within them (Table S1). The identification of multiple permissive InDels in M.HaeIII reinforces the possibility that shifted open reading frames are often read-through to give functional proteins (for examples see [8], [13], [14], [41]). The systematic mapping performed here indicated a strict correlation of the bypass likelihood with the repeat's length, and with its structural location. These parameters may assist the identification of ORFs carrying frame-shifts that may actually encode full-length, functional proteins.
Implications: The bridging potential of frame-shifting InDels
The tolerance of a frame-shifting InDels correlates with the tendency of the position within which it occurred to acquire InDels in the first place. For the very same reason, the likelihood for reversion of an InDel, thus restoring the original frame, is very high. Reversion may occur at the same position, or at other positions within the same repeat (scenarios that are indistinguishable by sequence comparison), or, as observed here, in positions flanking the repeat (Table S1). Repeats therefore comprise hot spots for changes in length and composition, as observed in rapidly evolving proteins related to bacterial pathogenicity, or in organisms that rapidly switch on and off certain genes [3]–[7]. Similarly, the target recognition domains (TRDs) of DNA methyltransferases are highly diverse not only in sequence, but also in length [20]. Indeed, analysis of alignments of M.HaeIII and its orthologs with “Path” support the hypothesis of diversification via frame-shifting InDels (Figure S2).
The bypass of frame-shifting InDels, although transient and/or accompanied by partial loss of function, greatly increases the likelihood of occurrence of a second InDel in sequential proximity to the original one, thus restoring the frame. This may result in the diversification of both the length and composition of the entire stretch of amino acids between the two InDels, and thus, in drastic structural and functional changes occurring via functional intermediates [23]. Indeed, transcriptional and translational errors, or phenotypic mutations, may play an evolutionary role in shaping protein properties or acting as bridging intermediates [42]–[45].
Methods
Plasmids and strains
The M.HaeIII wild-type gene carrying four stabilizing mutations [18] was cloned with an N-terminal His-tag into pASK-IBA3+ vector (IBA, Ampicillin resistance, using NcoI and NotI; Figure S9). Plasmids were transformed into E. coli strain ER2267 (EcoK r- m- McrA- McrBC-Mrr-) in which GGCC DNA methylation is not toxic [39]. Transformants were selected by growth on ampicillin.
Mutagenesis and selection
Random mutagenesis was performed by PCR using an error-prone polymerase (GeneMorph Mutazyme, Stratagene) and primers than flank the M.HaeIII's ORF (pASK-F and pASK-R, Table S2). The wild-type gene with 4 stabilizing mutations [18] was used as template for the first round (G0). In following rounds, the selected pool of M.HaeIII variants from the previous round was used as a template for the next one. The PCR was optimized to an average of 2.2 mutations per gene. Each round of evolution, or generation (noted as ‘G’), included the following steps (Figure S3): (i) The pool of M.HaeIII genes from the previous round was randomly mutated, recloned using the NcoI and NotI sites, transformed to E. coli and plated on agar plates containing ampicillin. (ii) About 106 individual transformants were obtained in each round, and the cells were grown at 37°C over-night. (iii) The plasmid DNA was extracted and was digested with HaeIII (10–20 units, in 50 µl of NEB buffer 2, for 2 hours at 37°C). (iv) The plasmid DNA was purified (PCR purification kit, QIAGEN) and re-transformed for another round of enrichment. Each round of evolution included one cycle of mutagenesis and three cycles of enrichment (transformation, growth, plasmid extraction and digestion). The naive library, G0, relates to the transformed plasmid DNA derived from cloning of the repertoire of M.HaeIII genes after the first round of mutagenesis with no selection by HaeIII digestion.
High-throughput sequencing
The samples of the naive (G0) and the selected libraries from Round 17 (G17) were prepared in the following way: (i) The plasmid pools were PCR amplified with primers pASKXhoI-F and pASKXhoI-R that amplified the M.HaeIII's open reading frame while appending XhoI restriction sites at both ends (Figure S9, Table S2). (ii) The amplified products were purified by PCR purification kit (QIAGEN) and digested with XhoI (20 units, in 60 µl of NEB buffer 4, for 2 hours at 37°C). (iii) The digested products were isolated by gel electrophoresis and a gel extraction kit (QIAGEN). (iv) To avoid bias due to poor sequencing of the edges, the fragments were ligated using the XhoI site to give concatemers. The ligation products were purified by ethanol precipitation. Sequencing libraries were prepared and sequenced according to manufacturer's protocol at the Weizmann Institute's high throughput-sequencing core facility. The obtained sequencing reads (∼40 nts) were mapped to the reference sequence of wild-type M.HaeIII with two methods: (i) Using NCBI blastn v2.2.20 [46] with parameters: e-value cutoff 0.0001, word size 7, and while allowing up to 6 mismatches and requiring a minimal alignment length of 24 consecutive nts, as previously described [47], [48]; and (ii) Using Novoalign v2.07.00 with parameters: c 4 Hash step-size 6 [47]. Point mutations, insertions and deletions were assigned based on the mapping of the sequencing reads to the reference sequence as previously described [29], [48]. Large insertions (7 or more bases) were determined by the Blast alignments, due to Blast's ability to open long gaps by performing local-alignments of the sequences. Single nucleotide mutations and short indels (7 bases or shorter) were determined by the Novoalign alignments as they take into account the base-quality information provided by the Genome Analyzer platform (using quality threshold of Q20 for filtering both indels and point mutations). Every mismatch or gap in the reads alignment relative to the wild-type reference was recorded per each nucleotide position, and further analyzed using custom Perl scripts. Only InDels that were uniformly distributed along the 40 bp reads were included. Indeed, InDels that were detected with a high bias towards the edges of reads were individually tested and found to be artifacts and were manually removed (Figure S10). InDel frequencies were determined per nucleotide position as the number of reads with a given InDel(s) divided by the total number of reads that mapped this position.
Individual InDel variants
Individual InDels were introduced by all-around PCR using the pASK encoding wild-type M.HaeIII as template and phosphorylated primers harboring each InDel (Table S3). The PCR products were gel purified, ligated (blunt-end ligation; 10 units of T4 ligase, NEB, 2 hours at room temp.) and transformed to E. coli. Transformants were selected on ampicillin, and InDel incorporation was confirmed by sequencing. Appending of the C-terminal HA-tag was performed by PCR using individual InDel constructs as template and primers pASK-F and XhoCtFus-R (Table S2). The PCR products were digested with NcoI and XhoI and ligated into a modified pASK vecor containing an in-frame C-terminal HA-tag (Figure S9).
Methylation assays
Sequence-verified InDel variants were transformed to E. coli and grown in LB media in the presence of ampicillin to OD600∼0.6. The cultures were then split: 200 ng/ml of the expression inducer anhydrotetracycline (AHT) was added to one half and the second was kept growing as is. Cultures were grown over-night at 37°C. The plasmid DNA was extracted, treated with HaeIII restriction enzyme (10–20 units, 2 hours at 37°C) and analyzed by gel electrophoresis.
Supporting Information
Zdroje
1. LynchM, SungW, MorrisK, CoffeyN, LandryCR, et al. (2008) A genome-wide view of the spectrum of spontaneous mutations in yeast. Proc Natl Acad Sci U S A 105: 9272–7.
2. StreisingerG, OkadaY, EmrichJ, NewtonJ, TsugitaA, et al. (1966) Frameshift mutations and the genetic code. Cold Spring Harb Symp Quant Biol 31: 77–84.
3. KashiY, KingDG (2006) Simple sequence repeats as advantageous mutators in evolution. Trends Genet 22: 253–9.
4. KochAL (2004) Catastrophe and what to do about it if you are a bacterium: the importance of frameshift mutants. Crit Rev Microbiol 30: 1–6.
5. MoxonER, RaineyPB, NowakMA, LenskiRE (1994) Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr Biol 4: 24–33.
6. RaineyP, MoxonR (1993) Unusual mutational mechanisms and evolution. Science 260 1958; author reply 1959–60.
7. WernegreenJJ, KauppinenSN, DegnanPH (2010) Slip into something more functional: selection maintains ancient frameshifts in homopolymeric sequences. Mol Biol Evol 27: 833–9.
8. WagnerLA, WeissRB, DriscollR, DunnDS, GestelandRF (1990) Transcriptional slippage occurs during elongation at runs of adenine or thymine in Escherichia coli. Nucleic Acids Res 18: 3529–35.
9. FarabaughPJ (1996) Programmed translational frameshifting. Microbiol Rev 60: 103–34.
10. FarabaughPJ (1996) Programmed translational frameshifting. Annu Rev Genet 30: 507–28.
11. BaranovPV, GestelandRF, AtkinsJF (2002) Recoding: translational bifurcations in gene expression. Gene 286: 187–201.
12. Cobucci-PonzanoB, TrinconeA, GiordanoA, RossiM, MoracciM (2003) Identification of the catalytic nucleophile of the family 29 alpha-L-fucosidase from Sulfolobus solfataricus via chemical rescue of an inactive mutant. Biochemistry 42: 9525–31.
13. Cobucci-PonzanoB, TrinconeA, GiordanoA, RossiM, MoracciM (2003) Identification of an archaeal alpha-L-fucosidase encoded by an interrupted gene. Production of a functional enzyme by mutations mimicking programmed −1 frameshifting. J Biol Chem 278: 14622–31.
14. TamasI, WernegreenJJ, NystedtB, KauppinenSN, DarbyAC, et al. (2008) Endosymbiont gene functions impaired and rescued by polymerase infidelity at poly(A) tracts. Proc Natl Acad Sci U S A 105: 14934–9.
15. MeyerovichM, MamouG, Ben-YehudaS (2010) Visualizing high error levels during gene expression in living bacterial cells. Proc Natl Acad Sci U S A 107: 11543–8.
16. TsuchihashiZ, KornbergA (1990) Translational frameshifting generates the gamma subunit of DNA polymerase III holoenzyme. Proc Natl Acad Sci U S A 87: 2516–20.
17. TsuchihashiZ, BrownPO (1992) Sequence requirements for efficient translational frameshifting in the Escherichia coli dnaX gene and the role of an unstable interaction between tRNA(Lys) and an AAG lysine codon. Genes Dev 6: 511–9.
18. Rockah-ShmuelL, TawfikDS (2012) Evolutionary transitions to new DNA methyltransferases through target site expansion and shrinkage. Nucleic Acids Res DOI: 10.1093/nar/gks944
19. BogaradLD, DeemMW (1999) A hierarchical approach to protein molecular evolution. Proc Natl Acad Sci U S A 96 (6) 2591–5.
20. Toth-PetroczyA, TawfikDS (2013) Protein Insertions and Deletions Enabled by Neutral Roaming in Sequence Space. Mol Biol Evol 10.1093/molbev/mst003.
21. GirdeaM, NoeL, KucherovG (2009) Back-translation for discovering distant protein homologies in the presence of frameshift mutations. Algorithms Mol Biol 5: 6.
22. GirdeaM, NoeL, KucherovG (2010) Back-translation for discovering distant protein homologies in the presence of frameshift mutations. Algorithms Mol Biol 5: 6.
23. SmithJM (1970) Natural selection and the concept of a protein space. Nature 225: 563–4.
24. MillsRE, LuttigCT, LarkinsCE, BeauchampA, TsuiC, et al. (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16: 1182–90.
25. LevinsonG, GutmanGA (1987) Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol 4: 203–21.
26. LiYC, KorolAB, FahimaT, BeilesA, NevoE (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol Ecol 11: 2453–65.
27. KlintscharM, WiegandP (2003) Polymerase slippage in relation to the uniformity of tetrameric repeat stretches. Forensic Sci Int 135: 163–6.
28. NishizawaM, NishizawaK (2002) A DNA sequence evolution analysis generalized by simulation and the markov chain monte carlo method implicates strand slippage in a majority of insertions and deletions. J Mol Evol 55: 706–17.
29. MoranNA, McLaughlinHJ, SorekR (2009) The dynamics and time scale of ongoing genomic erosion in symbiotic bacteria. Science 323: 379–82.
30. BhangaleTR, RiederMJ, LivingstonRJ, NickersonDA (2005) Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet 14: 59–69.
31. ChenJQ, WuY, YangH, BergelsonJ, KreitmanM, et al. (2009) Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Mol Biol Evol 26: 1523–31.
32. de la ChauxN, MesserPW, ArndtPF (2007) DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage. BMC Evol Biol 7: 191.
33. SzomolanyiE, KissA, VenetianerP (1980) Cloning the modification methylase gene of Bacillus sphaericus R in Escherichia coli. Gene 10: 219–25.
34. BershteinS, GoldinK, TawfikDS (2008) Intense neutral drifts yield robust and evolvable consensus proteins. J Mol Biol 379: 1029–44.
35. KobayashiI (2001) Behavior of restriction–modification systems as selfish mobile elements and their impact on genome evolution. Nucleic Acids Research 29: 3742–3756.
36. MrukI, BlumenthalRM (2008) Real-time kinetics of restriction-modification gene expression after entry into a new host cell. Nucleic Acids Res 36: 2581–93.
37. NeuenschwanderM, ButzM, HeintzC, KastP, HilvertD (2007) A simple selection strategy for evolving highly efficient enzymes. Nat Biotechnol 25: 1145–7.
38. DrummondDA, WilkeCO (2009) The evolutionary consequences of erroneous protein synthesis. Nat Rev Genet 10: 715–24.
39. RaleighEA, WilsonG (1986) Escherichia coli K-12 restricts DNA containing 5-methylcytosine. Proc Natl Acad Sci U S A 83: 9070–4.
40. McDonaldMJ, WangWC, HuangHD, LeuJY (2010) Clusters of nucleotide substitutions and insertion/deletion mutations are associated with repeat sequences. PLoS Biol 9: e1000622.
41. Cobucci-PonzanoB, GuzziniL, BenelliD, LondeiP, PerrodouE, et al. (2010) Functional characterization and high-throughput proteomic analysis of interrupted genes in the archaeon Sulfolobus solfataricus. J Proteome Res 9: 2496–507.
42. BurgerR, WillensdorferM, NowakMA (2006) Why are phenotypic mutation rates much higher than genotypic mutation rates? Genetics 172: 197–206.
43. GoldsmithM, TawfikDS (2009) Potential role of phenotypic mutations in the evolution of protein expression and stability. Proc Natl Acad Sci U S A 106: 6197–202.
44. WhiteheadDJ, WilkeCO, VernazobresD, Bornberg-BauerE (2008) The look-ahead effect of phenotypic mutations. Biol Direct 3: 18.
45. RajonE, MaselJ (2011) Evolution of molecular error rates and the consequences for evolvability. Proc Natl Acad Sci U S A 108: 1082–7.
46. AltschulSF, MaddenTL, SchafferAA, ZhangJ, ZhangZ, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–402.
47. AvraniS, WurtzelO, SharonI, SorekR, LindellD (2011) Genomic island variability facilitates Prochlorococcus-virus coexistence. Nature 474: 604–8.
48. WurtzelO, Dori-BachashM, PietrokovskiS, JurkevitchE, SorekR (2010) Mutation detection with next-generation resequencing through a mediator genome. PLoS One 5: e15628.
49. AshkenazyH, ErezE, MartzE, PupkoT, Ben-TalN (2010) ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Res 38: W529–33.
50. ReinischKM, ChenL, VerdineGL, LipscombWN (1995) The crystal structure of HaeIII methyltransferase convalently complexed to DNA: an extrahelical cytosine and rearranged base pairing. Cell 82: 143–53.
51. BlattnerFR, PlunkettG3rd, BlochCA, PernaNT, BurlandV, et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277: 1453–62.
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2013 Číslo 10
- Primární hyperoxalurie – aktuální možnosti diagnostiky a léčby
- Srdeční frekvence embrya může být faktorem užitečným v předpovídání výsledku IVF
- Akutní intermitentní porfyrie
- Vztah užívání alkoholu a mužské fertility
- Šanci na úspěšný průběh těhotenství snižují nevhodné hladiny progesteronu vznikající při umělém oplodnění
Nejčtenější v tomto čísle
- Dominant Mutations in Identify the Mlh1-Pms1 Endonuclease Active Site and an Exonuclease 1-Independent Mismatch Repair Pathway
- Eleven Candidate Susceptibility Genes for Common Familial Colorectal Cancer
- The Histone H3 K27 Methyltransferase KMT6 Regulates Development and Expression of Secondary Metabolite Gene Clusters
- A Mutation in the Gene in Labrador Retrievers with Hereditary Nasal Parakeratosis (HNPK) Provides Insights into the Epigenetics of Keratinocyte Differentiation