Systematic Dissection of Coding Exons at Single Nucleotide Resolution Supports an Additional Role in Cell-Specific Transcriptional Regulation

Download PDF České info

Exons that code for protein can also have additional functions, such as regulating gene transcription through enhancer activity. Here, we changed every nucleotide in three different exons that also function as enhancers, and examined their enhancer activity to test whether nucleotide changes in these exons can affect both the protein sequence and enhancer function. We found that mutations with a significant effect on enhancer function can reside both in regions that change the protein sequence (non-synonymous) and regions that do not change it (synonymous). When we conducted a similar analysis in a different cell type, we observed a difference in the nucleotide changes that cause a significant effect on enhancer activity, suggesting that the enhancer functional units can differ between tissues.

Published in the journal: . PLoS Genet 10(10): e32767. doi:10.1371/journal.pgen.1004592
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1004592

Summary

Introduction

Protein coding sequences have been shown to contain additional functional information such as splicing [1], [2], mRNA stability [3], microRNA target sites [4], and transcriptional enhancer activity [5]–[10]. Furthermore, by analyzing various genomic datasets, numerous exons were shown to interact with promoter and enhancer-like regions suggesting that they are involved in alternative splicing, chromatin structure and gene regulation [2]. A study that analyzed enhancer-associated ChIP-seq datasets found that, on average, 7% of peaks overlap coding exons [5], suggesting that numerous eExons are embedded in mammalian genomes. Furthermore, functional characterization of potential eExons in zebrafish [9] and mice [5] found that over half of them are functional developmental enhancers. In addition, it was shown that eExons can regulate the gene they reside in [6]–[10] and also nearby genes [5], suggesting that mutations in these exons could lead to phenotypes that are not due to their protein function.

The functional consequence of non-synonymous mutations is well established [11]. Synonymous mutations have also been shown to have a functional effect, for example causing disease by improper splicing [12]. However, whether mutations in coding exons can lead to phenotypes by impacting enhancer function is not well known. A recent study demonstrated that human chromosomal abnormalities encompassing two eExons, DYNC1I1 exon 15 and 17, that regulate developmental limb expression of two neighboring genes, DLX5 and DLX6, could cause split hand and foot malformations [5], similar to a DLX5 coding mutation [13]. Transcription factors involved in gene regulation were suggested to bind to coding protein sequences and are thought to influence codon choice and, consequently, protein evolution [14]. However, the functional consequence of point mutations in eExons has yet to be determined.

Here, we identified several functional liver eExons and used massively parallel reporter assays (MPRA) [15]–[19] to determine the functional consequences of all possible nucleotide substitutions in three of them. We found that synonymous and non-synonymous mutations have a similar effect on enhancer activity and deleterious enhancer mutations tend to overlap liver-associated transcription factor binding sites (TFBS). Using similar MPRA assays in HeLa cells, we show that their mutational profile can vary in different cell types.

Results

ChIP-seq analysis of liver eExons

In order to identify functional liver eExons for MPRA, we analyzed liver-associated ChIP-seq datasets for coding exons that have enhancer marks. We analyzed enhancer associated ChIP-seq datasets (H3K4me1, H3K27ac and p300) of human hepatocytes generated in a separate study (Smith RP et al., manuscript in preparation) and in 8 week old mouse liver [20] (see methods). Since these enhancer marks could also identify potential promoters, we excluded the first exons of genes that had enhancer marks from our subsequent analyses, as these exons could potentially be promoters and not necessarily enhancers (Table S1). For the H3K4me1 and H3K27ac ChIP-seq datasets, we found that 18–20% and 9–11% of all peaks overlap coding exons in human hepatocytes and adult mouse liver, respectively. Since the average peak size of these two enhancer marks is rather long (∼2 kb), we also analyzed p300 ChIP-seq datasets, which have a shorter average peak size (∼400 bp). We found that 7% and 6% of all peaks overlap coding exons in human hepatocytes and mouse liver, respectively, (Table S1).

We next wanted to increase our chances of obtaining a functional eExon for our enhancer assays and subsequent MPRA. We thus chose the human hepatocyte p300 ChIP-seq dataset for all subsequent analysis (Figure S1). This was done due to its aforementioned shorter peak size, thus increasing the chances that a substantial exon-peak overlap will be indicative of function, since 80% of all human exons are <200 bp in length [21]. We selected only coding exons where at least 25% of the human hepatocyte p300 ChIP-seq peak overlapped the actual exon. Using transcription factor (HNF4A, FOXA1, FOXA2) ChIP-seq datasets from HepG2 cells [22], our eExon candidates were further filtered for peaks that were bound by at least two of these transcription factors. From the remaining 48 eExon candidates, we selected 15 for individual enhancer assays (Table S2).

Functional characterization of liver eExons

To determine whether these exons function as liver enhancers, we tested the fifteen selected candidates for enhancer activity in human hepatocellular carcinoma (HepG2) cells and mouse liver. The selected ChIP-seq exonic peaks were amplified from human genomic DNA and cloned into the pGL4.23 vector (Promega) that contains a minimal promoter followed by the luciferase reporter gene. The vectors were transfected into HepG2 cells and luciferase activity was measured after 24 hours. Ten out of the fifteen tested exons had significant luciferase activity versus the empty vector (p<0.05; t-test), suggesting that they function as enhancers (Figure 1A). We next tested these sequences in mice using the hydrodynamic tail vein injection [23], [24] and found that eight out of the fifteen exons showed significant liver enhancer activity (Figure 1B; p<0.05; t-test). Combined, seven exons functioned as enhancers in both HepG2 and mouse liver.

Mutation analysis of three liver eExons using massively parallel reporter assays

Since eExon mutations could alter both protein coding function and enhancer activity, we were interested to assess the effect of single nucleotide variants (SNVs) on enhancer function at single base pair resolution. We selected three eExons with the highest enhancer activity from the mouse enhancer assay for MPRA: 1) exon 17 of sortilin-related receptor, L (DLR class) A repeats containing (SORL1) gene, which encodes the YWTD domain of this receptor (also named LR11/SorLA). SORL1 is thought to play a role in endocytosis and sorting and is expressed in the human central nervous system and liver [25], but absent in the mouse liver [26], [27]; 2) exon 2 of TRAF3 interacting protein 2 (TRAF3IP2; also known as ACT1) which encodes the Helix-loop-helix (HLH) domain of this protein. TRAF3IP2 is a signaling adaptor protein involved in the regulation of adaptive immunity through the IL-17 pathway [28]–[30] and is expressed in B-cells and liver [31], [32]. 3) exon 6 of the peroxisome proliferator-activated receptor gamma (PPARG) gene, which encodes the ligand binding domain of this protein. PPARG is a nuclear receptor that regulates fatty acid storage and glucose metabolism and is expressed in the liver [33].

Using MPRA, we systematically dissected the functional consequences of all possible SNVs for these three eExons in the mouse liver (Figure 2). By oligonucleotide synthesis and polymerase cycling assembly, we generated for each eExon a low complexity library (≥10,000) of enhancer mutant haplotypes that diverged from the wild type sequence by ∼2–3% (Table S3). These mutant enhancers, along with 20-bp degenerate tags, were cloned into pGL4.23 (see experimental procedures). We performed subassembly on each library to determine the full sequence of each enhancer haplotype and its represented tag [34]. Each library was then assayed in mice using the hydrodynamic tail vein injection, livers were harvested after 24 hours and the relative activities of individual haplotypes were measured by sequencing the transcribed tags. Using linear regression analysis, the effect-size of every possible single-nucleotide change on enhancer activity was estimated (see experimental procedures). We found that the functional consequence of most mutations was modest, with ∼18% affecting activity by ≥1.2-fold and ∼4% by ≥2-fold. The majority of SNVs with ≥2 fold change reduced enhancer activity (92%) and only a few SNVs increased activity (8%) (Table S4). To validate our results, we individually tested selected SNVs using the hydrodynamic tail vein assay and found a high correlation with their MPRA profiles (R = 0.94; Figure S2).

**Fig. 2. MPRA regulatory profiles of three liver eExons.**

We next examined the functional profile of each eExon. In all three eExons, we observed dispersed clusters that had high effect sizes (Figure 2A–C). For example, the first ∼250 bases and the last 120 bases of SORL1 exon 17 did not have a strong effect on enhancer activity, while nucleotide substitutions at positions 259–328, which are mostly coding sequences, had a significant effect on enhancer activity (Figure 2A). For TRAF3IP exon 2, the first ∼330 bases and the last ∼240 bases did not have a significant effect on enhancer activity, while almost all nucleotide substitutions at protein coding positions 333–409 had a significant impact (Figure 2B). PPARG exon 6 had clusters of nucleotide substitutions along the entire tested sequence with a significant impact on enhancer activity (Figure 2C). These results point to the existence of discrete regions in these eExons that modulate enhancer activity and also code for important protein domains.

Enhancer function is integrated with protein function

We next investigated whether the functional regions of the enhancer are subjective to the protein code or impartial to it by analyzing the effect-size of synonymous versus non-synonymous SNVs in these three eExons. We divided the functional SNVs (Fold change (FC) ≥1.2) to synonymous and non-synonymous and compared the effect size of synonymous versus non-synonymous SNVs on enhancer function. We found no significant differences between synonymous and non-synonymous SNVs and their effect-sizes (p-value = 0.67; Fisher test). A similar analysis for just the high functional SNVs (FC≥2) also showed no significant differences (p-value = 0.36; Fisher test). Thus, our results suggest that the essential functional enhancer sequences are intertwined with the protein coding sequences.

Functional clusters co-localize with transcription factor binding sites

We analyzed the profiles of each eExon to determine whether SNVs that significantly affect enhancer activity overlap with liver-associated TFBS. We found several positions separated by less than ∼6 nucleotides that had significantly correlated effect sizes (P≤0.01) (Figure S3 A–C), suggesting that these clusters might represent TFBS. Indeed, when we analyzed these eExons for liver-associated TFBS, we observed a striking overlap between predicted liver-associated TFBS and these clusters (Table S5; Figure 2D,E). Across the three tested eExons, AP-1 and HNF4A were the most prevalent TFBS overlapping these clusters. For example, a predicted HNF4A binding site at position 259–272 (Figure 2D) and a predicted AP-1 binding site at position 322–327 (Figure 2E) in SORL1 exon 17, overlap a cluster of SNVs that significantly affect enhancer activity. All the mutations overlapping the AP-1 binding site had a negative effect on enhancer activity, with the notable exception of a variant with a positive effect (325 G>C) that likely increased the AP-1 binding affinity (Figure 2E). In addition, we also observed mutations that generated novel TFBS that increased enhancer activity. For example, in TRAF3IP2 exon 2, 373G>T, and two neighboring nuclear factor 1(NF-1) binding sites, 393A>T, 399T>A, in SORL1 exon 17 were created by nucleotide substitutions and subsequently increased enhancer activity (Figure S3D–E).

To assess whether the binding of AP-1 and HNF4A is affected by these SNVs, we carried out co-transfection assays in HEK293T, a human embryonic kidney cell line where these three eExons are inactive (Figure 3A). Co-transfection of AP-1 and/or HNF4A (see online methods) along with the three enhancer vectors showed significant enhancer activity for all three eExons, suggesting that AP-1 and HNF4A regulate their activity (Figure 3A). We then evaluated whether SNVs that alter AP-1 and HNF4A binding sites could change their enhancer activity. For SORL1 exon 17, deleterious mutations in predicted HNF4A (260G>T, 264G>A) and AP-1 (322T>G) TFBS reduced enhancer activity when the applicable TF was transfected, while SNVs (393A>T, 399T>A) coinciding with NF-1 had similar enhancer activity as the reference sequence (Figure 3B). A SNV in TRAF3IP2 exon 2 (373G>T) that generated a predicted AP-1 binding site led to significant enhancer activity both with and without the transfected TFs (Figure 3B). Not all SNVs that altered predicted TFBSs led to the anticipated enhancer activity changes in our assay. For example, SNVs 384T>A and 390T>A in PPARG exon 6, that alter an AP-1 TFBS, did not reduce enhancer activity in AP-1 transfected cells (Figure 3B). Combined, our results demonstrate that liver expressed TFs, such as AP-1 and HNF4A, modulate the enhancer activity of these eExons and mutations in their predicted TFBSs could partially explain the effect of SNVs on enhancer activity.

**Fig. 3. HNF4A and AP-1 alter eExon enhancer activity.**

Tissue-specific regulatory profiles of eExons

To determine whether the distinctive TF repertoire of different cells could modulate enhancer activity, we set out to carry a similar MPRA analysis for all three eExons in a different cell type and compare the results to the regulatory profile from mouse liver. Using available TF ChIP-seq datasets [22], we observed that these three eExons overlap several ChIP-seq peaks in various human cell lines (Figure 2C–E). Of note, PPARG exon 6 and TRAF3IP2 exon 2 specifically overlap AP-1 ChIP-seq peaks in human cervical cancer cells (HeLa) cells. Subsequent luciferase reporter assays showed that all three eExons are active enhancers in HeLa cells (Figure S4A). We thus set out to do a similar MPRA experiment in HeLa cells for all three eExons.

Using MPRA, we dissected the mutational profile of these three eExons and found that the functional consequence of most mutations in HeLa cells was low to modest, with ∼12% affecting activity by ≥1.2-fold and ∼1% by ≥2-fold. Comparison to mouse liver mutation profiles showed that the overall effect size of SNVs (with ≥1.2 fold change) was lower in HeLa cells. The regulatory profiles of PPARG exon 6 showed high similarity between mouse liver and HeLa cells, while discrete differences were observed for the other two exons (Figure 4). In SORL1 exon 17 for example, a deleterious mutation cluster in mouse liver that overlaps an HNF4A TFBS at positions 259–272 (Figure 2D) had no effect in HeLa cells (Figure 4A,D), while another mutation cluster at positions 310–317 had a significant effect on enhancer activity in HeLa but not in mouse liver (Figure S4B). Some functional clusters remained unchanged in both experiments (Figure S4B–D). For example, SNVs at positions 322–327 that overlap an AP-1 binding site (Figure 2E) had a similar profile in both cells with a lower effect size in HeLa. To validate our results, individual SNVs overlapping predicted TFBSs were tested in HeLa and were compared to our previous mouse liver results (Figure S2). Overall, we observed luciferase fold-changes that correlated with their corresponding MPRA results (R = 0.77; see methods). Combined, our results demonstrate that the regulatory profile of an enhancer can change between different cell types.

**Fig. 4. Regulatory profiles of eExons in mouse liver and HeLa cells.**

Discussion

In this study, we analyzed enhancer-associated ChIP-seq datasets (H3K4me1, H3K27ac, p300) from human hepatocytes and mouse liver and observed that on average ∼6% of peaks overlap coding exons after excluding the first exon. Enhancer assays for 15 selected exons that overlap these peaks showed that 8 and 10 of them are functional enhancers in adult mice and human HepG2 cells, respectively. Our observed ∼50% success rate is in line with previous reports that tested eExons for their developmental enhancer activity [5], [9]. However, it is worth noting that in this study we further selected for potentially positive enhancers by looking at ≥25% exon-peak overlap and additional transcription factor ChIP-seq datasets which probably increased our functional enhancer success rate. Also of note, while human SORL1 is expressed in the liver [25], mouse Sorl1 is not expressed in this tissue [26], [27], and the human sequence we used showed mouse liver enhancer activity. Previous work has shown that when human sequences are tested for enhancer activity, even in zebrafish, they can portray regulatory activity even if they do not have homologous sequences in that organism [35], [36].

Using MPRAs, we measured the distribution of effect sizes of all possible SNVs in three liver eExons in vivo. The three eExon MPRA displayed comparable functional profiles to previously reported liver enhancer MPRA [17], and similarly show that the majority of SNVs have modest effects on transcriptional activity. Since we tested the ChIP-seq peak which includes not only the exonic sequence but also the adjacent intronic sequences, the distribution of effect sizes allowed us to determine the location of the enhancer core and whether the protein coding sequence is required for enhancer activity. For two eExons, SORL1 exon 17 and TRAF3IP2 exon 6, the distribution of SNVs with a significant effect on enhancer activity were specifically clustered in coding sequences that encode important protein domains. In addition, non-synonymous mutations in all three eExons had deleterious effects on enhancer activity, indicating that the genetic code can contain overlapping functional information. Interestingly, the dual function of eExons is not only restricted to protein coding function, as the cluster with the highest impact on enhancer activity in SORL1 exon 17 overlaps a donor splice site (Figure 2E). Alternative splicing occurs simultaneously with nascent RNA transcription [2] and these eExons could also be splicing regulators. These results suggest that mutations in eExons could lead to multiple phenotypes, both from disruption of the protein function and transcriptional regulation.

Our study has several caveats. One caveat is that we are not testing the enhancer activity of these sequences in their natural setting. Similar to most standard enhancer assays, our study takes out these sequences from their natural settings and places them in front of a minimal promoter that is different than their target promoter. In addition, the transcriptional status of these exons in concert with their enhancer activity is not being assessed. Future studies that create knockin mice to selected exon coding mutations that have a deleterious effect on enhancer activity, as determined by our MPRA, could potentially assay this. We also do not know the target genes that are regulated by these eExons. The genes these exons reside in are expressed in most of the tissues and cell types that we analyzed (Table S2), but this does not necessarily establish that they are their target genes. Furthermore, analysis of available chromatin interaction Hi-C datasets [37], [38] from different cell types (IMR90, hESC) suggest that these eExons interact with their gene promoter region as well as nearby gene promoters and other intergeneic regions (Table S7).

An interesting and still open question is what sort of evolutionary mechanism permitted coding exons to acquire another function as transcriptional enhancers. Two different forces could have been the drivers for their evolutionary constraint: protein function or transcription regulatory function, such as enhancer activity. Regarding protein constraint, previous studies have shown that 70% of amino acids in a protein can be altered while maintaining its structure and function [39]. This suggests that, in addition to synonymous sites, non-synonymous changes could allow for substantial flexibility, accommodating for enhancer function. Comparison between synonymous and non-synonymous SNVs in our study did not find any significant differences in their effect on enhancer activity. Supporting this is a recent study that used genomic deoxyribonuclease I footprinting analysis within synonymous and non-synonymous sequences and found that transcription factor binding could possibly impose a functional constraint of both the regulatory code and amino acid choice [14]. Analysis of synonymous constraint elements (SCE) [40], which are significantly conserved synonymous regions, found that only ∼6% of them overlap exonic human hepatocytes and mouse liver ChIP-seq peaks (Table S6). Four of the 15 tested exons overlap SCE, and PPARG exon 6 was the only eExon tested by MPRA to overlap an SCE (Table S2). These results suggest that at least for our selected eExons, a tendency for synonymous constraint was not observed (Table S2). The evolutionary constraint of noncoding enhancers has frequently served as a proxy for functional constraint [41], [42], but studies have also shown that many noncoding enhancers evolved rapidly and that mammalian genomes contain large numbers of evolutionarily young, sometimes species-specific, enhancers [43], [44]. Therefore, TFs that play a role in enhancer activation might not discriminate between coding and noncoding sequences and eExons might not be under additional evolutionary constraint than typical coding sequences.

The activity of enhancers, both coding and noncoding, is regulated by the binding of TFs. The cell-specific TF repertoire regulates enhancer activity levels and depends on motif positioning and larger regulatory context. Techniques such as ChIP-seq can identify the TF binding repertoire of a certain enhancer, but are largely limited in their resolution, technological biases and require a priori knowledge about potential TFs that bind to the enhancer. By testing the regulatory profile of three enhancers in two cell types, we identified specific TFBS overlapping positions that can have a different impact on enhancer activity in another cell type. Moreover, functional positions that do not overlap predicted TFBS could identify novel TFs that control enhancer activity in that cell type. Enhancer sequences can encompass a wide repertoire of TFBS. Our study suggests that a collection of TFs could dictate the activity of enhancers in a specific cell type, while a different collection of TFs could dictate enhancer activity in another cell type.

Materials and Methods

Data availability

In addition to raw sequencing reads available in the NCBI Short Read Archive (SRA) under accession SRP044727, a full list of mutations, along with their associated effect sizes and p-values, are provided as Table S5.

eExon mutation library construction

Enhancer haplotypes were generated from short, doped oligonucleotides using polymerase cycling assembly (PCA) as previously described [17]. Sets of overlapping oligonucleotides for each eExon were designed using DNAWorks [45]. Oligonucleotides were synthesized by Integrated DNA Technologies. All positions corresponding to the enhancer region (except for the flanking primer landing sites on either side of the enhancer) were synthesized using a hand-mix doped at a ratio of 97∶1∶1∶1 (that is, designated base at a frequency of 97%, and every other base at a frequency of 1%). The degenerate tags were first cloned to create a complex library of tagged pGL4.23 plasmids as described in Patwardhan et al, 2012 [17]. Briefly, the tag oligonucleotide (TAG_OLIGO) was made double-stranded using primer extension in a 50 µl reaction volume with 1× iProof Master Mix, 0.5 µg single-stranded tag oligo, 0.5 µg reverse primer (TAG_EXTEND). The reaction was incubated at 95°C for 3 min, 61°C for 10 min and then 72°C for 5 min. The product was purified using a QIAquick column and eluted in 50 µl EB. It was further subjected to ExoI treatment in 40 µl reaction volume for 1 h at 37°C to degrade any remaining single-stranded DNA, and purified again using QIAquick columns. The resulting double-stranded tag oligo was then cloned into pGL4.23 at the XbaI site (at 1,799 bp) using standard InFusion (Clontech) protocol. The InFusion reaction was diluted to 100 µl using TE8. 1.5 µl of the diluted cloning reaction was used to transform 50 µl of chemically competent FusionBlue cells (Clontech) using standard protocols. The enhancer haplotypes were then cloned into the tagged pGL4.23 plasmids using standard InFusion protocol. 2.5 µl of the cloning reaction was used to transform 50 µl of chemically competent cells (Stellar) using standard protocol. For each of the three enhancers, eight transformation reactions were pooled and grown overnight in 50 ml liquid culture at 37°C in a shaking incubator. DNA was extracted using the Invitrogen ChargeSwitch Midi Prep Kit.

Cell culture and reporter assays

1–2×10⁵ of HepG2, HEK293T and HeLa cells (ATCC) were cultured in 24 well plates overnight using standard protocols and were transfected with 500 ng of the enhancer candidate cloned into pGL4.23 plasmid, along with 50 ng of the Renilla vector, to correct for transfection efficiency, using X-tremeGENE HP DNA transfection reagent (Roche). After 24 hours, the enhancer activity was measured using the Dual-Luciferase reporter assay (Promega) on a Synergy 2 microplate reader (BioTek Instruments). In HEK293T, the enhancer variant vectors were co-transfected along with 100 ng of HNF4α2 and HNF4α8 expression constructs for HNF4A [46] and c-Fos and c-Jun for AP-1, which is a heterodimer of c-Fos and c-Jun [47]. 500 ng of the luciferase reporter construct pZL.HIV.LTR.AI-4 [48], that contains four tandemerized HNF4 response elements was used as a positive control for HNF4A transfection and 500 ng of pGL4.13 (Promega) that contains an AP-1 responsive element was used as a positive control for c-Fos and c-Jun transfections [49]. For MPRA libraries, 1–2×10⁶ HeLa cells were cultured on 10 mm culture dishes overnight using standard protocols and transfected with 10 µg of the constructed library. After 24 hours, cells were harvested and total RNA was purified using the RNAeasy maxi prep kit (Qiagen) and subjected to mRNA selection (Oligotex, Qiagen) following the manufacturer's protocols.

Hydrodynamic tail vein enhancer assay

For the hydrodynamic tail vein assay, each tested sequence or MPRA library, cloned in the pGL4.23[luc2] vector, were injected (10 µg) alongside 2 µg of pGL4.74[hRluc/TK] vector to correct for injection efficiency, into five CD1 mice (Charles River Laboratories) using the TransIT EE hydrodynamic gene delivery system (Mirus Bio LLC) according to the manufacturer's protocol. Negative (empty pGL4.23[luc2]) and positive (ApoE liver enhancer [50] controls (n = 5) were also injected in parallel at each injection date/experiment. After 24 h, the mice were euthanized, livers harvested and used to make RNA in order to sequence transcribed tags (MPRA libraries) and also to measure luciferase activity (MPRA libraries and individual constructs). Approximately 1gr of mouse liver was used to purify total RNA using the RNAeasy maxi prep and 500 µg were used to select for mRNA. For luciferase measurements, livers were homogenized in passive lysis buffer (Promega), followed by centrifugation at 4°C for 30 min at 14,000 rpm. Firefly and Renilla luciferase activity in the supernatant (diluted 1∶20) were measured on a Synergy 2 microplate reader in replicates of six for each liver, using the Dual-Luciferase reporter assay system. The ratios for firefly luciferase:Renilla luciferase were determined and expressed as relative luciferase activity. All mouse work was approved by the UCSF Institutional Animal Care and Use Committee protocol number AN100160.

Sequencing of RNA-derived tags of eExon haplotype libraries

20-bp tags were identified in liver/HeLa mRNA using previously described methods [17]. Sixteen RT-PCR reactions were performed for each of the biological duplicates, which were then multiplexed and sequenced on a GAIIx (Illumina) using a custom sequencing primer (TAG_SEQ_F). Each run was either 20 or 36 cycles with an additional 6 cycles to read the indexing tag using the index sequencing primer (TAG_SEQ_INDEX). For each aliquot, reads were filtered based on the quality scores for the first 20 bases, which correspond to the degenerate tag. The corresponding number of reads per each tag was counted and only tags that were supported by at least ten reads were used for further analysis.

Associating eExon haplotypes with tags

eExon haplotypes were associated with tags as previously described [17]. Briefly, 1,000 bp segments separating eExon haplotypes and tags on the pGL4.23 plasmid were excised by digesting with HindIII, which cuts both the 3′ of the eExon haplotype, and 5′ of the tag. The digested plasmids were purified and recircularized using intramolecular ligation (T4 DNA Ligase from NEB), resulting in the tag being adjacent to the 3′ end of the eExon haplotype. The region spanning the eExon haplotype and tags was amplified from recircularized plasmids by PCR with the forward primer targeting the region immediately 5′ of the eExon haplotype (eEXON_COMMON_F) and the reverse primer targeting the region immediately 3′ of the tag (TAG_PE_R). The amplicons were then subjected to the subassembly protocol as conceptually described in Hiatt et al. [34]. The random fragmentation step was carried out using the Nextera Tn5 transposase (Illumina) instead of mechanical shearing. The Nextera reaction was purified using MinElute column (Qiagen) and size-selected by PAGE. The size selected fragments were subjected to PCR in 25 ul reaction volume with 1× KapaHiFi Hot Start Ready Mix (Kapa BioSystems), 0.5× SYBR Green I, 20% of the size-selected product and each of the primers, Nextera Adapter 1 and TAG_PE_R at 0.3 uM final concentration. Thermal cycling was carried out using BioRad Mini Opticon System using the following program: 95°C for 3 mins; and then 30 cycles of 98°C for 20 sec, 65°C for 15 sec, 72°C for 15 sec. The PCR products were purified using QiaQuick column and then sequenced on a Hi-Seq 2000. Read1 collected 101 bp of the enhancer sequence staring at random breakpoints along the enhancer. Index read collected the 20 bp tag sequence. The reads were then grouped by tag. Reads belonging to each group were then aligned to the wild type enhancer sequence to identify the mutations on the haplotype associated with that tag using custom scripts.

eExon haplotype effect size analysis

All linear regression analyses were done using the lm() or lsfit() functions available in the R Statistical Package as previously described [17]. To quantity the effect of mutation at any given position on the number of aliquots in which an enhancer haplotype was observed, a separate linear regression model was used for every position along the enhancer, with a single predictor representing whether the given position was wild-type or mutant (univariate model). Similarly, a modified model was used to include the three possible nucleotide substitutions at any position that estimates effect sizes for each position being driven by specific nucleotide substitutions (trivariate model).

For each position in each enhancer, we constructed a linear model to assess the extent to which the presence of a mutation at that position is predictive of a change in the number of RNA aliquots in which the tag associated with an enhancer haplotype was observed. This is effectively a proxy for its impact on transcriptional activation, i.e. “effect size”. Specifically, we use the term “effect size” to describe the log2 fold change in the predicted transcriptional activity, as measured by the number of RNA aliquots in which a specific haplotype appeared, relative to the wild-type. For each of the three enhancers, we first calculated effect sizes separately on the data from each mouse (16 RNA aliquots per mouse). The effect sizes for these biological replicates were highly correlated (Tail-vein: SORL1 exon 17: r = 0.88, PPARG exon 6: r = 0.91, TRAF3IP2 exon 2: r = 0.92; HeLa: SORL1 exon 17: r = 0.85, PPARG exon 6: r = 0.85, TRAF3IP2 exon 2: r = 0.8). Based on this high reproducibility and to increase resolving power, we performed all subsequent analyses after combining data across both replicates for each enhancer.

Estimation of effect size of mutation at each position along the enhancer (univariate model)

To quantity the effect of mutation at any given position on the number of aliquots in which an enhancer haplotype was observed, we built a separate linear regression model at every position along the enhancer, with a single predictor representing whether the given position was wild-type or mutant (Table S3). The predictor was thus a binary variable representing presence (1) or absence (0) of a mutation at that position.

where,

y_i = number of aliquots in which the ith haplotype was observed (referred to as aliquot counts),

X_ij = 1 if position j was mutant and 0 if position j was wild-type in the ith haplotype.

To facilitate comparison between positions and between enhancers, we calculated the “effect size” of mutation at a position j as:

All linear regression analyses were performed using the lm() or lsfit() functions available in the R Statistical Package. The p-value reported by the model for was used to judge whether the effect size was significant.

Estimation of effect size of each specific nucleotide change at each position along the enhancer (trivariate model)

To explore whether the estimated effect sizes for each position were being driven by specific nucleotide substitutions, we modified the model just described to include three predictors, each representing one of the three possible nucleotide substitutions at that position. The factors were set up as binary variables representing the presence (1) or absence (0) of the particular change at that position.

Effect sizes were then calculated from the coefficients produced by the models as follows (for k = 1,2,3):

The p-value reported by the model for was used to judge whether the effect of a given nucleotide substitution at a given position was significant.

TFBS identification

Putative TFBSs were identified using MATCH [51] to search for motifs in TRANSFAC Release 2010.3 [52]. MATCH was run independently on each individual sequence with default parameters.

Computing TF mark scores

For each position with an observed nucleotide substitution, a 51 bp segment of DNA centered at the position was used for a Position Specific Scoring Matrix (PSSM) scan. PSSMs for a subset of liver expressed TFs (Table S5) were obtained from the publicly available JASPAR and TRANSFAC databases. For each PSSM of length L, a ‘mark score’ was calculated for all subsequences of length L within the 51 bp DNA segment that overlapped the central position. A mark score for the reference subsequence (S_r) and the mutant subsequence (S_m) were calculated as:

Where B is the background distribution of nucleotides in the genome. TF motifs were only considered if either Score_r or Score_m was greater than the relative entropy score of the TF. Finally, the TF with the largest absolute score change between Score_r and Score_m is listed in Table S5. The relative entropy is defined as:

where m_i,j is entry in row I and column j of the PSSM.

Supporting Information

Zdroje

1. SaunaZE, Kimchi-SarfatyC (2011) Understanding the contribution of synonymous mutations to human disease. Nat Rev Genet 12: 683–691.

2. MercerTR, EdwardsSL, ClarkMB, NephSJ, WangH, et al. (2013) DNase I-hypersensitive exons colocalize with promoters and distal regulatory elements. Nat Genet 45: 852–859.

3. NackleyAG, ShabalinaSA, TchivilevaIE, SatterfieldK, KorchynskyiO, et al. (2006) Human catechol-O-methyltransferase haplotypes modulate protein expression by altering mRNA secondary structure. Science 314: 1930–1933.

4. LewisBP, BurgeCB, BartelDP (2005) Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120: 15–20.

5. BirnbaumRY, ClowneyEJ, AgamyO, KimMJ, ZhaoJ, et al. (2012) Coding exons function as tissue-specific enhancers of nearby genes. Genome Res

6. NeznanovN, UmezawaA, OshimaRG (1997) A regulatory element within a coding exon modulates keratin 18 gene expression in transgenic mice. J Biol Chem 272: 27549–27557.

7. DongX, NavratilovaP, FredmanD, DrivenesO, BeckerTS, et al. (2010) Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons. Nucleic Acids Res 38: 1071–1085.

8. LampeX, SamadOA, GuiguenA, MatisC, RemacleS, et al. (2008) An ultraconserved Hox-Pbx responsive element resides in the coding sequence of Hoxa2 and is active in rhombomere 4. Nucleic Acids Res 36: 3214–3225.

9. RitterDI, DongZ, GuoS, ChuangJH Transcriptional enhancers in protein-coding exons of vertebrate developmental genes. PLoS One 7: e35202.

10. TumpelS, CambroneroF, SimsC, KrumlaufR, WiedemannLM (2008) A regulatory module embedded in the coding region of Hoxa2 controls expression in rhombomere 2. Proc Natl Acad Sci U S A 105: 20077–20082.

11. VeltmanJA, BrunnerHG (2012) De novo mutations in human genetic disease. Nat Rev Genet 13: 565–575.

12. WangGS, CooperTA (2007) Splicing in disease: disruption of the splicing code and the decoding machinery. Nat Rev Genet 8: 749–761.

13. ShamseldinHE, FadenMA, AlashramW, AlkurayaFS (2012) Identification of a novel DLX5 mutation in a family with autosomal recessive split hand and foot malformation. J Med Genet 49: 16–20.

14. StergachisAB, HaugenE, ShaferA, FuW, VernotB, et al. (2013) Exonic transcription factor binding directs codon choice and affects protein evolution. Science 342: 1367–1372.

15. KheradpourP, ErnstJ, MelnikovA, RogovP, WangL, et al. (2013) Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res 23: 800–811.

16. MelnikovA, MuruganA, ZhangX, TesileanuT, WangL, et al. (2012) Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol 30: 271–277.

17. PatwardhanRP, HiattJB, WittenDM, KimMJ, SmithRP, et al. (2012) Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol 30: 265–270.

18. HaberleV, LenhardB (2012) Dissecting genomic regulatory elements in vivo. Nat Biotechnol 30: 504–506.

19. SharonE, KalmaY, SharpA, Raveh-SadkaT, LevoM, et al. (2012) Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat Biotechnol 30: 521–530.

20. ShenY, YueF, McClearyDF, YeZ, EdsallL, et al. (2012) A map of the cis-regulatory sequences in the mouse genome. Nature 488: 116–120.

21. SakharkarMK, ChowVT, KangueaneP (2004) Distributions of exons and introns in the human genome. In Silico Biol 4: 387–393.

22. DunhamI, KundajeA, AldredSF, CollinsPJ, DavisCA, et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74.

23. ZhangG, BudkerV, WolffJA (1999) High levels of foreign gene expression in hepatocytes after tail vein injections of naked plasmid DNA. Hum Gene Ther 10: 1735–1737.

24. KimMJ, AhituvN (2013) The hydrodynamic tail vein assay as a tool for the study of liver promoters and enhancers. Methods Mol Biol 1015: 279–289.

25. MorwaldS, YamazakiH, BujoH, KusunokiJ, KanakiT, et al. (1997) A novel mosaic protein containing LDL receptor elements is highly conserved in humans and chickens. Arterioscler Thromb Vasc Biol 17: 996–1002.

26. PallesenLT, VaegterCB (2012) Sortilin and SorLA regulate neuronal sorting of trophic and dementia-linked proteins. Mol Neurobiol 45: 379–387.

27. Posse De ChavesEI, VanceDE, CampenotRB, KissRS, VanceJE (2000) Uptake of lipoproteins for axonal growth of sympathetic neurons. J Biol Chem 275: 19883–19890.

28. RyzhakovG, BlazekK, UdalovaIA (2011) Evolution of vertebrate immunity: sequence and functional analysis of the SEFIR domain family member Act1. J Mol Evol 72: 521–530.

29. EllinghausE, EllinghausD, StuartPE, NairRP, DebrusS, et al. (2010) Genome-wide association study identifies a psoriasis susceptibility locus at TRAF3IP2. Nat Genet 42: 991–995.

30. HuffmeierU, UebeS, EkiciAB, BowesJ, GiardinaE, et al. (2010) Common variants at TRAF3IP2 are associated with susceptibility to psoriatic arthritis and psoriasis. Nat Genet 42: 996–999.

31. XiaYF, LiYD, LiX, GengJG (2002) Identification of alternatively spliced Act1 and implications for its roles in oncogenesis. Biochem Biophys Res Commun 296: 406–412.

32. QianY, QinJ, CuiG, NaramuraM, SnowEC, et al. (2004) Act1, a negative regulator in CD40- and BAFF-mediated B cell survival. Immunity 21: 575–587.

33. BarakY, NelsonMC, OngES, JonesYZ, Ruiz-LozanoP, et al. (1999) PPAR gamma is required for placental, cardiac, and adipose tissue development. Mol Cell 4: 585–595.

34. HiattJB, PatwardhanRP, TurnerEH, LeeC, ShendureJ (2010) Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods 7: 119–122.

35. FisherS, GriceEA, VintonRM, BesslingSL, McCallionAS (2006) Conservation of RET regulatory function from human to zebrafish without sequence similarity. Science 312: 276–279.

36. McGaugheyDM, VintonRM, HuynhJ, Al-SaifA, BeerMA, et al. (2007) Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b. Genome Res 10: 10.

37. DixonJR, SelvarajS, YueF, KimA, LiY, et al. (2012) Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485: 376–380.

38. JinF, LiY, DixonJR, SelvarajS, YeZ, et al. (2013) A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature 503: 290–294.

39. NgPC, HenikoffS (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80.

40. LinMF, KheradpourP, WashietlS, ParkerBJ, PedersenJS, et al. (2011) Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes. Genome Res 21: 1916–1928.

41. ViselA, PrabhakarS, AkiyamaJA, ShoukryM, LewisKD, et al. (2008) Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat Genet 40: 158–160.

42. MarguliesEH, CooperGM, AsimenosG, ThomasDJ, DeweyCN, et al. (2007) Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res 17: 760–774.

43. BlowMJ, McCulleyDJ, LiZ, ZhangT, AkiyamaJA, et al. (2010) ChIP-Seq identification of weakly conserved heart enhancers. Nat Genet 42: 806–810.

44. SchmidtD, WilsonMD, BallesterB, SchwaliePC, BrownGD, et al. (2010) Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328: 1036–1040.

45. HooverDM, LubkowskiJ (2002) DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res 30: e43.

46. BoganAA, Dallas-YangQ, RuseMDJr, MaedaY, JiangG, et al. (2000) Analysis of protein dimerization and ligand binding of orphan receptor HNF4alpha. J Mol Biol 302: 831–851.

47. EferlR, WagnerEF (2003) AP-1: a double-edged sword in tumorigenesis. Nat Rev Cancer 3: 859–868.

48. SladekFM, RuseMDJr, NepomucenoL, HuangSM, StallcupMR (1999) Modulation of transcriptional activation and coactivator interaction by a splicing variation in the F domain of nuclear receptor hepatocyte nuclear factor 4alpha1. Mol Cell Biol 19: 6509–6522.

49. HermansenR, SierraMA, JohnsonJ, FriezM, MilavetzB (1996) Identification of Simian virus 40 promoter DNA sequences capable of conferring restriction endonuclease hypersensitivity. J Virol 70: 3416–3422.

50. SimonetWS, BucayN, LauerSJ, TaylorJM (1993) A far-downstream hepatocyte-specific control region directs expression of the linked human apolipoprotein E and C-I genes in transgenic mice. J Biol Chem 268: 8221–8229.

51. KelAE, GosslingE, ReuterI, CheremushkinE, Kel-MargoulisOV, et al. (2003) MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 31: 3576–3579.

52. MatysV, Kel-MargoulisOV, FrickeE, LiebichI, LandS, et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108–110.