Genetic Variants and Their Interactions in the Prediction of Increased Pre-Clinical Carotid Atherosclerosis: The Cardiovascular Risk in Young Finns Study

Download PDF České info

The relative contribution of genetic risk factors to the progression of subclinical atherosclerosis is poorly understood. It is likely that multiple variants are implicated in the development of atherosclerosis, but the subtle genotypic and phenotypic differences are beyond the reach of the conventional case-control designs and the statistical significance testing procedures being used in most association studies. Our objective here was to investigate whether an alternative approach—in which common disorders are treated as quantitative phenotypes that are continuously distributed over a population—can reveal predictive insights into the early atherosclerosis, as assessed using ultrasound imaging-based quantitative measurement of carotid artery intima-media thickness (IMT). Using our population-based follow-up study of atherosclerosis precursors as a basis for sampling subjects with gradually increasing IMT levels, we searched for such subsets of genetic variants and their interactions that are the most predictive of the various risk classes, rather than using exclusively those variants meeting a stringent level of statistical significance. The area under the receiver operating characteristic curve (AUC) was used to evaluate the predictive value of the variants, and cross-validation was used to assess how well the predictive models will generalize to other subsets of subjects. By means of our predictive modeling framework with machine learning-based SNP selection, we could improve the prediction of the extreme classes of atherosclerosis risk and progression over a 6-year period (average AUC 0.844 and 0.761), compared to that of using conventional cardiovascular risk factors alone (average AUC 0.741 and 0.629), or when combined with the statistically significant variants (average AUC 0.762 and 0.651). The predictive accuracy remained relatively high in an independent validation set of subjects (average decrease of 0.043). These results demonstrate that the modeling framework can utilize the “gray zone” of genetic variation in the classification of subjects with different degrees of risk of developing atherosclerosis.

Published in the journal: . PLoS Genet 6(9): e32767. doi:10.1371/journal.pgen.1001146
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1001146

Summary

Introduction

A major challenge of medical genetics is to determine an optimal set of genetic markers, typically in the form of single nucleotide polymorphisms (SNP), which when combined together with conventional risk factors, could be used in individual level risk prediction, classification and clinical decision-making. However, genome-wide association studies (GWAS) have demonstrated that the ubiquitous heritability of most common disorders is due to multiple SNPs of small effect size and even an aggregate of these effects is not yet predictive enough for clinical utility [1]. It has therefore been suggested that the traditional case-control studies, which focus on qualitative phenotypes such as diagnosed cases versus controls, could be complemented by population-based cohort studies, which profile quantitative clinical phenotypes and how they change over time in individuals who are representative of the general population. Consequently, certain common disorders may be interpreted as being the extremes of the quantitative phenotypes that are continuously distributed over the population [1]. Comparing various ranges of the low and high extremes of such quantitative traits, rather than dichotomizing the same distribution exclusively into cases and controls, can offer the means to increase the statistical power of the variants [2]–[5], uncover molecular pathways and networks behind various subtypes and progression stages [6], and eventually even help to improve the early diagnosis, treatment and prevention of the most extreme cases. The objective here was to systematically investigate the potential of this extreme selection strategy to provide predictive insights into the early development of atherosclerosis, using the carotid IMT as a quantitative phenotype and our unique population-based follow-up study of atherosclerosis precursors as a basis for sub-sampling of subjects with increasing disease risk.

Atherosclerosis is a common disorder which develops due to the complex interplay of various genetic and environmental factors, most of which are still poorly understood. It is known that conventional cardiovascular risk factors, such as obesity, elevated blood pressure and high low-density lipoprotein (LDL) cholesterol levels, play an important role in the risk of its progression into severe clinical manifestations, for instance, coronary heart disease (CHD) [7], [8]. Recently, a number of genetic risk markers that associate with coronary disease outcomes and serum lipid concentrations have also been identified in case-control settings [9]–[21]. However, the relative contribution of genetic variation to the early stages of the cardiovascular disease remains unclear. From the experimental design point of view, the subtle inter-individual phenotypic variability makes it difficult to prognosticate clear-cut cases and controls in a pre-clinical setting, thereby limiting the capability of the cross-sectional case-control designs in distinguishing the variants associated with an increased progression risk from the background variability. An additional challenge is that even in the absence of significant single-marker effects, multiple genetic markers from distinct molecular pathways may act synergistically when combined, leading to different atherosclerosis phenotypes. Confounding inter-individual variation and interactions across the genetic and conventional risk factors can also mask the phenotypic variation, especially when studying composite phenotypes such as LDL-cholesterol levels [22]. Therefore, a well-defined quantitative measurement that reflects the full spectrum of the disease progression is needed, together with an efficient computational approach, to systematically explore the genotype-phenotype relationships across different development stages of atherosclerosis.

Measurement of the carotid artery intima-media thickness (IMT) is an established, intermediate phenotype of atherosclerosis that has been used, for instance, to investigate the development of pre-clinical atherosclerosis [23], [24], and to predict the onset of future cardiovascular events, such as myocardial infraction and stroke [25]–[27]. It can be measured non-invasively through the use of ultrasound imaging in large populations of healthy subjects, without the biases related to clinically diagnosed cases and controls [28], making it an ideal quantitative measurement for stratifying subjects into various risk classes. However, comparisons of such risk classes using statistical significance testing procedures that consider only one SNP at a time may yield sub-optimal findings when exploring the genotype-specific effects of large number of SNPs, given that these modest phenotypic effects are likely to be characterized by substantial genetic heterogeneity among multiple variants [29]–[31]. Accordingly, it has been argued that the statistics being used to identify variants that are significantly associated with the disease risk -⁠ typically odds ratios or p-values for association -⁠ are not the most appropriate means for evaluating the predictive or clinical value of the genetic profiles [32], [33]. For example, the individual SNPs with the strongest statistical support in coronary artery disease-related case-control studies seem to have only a minor, if any, role in predicting carotid IMT or its progression, when compared to the conventional risk factors [34], [35]. In fact, these susceptibility variants are able to provide only a marginal and inconsistent improvement even in the discrimination of the CHD cases or prediction of cardiovascular events [36]–[41], thus hindering the value of these ‘top hits’ for diagnostic prediction. Moreover, additional challenges stem from the identification of gene-gene and gene-environment interactions, which are thought to be profoundly important in the development of many complex diseases [29], [30], [42].

In the present analysis from the Young Finns Study, we took a more holistic approach towards revealing the contribution of genetic variation to the early progression of atherosclerosis. The approach was based on a stratified sampling and comparison of the increasing risk classes from our longitudinal population cohort. Rather than using the conventional single-SNP statistical significance testing in the identification of risk-modifying variants and their interactions, we explicitly searched for those subsets of SNPs that are the most predictive of the increasing risk classes by means of a predictive modeling framework using a machine learning-based SNP-subset selection procedure. The predictive approach was used here to mine those associations that did not necessary meet the stringent levels of statistical significance at the level of individual SNPs, yet still having significant contribution to the combined predictive power at the level of SNP-subsets. In particular, we addressed the following questions: (i) whether the genetic variants can improve the prediction accuracy of IMT-based risk classes beyond that obtained with conventional risk factors; (ii) which variants are the most predictive of the subjects that show extreme IMT levels either at the baseline or in the follow-up study, or progression over the 6-year period; (iii) whether the predictive SNP-panels also include other variants than those risk markers identified in the previous case-control association studies; and (iv) whether the machine learning-based SNP selection can provide variants with increased predictive power compared to the SNPs with the greatest statistical significance in the present study population. We also illustrate how the predictive modeling framework can be employed to identify epistasis interactions among genetic variants that are related to the disease progression. Finally, as the first step toward elucidating functional mechanisms behind the genetic variants and their interactions, we also mapped the biological pathways and processes that underlie those variants most predictive of the extreme progression cases.

Results

The baseline study cohort in 2001 was comprised of 1,027 subjects from the Finnish general population, aged 24–39 years, with complete data including both the ultrasound-based imaging of the carotid IMT and the blood sample-based genotyping of the candidate SNPs (see Table S1); of these subjects, 813 also participated in the 2007 follow-up study of the IMT progression (see Materials and Methods for details). The relative contribution of the SNPs to the individual IMT levels was evaluated by means of a predictive modeling framework, in which the study subjects were first divided into gradually increasing low-risk and high-risk classes according to the quantile points, say (1-q) and q, of their pooled IMT distribution (q ranges from 5% to 25%; see Figure 1). A non-linear Bayesian classifier was implemented here as the predictive model (see Materials and Methods for details). Using both the genetic and conventional risk factors collected in the baseline study in 2001 as predictor variables, we determined the most predictive risk factor combinations separately for both the 2001 and 2007 IMT risk classes, as well as the IMT progression between 2001 and 2007. For a comparison, the most significant genetic variants were determined using single-SNP statistical testing for the same risk classes. The area under the receiver operating characteristic curve (AUC), with cross-validation, was used to evaluate the predictive value of the different factor combinations, followed by independent validation set-based assessment of how well the predictive models can generalize to independent sets of subjects.

**Fig. 1. Distributions of intima-media thickness (IMT) of the study subjects.**

Clinical characteristics of the study subjects

The quantitative distributions of the levels of IMT and its progression over the 6-year period are shown in Figure 1. The IMT levels in the study population showed a slightly positive-skewed enrichment of subjects with higher IMT values indicating an increased risk of atherosclerosis (Figure 1A). There was a significant difference in the IMT distributions between the 2001 and 2007 follow-up studies (Kolmogorov-Smirnov D = 0.234, p<0.001). As expected, the majority of the conventional risk factors measured in 2001, including age, sex and BMI, were strongly correlated with the IMT levels both in the 2001 and 2007 studies (Table 1). However, only two risk factors, waist circumference and apolipoprotein B (ApoB), correlated with the IMT progression (the 2007-2001 change). In particular, even if the age was the most significant correlate of the IMT levels in 2001 and 2007, its linear explanatory power turned out to be insignificant for the IMT progression. Accordingly, the distributions of the IMT progression were similar in the groups of younger and older subjects (D = 0.0791, p>0.10; Figure 1B). To keep the non-linear prediction problem as general as possible, the age-groups and sexes were pooled into a single continuous distribution; however, all the predictive models were adjusted for the baseline conventional risk factors (Table 1). This enabled us to examine, for instance, the added contribution of genetic variation to the IMT progression not explained by the variation in the conventional cardiovascular risk factors.

**Tab. 1. The baseline characteristics in 2001 along with their correlations with the 2007 level and progression of intima-media thickness (IMT).**

Prediction of baseline IMT using genetic variants

To assess whether the genetic variants can increase the prediction accuracy of the risk classes beyond that obtained with the conventional risk factors alone, we used the predictive modeling framework with a machine learning-based SNP selection. The predictive risk factor combinations selected using this procedure were able to significantly improve the prediction of the subjects across the spectrum of low-risk and high-risk classes in 2001 (Figure 2A), when compared to using the conventional risk factors (CRFs) either alone or combined with those SNPs that were significantly associated with the low -⁠ and high-risk differences in the study subjects (the significances of the SNPs are detailed in Table S2). Interestingly, the panel of genetic risk markers established in the previous case-control association studies alone had a predictive power similar to that of a random classifier (average AUC 0.489), and these SNPs could not improve the prediction of the IMT risk classes over and above of the conventional risk factors (Established SNPs and CRFs; Figure 2). As expected, the predictive accuracy gradually decreased when moving from 5% to 25% quantile level, as the risk classes became phenotypically more heterogeneous in terms of the quantitative IMT-levels (see Figure 1A). The variants most predictive of the subjects with 15% of the lowest and highest IMT-levels in 2001 are listed in Table 2, together with their gene annotation information and the single-SNP statistical and predictive powers.

**Fig. 2. Prediction accuracy as a function of increasing risk classes.**

**Tab. 2. The single nucleotide polymorphisms (SNPs) predictive of the subjects with 15% lowest and highest IMT levels in 2001.**

Prediction of follow-up IMT using genetic variants

The predictive power of the genetic variants that were selected using the machine learning-based procedure increased further when predicting the risk classes in the 2007 follow-up, even if the genetic and conventional risk factors collected in only the baseline study were used as predictors (Figure 2B). This result can partly be attributed to the progression of the disease condition over the six years in a part of the study subjects (see Figure 1A). In particular, the classes of the most extreme levels of the IMT could be predicted with reasonably high accuracy also using single-SNP statistical testing, whereas the panel of established SNPs either with or without the conventional risk factors again showed much poorer performance (Figure 2B). These results suggest that the genetic variants, especially those that were identified using the machine learning-based SNP selection (see Table 3), can encode significant information according to which it is possible to predict subjects who will belong to different risk classes later in their lives with accuracies beyond that obtained with the conventional risk factors. We note that the baseline 2001 IMT-level was not used in the reported results when predicting the 2007 risk classes; however, in the case when the baseline IMT-level was used as an additional predictor, the prediction accuracies became very close to perfect discrimination (AUC ranged from 0.920 to 0.999). This shows that the non-linear modeling approach could learn also the significant linear correlation between the 2001 and 2007 IMT-levels (r = 0.582; Table S3).

**Tab. 3. The single nucleotide polymorphisms (SNPs) predictive of the subjects with 15% lowest and highest IMT levels in 2007.**

Genetic variants predisposing to IMT progression

We next searched explicitly for those factors that are most predictive of the subjects who show extreme progression in their IMT-levels between the two follow-up studies. When applying the machine learning-based procedure to prediction of the subjects with increasing changes in their IMT-levels between the study years 2001 and 2007, the selected SNPs could again systematically increase the predictive power across all the progression risk classes, compared to the accuracy obtained with the conventional risk factors either alone or when combined with the panels of variants identified in the previous case-control studies or in the present study population using single-SNP statistical testing (Figure 2C). In this case, however, the prediction accuracies were not anymore monotonically decreasing functions of the quantile point (q). In particular, the 10% risk class was found to be problematic, which could be due to the particular IMT cutoff values used in its quantitative definition. Interestingly, the SNP set most predictive of the IMT progression contained a relatively large number of variants with modest contributions to the predictive power; of these variants, only one was among the established markers identified in the previous case-control studies (Table 4). Even if the IMT progression proved relatively difficult to predict, the many novel markers support the potential and added value of genetic variation, especially when evaluating susceptibility to the most extreme progression risk class (q = 5%).

**Tab. 4. The single nucleotide polymorphisms (SNPs) predictive of the subjects with 15% lowest and highest IMT changes from 2001 to 2007.**

Epistasis interactions between the predictive variants

To identify candidate epistasis (or synergistic) interactions between the genetic risk factors, we searched for such pairs of genetic variants that led to the largest drop in the prediction accuracy when removed together from the set of predictive SNPs, relative to the drop resulting from removing either of the variants separately. As a feasibility study, we explored the particular SNP set which was found to be highly predictive of the subjects with the most extreme IMT progression from 2001 to 2007 (Figure 2C, q = 5%). When investigating a specific variant (rs2516839) in the upstream stimulatory factor 1 (USF1), a known regulator of the transcription of several cardiovascular-related genes, we identified a number of potential genetic interaction partners of USF1 (Figure 3), including formin 2 (FMN2, rs17672135), protein tyrosine phosphatase, non-receptor type 22 (PTPN22, rs2476601), hepatic triglyceride lipase (LIPC, rs1800588), and arachidonate 5-lipoxygenase-activating protein (ALOX5AP, rs17222814). It is interesting to note that each of these candidate gene-gene interactions originated from different biological processes, indicating that the disease progression and phenotypic heterogeneity is likely due to genetic alterations within multiple molecular pathways (Table S4). Such interactions and pathways may serve as basis for more detailed further studies of the molecular mechanisms and disease networks that predispose to such excess levels of the IMT-progression that can lead to clinical cardiovascular events in the future.

**Fig. 3. Candidate interaction partners of a variant in USF1 (rs2516839).**

Evaluation on independent and randomized subject sets

To further explore the generalization capability of the prediction models estimated and evaluated on the current study subjects, we constructed a separate validation set consisting of those subjects who were filtered out in the initial subject selection because of missing data, but had a complete set of those SNPs identified for the particular risk class (see Figure S1). These new subjects were then split into the classes of ‘low-risk’ and ‘high-risk’ based on the exact same IMT-cutoff values that were used in the original subjects. In general, the results in the independent validation set scaled as expected (Figure 4). Even if the prediction of the new subject classes using those SNPs identified in the original dataset led to decreased prediction accuracies (average decrease in AUC was 0.043), their prediction capability was shown to extend beyond the original subjects, especially for the extreme 5% IMT cases, whereas the 10% risk class again showed poorer performance. A part of the decreased accuracy can be attributed to the sensitivity of the extreme selection strategy to the particular IMT quantile cut-offs being used (the dotted trace). We also repeated the same model building and evaluation framework for randomized datasets, in which subjects were divided into the low -⁠ and high-risk classes at random. This resulted in random prediction accuracies (average AUC 0.496), indicating that the high accuracies obtained with the predictive models were not by chance alone (Figure 4). Based on these results, independent and randomized subject sets were found to be useful for controlling the degree of overfitting, even when cross-validation is used in the model building.

**Fig. 4. Prediction accuracies on independent and randomized subject sets.**

Discussion

The present results demonstrate a predictive relationship between an individual's genotypic variation and early signs of atherosclerosis along with its progression over a 6-year period in our population-based longitudinal follow-up study. The relationship was much stronger with the variants identified using the machine learning-based approach compared to the variants identified using single-locus statistical hypothesis testing procedures either in the present study population or in the previous case-control association studies of clinically manifesting CHD [9]–[21]. This latter finding is in line with a recent observation that the genetic scores, constructed from individual SNPs that met the genome-wide level of statistical significance in earlier GWASs, could not improve the prediction of cardiovascular risk after adjustment for conventional cardiovascular risk factors [41]. Similar observations have been made in the context of other diseases when using such a ‘bottom-up’ approach to building discrimination models [33]. In the present study, rather than exclusively using only those variants with the lowest p-values for association, we took here an alternative ‘top-down’ approach to predictive modeling by explicitly searching for all of the genetic and conventional risk factors that positively contribute to the prediction power. It was surprising to note that, among the most predictive variants, there was only a single statistically significant SNP in the present cohort (see Table 2, Table 3, Table 4), supporting the idea that many of the predictive associations are detected much lower down on the ranked list of hits compared to the top hits with the highest statistical support [43]. Ignoring such ‘gray zone’ variants is likely to result in missing an important proportion of the quantitative variation in heritability [44]. The proposed predictive modeling framework therefore complements the statistical class comparison procedures traditionally used during the discovery phase.

We used our longitudinal cohort data of carotid atherosclerosis precursors to implement a class prediction model, with the specific aim to build a multivariate discrimination function, or a classifier [45], which can accurately predict the risk class of a new subject on the basis of a panel of key variants. Sampling of the subjects with increasing carotid IMT levels from our follow-up study provided us with the unique opportunity to investigate the genetic variants contributing to the present and future atherosclerosis risk. Evaluation of the genetic variants predictive of the 2001 IMT risk classes was used here to set a baseline for the prediction accuracies and for the corresponding SNP panels. Medically, it is perhaps most interesting to evaluate the ability to predict the future IMT risk classes as well as the progression of the IMT levels over the time. The determination of the future atherosclerosis risk is analogous to predicting the 2007 IMT risk classes based on the data reflecting the 2001 baseline genetic variants and confounding risk factors. The IMT progression (i.e., difference between the 2007 and 2001 IMT levels) is relevant in that even though an individual may not be considered to be in the risk group in 2007, the rate of change in the IMT levels between the evaluation years is large enough to warrant the subject as still being regarded as being at higher risk. The group with extreme IMT progression therefore represents the set of subjects who would be potential candidates for primary prevention in order to offset their likelihood of developing carotid atherosclerosis in the future. The full set of the SNP-panels predictive of the IMT-levels in the 2001 and 2007 studies, as well as of its relative progression from 2001 to 2007, are listed and characterized in Table S1. The genetic interactions between those variants that were highly predictive of the extreme IMT-progression are further discussed in Text S2.

Those SNPs that were found to be the most predictive of the 15% risk classes of IMT-levels and progression (Table 2, Table 3, Table 4) can be interpreted on the basis of a prior knowledge (Table S5). Most of the SNPs and corresponding genes have earlier been associated with cardiovascular disease risk factors such as low serum HDL-cholesterol and high serum LDL-cholesterol, triglycerides, lipoprotein(a) and apolipoprotein B concentrations (i.e., APOB, LPA, WWOX, ABCA1, USF1, PSRC1, ADRB2), inflammation, inflammatory and immunological factors such as serum CRP and interleukin levels (i.e., CRP, IL18, IL1B, LTA, ALOX5AP, IL10, ICOS, PTPN22), blood pressure, hemodynamics as well as serum asymmetric dimethyl arginine concentrations (DDAH2, WRN, WNK1, CDH13, NOS3), obesity, BMI, metabolic syndrome (FTO, ADRB2), and lipoprotein oxidation (PON1). Most of these SNPs are also linked to different cardiovascular traits, such as coronary artery disease, coronary artery calcification and atherosclerosis plaque areas, myocardial infarction, sudden cardiac death, stroke, as well as having phenotypic relationships with subclinical atherosclerotic traits such as carotid IMT (ESR1, APOB, PON1, USF1, ALOX5AP, ESR2, IL10, FCGR2A). Such associations have been found either alone or by interaction with other genes and clinical or environmental factors, including diabetes mellitus and use of alcohol or smoking [46], [47]. There were also novel IMT-related SNP candidates, earlier associated with bone density (C6orf97 and some intergenic SNPs), revealing possible mechanistic links to bone mineral and calcium metabolism. It is known that morphogenetic proteins and vascular calcification are activated in advanced atherosclerotic plaques [48]–[50]. On the basis of the present results, the same seems to hold true already in the sub-clinical stage of carotid atherosclerosis.

Limitations of the study and future developments

As with any association study that evaluates the contribution of a large number of candidate variants to a given phenotype, the question of how well the results will generalize to other study populations remains to be studied. This is a potential limitation in all SNP studies regardless of whether the class comparison or class prediction approach is being applied. It is known that associations identified in one population using the single-SNP statistical hypothesis testing procedures may not be detected in other populations in part due to the p-values being affected by the confounding factors [29], [51]. Measures which directly evaluate the predictive value of multiple factors, such as AUC-values, can overcome some of these limitations but are not without caveats [32], [33], [52]. Unlike many other class prediction studies that have used the AUC to assess the discrimination accuracy within the given cases and control subjects only, here we used cross-validation both when selecting coherent subsets of the most predictive variants, through feature selection, as well as when evaluating their prediction accuracy, as compared to the subsets of the most significant SNPs. Cross-validation was necessary to avoid a selection bias, which can lead to over-optimistic prediction results and the reporting of a large number of over-fitted genetic variants [45], [53]. The final evaluation of the panels of SNPs was done using an independent subject set to confirm that the reported models also generalize to other sub-populations beyond those used in the initial model estimation and validation. Testing on an independent dataset can also help to resolve any biases that may exist due to the fact that the cross-validation folds are far from independent of one another.

In common with many other SNP-studies, our main objective here was to find out those variants that are the most predictive of the atherosclerosis risk and progression in our follow-up study. When the aim is to obtain high prediction accuracies, the rules for including factors in the discrimination model are different from those when searching for the strongest statistical associations [54]. However, regardless of whether the discoveries come from statistical significance testing or from machine learning-based SNP-selection, the selected variants need to be carefully validated in further studies [55]. These two complementary approaches have also been combined, by building prediction models based exclusively on statistically significant SNPs, but this combined approach has been shown to result in poor classification accuracies [33]. In fact, reasonable increases in the prediction accuracies are often obtained only after including hundreds of top variants, depending on the complexity of the disease phenotype and whether or not cross-validation is utilized [32], [38], [39]. When the aim is class prediction, we believe it is better to make use of those methods that are specifically designed for optimal prediction, together with stringent feature selection and cross-validation, to output modest number of highly predictive and reliable variants for further study [45]. Further evaluation of the prediction power on independent and randomized subject sets was also found to be useful for controlling the degree of over-fitting, as shown in Figure 4, even when systematic cross-validation schemes are being used in the model building process [56], [57].

It was interesting to note here that the simple naïve Bayes classifier performed well in the prediction of the atherosclerosis risk. The conditional independence assumption behind this probabilistic prediction model results in the nominal predictive probabilities that are often unrealistic, in the sense of being very close to either zero or one. Therefore, we followed the standard practice and chose the class with the highest posterior probability. Despite this simplifying assumption, the naïve Bayes classifier generally provided the best prediction results across the various risk classes, compared to other classification models, such as Bayes Nets, Support Vector Machines, or Random Forest (see Text S1 for their comparison). Moreover, because of its simplicity, the naïve Bayes classifier is also computationally more efficient than the other, more complex prediction models, enabling its usage in GWAS meta-analyses as well. These observations are in line with previous works, which have shown that the naïve Bayes classifier can perform well even in the case when there are strong dependencies in the dataset [58]–[60]. In particular, it has proven to be effective in the context of the IMT-phenotype and in SNP-data [61], [62]. Standard filtering procedures, such as those based on the Hardy-Weinberg equilibrium, and other quality control measures implemented during the genotyping can result in severe restrictions on the joint distribution of alleles, enabling them to appear independent of one another, further explaining the good performance of the naïve Bayes classifier. However, other efficient SNP-subset selection methods that go beyond the single-SNP testing, such as those based on penalized maximum-likelihood approach [63], or different filter-wrapper machine learning approaches [31], could be used in the generic modeling framework.

While previous studies have identified sex-related differences in the cardiovascular disease incidence and genetic risk factors [64], the objective of the present study was to demonstrate that a common panel of genetic risk factors can already improve the prediction of subclinical carotid atherosclerosis risk and progression in a general population of young adults. Therefore, we did not stratify the subjects on the basis of any of the conventional risk factors, including sex or age, but the subjects were combined into a single distribution (Figure 1). In the future studies, however, it is possible to divide the heterogeneous population into more homogeneous sub-samples to investigate the relationship between the genetic and conventional risk factors in more controlled settings. Further, pathway and network analyses of such sub-sample-specific genetic variants and their interactions could reveal also underlying similarities or differences in the biological processes and genetic networks [6]. We have previously shown that sub-sampling-based automated procedures can help to detect hidden subject sub-groups that present with similar genetic profiles in genome-wide studies and which may associate with divergent clinical outcomes [65]. An automated subject grouping combined with the predictive modeling framework introduced in the present study could offer possibilities to start developing personalized approaches that make the most of genetic variation together with clinical data to predict individual susceptibility to the initiation and progression of carotid atherosclerosis and other complex diseases. Such experimental-computational approaches may prove to have also clinical utility in the early detection and management of sub-clinical atherosclerosis and other quantitative disorders.

Materials and Methods

Subject selection

The Cardiovascular Risk in Young Finns Study is an on-going population-based follow-up study of atherosclerosis precursors from childhood to adulthood [66]. The multi-center study has been carried out in five university hospitals across Finland (Turku, Tampere, Helsinki, Kuopio and Oulu). The baseline cross-sectional study in 1980 included a total of 3,596 children and adolescents, aged between 3–18 years, who were randomly chosen from the national population register [67]. Since then, follow-up studies have been conducted in 1983, 1986, 2001 and 2007, in which the conventional risk factor data have systematically been collected from the individuals participating in those studies. In the two most recent follow-ups in 2001 and 2007, which were used in the present analysis, a total of 2,283 and 2,204 participants were re-examined, comprising the age groups of 24, 27, 30, 33, 36, 39 years and 30, 33, 36, 39, 42, 45 years, respectively; out of these, a total of 1,828 subjects participated both in the 2001 and 2007 follow-up studies [68]. The subjects involved in the cohort provided written consent to be included in the study approved by local ethics committees.

The study cohort for the present analysis was comprised of those subjects who took part in both the ultrasound and the genotyping studies in 2001. The carotid artery intima-media thickness (IMT) was measured from 1,809 subjects in both of the follow-up studies. Genotyping of single nucleotide polymorphisms (SNPs) was based on the DNA collected in 2001. The candidate gene approach was used to explore potentially interesting relationships between several known SNPs and clinical traits. Subjects who had missing values either in their IMT or SNP data in the year 2001 or 2007 were excluded from the present analysis, in order to eliminate their potentially adverse effects on both the reported prediction accuracies and on the selected genetic variants. Due to such stringent subject selection criteria (see Figure S1), the complete data matrices from n = 1,027 subjects were used in the search of genetic variants (SNP sets) that are predictive of the atherosclerosis (indexed by IMT) at baseline (2001); of these, n = 813 had complete data also in the follow-up study (2007), and could be used when searching for variants predictive of IMT progression (the change from 2001 to 2007).

Clinical characteristics

In the present analysis, we used the conventional risk factor data from the 2001 follow-up study. The physical examination consisted of the measurement of height, weight, systolic and diastolic blood pressure, and waist circumference [66]. The body mass index (BMI) was calculated by dividing the patients' weight in kilograms by the square of their height in meters. Waist circumference was recorded as the average of two measurements with an accuracy of 0.1 cm. Blood pressure was measured at least three times with a random zero sphygmomanometer, and the average of the three readouts of systolic and diastolic blood pressure was recorded. Lifestyle risk factors, such as smoking, were examined with questionnaires; the subjects who smoked daily were regarded as smokers. For the assessment of serum lipoprotein levels, venous blood samples were drawn after an overnight fast and the serum was separated, aliquoted and stored at −70°C until analysis. Standard enzymatic methods were used for recording the levels of serum total cholesterol, HDL-cholesterol, and LDL-cholesterol, as well as the concentrations of serum triglycerides, apolipoprotein A1 (ApoA1) and B (ApoB) [67], [68].

Genotyping studies

Genomic DNA was extracted from peripheral blood leukocytes with a commercially available kit (Qiagen Inc., Valencia, CA). The DNA samples collected during the 2001 follow-up study were genotyped as described previously [66], [69]. In the present analysis, we included the panel of 17 SNPs with the highest single-SNP statistical significance in the recent GWASs identifying variants for CHD outcomes and serum lipids [9]–[21], as well as a number of other candidate SNPs listed in the first phase of the international pooling project of cardiovascular cohorts [70]. A total of 108 SNPs with complete genotyping data in the selected subjects were considered here in the predictive modeling; these SNPs are generally related to serum lipid and lipoprotein metabolism, oxidation, cellular lipid metabolism, inflammation, immunological system, cell signaling, cell migration, cell growth, homocystein metabolisms, cellular adhesion and blood coagulation (see Table S1 for the full list of SNPs together with information on their gene annotation and chromosomal location, as well as on associated phenotypes available from previous studies).

Ultrasound imaging

Ultrasound studies were performed using Sequoia 512 ultrasound mainframes (Acuson Inc., Mountain View, CA, USA), with 13.0 MHz linear array transducers. Exactly the same scanning protocol was used both in 2001 and 2007 studies, as previously described [23]. Briefly, carotid IMT was measured on the posterior (far) wall of the left carotid artery. At least four measurements were taken 10 mm proximal to the bifurcation, and the average of the readouts was recorded. The digitally stored scans were manually analyzed by the same reader both in 2001 and 2007 blinded to the subjects' characteristics. The between-visit coefficient of variation of such IMT measurements was 6.4%, as estimated between two visits that were three months apart [23]. Since the IMT correlates with the risk of atherosclerosis progression and subsequent cardiovascular events [23]–[27], it was used here for stratifying the subjects into gradually increasing risk classes. Being non-invasive in its nature, this measurement can be justified in large populations of healthy subjects, without biases related clinically diagnosed cases and controls [28], making it a convenient quantitative phenotype of atherosclerosis in population-based follow-up studies. The quantitative IMT measurement suffers from a degree of measurement error, which can lead to regression to the mean (Figure S2).

Predictive modeling

The relative contribution of the conventional and genetic risk factors to the individual IMT levels was investigated by means of a predictive modeling framework, similar to that which we and others have used before [61], [62]. Briefly, the study subjects were first divided into several risk classes according to their IMT levels. Based on the concept of extreme selection strategy [1]–[3], the quantile points, say (1-q) and q, of the IMT distribution were used to define the low and high risk classes, respectively (see Figure 1). The prediction of whether a subject belongs to the high-risk (H_q) or low-risk (L_q) class was done on the basis of his or her individual SNP data (S₁, …, S_l), whereas clinical characteristics, smoking habits, sex and age were used as confounding risk factors (C₁, …,C_m). A probabilistic prediction model, the so-called naïve Bayes classifier, was used here because of its low computational cost and good performance in previous studies [61], [62], [71]. Mathematically, the predictive classifier can be formulated as a conditional probability of observing the true class R (either H_q or L_q) given the genetic and confounding risk factors (the predictors P):(1)where K is a scaling factor independent of the risk class R. The a priori probabilities were set to the number of training samples in the low and high classes [71], and for numeric risk factors, the training algorithm estimates the densities using Gaussian distributions [72] (see Text S1 for more details). The subjects in the test material were then classified by choosing the risk class with the highest posterior probability in Eqn (1). The predictive power of different risk factor combinations was assessed with the k-fold cross-validation procedure, in which the given sample was divided into k distinct subsets of equal sizes, each of which in turn was used as a validation set, to assess how well the results will generalize to new sets of subjects, while the remaining sub-samples were used in the initial training of the prediction model [71]. The final prediction accuracy was reported as the average over the k validation rounds (here k = 10; see Figure S3).

Selection of predictive variants

The selection of predictive genetic and conventional risk factors was performed in two-steps, with the aim of identifying a minimal set of informative features for predicting the different risk classes (see Figure S3). The SNP selection was done using a machine-learning-based procedure, similar to the ‘filter-wrapper’ method [73]. The filtering phase starts from the full set of SNPs and uses an entropy-based information gain measure to reduce the high-dimensional search space to the subset of most informative genetic and conventional risk factors (here 40), which could subsequently be traversed thoroughly in the next phase of selection. In the wrapper phase, the best first-based iterative search-and-evaluate algorithm was used to further improve this subset by excluding those factors with least predictive power, using backward search direction, while the backtracking option allows for escaping from local optima [71]. The predictive power of the selected factor combinations was assessed using the naïve Bayes classifier, run with a 5-fold cross-validation to avoid potential selection bias, and the final prediction accuracy was evaluated using external 10-fold cross-validation (see Figure S3). The predictive modeling and risk factor selection was carried out with the Weka data mining platform (version 3.7; University of Waikato, New Zealand) [71].

Assessment of prediction accuracy

The predictive accuracy of the classifiers, constructed using either the p-value-based selection of the most significant SNPs or the machine-learning-based selection of the most predictive SNP-sets, was assessed using the receiver operating characteristic (ROC) analyses; ROC curves characterize the relative trade-off between true positive rate (sensitivity) and false positive rate (1 –⁠ specificity) of a classifier over the whole range of discrimination thresholds [32], [33], [71]. The overall accuracy of a classifier was summarized using the area under the ROC curve (AUC) measure; for an ideal classifier, AUC = 1, whereas a random classifier obtains an AUC = 0.5 on average [52], [61], [71]. The relative predictive power of each individual SNP or SNP-SNP interaction was assessed in terms of the change in AUC level when the particular SNP (say x) or the SNP-pair (x,y) was deleted from the selected set of variants (denoted by and , respectively). The interaction score for detecting epistasis effects was defined as , resembling additive definition of genetic interactions based on single and double-deletion experiments in model organisms [74]. The AUC-values were calculated using the Weka platform (version 3.7; University of Waikato, New Zealand) [71].

Statistical procedures

The level of statistical association of single SNPs with the IMT-classes was assessed by determining the genotypic probabilities (p-values), on the basis of the 2×3 contingency matrix that contains the counts of the three genotypes among the low-risk and high-risk subjects [75]. Computationally efficient calculation of the exact p-values for each individual SNP was carried out with the ExactFDR software [76]. The Pearson correlation coefficient was used to assess the linear association between the various conventional risk factors and IMT-levels or changes. These p-values were adjusted for multiple testing using the Bonferroni correction. Although it is known that this correction may be conservative, especially when the test statistics are dependent, it provides an effective means for ensuring that the findings deemed most significant are not by chance alone when many hypotheses are being tested simultaneously. Differences in the distributions of the IMT-levels or changes between sub-populations were assessed using the Kolmogorov-Smirnov D-statistic, which is based on the maximal vertical distance between the two distributions. The statistical analyses were carried out with the SPSS Statistics software (version 17.0; SPSS Inc., Chicago, IL, USA) and with the statistical computing platform R (http://www.rproject.org/).

Supporting Information

Zdroje

1. PlominR

HaworthCM

DavisOS

2009 Common disorders are quantitative traits. Opinion. Nat Rev Genet 10 872 878

2. SchorkNJ

NathSK

FallinD

ChakravartiA

2000 Linkage disequilibrium analysis of biallelic DNA markers, human quantitative trait loci, and threshold-defined case and control subjects. Am J Hum Genet 67 1208 1218

3. LanktreeMB

HegeleRA

SchorkNJ

SpenceJD

2010 Extremes of unexplained variation as a phenotype: an efficient approach for genome-wide association studies of cardiovascular disease. Circ Cardiovasc Genet 3 215 221

4. ZhangG

NebertDW

ChakrabortyR

JinL

2006 Statistical power of association using the extreme discordant phenotype design. Pharmacogenet Genomics 16 401 143

5. EguchiT

MaruyamaT

OhnoY

MoriiT

HiraoK

2009 Possible association of tumor necrosis factor receptor 2 gene polymorphism with severe hypertension using the extreme discordant phenotype design. Hypertens Res 32 775 779

6. TorkamaniA

SchorkNJ

2009 Pathway and network analysis with high-density allelic association data. Methods Mol Biol 563 289 301

7. PearsonTA

2002 New tools for coronary risk assessment: what are their advantages and limitations? Circulation 105 886 892

8. KoskinenJ

KähönenM

ViikariJS

TaittonenL

LaitinenT

2009 Conventional cardiovascular risk factors and metabolic syndrome in predicting carotid intima-media thickness progression in young adults: the cardiovascular risk in young Finns study. Circulation 120 229 236

9. SamaniNJ

ErdmannJ

HallAS

HengstenbergC

ManginoM

2007 Genome-wide association analysis of coronary artery disease. N Engl J Med 357 443 453

10. McPhersonR

PertsemlidisA

KavaslarN

StewartA

RobertsR

2007 A common allele on chromosome 9 associated with coronary heart disease. Science 316 1488 1491

11. HelgadottirA

ThorleifssonG

ManolescuA

GretarsdottirS

BlondalT

2007 A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 316 1491 1493

12. LarsonMG

AtwoodLD

BenjaminEJ

CupplesLA

D'AgostinoRBSr

2007 Framingham Heart Study 100K project: genome-wide associations for cardiovascular disease outcomes. BMC Med Genet 8 S5

13. The Wellcome Trust Case Control Consortium 2007 Genome-wide association study of 14 000 cases of seven common diseases and 3 000 shared control. Nature 447 661 678

14. LukeMM

KaneJP

LiuDM

RowlandCM

ShiffmanD

2007 A polymorphism in the protease-like domain of apolipoprotein(a) is associated with severe coronary artery disease. Arterioscler Thromb Vasc Biol 27 2030 2036

15. WillerCJ

SannaS

JacksonAU

ScuteriA

BonnycastleLL

2008 Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet 40 161 169

16. KathiresanS

MelanderO

AnevskiD

GuiducciC

BurttNP

2008 Polymorphisms associated with cholesterol and risk of cardiovascular events. N Engl J Med 358 1240 1249

17. ShiffmanD

KaneJP

LouieJZ

ArellanoAR

RossDA

2008 Analysis of 17,576 potentially functional SNPs in three case-control studies of myocardial infarction. PloS ONE 3 e2895 doi:10.1371/journal.pone.0002895

18. AbdullahKG

LiL

ShenGQ

HuY

YangY

2008 Four SNPs on chromosome 9p21 confer risk to premature, familial CAD and MI in an American Caucasian population (GeneQuest). Annals Human Genet 72 654 657

19. SagooGS

TattI

SalantiG

ButterworthAS

SarwarN

2008 Seven lipoprotein lipase gene polymorphisms, lipid fractions, and coronary disease: a HuGE association review and meta-analysis. Am J Epidemiol 168 1233 1246

20. AndersonJL

HorneBD

KolekMJ

MuhlesteinJB

MowerCP

2008 Genetic variation at the 9p21 locus predicts angiographic coronary artery disease prevalence but not extent and has clinical utility. Am Heart J 156 1155 1162

21. PaynterNP

ChasmanDI

BuringJE

ShiffmanD

CookNR

2009 Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21.3. Ann Intern Med 150 65 72

22. LusisAJ

PajukantaP

2008 A treasure trove for lipoprotein biology. Comment. Nat Genet 40 129 130

23. RaitakariOT

JuonalaM

KähönenM

TaittonenL

LaitinenT

2003 Cardiovascular risk factors in childhood and carotid artery intima-media thickness in adulthood: The Cardiovascular Risk in Young Finns Study. JAMA 2003 290 2277 2283

24. LiS

ChenW

SrinivasanSR

BondMG

TangR

2003 Childhood cardiovascular risk factors and carotid vascular changes in adulthood: The Bogalusa Heart Study. JAMA 290 2271 2276

25. SalonenJT

SalonenR

1991 Ultrasonographically assessed carotid morphology and the risk of coronary heart disease. Arteroscler Thromb 11 1245 1249

26. O'LearyDH

PolakJF

KronmalRA

ManolioTA

BurkeGL

1999 Carotid-artery intima and media thickness as a risk factor for myocardial infarction and stroke in older adults. Cardiovascular Health Study Collaborative Research Group. N Engl J Med 340 14 22

27. LorenzMW

MarkusHS

BotsML

RosvallM

SitzerM

2007 Prediction of clinical cardiovascular events with carotid intima-media thickness: a systematic review and meta-analysis. Circulation 115 459 467

28. O'LearyDH

PolakJF

2002 Intima-media thickness: a tool for atherosclerosis imaging and event prediction. Am J Cardiol 90 18L 21L

29. FrazerKA

MurraySS

SchorkNJ

TopolEJ

2009 Human genetic variation and its contribution to complex traits. Nat Rev Genet 10 241 251

30. MooreJH

WilliamsSM

2009 Epistasis and its implications for personal genetics. Am J Hum Genet 85 309 320

31. MooreJH

AsselbergsFW

WilliamsSM

2010 Bioinformatics challenges for genome-wide association studies. Bioinformatics 26 445 455

32. KraftP

WacholderS

CornelisMC

HuFB

HayesRB

2009 Beyond odds ratios: communicating disease risk based on genetic profiles. Perspective. Nat Rev Genet 10 264 9

33. JakobsdottirJ

GorinMB

ConleyYP

FerrellRE

WeeksDE

2009 Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet 5 e1000337 doi:10.1371/journal.pgen.1000337

34. SamaniNJ

RaitakariOT

SipiläK

TobinMD

SchunkertH

2008 Coronary artery disease-associated locus on chromosome 9p21 and early markers of atherosclerosis. Arterioscler Thromb Vasc Biol 28 1679 1683

35. FanYM

RaitakariOT

KähönenM

Hutri-KähönenN

JuonalaM

2009 Hepatic lipase promoter C-480T polymorphism is associated with serum lipids levels, but not subclinical atherosclerosis: The Cardiovascular Risk in Young Finns Study. Clin Genet 76 46 53

36. HumphriesSE

CooperJA

TalmudPJ

MillerGJ

2007 Candidate gene genotypes, along with conventional risk factor assessment, improve estimation of coronary heart disease risk in healthy UK men. Clin Chem 53 8 16

37. MorrisonAC

BareLA

ChamblessLE

EllisSG

MalloyM

2007 Prediction of coronary heart disease risk using a genetic risk score: the Atherosclerosis Risk in Communities Study. Am J Epidemiol 166 28 35

38. van der NetJB

JanssensAC

DefescheJC

KasteleinJJ

SijbrandsEJ

2009 Usefulness of genetic polymorphisms and conventional risk factors to predict coronary heart disease in patients with familial hypercholesterolemia. Am J Cardiol 103 375 380

39. van der NetJB

JanssensAC

SijbrandsEJ

SteyerbergEW

2009 Value of genetic profiling for the prediction of coronary heart disease. Am Heart J 158 105 110

40. IoannidisJP

2009 Prediction of cardiovascular disease outcomes and established cardiovascular risk factors by genome-wide association markers. Circ Cardiovasc Genet 2 7 15

41. PaynterNP

ChasmanDI

ParéG

BuringJE

CookNR

2010 Association between a literature-based genetic risk score and cardiovascular events in women. JAMA 303 631 637

42. CordellHJ

2009 Genome-wide association studies: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 10 392 404

43. DonnellyP

2008 Progress and challenges in genome-wide association studies in humans. Commentary. Nature 456 728 731

44. MaherB

2008 Personal genomes: The case of the missing heritability. News Feature. Nature 456 18 21

45. SimonR

RadmacherMD

DobbinK

McShaneLM

2003 Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95 14 18

46. RontuR

KarhunenPJ

IlveskoskiE

MikkelssonJ

KajanderO

2003 Smoking-dependent association between paraoxonase 1 M/L55 genotype and coronary atherosclerosis in males: an autopsy study. Atherosclerosis 171 31 37

47. McGeachieM

RamoniRL

MychaleckyjJC

FurieKL

DreyfussJM

2009 Integrative predictive model of coronary artery calcification in atherosclerosis. Circulation 120 2448 2454

48. BostromK

WatsonKE

HornS

WorthamC

HermanIM

1993 Bone morphogenetic protein expression in human atherosclerotic lesions. J Clin Invest 91 1800 1809

49. BucayN

SarosiI

DunstanCR

MoronyS

TarpleyJ

1998 Osteoprotegerin-deficient mice develop early onset osteoporosis and arterial calcification. Genes Dev 12 1260 1268

50. Collin-OsdobyP

2004 Regulation of vascular calcification by osteoclast regulatory factors RANKL and osteoprotegerin. Review. Circ Res 95 1046 1057

51. StephensM

BaldingDJ

2009 Bayesian statistical methods for genetic association studies. Nat Rev Genet 10 681 690

52. JanssensAC

van DuijnCM

2009 Genome-based prediction of common diseases: methodological considerations for future research. Genome Med 1 20

53. AmbroiseC

McLachlanGJ

2002 Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99 6562 6566

54. PepeMS

JanesH

LongtonG

LeisenringW

NewcombP

2004 Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 159 882 890

55. IoannidisJP

ThomasG

DalyMJ

2009 Validating, augmenting and refining genome-wide association signals. Nat Rev Genet 10 318 329

56. ReunanenJ

2003 Overfitting in making comparisons between variable selection methods. J Machine Learn Res 3 1371 1382

57. AnderssenE

DyrstadK

WestadF

MartensH

2006 Reducing over-optimism in variable selection by cross-model validation. Chemometrics Intell Laborat Systems 84 69 74

58. DomingosP

PazzanM

1997 On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29 103 130

59. HandDJ

YuK

2001 Idiot's Bayes –⁠ not so stupid after all? International Statistical Rev 69 385 398

60. ZhangH

2005 Exploring conditions for the optimality of naïve Bayes. International J Patt Recogn Artif Intelligence 19 183 198

61. AittokallioJ

PoloO

HiissaJ

VirkkiA

ToikkaJ

2008 Overnight variability in transcutaneous carbon dioxide predicts vascular impairment in women. Exp Physiol 93 880 891

62. LongN

GianolaD

RosaGJ

WeigelKA

AvendañoS

2009 Comparison of classification methods for detecting associations between SNPs and chick mortality. Genet Sel Evol 41 18

63. HoggartCJ

WhittakerJC

De IorioM

BaldingDJ

2008 Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet 4 e1000130 doi:10.1371/journal.pgen.1000130

64. SilanderK

AlanneM

KristianssonK

SaarelaO

RipattiS

2008 Gender differences in genetic risk profiles for cardiovascular disease. PLoS ONE 3 e3615 doi:10.1371/journal.pone.0003615

65. HiissaJ

EloLL

HuhtinenK

PerheentupaA

PoutanenM

2009 Resampling reveals sample-level differential expression in clinical genome-wide studies. OMICS 13 381 396

66. RaitakariOT

JuonalaM

RönnemaaT

Keltikangas-JärvinenL

RäsänenL

2008 Cohort profile: the Cardiovascular Risk in Young Finns Study. Int J Epidemiol 37 1220 6

67. ÅkerblomHK

ViikariJ

UhariM

RäsänenL

BycklingT

1985 Atherosclerosis precursors in Finnish children and adolescents. I. General description of the cross-sectional study of 1980, and an account of the children's and families' state of health. Acta Paediatr Scand Suppl 318 49 63

68. RaikoJR

ViikariJS

IlmanenA

Hutri-KähönenN

TaittonenL

2010 Follow-ups of the Cardiovascular Risk in Young Finns Study in 2001 and 2007: Levels and 6-year changes in risk factors. J Intern Med 267 370 384

69. LivakKJ

1999 Allelic discrimination using fluorogenic probes and the 5′ nuclease assay. Genet Anal 14 143 149

70. EvansA

SalomaaV

KulathinalS

AsplundK

CambienF

2005 MORGAM (an international pooling of cardiovascular cohorts). Review. Int J Epidemiol 34 21 27

71. WittenIH

FrankE

2005 Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. San Francisco Morgan Kaufmann Publishers

72. JohnG

LangleyP

1995 Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference of Uncertainty in Artificial Intelligence San Mateo Morgan Kaufmann Publishers 338 345

73. LongN

GianolaD

RosaGJ

WeigelKA

AvendañoS

2007 Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. J Anim Breed Genet 124 377 389

74. PhillipsPC

2008 Epistasis: the essential role of gene interactions in the structure and evolution of genetic systems. Review. Nat Rev Genet 9 855 867

75. BaldingDJ

2006 A tutorial on statistical methods for population association studies. Nat Rev Genet 7 781 791

76. WojcikJ

FornerK

2008 ExactFDR: exact computation of false discovery rate estimate in case-control association studies. Bioinformatics 24 2407 2408