Inferring causal direction between two traits in the presence of horizontal pleiotropy with GWAS summary data

Authors: Haoran Xue ^aff001; Wei Pan ^aff002
Authors place of work: School of Statistics, University of Minnesota, Minneapolis, Minnesota, United States of America ^aff001; Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America ^aff002
Published in the journal: Inferring causal direction between two traits in the presence of horizontal pleiotropy with GWAS summary data. PLoS Genet 16(11): e32767. doi:10.1371/journal.pgen.1009105
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1009105

Summary

Orienting the causal relationship between pairs of traits is a fundamental task in scientific research with significant implications in practice, such as in prioritizing molecular targets and modifiable risk factors for developing therapeutic and interventional strategies for complex diseases. A recent method, called Steiger’s method, using a single SNP as an instrument variable (IV) in the framework of Mendelian randomization (MR), has since been widely applied. We report the following new contributions. First, we propose a single SNP-based alternative, overcoming a severe limitation of Steiger’s method in simply assuming, instead of inferring, the existence of a causal relationship. We also clarify a condition necessary for the validity of the methods in the presence of hidden confounding. Second, to improve statistical power, we propose combining the results from multiple, and possibly correlated, SNPs as multiple instruments. Third, we develop three goodness-of-fit tests to check modeling assumptions, including those required for valid IVs. Fourth, by relaxing one of the three IV assumptions in MR, we propose several methods, including an Egger regression-like approach and its multivariable version (analogous to multivariable MR), to account for horizontal pleiotropy of the SNPs/IVs, which is often unavoidable in practice. All our methods can simultaneously infer both the existence and (if so) the direction of a causal relationship, largely expanding their applicability over that of Steiger’s method. Although we focus on uni-directional causal relationships, we also briefly discuss an extension to bi-directional relationships. Through extensive simulations and an application to infer the causal directions between low density lipoprotein (LDL) cholesterol, or high density lipoprotein (HDL) cholesterol, and coronary artery disease (CAD), we demonstrate the superior performance and advantage of our proposed methods over Steiger’s method and bi-directional MR. In particular, after accounting for horizontal pleiotropy, our method confirmed the well known causal direction from LDL to CAD, while other methods, including bi-directional MR, might fail.

Keywords:

Gene expression – Quantitative trait loci – Simulation and modeling – Genome-wide association studies – Cholesterol – Covariance – Single nucleotide polymorphisms

Introduction

Mendelian Randomization (MR) has become increasingly popular and widely used to infer causal relationships between pairs of traits, which can be molecular (such as gene expression or DNA methylation), behavioral/educational or clinical [1–3]. MR (or two-sample MR) utilizes genetic variants, almost exclusively single nucleotide polymorphisms (SNPs), as instruments or instrumental variables (IVs) in two independent GWAS (summary) datasets for the two traits respectively. Although the gold standard for causal inference is randomized clinical trials (RCTs), RCTs are often too costly, unethical and infeasible, in which case MR offers a cost-effective approach. The recent years’ popularity of MR coincides with the ever-increasing availability of large-scale GWAS data and resources/tools like MR-Base [4]. As reviewed in [5], the potential of MR has been fully demonstrated by several notable examples in the studies of cardiovascular and metabolic diseases, such as MR analyses showing of C-reactive protein’s and moderate alcohol consumption’s unlikely role in reducing coronary heart disease risk, and replicating RCT findings of several drug targets (e.g. Niemann-Pick C1‑like protein). In most MR applications, one often assumes that the causal direction between two traits is known. Otherwise, bi-directional MR can be applied [6, 7]: one first applies MR to infer whether there is a causal relationship from one trait, say X, to another trait, say Y; then one reverses the roles of the two traits to infer whether there is a causal relationship from Y to X. If one of the two causal directions does hold, one would hope to detect it with bi-directional MR. Bi-directional MR can be also more intuitively applied graphically [8]: the estimated effect sizes on the two traits from the significant SNPs can be compared in two scatter plots, one from the SNPs significant for each trait; it is expected that, if the causal direction is X → Y, then there should be a strong relationship between the two sets of the effect sizes on the two traits of the SNPs significant for X, but not necessarily so for the SNPs significant for Y. However, whether bi-directional MR works depends on one critical, most often unknown, assumption: two sets of valid SNPs/IVs used in the two directions. That is, if an SNP primarily influences X, but influences Y only through X, then it should be used as an IV to infer X → Y in the first step; at the same time, it cannot be used as an IV to infer Y → X. In practice, due to limited knowledge, it is almost impossible to always choose valid IVs. Instead, often one just uses all significant (and independent) SNPs for X to infer X → Y, and vice versa for Y → X. As pointed out by [9], if an SNP primarily influences X (and influences Y completely through X), it will be associated with both X and Y. Hence, if the sample sizes are large enough, its associations with both X and Y will be detected, and it will be used as an IV in both steps or directions in bi-directional MR, leading to the incorrect conclusion that a causal relationship exists in both directions. This problem with bi-directional MR will be confirmed in our real data examples later. Relatedly, the result of bi-directional MR (or its graphical implementation) is confounded by possibly varying statistical power for the two traits (e.g., due to different sample sizes and different types of the traits such as one quantitative and the other binary). In the above example that an SNP primarily influences trait X but not Y, if the statistical power for detecting the association between the SNP and Y is large while that for the other pair is small, it is likely that the SNP will be significantly associated with Y but not with X, thus incorrectly used as an IV for Y but not for X, likely leading to the false conclusion of a causal effect of Y on X.

Recently a simple and effective approach, called MR Steiger’s method, was proposed [9]. It was discussed in the context of comparison with the main competitor of MR, mediation analysis, which requires one-sample data (i.e. both SNPs and the two traits are observed in the same sample). The proposed method is based on testing the difference of two (possibly correlated) Pearson’s correlation coefficients using Steiger’s test, and thus named. It was shown that MR Steiger’s method is more robust to measurement errors and hidden confounding than the causal inference test (CIT) [10] and likely other mediation analysis approaches. However, MR Steiger’s method has been most widely used in 2-sample MR applications. In [9], since MR Steiger was discussed in the context of mediation analysis, its derivation was mostly done in the absence of hidden confounding, though in an appendix the issue was briefly discussed under confounding; in contrast, an appealing advantage of MR is its valid causal inference while allowing the presence of hidden confounding, which is largely expected in practice. Therefore, first, we will formulate the problem in the presence of confounding; along the way we will point out for the first time that a key condition is needed for valid inference. Although the condition is expected to hold often in practice, it would be useful to have a method to check the condition; we propose such a method. Second, since MR Steiger’s method is based on using only a single SNP as IV, to improve both its statistical power and applicability, we extend it to allow multiple, and possibly correlated, SNPs. In particular, there is increasing interest in applying transcriptome-wide association studies (TWAS, or PrediXcan) to identify causal genes or other molecular/imaging endophenotypes by integrating GWAS with eQTL/xQTL data [11–20]. In these applications, multiple correlated cis-SNPs near a gene are used as IVs to impute or predict the gene’s expression level (or another endophenotype) to infer whether the gene’s expression has a causal effect on a trait, say coronary artery disease (CAD); it is assumed that the causal relationship (if existing) is from the gene to the trait. Albeit reasonable, there is possibility of reverse causation: a gene’s expression change is not the cause, but the consequence, of having CAD. Hence, it would be useful to apply a method to confirm whether the causal direction is indeed from the gene to the trait (or vice versa). In such a case, due to often a small sample size of an eQTL study, it would be low powered to apply MR Steiger’s method with a single SNP; instead, it would be more powerful and thus more desirable to apply a method with all the cis-SNPs used to predict the gene’s expression level. Our proposed new methods are more suitable for such an application. The same argument was made for two traits based on large-scale GWAS when single SNP summary statistics-based MR (SMR) [13] was generalized to GSMR with multiple (possibly correlated) SNPs [21]. Third, we develop corresponding goodness-of-fit tests, similar to Cochran’s Q statistic, to check the IV and other modeling assumptions for our proposed methods. Fourth, most importantly, with multiple SNPs/IVs, we can relax one of the three key assumptions in MR for valid causal inference: allowing the presence of horizontal pleiotropy, which is expected to be commonplace as shown by previous studies [22, 23]. In particular, the likely violation of the IV assumptions for TWAS and its severe consequences on invalid causal conclusions have been increasingly recognized and discussed in the literature [14, 24–26]. We propose a valid approach similar to Egger regression, which has been widely used for exactly the same purpose (i.e. in the presence of horizontal pleiotropy) in MR [27]. As a side product of allowing horizontal pleiotropy, each of our methods can be applied to infer whether there is any causal relationship between two traits and if so, what is the causal direction; in contrast, MR Steiger’s method can only infer causal direction after assuming the existence of one of the two causal directions. Finally, we introduce a multivariable approach, analogous to multivariable MR (MV-MR) [28, 29], as another robust approach; although we focus on uni-directional causal relationships, we also briefly discuss an extension to cases with possibly bi-directional causal relationships.

To demonstrate the wide-ranging utility and superior performance of our proposed methods, we applied both the new and existing methods to infer the causal directions between low density lipoprotein (LDL) cholesterol, or high density lipoprotein (HDL) cholesterol, and CAD. Multiple previous MR analyses have clearly established the causal role of LDL cholesterol on CAD [30–33], confirmed by meta-analyses of RCTs with statins and other cholesterol-lowering interventions [34–36]. Hence, the established causal direction of LDL → CAD can serve as a positive control to test various methods. Indeed, after accounting for horizontal pleiotropy, our methods reached the correct conclusion. In contrast, all other methods, including our method without accounting for horizontal pleiotropy and bi-directional MR (with various MR methods either ignoring or robust to/accounting for horizontal pleiotropy), failed to reach the correct conclusion. On the other hand, contrary to observational studies [37], HDL cholesterol does not appear to have a causal role on CAD, as evidenced by several MR analyses and recent RCTs [31–33, 38, 39]. Hence, it can serve as a negative control. Again our proposed and most other methods, as expected, could not establish any causal direction between HDL cholesterol and CAD. We also conducted extensive simulation studies to support the wide applicability and validity of our proposed methods.

Methods

We will first discuss the ideal case when all chosen SNPs are valid IVs, which means each SNP satisfies the three key IV assumptions: (i) each SNP is associated with the exposure; (ii) conditional on the exposure, each SNP is not associated with the outcome; i.e., there is no direct effect of each SNP on the outcome; (iii) each SNP is not associated with hidden confounders. Then we will consider the case when assumption (ii) is violated when one or more SNPs have direct effects, i.e. horizontal pleiotropic effects, on the outcome. We propose an Egger regression-like approach and an alternative based on the Generalized Least Squares (GLS) estimation method, both performing well and similarly. Finally, we introduce a multivariable extension to the Egger regression method, analogous to (MV-MR), as a more robust approach, and an extension to cases with possibly bi-directional causal relationships.

Ideal case: With valid IVs (and hidden confounding)

True model

If two traits X and Y are causally related, we’d like to determine the true causal direction, which can be either X → Y or Y → X. In the ideal case where all the SNPs used are valid IVs, two possible true models are shown in Fig 1. We use g to represent one SNP (but later more SNPs) to be used as IV, and U is a hidden confounder (or an ensemble of all confounders). Each directed edge represents a direct and causal effect with the effect size above the line. As usual, we also assume that all the relationships in the models are linear.

For X → Y, the data-generative/causal models are:

where ϵ_X and ϵ_Y are independent error terms.

For Y → X, we have the corresponding true models.

Steiger’s method based on a single SNP

Based on true models (1) and (2), we have the (true) correlation between X and g, denoted by ρ_Xg, and correlation between Y and g, denoted by ρ_Yg, as

From Eq (3) we obtain the correlation ratio:

We note that the correlation ratio K_YX is a constant, independent of g. It is easy to see that, if there is no confounder U in the true models, and without loss of generality assuming var(ϵ_Y) > 0, we have

implying |K_YX| < 1. On the other hand, if the true model is Y → X, we can similarly define and derive that |K_XY| ≔ |ρ_Xg/ρ_Yg| < 1.

It is noted that the two conditions, |K_YX| < 1 and |ρ_Yg| < |ρ_Xg|, are equivalent. While we will use the first one, the second one was used in MR Steiger’s method [9]. In spite of the trivial equivalence between the two, a key insight is that, because K_YX is independent of g, it can be used to combine over multiple SNPs/IVs; furthermore, the first one can be used to infer whether there is indeed a causal relationship between the two traits by checking whether K_YX = 0, equivalent to β_YX = 0, in contrast to simply assuming the existence of a causal relationship in Steiger’s method. These two points here are to be elaborated on later.

From two independent GWAS summary datasets with sample sizes n_X and n_Y for traits X and Y respectively, for SNP g, we have its marginal effect estimate on X and the variance, β ^ M , X g and v a r ( β ^ M , X g ), and the corresponding ones for Y as β ^ M , Y g and v a r ( β ^ M , Y g ). Then we can calculate its sample correlations with X and Y, denoted by r_Xg and r_Yg, from the summary data:

MR Steiger’s method [9] first applies Fisher’s Z-transformation to |r_Xg|:

and calculate Z_Yg = F(|r_Yg|). Then it infers the causal direction by comparing Z_Xg and Z_Yg to obtain Z_g = Z_Xg −⁠ Z_Yg and its variance var(Z_g) = 1/(n_X −⁠ 3) + 1/(n_Y −⁠ 3). The test statistic Z g / v a r ( Z g ) is compared to a standard normal distribution N(0, 1) to yield a p-value. If the p-value is larger than a given significance level, MR Steiger’s method would not make any conclusion; otherwise, depending on if Z_g > 0 or not, it would conclude that the causal direction is from X to Y, or from Y to X.

It is noted that, to test the suitable hypothesis on the magnitudes of the two correlations, MR Steiger’s method takes the Fisher’s transformation on the absolute values of the two correlations, i.e., |r_Xg| and |r_Yg|. For finite samples, using the non-negative |r_Xg| and |r_Yg| may lead to poor approximations by their normal distributions, though it is asymptotically valid (under usual regularity conditions, including each correlation is not 0 or ±1). S10 Text gives an illustration.

It is clear that the key condition for Steiger’s method as proposed by [9] is inequality (5). It is expected to hold with the following intuitive explanation: based on true model (2), var(Y) represents the total variation of Y while β Y X 2 · v a r ( X ) is only its partial variation mediated through X. However, in the presence of confounder U, as shown in Eq (2), we have

from which we cannot conclude that (5) holds, simply because the sum of the last three terms in var(Y) can be negative. A sufficient condition for (5) is that β_YX β_YU cov(X, U) is (i) positive, or (ii) negative but small in magnitude. Condition (i) means that the directions of the correlations between X and U and between Y and U are consistent with the sign of β_YX: for example, if β_YX > 0, then the effects of U on X and Y are in the same direction; on the other hand, if β_YX < 0, then the effects of U on X and Y are in different directions. A sufficient condition for (ii) is the small confounding and small causal effect sizes. For complex traits, we believe that these assumptions (i) and (ii) are often reasonable, albeit untestable directly due to that the confounder U is unobserved. Furthermore, [9] showed numerically that, if the proportion of the variance explained between the two traits, R Y X 2, is less than 0.2, then the condition almost always holds. It is noted that for complex human traits X and Y that are influenced by many genetic and environmental factors, R Y X 2<0.2 is expected to hold often.

Nevertheless, as an assumption, it would be desirable to check condition (5) in practice. Next we propose such a method. First, we can obtain an estimate β ^ Y X of the causal effect β_YX, e.g. by MR Egger regression. Second, if we have individual level data, then we can estimate var(X) and var(Y) directly, thus verify whether the condition holds. It is noted that var(X) and var(Y) are just the marginal variances of the two traits, which can be consistently estimated by the corresponding sample variances. On the other hand, if we only have GWAS summary data for both traits, we propose the following procedure to estimate var(X) and var(Y). For each SNP g, we have its marginal association estimate β ^ M , X g, v a r ( β ^ M , X g ) and var(g) = 2MAF(1−MAF) in the marginal model X = β M , 0 X g + β M , X g ⋅ g + ϵ X ′ , where MAF is the minor allele frequency of the SNP. Then it is easy to verify that

where n_X is the sample size. We can similarly estimate var(Y). Since such variance estimates may depend on the SNP g being used, we can obtain many such estimates based on each of many SNPs, then check the distribution of var(Y)/var(X) and compare to β ^ Y X 2 to verify condition (5).

In practice, as to be shown in our real data examples, in order to apply MR Steiger’s method and its extensions to be proposed next, we first assume that, if the causal direction is from X to Y, condition (5) holds; we apply our above proposed procedure to check it.

New method with multiple and possibly correlated SNPs

Steiger’s method is based on using only one SNP as IV. There is a good reason to extend it to multiple and possibly correlated SNPs, e.g. from a genomic locus: to improve statistical efficiency with multiple SNPs. In particular, in some applications such as TWAS [11, 12], it is necessary to use multiple correlated cis SNPs to more effectively predict/impute one gene’s expression level; if we are interested in inferring whether the causal direction is from the gene to a GWAS trait, we have to use multiple correlated SNPs to improve power.

Now we describe the new method for multiple, possibly correlated, SNPs. Due to linkage disequilibrium (LD), it is likely that some SNPs at the same locus are correlated and are significantly associated with both X and Y. Suppose we have m such SNPs denoted in g; for each SNP i, from Eq (6), with summary statistics we can get its sample correlations with X and Y respectively as r X g i and r Y g i. Denote r X g = ( r X g 1 , . . . , r X g m ) T, r Y g = ( r Y g 1 , . . . , r Y g m ) T, ρ X g = ( ρ X g 1 , . . . , ρ X g m ) T and ρ Y g = ( ρ Y g 1 , . . . , ρ Y g m ) T. As discussed in [40], we obtain the asymptotic distributions of r_Xg and r_Yg as

where the covariance matrices V_Xg and V_Yg depend on the correlation matrix of the SNPs in a complicated way, as shown in S1 Text. With GWAS summary data, we use a reference panel, such as one from the 1000 Genomes Project [41], to estimate the correlation matrix of the SNPs. Given the asymptotic distributions of r_{X_g} and r_{Y_g}, and the two independent GWAS datasets, r_{X_g} and r_{Y_g} are also independent, leading to a block-diagonal covariance matrix for their joint distribution. By the delta method, we can obtain:

where “≔” represents “by definition”, 1 = (1, …, 1)^T and

As we have already shown, ρ Y g 1 / ρ X g 1 = ⋯ = ρ Y g m / ρ X g m = K Y X is a constant, independent of the SNPs. Hence, the generalized least squares (GLS) method is applied to estimate K_YX as

We call this method Causal Direction-Ratio, or CD-Ratio for short. It is noted that CD-Ratio can be applied to special situations with only one SNP, or with multiple independent SNPs. Note that for two independent SNPs, their sample correlations with the same trait are in general non-independent.

With each individual estimate, or the combined estimate K ^ Y X and its v a r ( K ^ Y X ), we can construct a confidence interval for K_YX. Similarly, from the other direction we can estimate K_XY as K ^ X Y with v a r ( K ^ X Y ). Assuming condition (5) is satisfied, we infer the causal direction between X and Y by comparing whether the confidence interval of K_XY (or K_YX) is completely within or outside interval [-1, 1], as to be discussed later.

Extension to case with invalid IVs with direct effects

Now consider one or more SNPs/IVs with direct effects on the outcome as shown in Fig 2. The true causal models are

implying

True model for causal direction from <i>X</i> to <i>Y</i> in the presence of pleiotropic/direct effect (i.e. <i>α</i><sub><i>Yg</i></sub>). — **Fig. 2. True model for causal direction from X to Y in the presence of pleiotropic/direct effect (i.e. α_Yg).**

Accordingly, we calculate the (Pearson’s) correlations between the SNP and X and Y,

Again by definition K Y X ≔ β Y X v a r ( X ) / v a r ( Y ); assuming condition (5) is satisfied, we have |K_YX| < 1. Furthermore, K_YX = 0 is the sufficient and necessary condition for β_YX = 0, i.e. no causal relationship from X to Y. Different from that in the Ideal Model, as pleiotropic effects are allowed now, even there is no causal relationship between X and Y, there could still be significant SNPs g for both traits due to pleiotropy. By testing whether K_YX = 0 or not, we can determine the existence of a causal relationship.

It is noted that b_Yg ∝ α_Yg ≠ 0. We next develop an estimation method by treating b_Yg’s as random effects, similar to that in MR Egger regression [27]. For simplicity we assume g’s are standardized to have mean 0 and variance 1, and the covariance matrix of g’s is Σ. Since g’s are possibly correlated, we may have correlated b_Yg’s. Assuming the direct effects from g to Y are independently from a normal distribution, we would have

See S2 Text for the derivation. For simplicity, in the following we denote v = Σ1 = (v₁, ⋯, v_m)^T, b₀ for b 0 / v a r ( Y ), and σ 0 2 for σ 0 2 / v a r ( Y ). Using the sample correlation coefficients to approximate the population correlation coefficients in Eq (16) (while ignoring the errors in r_Xg), we have

in which the InSIDE assumption (i.e. the independence between β_Xg and α_Yg) is imposed (as for MR Egger regression). Given σ 0 2, estimating (b₀, K_YX) is a GLS problem, thus we estimate (b₀, K_YX) and σ 0 2 iteratively, and obtain their standard errors s e ( b ^ 0 ) , s e ( K ^ Y X ); see S2 Text for details. Briefly, we simply regress the two sample correlations with an intercept term to estimate the slope parameter K_YX using the GLS method (which reduces to the weighted least squares with the weight inversely proportional to var(e_Yg) if the SNPs are independent). This is similar to Egger regression in MR, for which the effect size estimates, instead of the sample correlation coefficients, are regressed against each other. We call this method Causal Direction-Egger, or CD-Egger for short.

A possible downside of the regression approach is its ignoring the variability of r_Xg while account for that of r_Yg, similar to that in MR Egger regression [42]. To overcome this shortcoming, we propose a new method called Causal Direction-GLS, or simply CD-GLS. Since it performs similarly to CD-Egger (due to small variability of r_Xg with a usually large sample size of GWAS), we relegate its detail to S2 Text.

Remarks on extension to case with invalid IVs associated with confounders

Now we consider an SNP with a direct effect on the hidden confounder, as shown in the DAG in the left panel of Fig 3. This is the case when the third key IV assumption is violated. Note that, as in [27], for simplicity we assume an effect size 1 of the confounder U on both X and Y, while it can be easily generalized. The case can be converted to that as if the second key IV assumption is violated, as shown in the right panel of Fig 3. It may appear that the two new methods developed in the previous subsection can be applied. However, the InSIDE assumption (i.e. the independence assumption of the IV effect on the exposure X, γ = γ₁ + β, and that on the outcome Y, α = α₁ + β) is clearly violated because of their shared component β. Hence, applying the two new methods developed in the previous section would lead to biased inference, unless the direct effect β is small and thus negligible. We will not pursue this challenging topic further, which is beyond the scope of this paper.

The case where the SNP/IV is directly associated with confounder <i>U</i> (left panel), which, by absorbing the effect of <i>SNP</i> → <i>U</i> into <i>SNP</i> → <i>X</i> and <i>SNP</i> → <i>Y</i>, leads to the case where the SNP/IV has a direct effect on <i>Y</i> (right panel). — Fig. 3. The case where the SNP/IV is directly associated with confounder U (left panel), which, by absorbing the effect of *SNP* → U into *SNP* → X and *SNP* → Y, leads to the case where the SNP/IV has a direct effect on Y (right panel).

Decision rules

For each of three methods CD-Ratio, CD-Egger, and CD-GLS, we would have K ^ X Y and K ^ Y X. Along with their standard errors (SEs), we construct the confidence intervals at any significance level, say α, for K_XY and K_YX respectively as C I X Y = K ^ X Y ± Z α / 2 SE ( K ^ X Y ) and C I Y X = K ^ Y X ± Z α / 2 SE ( K ^ Y X ). Assuming Key Condition (5) holds, then our decision rules are: (i) If CI_XY is completely outside interval [-1, 1], while CI_YX is inside [-1, 0) or (0, 1], we conclude that the causal direction is from X to Y; (ii) If CI_YX is completely outside interval [-1, 1], while CI_XY is inside [-1, 0) or (0, 1], we conclude that the causal direction is from Y to X; (iii) Otherwise, we cannot make a conclusion.

In the above, we have incorporated inferring whether there is indeed any causal relationship between the two traits X and Y, in contrast to Steiger’s method [9] in simply assuming the existence of a causal relationship. As to be shown in simulations, this is advantageous over Steiger’s method and expected to be useful in practice. Our proposal again exploits the use of our formulated estimand K_YX (or K_XY): the sufficient and necessary condition for no causal relationship with β_XY = β_YX = 0 is by definition equivalent to K_YX = K_XY = 0. Hence, we propose using their confidence intervals to determine the existence of a causal relationship.

Goodness-of-fit tests

In MR the intercept estimate and its SE from Egger regression can be used as a goodness-of-fit (GOF) test to check the assumptions of the valid IVs [27]; some extensions have also been developed [43]. Accordingly, we can also use our estimate b ^ 0 and its CI for such a purpose. In addition, as argued in [44], it is likely better and more powerful to conduct homogeneity testing using Cochran’s Q statistics (for the IVW method) and Rucker’s Q′ statistics (for MR-Egger regression). We develop the corresponding three tests in the current context, similar to but different from those for IVW or Egger regression in MR. Using the asymptotic distribution (9), we can construct a chi-squared GOF test statistic for CD-Ratio:

where K ^ Y X is the CD-Ratio (or any other consistent) estimate. Under the null hypothesis that all SNPs are valid IVs, the (asymptotic) null distribution is Q R a t i o Y X ∼ χ m -⁠ 1 2, a chi-squared distribution with degrees of freedom m −⁠ 1, where m = dim(r_Xg) is the number of the SNPs/IVs being used. Rejection of the null hypothesis may imply that the SNPs may have direct effects on the outcome, i.e. there is horizontal pleiotropy.

Based on the asymptotic distribution (17), we can construct a similar GOF test for CD-Egger after accounting for the possible presence of horizontal pleiotropy:

where b ^ 0 and K ^ Y X are the CD-Egger estimates of b₀ and K_YX respectively. Under the null hypothesis that the model assumptions hold, the (asymptotic) null distribution is Q E g g e r Y X ∼ χ m -⁠ 2 2. We can construct a similar GOF test based on CD-GLS, which we found performed similarly to Q E g g e r Y X and thus is relegated to S3 Text. Rejection of the null hypthesis by either test may imply that the SNPs/IVs are possibly correlated with some hidden confounders.

For the other direction Y to X, similarly we can perform corresponding chi-squared GOF tests with Q R a t i o X Y, Q E g g e r X Y and Q G L S X Y.

Further extensions

Multivariable approach

Analogous to the extenstion of MR to multivariable MR (MV-MR) [28, 29], we extend our CD-Egger method to its multivariable version, denoted MV-CD-Egger, to explicitly account for pleiotropic/direct effects through a third (or more) trait. Specifically, we introduce the third trait A, which may be an observed confounder for the two traits of interest, X and Y. When the true causal direction is from X to Y, the true causal model is:

Here e_A, e_X, e_Y are independent random error terms. In model (20), we allow a general relationship between A and X, and between A and Y. When the true causal direction is from Y to X, we have a corresponding true model. The two possible true models are illustrated in Fig 4.

Two possible true models: The left one is for causal direction from <i>X</i> to <i>Y</i>, and the right one is for <i>Y</i> to <i>X</i>. — **Fig. 4. Two possible true models: The left one is for causal direction from X to Y, and the right one is for Y to X.**

Under the true model of the causal direction from X to Y, since the IV g is independent of the hidden confounder U and error term e_Y, we can calculate the covariance between g and Y as:

Let ρ_Yg be the correlation between Y and g, and similarly define ρ_Xg and ρ_Ag, then we have

Let K Y X = β Y X · v a r ( X ) v a r ( Y ), K Y A = β Y A · v a r ( A ) v a r ( Y ), and b Y g = α Y g · v a r ( g ) v a r ( Y ). Paralleling with the key condition (5), we assume |K_YX| < 1 and |K_YA| < 1. Hence we have

As for CD-Egger, we model the distribution of b_Yg similarly, and apply an iterative method to obtain the maximum likelihood estimates of parameters K_YA and K_YX with their standard errors as shown in S2 Text; the only change is to redefine X = (v , r X g , r A g). At the end, we construct the normal-based confidence intervals for K_YX and K_XY, and apply the same decision rules as before.

Bi-directional causal effects

So far we have restricted our discussion on uni-directional causal effects. Inspired by [45], as suggested by a reviewer, we now briefly discuss an extension of our CD-Ratio method to bi-CD-Ratio that works regardless whether there is no, a uni -⁠ or bi-directional causal relationship between two traits. In the presence of bi-directional causal effects, as discussed by [45], for the model identifiability, at least one valid IV is required for each trait as shown in Fig 5, which is the case we consider here.

The true model for two traits <i>X</i> and <i>Y</i> with a bi-directional causal relationship: <i>g</i><sub>1</sub> and <i>g</i><sub>2</sub> are valid IVs for <i>X</i> and <i>Y</i> respectively, and <i>U</i> is unobserved confounder. — **Fig. 5. The true model for two traits X and Y with a bi-directional causal relationship: g₁ and g₂ are valid IVs for X and Y respectively, and U is unobserved confounder.**

The true causal model is:

where ϵ₁ and ϵ₂ are independent errors for X and Y respectively. The reduced form of the true model is

from which it is easy to calculate the correlation between X and g₁, denoted by ρ X g 1, and similarly ρ X g 2, ρ Y g 1 and ρ Y g 2 as:

and thus

As before, since β 1 2 · v a r ( X ) represents a part of the variance in Y due to the direct effect of X on Y, we assume β 1 2 · v a r ( X ) < v a r ( Y ), and thus |K₁| < 1. Similarly we assume |K₂| < 1. These two assumptions parallel with our key condition (5).

From the GWAS summary data for traits X and Y, we calculate the sample correlations r X g 1, r X g 2, r Y g 1 and r Y g 2. Then we apply our CD-Ratio method with g₁ to estimate K₁ as K ^ 1 with its standard error SE ( K ^ 1 ), and construct its 100(1 −⁠ α)% confidence interval C I K 1 = K ^ 1 ± Z α / 2 SE ( K ^ 1 ). If C I K 1 is completely inside [−1, 0) or (0, 1], we conclude that X has a causal effect on Y; if C I K 1 covers 0, then we conclude that X has no causal effect on Y. Similarly we construct a confidence interval for K₂ and infer whether there is a causal effect from Y to X. In this way, we may conclude with bi-directional causal effects between X and Y.

As to be shown in simulations, in the case with valid IVs, both our bi-CD-Ratio and bi-directional MR performed well, being able to distinguish among no, uni -⁠ and bi-directional causal relationships between two traits. In contrast, Steiger’s method did not work because it cannot determine whether there is any causal relationship between two traits.

Other methods

In addition to Steiger’s method, bi-directional MR, also called reciprocal MR, is popular. For the latter, it consists of two steps: first, an MR analysis is applied to infer the causal effect of X on Y, often using a set of (approximately) independent SNPs significantly associated with X as IVs; second, a separate MR analysis is applied to infer the causal effect of Y on X using a possibly different set of independent SNPs that are significantly associated with Y. The decision rules are similar to ours: 1) if the result is significant in Step 1, but not in Step 2, we conclude that the causal direction is from X to Y; 2) If the result is significant in Step 2, but not in Step 1, we conclude that the causal direction is from Y to X; 3) Otherwise, we cannot determine the causal direction. However, an issue with bi-directional MR is that the differing sample sizes, or generally differing statistical power, between the two GWAS for traits X and Y respectively, would possibly confound the result. An extreme example for an ideal case is that, suppose that both GWAS are sufficiently powered to detect all associated SNPs; it is easy to see that, if there is indeed a causal relation from X to Y and that Y is completely determined by X genetically, then all the SNPs associated with X would be associated with Y too, leading to that both Steps 1 and 2 would give significant results.

We will apply a wide range of MR methods, representing both well-established and new ones as implemented in R package TwoSampleMR, for bi-directional MR. These include the inverse-variance–weighted (IVW) method [46], MR-Egger regression [27], the weighted median estimator [47], the weighted mode-based estimation [48, 49], and the Robust Adjusted Profile Score (RAPS) method [42]. To save space, we will only present the results of RAPS with the set-up of overdispersion and a squared-error loss; using other set-ups led to qualitatively similar results. The IVW is perhaps the most popular, requiring all the three key assumptions on valid IVs to hold, under which it is most efficient; the other ones either explicitly account for or are robust to horizontal pleiotropy. Since all these MR methods require using independent SNPs, we selected the most significant SNP from each locus if it contained more than one significant SNPs, then applying MR to only independent SNPs in any simulated or real data.

GWAS data and reference panel

To draw inference on the causal direction between two traits using either our proposed methods or bi-directional MR, we only need two independent GWAS summary datasets for the two traits respectively, and a reference panel to estimate LD-matrices for correlated SNPs.

We showcase an application to infer causal directions between LDL and CAD, and between HDL and CAD. The large-scale GWAS summary data for traits LDL/HDL and CAD were drawn from two year 2017 publications [50, 51] with sample sizes up to 295826 for LDL, 316391 for HDL and 336860 for CAD of primary European ancestry respectively. We used the European samples of 489 unrelated individuals in the 1000 Genomes Project [41] as our reference panel to calculate the LD-matrix.

For our proposed methods and Steiger’s method, we chose the IVs from the SNPs significantly associated with both traits, denoted as before as X and Y. Based on [52], we partitioned the whole genome into 1703 approximately LD-independent loci and assigned the significant SNPs (associated with both X and Y) to one of these 1703 independent loci. We pruned out highly correlated SNPs falling inside each locus in the following two steps. Step 1, we ordered the significant SNPs in each locus according to their combined significance level with X and Y. Suppose there were m significant SNPs with their p-values p_X = (p_X1, …, p_Xm) and p_Y = (p_Y1, …, p_Ym) for two traits X and Y respectively. We sorted p_X and p_Y in an ascending order. For each SNP i, i = 1, …, m, we obtained the ranks of its p-values p_Xi and p_Yi, denoted as I_Xi and I_Yi. Then we calculated its rank sum I_i = I_Xi + I_Yi, i = 1, …, m. After that, we reordered the SNPs in the ascending order of I_i. Step 2, we pruned out some highly correlated (and significant) SNPs in each locus. We kept SNP 1, and for each i = 2, …, m, if |cor(SNP_i, SNP_j)| ≤ 0.8 for all j = 1, …, i −⁠ 1, we kept SNP i; otherwise we removed it. At the end, we had a set of the significant SNPs with pairwise absolute correlations less than 0.8.

Main simulation set-ups

For our main simulations, to mimic the real data analysis for LDL and CAD, in which 22 SNPs from 12 loci were used, we generated simulated data with the same 22 SNPs. We extracted the genotype data of 489 European individuals from the 1000-Genomes Project, then replicated the genotype data of the 22 SNPs from the 489 individuals 200 times, obtaining a sample of size n = 97800.

Following [53], in each simulation, we generated the two traits from true models (12-13), where g was the vector of the genotype scores of the 22 SNPs, β_Xg was the (row) vector of the effect sizes of the 22 SNPs on trait X with each element independently drawn from a truncated standard normal distribution (truncated at 0.5 to ensure | β X g j | ≥ 0 . 5 ), the confounder U ∼ N(0, 1); the causal effect size β_YX from X to Y was fixed at one value, the direct effects of the SNPs on Y, the elements of (row) vector α, were generated independently α i ~ N ( μ α , σ α 2 ) ; and both error terms were independently from N(0, 4). With various values of β_YX, μ_α and σ_α, we considered a total of 75 simulation set-ups, covering a wide range of situations from no pleiotropy (i.e. μ_α = σ_α = 0) to balanced pleiotropy (i.e. μ_α = 0 and σ_α > 0) and direction pleiotropy (i.e. μ_α ≠ 0 and σ_α > 0).

After generating each simulated dataset, we performed single SNP-based univariate/marginal analysis for each trait, obtaining the corresponding two GWAS summary datasets. The proposed methods, CD-Ratio, CD-Egger and CD-GLS, were then applied to the GWAS summary data, and for each method one of three possible decisions was made according to the decision rule. For comparison, we also applied MR Steiger’s method to each of the 22 SNPs; we then calculated the proportions among the 22 SNPs of the three possible decisions reached, denoted by “Steiger-Prop”. In addition, we also pooled together the individual results from each of the 22 SNPs by majority voting to reach a final decision; this method is denoted as “Steiger-MV”.

Results

Real data examples

LDL and CAD

For LDL and CAD, at the significance cutoff of p < 5 × 10⁻⁸, out of 1703 (approximately) independent loci [52], we identified 39 significant loci for CAD, 102 for LDL, and 12 common ones for both traits from the two GWAS summary datasets. There were 40 SNPs in these 12 common loci significant for both LDL and CAD. After pruning out highly correlated ones, we obtained 22 SNPs in the 12 loci. We applied MR Steiger’s method and our proposed methods with these 22 SNPs as IVs. For each of the 22 SNPs, we applied MR Steiger’s method to obtain a 95% CI for Z_g, as shown in Fig 6(A). If a CI covers 0, no conclusion would be made; otherwise, depending on whether it is in the right or left side of the zero point, we conclude the causal direction of LDL to CAD or the opposite. A problem with single SNP-based analysis is its lower power and thus possibly inconsistent results: as shown in Fig 6(A), although we’d conclude the causal direction of LDL driving CAD based on most of the SNPs, we would fail to conclude based on each of five SNPs while concluding CAD driving LDL on one SNP.

The results by Steiger’s method: 95% CI of the test statistics <i>Z</i><sub><i>g</i></sub> for each SNP. — **Fig. 6. The results by Steiger’s method: 95% CI of the test statistics Z_g for each SNP.**

As an alternative, under the Ideal Model, at each locus, we first applied CD-Ratio to all individual SNPs; if there were more than one SNP, we applied CD-Ratio to combine the SNP-specific results at the locus; then we applied CD-Ratio to all 22 SNPs from the 12 loci. For the Pleiotropy Model, we applied CD-Egger and CD-GLS directly to all the SNPs across all the loci. Fig 7 shows the results. First, we emphasize that, perhaps as expected, the single SNP-based results varied with the specific SNP being used as the IV, presumably due to that some SNPs were not valid IVs. For example, for the majority of the SNPs, CD-Ratio would give each a 95% confidence interval (CI) of K_{LDL → CAD} completely inside (0, 1], but a few outside, supporting the causal direction from LDL to CAD. Also note that none of the CIs covered 0, supporting the existence of a causal relationship between the two traits. Second, for the several loci with multiple SNPs, by combining the results over the SNPs, our proposed CD-Ratio improved statistical efficiency by giving shorter locus-specific CIs than the individual SNP-specific ones. Finally, most importantly, under the Ideal Model, after combining over all 22 SNPs from 12 loci, the CD-Ratio method could not conclude on the causal direction, because its CI for K_{LDL → CAD} was inside (0,1] while that for K_{CAD → LDL} covered 1, as given in Table 1. In contrast, after accounting for possible horizontal pleiotropy, our proposed methods CD-Egger and CD-GLS would both conclude the causal direction from LDL to CAD, in agreement with the previous MR analysis and RCT results reported in the literature [30, 34–36].

**Tab. 1. The results for inferring the causal direction between LDL and CAD after combining across all 22 SNPs in 12 loci.**

Fig. 7. Forest plots for inferring the causal direction between LDL and CAD, showing the differently colored SNP- and locus-specific results and the combined results across all SNPs and loci, respectively.

It is noted that, Locus 253 and Locus 1246 showed a negative (i.e. beneficial) effect of LDL on CAD, in contrary to what is known in the literature. Treating each as an outlier, after removing it, we applied the new methods, and all the three (including CD-Ratio) reached the same conclusion on the causal direction from LDL to CAD; see S4 Text. On the other hand, if we only selected and used one SNP from each locus, CD-Ratio could not conclude while both CD-Egger and CD-GLS reached the same conclusion of LDL to CAD, as shown in S4 Text.

It is emphasized that the necessity of accounting for horizontal pleiotropy in this example was supported by the statistically significant non-zero estimates of b₀ in either direction, which was interpreted as representing the mean of the pleiotropic effects; it was notably larger when considering causal direction CAD → LDL than that in the other direction (Table 1). This latter point was also confirmed by the results of MR Egger regression as to be shown. Furthermore, we conducted the GOF test with Q_Ratio = 337.1, leading to a p-value almost 0, rejecting the null hypothesis that the Ideal Model was adequate. The GOF test Q_Ratio = 527.2 for CAD → LDL, leading to a p-value almost 0 too. For CD-Egger, for the direction LDL → CAD we got Q_Egger = 22.2 and p-value = 0.329; for the other direction, we obtained Q_Egger = 22.0 and p-value = 0.340. The similar results were obtained from Q_GLS, supporting the adequacy of the Pleiotropic Model and the good performance of CD-Egger and CD-GLS.

For comparison, we also applied bi-directional MR. In each of 39 loci significant for CAD, we picked up the most significant SNP (i.e. with the smallest p-value) as the IV, then applied various MR methods to estimate and test the causal effect of CAD on LDL. Similarly, in each of 102 significant loci for LDL, we used the most significant SNP as the IV, then applied MR to estimate and test the causal effect of LDL on CAD. The results are shown in Table 2. We can see that, at the significance level 0.05, IVW, Egger and RAPS all showed some significant causal effects of CAD on LDL, and of LDL on CAD. Although Weighted Median and Weighted Mode concluded that CAD had no causal effect on LDL, and that LDL had causal effect on CAD, it could be due to the low statistical efficiency of the two methods; one evidence is that the two methods gave a negative effect of CAD on LDL, contrary to the known positive association between the two. Overall, across all the methods, there was much stronger statistical evidence to support the causal direction LDL → CAD than the other direction. Besides, MR-Egger gave an estimate of the mean pleiotropic effect for direction LDL to CAD as -0.003 with standard error 0.003, and -0.054 with standard error 0.028 for CAD to LDL. In summary, as expected, due to use of possibly invalid IVs, bi-directional MR tended to conclude causal effects for both directions.

**Tab. 2. Results of bi-rectional MR for the causal relationship between LDL and CAD by various MR methods.**

Finally, since our proposed CD-Egger and CD-GLS suggested the causal direction from LDL to CAD, we checked the key condition (5). Using all the 40 signfiicant SNPs in the 12 loci, we obtained the distribution of var(CAD)/var(LDL) ranging from 7.62 to 13.95 with mean 9.44, much larger than any MR β ^ 2 for LDL → CAD listed in Table 2, supporting that condition (5) held here.

HDL and CAD

For HDL and CAD, at the significance cutoff 5 × 10⁻⁸, out of 1703 (approximately) independent loci, we identified 39 significant loci for CAD, 119 for HDL, and 6 common ones for both traits. There were 19 SNPs significant for both HDL and CAD from the 6 common loci; after pruning out highly correlated ones, we obtained 8 SNPs. We applied the MR Steiger’s method and our proposed methods to the 8 SNPs in the 6 common and significant loci. Fig 6(B) shows the results from Steiger’s method, while Fig 8 showing those for CD-Ratio, CD-Egger and CD-GLS.

Again, similar to that for LDL and CAD, the single SNP results varied with different SNPs being used. Combining multiple SNPs in a locus showed gains with higher statistical estimation efficiency: for example, at Locus 1609 with 3 SNPs, CD-Ratio gave a shorter CI than the 3 individual SNP-specific ones. As shown in Table 3, we cannot conclude on the causal direction using CD-Ratio under the Ideal Model, because its CIs for K_HDL→CAD and K_CAD→HDL were both inside [−1, 0). After accounting for possible pleiotropy, both CD-Egger and CD-GLS gave a stronger but not statistically significant result of the causal direction from HDL to CAD; both reached the same conclusion: although each gave a CI of K_HDL→CAD within (-1, 0), each gave an estimate of K_CAD→HDL < −1 but with a CI covering −1, failing to determine the causal direction.

**Tab. 3. Results of inferring the causal direction between HDL and CAD after combining across all 8 SNPs in 6 loci.**

We also conducted GOF testing. First, we obtained Q_Ratio = 78.1, leading to a p-value almost 0, thus rejecting the null hypothesis that the Ideal Model for HDL → CAD was adequate. Similarly, we reached the same conclusion for CAD → HDL with Q_Ratio = 163.9 and a p-value almost 0. Next, for the Pleiotropic Model with CD-Egger, for the direction HDL → CAD, we obtained Q_Egger = 7.99 and p-value = 0.239; for the other direction, we had Q_Egger = 8.07 and p-value 0.233. Similar results were obtained with CD-GLS. Hence, we concluded the adequacy of the Pleiotropic Model.

We applied bi-directional MR for comparison. The results are shown in Table 4. At the significance level 0.05, IVW and RAPS showed significant causal effects of both HDL to CAD and CAD to HDL. MR Egger concluded with a significant causal effect of HDL to CAD but not CAD to HDL, while Weighted Median gave the opposite result. Weighted Mode showed that a causal effect from CAD to HDL was more significant than that from HDL to CAD, though neither was significant at level 0.05. In summary, different bi-directional MR methods gave different and contradictory results.

**Tab. 4. Results of bi-rectional MR for the causal relationship between HDL and CAD by various MR methods.**

Finally, given the inconclusive results by our methods, we checked the key condition (5) for both directions. Using all the 19 significant SNPs in the 6 loci, we obtained the distribution of var(CAD)/var(HDL) ranging from 8.37 to 11.77 with mean 9.84, much larger than all MR β ^ 2 for HDL → CAD; on the other hand, var(HDL)/var(CAD) ranged from 0.0849 to 0.1195 with mean 0.1032, larger than any MR β ^ 2 for CAD → HDL. Hence there was evidence supporting condition (5) (or its analog) in either direction.

LDL/HDL and CAD: Multivariable approaches

Since MV-MR has been shown to be more robust than MR, in particular in estimating causal effects of LDL and HDL on CAD [54], we also applied bi-directional MV-MR and MV-CD-Egger to infer the causal direction between LDL (or HDL) and CAD while adjusting for HDL (or LDL) as a potential confounder. As shown in S5 Text, our proposed MV-CD-Egger concluded with the causal direction from LDL to CAD, but no causal relationship between HDL and CAD. In contrast, both MV-MR-IVW and MV-MR-Egger concluded with LDL and HDL causal to CAD; in addition, MV-MR-Egger also identified a significant causal effect of CAD on LDL.

Main simulation results

Estimation of K_YX: CD-Ratio performed well with no pleiotropy while CD-Egger and CD-GLS always performed well

First we show the simulation results in terms of estimating K_YX by CD-Ratio, CD-Egger and CD-GLS. For each simulation set-up and each method, we show the mean and standard deviation (sd) of its estimates of K_YX from 500 simulated datasets, which are compared to the true value of K_YX and the mean standard error (se) respectively; we would like them to be as close as possible. From Table 5, we can see that CD-Egger and CD-GLS always gave almost unbiased estimates (with good SE estimates too) in all situations, suggesting their correct (and similar) inference on the causal direction. In contrast, while the CD-Ratio estimator was almost unbiased in the absence of pleiotropy, it became more and more biased as the mean pleiotropic effect (i.e. μ_α) increased. We obtained similar results and reached the same conclusion from many other simulation set-ups as shown in S6 Text.

Simulation results for estimating <i>K</i><sub><i>YX</i></sub>. — **Tab. 5. Simulation results for estimating K_YX.**

Inferring causal direction: CD-Ratio outperformed with no pleiotropy while CD-Egger and CD-GLS always outperformed Steiger’s and bidirectional MR methods

We next compare CD-Ratio, CD-Egger and CD-GLS with MR Steiger’s and bi-directional MR in terms of their decisions made on the causal direction. Fig 9 shows some representative results; more are given in S6 Text. First, we consider the case of no causal relationship between the two traits (i.e. β_YX = β_XY = 0). It was confirmed that MR Steiger’s does not work if there is no causal relationship between two traits: it often incorrectly concluded with one of the two directions; in contrast, our CD-Ratio based on single SNPs overcame this shortcoming if there was no pleiotropy, and performed similarly to Steiger’s method otherwise (see S6 Text). In general, bi-directional MR performed reasonably well, though most of them might make incorrect conclusions. For example, in the presence of balanced pleiotropy, all the five MR methods considered made incorrect decisions at least 10% time. In contrast, our proposed three new methods almost always reached the correct decision. Second, when the causal direction was from X to Y, our methods CD-Egger and CD-GLS had high (and similar) power to detect it across all the scenarios, regardless of the presence of balanced or directional pleiotropy. As expected, CD-Ratio performed well only when there was no pleiotropy or only balanced pleiotropy, but it completely lost power in the presence of directional pleiotropy. Steiger’s method was high powered except that it would often conclude with the incorrect causal direction in the presence of directional pleiotropy. On the other hand, all the five MR methods were low powered, and would often conclude with the incorrect causal direction in the presence of directional pleiotropy. Perhaps it is surprising to see so low power of bi-directional MR even in the ideal case of no pleiotropy. The reason is that bi-directional MR tends to detect both directions being significant (due to the use of invalid IVs), leading to no final conclusions by the decision rules (since we assume either no or a uni-directional causal relationship); this is confirmed by Fig 10, in which the significant results in either direction are shown separately (without being combined by the decision rules).

Relative frequencies of the causal directions reached by each method (from 500 simulations in each set-up) when there is no causal relatuonship (top two panels) or a causal direction is from <i>X</i> to <i>Y</i> (<i>β</i><sub><i>YX</i></sub> = 0.2; bottom two) with no pleiotropy, balanced pleiotropy (<i>μ</i><sub><i>α</i></sub> = 0, <i>σ</i><sub><i>μ</i></sub> = 0.1) and directional pleiotropy ((<i>μ</i><sub><i>α</i></sub> = 0.4, <i>σ</i><sub><i>μ</i></sub> = 0.1) respectively. — Fig. 9. Relative frequencies of the causal directions reached by each method (from 500 simulations in each set-up) when there is no causal relatuonship (top two panels) or a causal direction is from X to Y (β_YX = 0.2; bottom two) with no pleiotropy, balanced pleiotropy (μ_α = 0, σ_μ = 0.1) and directional pleiotropy ((μ_α = 0.4, σ_μ = 0.1) respectively.

The GOF tests performed well

In S6 Text, we show the results of our proposed GOF tests: all performed well with controlled type I error rates and high power in our simulations.

Secondary simulation results

Bi-CD-Ratio could detect bi-directional causal relationships

In S7 Text, we also show the results for inferring no, uni -⁠ and bi-directional relationships between two traits with valid IVs. It is confirmed that our proposed bi-CD-Ratio and (bi-directional) MR-Wald-Ratio performed well, while Steiger’s method did not work due to its inability to determine whether there is any causal relationship between two traits as discussed earlier.

CD-Ratio could perform in TWAS with relatively small sample sizes

We also did a simulation study to investigate how the asymptotic theory for the sample correlations and their ratios performs with smaller sample sizes as typical with molecular endophenotypes as in TWAS. We generated simulated data under realistic set-ups mimicking real eQTL datasets with sample sizes from around two hundreds to about a thousand, much smaller than those of typical GWAS as used in our main simulations. We also considered the presence of some highly correlated SNPs/IVs. As shown in S8 Text, our proposed CD-ratio performed very well: it might be a bit conservative with empirical type I error rates smaller than the nominal level when the sample size was small at a few hundreds; but as the sample size increased to a few thousands, its type I error rates approached the nominal level. Importantly, it was confirmed that by combining the results across the SNPs, even including some highly correlated ones, CD-Ratio improved the statistical power over that based on single SNPs.

Selection bias

We did simulations to study possible selection bias due to the double use of the same sample for SNP/IV selection and estimation. The simulation setups are the same as described in Main Simulation Set-ups section. We used 22 SNPs to generate simulated data, and in Main Simulation Results section we applied our proposed three methods to all 22 SNPs with the results shown in Table 5. Now for each simulated dataset, we used a p-value cutoff to select exactly N SNPs that were most significantly associated with both X and Y, then we applied a method with these N selected SNPs. We tried with N = 10, 15 or 22; when N = 22, we used all 22 SNPs as before. S9 Text shows the simulation results with no pleiotropy, balanced pleiotropy, and directional pleiotropy respectively. We summarize the results below. First, when there was no pleiotropy, all three methods gave almost unbiased estimates of K_YX with any of the three values of N, suggesting no or negligible selection bias. Second, with balanced pleiotropy, there were small biases. Third, with directional pleiotropy, there could be large biases. However, interestingly, under the null hypothesis of K_YX = 0, CD-Egger and CD-GLS yielded only small biases, leading to barely any inflation of their Type I error rates; see S9 Text.

Discussion

We have proposed several extensions to the increasingly popular MR Steiger’s method to infer the causal direction between two traits. MR Steiger’s method is based on using a single SNP as a valid IV; if there is strong biological knowledge to justify the use of the SNP as a valid IV, it is both useful and simple to apply the method. However, in practice, due to incomplete knowledge, one often just uses genome-wide significant SNPs as IVs, in which case there are several potential problems. Some of the SNPs are likely invalid IVs due to wide-spread horizontal pleiotropy. In addition, the results may be low-powered and inconsistent, each critically depending on the SNP being used. Our proposed multiple SNP-based approaches would alleviate the above two problems. First, by using multiple SNPs, we allow and thus explicitly model possible pleiotropic effects, leading to two new methods, CD-Egger and CD-GLS, that are robust to and more powerful in the presence of pleiotropy. Second, combining the results over multiple SNPs is both more robust and more powerful than a single SNP-based method. In applications to inferring/confirming causal directions between molecular endophenotypes and complex diseases or quantitative traits, such as in TWAS, one would typically use multiple SNPs in LD from a local region to predict an endophenotype (e.g. a gene’s expression level) with relatively small sample sizes; in these applications, it may be necessary to apply a multiple correlated SNPs-based method like ours to achieve as high power as possible. Third, we can apply some goodness-of-fit tests to check the modeling assumptions. Furthermore, our methods can simultaneously infer both the existence and (if so) the direction of a causal relationship between two traits. For this reason, even for single SNPs, our CD-Ratio is a useful alternative to Steiger’s method because of the former’s ability to avoid incorrect conclusions of a causal direction when there is no causal relationship (and no pleiotropy). In addition, our methods have some advantages, i.e. higher statistical power and lower false positives, over, and are thus useful alternatives to, multiple SNP-based bi-directional MR. These advantages over bi-directional MR are perhaps due to that our proposed methods take advantage of and thus exploit some specific relationships between the Pearson correlations between each SNP/IV and the two traits implicitly implied by the true causal models. These points were confirmed in both extensive simulations and an application to infer the causal direction between LDL cholesterol and CAD. In particular, bi-directional MR was found to be non-robust for inferring causal directions in the presence of invalid IVs even when equipped with a robust MR method. In summary, our proposed methods are expected to be useful in the toolbox for causal inference, for which application of multiple complementary methods, called triangulation [55], is strongly advocated.

Our methods have some limitations. First and foremost, our methods are based on the standard asymptotic theory for sample correlations under the assumption of a large sample size (say n) and a small number of SNPs/IVs (say m). It has some important implications. Although including more relevant SNPs/IVs is expected to improve the statistical estimation efficiency and power in the ideal case, in practice there are some arguments against doing so: 1) including too many SNPs, i.e. making m large relative to n, the standard asymptotic theory may break down, leading to incorrect inference; 2) in particular, including some highly correlated SNPs may gain little while being at the risk of inducing a covariance matrix nearly singular, leading to bad estimates and inference with numerical problems; 3) our methods require estimating a large covariance matrix of dimension in the order of m² × m² for a sample correlation matrix in the order of m × m, demanding too much computer memory and computing time even if m is only in hundreds or thousands; 4) including more SNPs will increase the likelihood of violating IV assumptions; 5) it might be harder to accurately impute with high LD between any two SNPs for GWAS summary data based on a not-too-large reference panel. For the above reasons, in our applications we pruned out those significant SNPs with pairwise correlations larger than 0.8. We also emphasize that we require the selected SNPs/IVs to be statistically significant for both traits. One reason is similar to avoiding biases due to weak IVs in MR. Our methods are based on the ratios of sample correlations, say r_Yg/r_Xg. If the mean of the denominator r_Xg is 0, regardless of whether the normality (approximately) holds for each sample correlation, the ratio cannot be well approximated by a normal approximation; our methods are based on the asymptotic joint normal distribution for the correlation ratios. Second, we have mainly restricted to inferring uni-directional causal relationships. Although we have outlined an extension of our CD-Ratio method (along with bi-directional MR) to the case with bi-directional causal relationships, and shown some promising simulation results under the ideal situation with valid IVs, it remains to see how to extend and apply the methods in practice without knowing which IVs are valid, in which case there may be issues of non-identifiable models [45]. We note by passing that [45] considered a mediating analysis of how to estimate direct/indirect effects given the presence of a bi-directional causal relationship, differing from our task of estimating what causal direction if any. Third, as often done in applications with MR, we have used the same two GWAS samples both for selecting SNPs as IVs and for subsequent estimation and inference. As expected, this double use of the data could lead to selection bias. It was confirmed in our simulations in the presence of directional pleiotropy, though no or only small biases were observed in cases of no or only balanced pleiotropy, possibly due to our recommendation of selecting SNPs/IVs that are significant for both traits, alleviating the problem as compared to selecting SNPs/IVs based on only one trait as commonly adopted in MR. Nevertheless, this issue needs to be further studied; if possible, we may use a third independent GWAS sample for such selection to avoid selection bias as in the three-sample MR design [56]. Fourth, it is noted that our proposed methods are developed in the framework of two-sample MR, which is more common and less restrictive than the one-sample set-up (where both traits and SNPs are obtained from the same sample). Nevertheless, it will be usefull to extend our methods to the one-sample set-up. In particular, such methods may be useful for mediation analysis [57, 58], in which methods to determine causal directions (i.e. to determine between two traits which is the mediator and which is the outcome) lack and are urgently needed while a bi-directional mediation analysis would be problematic as recently reported [59]. In addition, we have not compared with structure equation modeling [60–62], Bayesian network [63] and latent causal variable [64] approaches, mainly because of their lack of statistical significance testing or handling hidden confounding. Finally, we have pointed out a strong connection between our proposed CD-Egger method and MR Egger regression in dealing with horizontal pleiotropy. Taking advantage of this connection, one may be able to develop other extensions as developed for MR [22, 29, 53, 65, 66] in addition to those used here (e.g. weighted median and mode methods in MR) for inference of causal direction. These interesting topics warrant future investigation.

Supporting information

S1 Text [pdf]
Asymptotic covariance matrix of and .

S2 Text [pdf]
Details of CD-Egger and CD-GLS.

S3 Text [pdf]
Degrees of freedom for the goodness-of-fit tests.

S4 Text [pdf]
More results for LDL and CAD.

S5 Text [pdf]
Results for LDL/HDL and CAD with multivariable approaches.

S6 Text [pdf]
More simulation results.

S7 Text [pdf]
Simulations for bi-directional causal effects.

S8 Text [pdf]
Simulations for TWAS with small sample sizes.

S9 Text [pdf]
Simulations to study selection bias.

S10 Text [pdf]
Fisher’s transformation of the absolute value of a correlation.

Zdroje

1. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003 Feb;32(1):1–22. doi: 10.1093/ije/dyg070 12689998

2. Davey Smith G, Ebrahim S. Mendelian randomization: prospects, potentials, and limitations. Int J Epidemiol. 2004 Feb;33(1):30–42. doi: 10.1093/ije/dyh132 15075143

3. Davey Smith G, Hemani G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum Mol Genet. 2014 Sep 15;23(R1):R89–98. doi: 10.1093/hmg/ddu328 25064373

4. Hemani G, Zheng J, Elsworth B, Wade KH, Haberland V, Baird D, et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife. 2018 May 30;7:e34408. doi: 10.7554/eLife.34408 29846171

5. Holmes MV, Ala-Korpela M, Davey Smith G. Mendelian randomization in cardiometabolic disease: challenges in evaluating causality. Nat Rev Cardiol. 2017 Oct;14(10):577–590. doi: 10.1038/nrcardio.2017.78 28569269

6. Timpson NJ, Nordestgaard BG, Harbord RM, Zacho J, Frayling TM, Tybjærg-Hansen A, et al. C-reactive protein levels and body mass index: elucidating direction of causation through reciprocal Mendelian randomization. Int J Obes (Lond). 2011 Feb;35(2):300–8. doi: 10.1038/ijo.2010.137 20714329

7. Richmond RC, Davey Smith G, Ness AR, den Hoed M, McMahon G, Timpson NJ. Assessing causality in the association between child adiposity and physical activity levels: a Mendelian randomization analysis. PLoS Med. 2014 Mar 18;11(3):e1001618. doi: 10.1371/journal.pmed.1001618 24642734

8. Pickrell JK, Berisa T, Liu JZ, Ségurel L, Tung JY, Hinds DA. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016 Jul;48(7):709–17. doi: 10.1038/ng.3570 27182965

9. Hemani G, Tilling K, Davey Smith G. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 2017 Nov 17;13(11):e1007081. doi: 10.1371/journal.pgen.1007081 29149188

10. Millstein J, Zhang B, Zhu J, Schadt EE. Disentangling molecular relationships with a causal inference test. BMC Genet. 2009 May 27;10 : 23. doi: 10.1186/1471-2156-10-23 19473544

11. Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015 Sep;47(9):1091–8. doi: 10.1038/ng.3367 26258848

12. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016 Mar;48(3):245–52. doi: 10.1038/ng.3506 26854917

13. Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet. 2016 May;48(5):481–7. doi: 10.1038/ng.3538 27019110

14. Xu Z, Wu C, Wei P, Pan W. A Powerful Framework for Integrating eQTL and GWAS Summary Data. Genetics. 2017 Nov;207(3):893–902. doi: 10.1534/genetics.117.300270 28893853

15. Xu Z, Wu C, Pan W; Alzheimer’s Disease Neuroimaging Initiative. Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage. 2017 Oct 1;159 : 159–169. doi: 10.1016/j.neuroimage.2017.07.036 28736311

16. Han S, Lin Y, Wang M, Goes FS, Tan K, Zandi P, et al. Integrating brain methylome with GWAS for psychiatric risk gene discovery. bioRxiv. 2018 Jan 1 : 440206. https://doi.org/10.1101/440206

17. Su YR, Di C, Bien S, Huang L, Dong X, Abecasis G, et al. A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics. Am J Hum Genet. 2018 May 3;102(5):904–919. doi: 10.1016/j.ajhg.2018.03.019 29727690

18. Cai M, Chen L, Liu J, Yang C. Quantifying the impact of genetically regulated expression on complex traits and diseases. bioRxiv. 2019 Jan 1 : 546580. https://doi.org/10.1101/546580

19. Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet. 2019 Mar;51(3):568–576. doi: 10.1038/s41588-019-0345-7 30804563

20. Yang T, Wu C, Wei P, Pan W. Integrating DNA sequencing and transcriptomic data for association analyses of low-frequency variants and lipid traits. Hum Mol Genet. 2020 Feb 1;29(3):515–526. doi: 10.1093/hmg/ddz314 31919517

21. Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat Commun. 2018 Jan 15;9(1):224. doi: 10.1038/s41467-017-02317-2 29335400

22. Verbanck M, Chen CY, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat Genet. 2018 May;50(5):693–698. doi: 10.1038/s41588-018-0099-7 29686387

23. Watanabe K, Stringer S, Frei O, Umićević Mirkov M, de Leeuw C, Polderman TJC, et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet. 2019 Sep;51(9):1339–1348. doi: 10.1038/s41588-019-0481-0 31427789

24. Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet. 2019 Apr;51(4):592–599. doi: 10.1038/s41588-019-0385-z 30926968

25. Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nat Genet. 2019 Apr;51(4):675–682. doi: 10.1038/s41588-019-0367-1 30926970

26. Wu C, Pan W. A powerful fine-mapping method for transcriptome-wide association studies. Hum Genet. 2020 Feb;139(2):199–213. doi: 10.1007/s00439-019-02098-2 31844974

27. Bowden J, Davey Smith G, Burgess S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol. 2015 Apr;44(2):512–25. doi: 10.1093/ije/dyv080 26050253

28. Burgess S, Thompson SG. Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects. Am J Epidemiol. 2015 Feb 15;181(4):251–60. doi: 10.1093/aje/kwu283 25632051

29. Burgess S, Dudbridge F, Thompson SG. Re: “Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects”. Am J Epidemiol. 2015 Feb 15;181(4):290–1. doi: 10.1093/aje/kwv017 25660081

30. Ference BA, Yoo W, Alesh I, Mahajan N, Mirowska KK, Mewada A, et al. Effect of long-term exposure to lower low-density lipoprotein cholesterol beginning early in life on the risk of coronary heart disease: a Mendelian randomization analysis. J Am Coll Cardiol. 2012 Dec 25;60(25):2631–9. doi: 10.1016/j.jacc.2012.09.017 23083789

31. Holmes MV, Asselbergs FW, Palmer TM, Drenos F, Lanktree MB, Nelson CP, et al. Mendelian randomization of blood lipids for coronary heart disease. Eur Heart J. 2015 Mar 1;36(9):539–50. doi: 10.1093/eurheartj/eht571 24474739

32. Voight BF, Peloso GM, Orho-Melander M, Frikke-Schmidt R, Barbalic M, Jensen MK, et al. Plasma HDL cholesterol and risk of myocardial infarction: a mendelian randomisation study. Lancet. 2012 Aug 11;380(9841):572–80. doi: 10.1016/S0140-6736(12)60312-2 22607825

33. White J, Swerdlow DI, Preiss D, Fairhurst-Hunter Z, Keating BJ, Asselbergs FW, et al. Association of Lipid Fractions With Risks for Coronary Artery Disease and Diabetes. JAMA Cardiol. 2016 Sep 1;1(6):692–9. doi: 10.1001/jamacardio.2016.1884 27487401

34. Cholesterol Treatment Trialists’ (CTT) Collaboration, Baigent C, Blackwell L, Emberson J, Holland LE, Reith C, et al. Efficacy and safety of more intensive lowering of LDL cholesterol: a meta-analysis of data from 170,000 participants in 26 randomised trials. Lancet. 2010 Nov 13;376(9753):1670–81. doi: 10.1016/S0140-6736(10)61350-5 21067804

35. Collins R, Reith C, Emberson J, Armitage J, Baigent C, Blackwell L, et al. Lancet. 2016 Nov 19;388(10059):2532–2561. doi: 10.1016/S0140-6736(16)31357-5 27616593

36. Silverman MG, Ference BA, Im K, Wiviott SD, Giugliano RP, Grundy SM, et al. Association Between Lowering LDL-C and Cardiovascular Risk Reduction Among Different Therapeutic Interventions: A Systematic Review and Meta-analysis. JAMA. 2016 Sep 27;316(12):1289–97. doi: 10.1001/jama.2016.13985 27673306

37. Emerging Risk Factors Collaboration, Di Angelantonio E, Sarwar N, Perry P, Kaptoge S, Ray KK, et al. Major lipids, apolipoproteins, and risk of vascular disease. JAMA. 2009 Nov 11;302(18):1993–2000. doi: 10.1001/jama.2009.1619 19903920

38. Barter PJ, Caulfield M, Eriksson M, Grundy SM, Kastelein JJ, Komajda M, et al. Effects of torcetrapib in patients at high risk for coronary events. N Engl J Med. 2007 Nov 22;357(21):2109–22. doi: 10.1056/NEJMoa0706628 17984165

39. Schwartz GG, Olsson AG, Abt M, Ballantyne CM, Barter PJ, Brumm J, et al. Effects of dalcetrapib in patients with a recent acute coronary syndrome. N Engl J Med. 2012 Nov 29;367(22):2089–99. doi: 10.1056/NEJMoa1206797 23126252

40. Neudecker H, Wesselman AM. The asymptotic variance matrix of the sample correlation matrix. Linear Algebra and its Applications. 1990 Jan 1;127 : 589–99. doi: 10.1016/0024-3795(90)90363-H

41. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015 Oct 1;526(7571):68–74. doi: 10.1038/nature15393 26432245

42. Zhao Q, Wang J, Hemani G, Bowden J, Small DS. Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. Annals of Statistics. 2020;48(3):1742–69. doi: 10.1214/19-AOS1866

43. Dai JY, Peters U, Wang X, Kocarnik J, Chang-Claude J, Slattery ML, et al. Diagnostics for Pleiotropy in Mendelian Randomization Studies: Global and Individual Tests for Direct Effects. Am J Epidemiol. 2018 Dec 1;187(12):2672–2680. doi: 10.1093/aje/kwy177 30188971

44. Bowden J, Hemani G, Davey Smith G. Invited Commentary: Detecting Individual and Global Horizontal Pleiotropy in Mendelian Randomization-A Job for the Humble Heterogeneity Statistic? Am J Epidemiol. 2018 Dec 1;187(12):2681–2685. 30188969

45. Talluri R, Shete S. An approach to estimate bidirectional mediation effects with application to body mass index and fasting glucose. Ann Hum Genet. 2018 Nov;82(6):396–406. doi: 10.1111/ahg.12261 29993118

46. Burgess S, Butterworth A, Thompson SG. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol. 2013 Nov;37(7):658–65. doi: 10.1002/gepi.21758 24114802

47. Bowden J, Davey Smith G, Haycock PC, Burgess S. Consistent Estimation in Mendelian Randomization with Some Invalid Instruments Using a Weighted Median Estimator. Genet Epidemiol. 2016 May;40(4):304–14. doi: 10.1002/gepi.21965 27061298

48. Hartwig FP, Davey Smith G, Bowden J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int J Epidemiol. 2017 Dec 1;46(6):1985–1998. doi: 10.1093/ije/dyx102 29040600

49. Burgess S, Zuber V, Gkatzionis A, Foley CN. Modal-based estimation via heterogeneity-penalized weighting: model averaging for consistent and efficient estimation in Mendelian randomization when a plurality of candidate instruments are valid. Int J Epidemiol. 2018 Aug 1;47(4):1242–1254. doi: 10.1093/ije/dyy080

50. Liu DJ, Peloso GM, Yu H, Butterworth AS, Wang X, Mahajan A, et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nat Genet. 2017 Dec;49(12):1758–1766. doi: 10.1038/ng.3977 29083408

51. Nelson CP, Goel A, Butterworth AS, Kanoni S, Webb TR, Marouli E, et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat Genet. 2017 Sep;49(9):1385–1391. doi: 10.1038/ng.3913 28714975

52. Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016 Jan 15;32(2):283–5.

53. Slob EAW, Burgess S. A comparison of robust Mendelian randomization methods using summary data. Genet Epidemiol. 2020 Jun;44(4):313–329. doi: 10.1002/gepi.22295

54. Zheng J, Baird D, Borges MC, Bowden J, Hemani G, Haycock P, et al. Recent Developments in Mendelian Randomization Studies. Curr Epidemiol Rep. 2017;4(4):330–345. doi: 10.1007/s40471-017-0128-6 29226067

55. Lawlor DA, Tilling K, Davey Smith G. Triangulation in aetiological epidemiology. Int J Epidemiol. 2016 Dec 1;45(6):1866–1886.

56. Zhao Q, Chen Y, Wang J, Small DS. Powerful three-sample genome-wide design and robust statistical inference in summary-data Mendelian randomization. Int J Epidemiol. 2019 Oct 1;48(5):1478–1492. doi: 10.1093/ije/dyz142

57. Wang K. Understanding Power Anomalies in Mediation Analysis. Psychometrika. 2018 Jun;83(2):387–406. doi: 10.1007/s11336-017-9598-1

58. Schaid DJ, Sinnwell JP. Penalized models for analysis of multiple mediators. Genet Epidemiol. 2020 Jul;44(5):408–424. doi: 10.1002/gepi.22296

59. Lutz SM, Sordillo JE, Hokanson JE, Chen Wu A, Lange C. The effects of misspecification of the mediator and outcome in mediation analysis. Genet Epidemiol. 2020 Jun;44(4):400–403. doi: 10.1002/gepi.22289

60. Ainsworth HF, Shin SY, Cordell HJ. A comparison of methods for inferring causal relationships between genotype and phenotype using additional biological measurements. Genet Epidemiol. 2017 Nov;41(7):577–586. doi: 10.1002/gepi.22061

61. Yuan Y, Shen X, Pan W, Wang Z. Constrained likelihood for reconstructing a directed acyclic Gaussian graph. Biometrika 2019 March;106(1):109–125. doi: 10.1093/biomet/asy057

62. Li C, Shen X, Pan W. Likelihood ratio tests for a large directed acyclic graph. Journal of the American Statistical Association 2020;115(531):1304–1319. doi: 10.1080/01621459.2019.1623042

63. Howey R, Shin SY, Relton C, Davey Smith G, Cordell HJ. Bayesian network analysis incorporating genetic anchors complements conventional Mendelian randomization approaches for exploratory analysis of causal relationships in complex data. PLoS Genet. 2020 Mar 2;16(3):e1008198. doi: 10.1371/journal.pgen.1008198 32119656

64. O’Connor LJ, Price AL. Distinguishing genetic correlation from causation across 52 diseases and complex traits. Nat Genet. 2018 Dec;50(12):1728–1734. doi: 10.1038/s41588-018-0255-0 30374074

65. Rees JMB, Wood AM, Burgess S. Extending the MR-Egger method for multivariable Mendelian randomization to correct for both measured and unmeasured pleiotropy. Stat Med. 2017 Dec 20;36(29):4705–4718. doi: 10.1002/sim.7492 28960498

66. Qi G, Chatterjee N. Mendelian randomization analysis using mixture models for robust and efficient estimation of causal effects. Nat Commun. 2019 Apr 26;10(1):1941. doi: 10.1038/s41467-019-09432-2 31028273

Článek Genetic engineering of sex chromosomes for batch cultivation of non-transgenic, sex-sorted males

Článek Opposing functions of Fng1 and the Rpd3 HDAC complex in H4 acetylation in Fusarium graminearum

Článek Folliculin variants linked to Birt-Hogg-Dubé syndrome are targeted for proteasomal degradation

Článek NIGT1 family proteins exhibit dual mode DNA recognition to regulate nutrient response-associated genes in Arabidopsis

Článek Oxidative stress antagonizes fluoroquinolone drug sensitivity via the SoxR-SUF Fe-S cluster homeostatic axis

Článek A phenome-wide association study of 26 mendelian genes reveals phenotypic expressivity of common and rare variants within the general population

Článek Mms19 promotes spindle microtubule assembly in Drosophila neural stem cells

Článek Mosquito genomes are frequently invaded by transposable elements through horizontal transfer

Článek Genotype imputation using the Positional Burrows Wheeler Transform

Článek Formal commentary

Článek The DNA damage response is required for oocyte cyst breakdown and follicle formation in mice

Článek Loss of hepatocyte cell division leads to liver inflammation and fibrosis