Exponential random graph model parameter estimation for very large directed networks
Authors:
Alex Stivala aff001; Garry Robins aff003; Alessandro Lomi aff001
Authors place of work:
Institute of Computational Science, Università della Svizzera italiana, Lugano, Ticino, Switzerland
aff001; Centre for Transformative Innovation, Swinburne University of Technology, Melbourne, Victoria, Australia
aff002; Melbourne School of Psychological Sciences, The University of Melbourne, Melbourne, Victoria, Australia
aff003; The University of Exeter Business School, The University of Exeter, Exeter, Devon, United Kingdom
aff004
Published in the journal:
PLoS ONE 15(1)
Category:
Research Article
doi:
https://doi.org/10.1371/journal.pone.0227804
Summary
Exponential random graph models (ERGMs) are widely used for modeling social networks observed at one point in time. However the computational difficulty of ERGM parameter estimation has limited the practical application of this class of models to relatively small networks, up to a few thousand nodes at most, with usually only a few hundred nodes or fewer. In the case of undirected networks, snowball sampling can be used to find ERGM parameter estimates of larger networks via network samples, and recently published improvements in ERGM network distribution sampling and ERGM estimation algorithms have allowed ERGM parameter estimates of undirected networks with over one hundred thousand nodes to be made. However the implementations of these algorithms to date have been limited in their scalability, and also restricted to undirected networks. Here we describe an implementation of the recently published Equilibrium Expectation (EE) algorithm for ERGM parameter estimation of large directed networks. We test it on some simulated networks, and demonstrate its application to an online social network with over 1.6 million nodes.
Keywords:
Network analysis – Algorithms – Social networks – Statistical distributions – Directed graphs – Research errors – Random graphs – Network reciprocity
Introduction
Exponential random graph models (ERGMs) are a class of statistical model often used for modeling social networks [1, 2]. Parameter estimation in these models is a computationally difficult problem, and algorithms based on Markov chain Monte Carlo (MCMC) are generally used [2–10]. The computational time required by these methods places a limit on the size of networks for which models can be estimated in practice. A recently published new algorithm for sampling from the ERGM distribution can reduce this time by an order of magnitude [11], and a new estimation algorithm even more [12], however scalability is still a problem when extremely large networks are considered. It is also worth noting that the state space for a directed network is far larger than for an undirected network with the same number of nodes [13], and so this problem is even more difficult in the case of directed networks.
One solution to this problem is to take snowball samples [14–18] from the original network, and estimate ERGM parameters from these [19, 20]. The first description of such a method was [19]. However this method requires that estimation over the entire set of random tie variables is feasible, limiting the size of networks to which the method can be applied in practice. A more recently proposed method [20] is to estimate parameters for each sample in parallel with conditional estimation [21], combining the estimates with a meta-analysis [22] or using bootstrap methods [23] to estimate the standard errors. This work, however, was only applied to undirected networks.
However the problem of directly estimating ERGM parameters for a very large network (rather than from snowball samples) remains, particularly for directed networks where snowball sampling is not straightforward. Here we describe an implementation of the Equilibrium Expectation (EE) method [12] extended to directed networks, which is scalable and efficient enough to be used to estimate ERGM parameters for networks with over one million nodes.
Hunter & Handcock (p. 581 of [6]) note that the largest network estimated to date (in 2006) was the N = 2209 nodes adolescent friendship network estimated by Hunter, Goodreau, & Handcock [24]. However this network was treated as undirected. Larger undirected networks subsequently had ERGM models estimated indirectly by snowball sampling, with the largest having 40 421 nodes [20]. By using an improved ERGM distribution sampler, Byshkin et al. [11] could directly estimate ERGM parameters for a 3061 node patient transfer network (treated as undirected), and the Equilibrium Expectation algorithm was demonstrated on an undirected online social network with 104 103 nodes [12]. Using the implementation described in this paper, modified to use a simplified EE algorithm, Borisenko, Byshkin, & Lomi [25] are able to estimate a simple ERGM model of a 75 879 node directed network.
We note that social networks are typically sparse, and we assume sparsity for efficient data structures. There is some specific work on sampling methods for ERGMs for large sufficiently dense networks with additional assumptions [26] such as the presence of block structure [27, 28] but here we assume only sparsity, and that the network can plausibly be described by an exponential random graph model.
In this paper, we describe an implementation of the EE algorithm, including the improved fixed density ERGM sampler [11] for application to directed networks. By implementing these algorithms and the associated computations of change statistics in a more efficient and scalable manner, we are able to estimate ERGM parameters for networks far larger than previously possible, even using existing implementations of the algorithms used for the computational experiments in the papers originally describing them [11, 12]. The implementation we describe allows ERGM parameter estimation for a model of a directed network with over one million nodes, while existing methods are only practical on networks of a few thousand nodes at most. We test the implementation first on simulated networks with known model parameters, in order to validate that it works correctly, and then demonstrate its application to an online social network with over 1.6 million nodes.
Exponential random graph models
An ERGM is a probability distribution with the form where
-
X = [Xij] is a 0-1 matrix of random tie variables,
-
x is a realization of X,
-
A is a configuration, a (small) set of nodes and a subset of ties between them,
-
zA(x) is the network statistic for configuration A,
-
θA is a model parameter corresponding to configuration A,
-
κ is a normalizing constant to ensure a proper distribution.
Which configurations A are allowed depends on the assumptions as to which ties are independent. Here we will use the social circuit dependence assumption [7, 29], under which two potential ties are conditionally dependent exactly when, if they were observed, they would form a 4-cycle in the network [13]. Configurations allowed by other, simpler, dependence assumptions (Bernoulli, dyad-independent, Markov (pp. 56–57 of [1]) are also allowed in these models.
Under this assumption, we will now describe the structural configurations used in this work. In the following, N is the number of nodes in the network.
The simplest configuration, included in every model, is Arc, analogous to the intercept in a regression. Arc is included to account for the overall density of the network observed. Its corresponding statistic is z L = ∑ i = 1 N ∑ j = 1 N x i j, the number of arcs in the graph. The Reciprocity parameter is used to test for propensity of arcs to be reciprocated, and its statistic is z Reciprocity = ∑ i = 1 N ∑ j = 1 N x i j x j i.
The degree distribution in a directed network is modeled with the alternating k-out-star and alternating k-in-star configurations defined by [29] and illustrated in Fig 1. The statistic for k-out-star is defined as: where S k O u t is the number of k-out-stars and λ ≥ 1 is a damping parameter. We use λ = 2 in this work, as used previously in, for example, [20, 29]. We note that in a more general form of ERGM, the curved exponential family random graph model [6], it is also possible to estimate (a parameter equivalent to) the parameter λ, and this is routinely done using the statnet software [30–33]. However the EE algorithm requires that every model parameter has a corresponding change statistic, and so cannot estimate curved ERGMs [12]. For this reason assume a fixed value for λ.
The zAinS statistic for k-in-stars is defined similarly.
Path closure and multiple connectivity are modeled with the alternating transitive k-triangles and alternating two-paths effects defined by [29] and illustrated in Fig 2. These statistics are defined as [34]: where L 2 ( i , j ) = ∑ h = 1 N x i h x h j is the number of directed two-paths from i to j, and
As well as path closure (AT-T) we can also define cyclic closure (AT-C), in which arcs constituting the triangles form a cycle. Its statistic zAT−C is defined analogously to zAT−T but counting cyclic k-triangles rather than transitive k-triangles:
We also include the shared popularity configuration A2P-D [13], the statistic for which zA2P−D is defined analogously to zA2P−T, but rather than counting directed paths between two nodes via k intermediate nodes, it counts “paths” where the each of the k intermediate nodes have arcs directed towards each of the two nodes (see Fig 2): where L 2 D ( i , j ) = ∑ h = 1 N x h i x h j. We then define the configuration A2P-TD which is the sum of the A2P-D and A2P-T statistics, adjusting for double-counting:
The shared activity configuration A2P-U [13], the statistic for which is zA2P−U is similar to A2P-D, but counts “paths” where each of the k intermediate nodes have arcs directed from the pairs of nodes to the intermediate nodes (see Fig 2): where L 2 U ( i , j ) = ∑ h = 1 N x i h x j h.
The statistics for the closures corresponding to the open path types A2P-D and A2P-U, popularity closure (AKT-D) and activity closure (AKT-U), respectively, are defined similarly to the way path closure (AKT-T) is defined for the corresponding multiple two-paths A2P-T.
These configurations are illustrated in Fig 2.
In addition, we will allow nodes to have binary, categorical, or continuous attributes, and use the following additional configurations using these nodal attributes. For the binary attribute, we use the four configurations Sender, Receiver, and Interaction, illustrated in Fig 3. The Sender parameter indicates increased propensity of a node with the True value of the binary attribute to “send” a tie to another node, and the Receiver the increased propensity of a node with the True value of the attribute to “receive” a tie from another node (both irrespective of the attribute value of the other node). The Interaction parameter indicates increased propensity for two nodes both with the True value of the attribute to have an arc connecting them. The corresponding statistics are defined as follows (where we now use the notation ∑i,j for summation over all pairs of nodes i ∈ {1…N}, j ∈ {1…N}, i ≠ j): where ai ∈ {0, 1} is the value of the binary attribute on node i.
For the categorical attribute, the Matching and Mismatching parameters indicate the increased propensity for a node to send a tie to another node with, respectively, the same, or different, value of the categorical attribute. The Matching reciprocity and Mismatching reciprocity parameters indicate the increased propensity for such ties to be reciprocated. These configurations are illustrated in Fig 4 and the corresponding statistics are defined by: where ci is the value of the categorical attribute at node i and δx,y is the Kronecker delta function
For a continuous attribute ui on a node i, we also define the (continuous) Sender, Receiver and Difference statistics as follows: These indicate, respectively, the increased propensity of a node to send ties for higher values of its continuous attribute, the increased propensity of a node to receive ties for higher values of its continuous attribute, and the increased propensity of nodes to have a tie between them for smaller absolute differences in their continuous attributes. The latter is a simple measure of homophily, the tendency for nodes with similar values of the attribute to have a tie between them.
Note that in ERGM estimation algorithms these statistics as defined above never actually need to be computed directly. Instead only the corresponding change statistics are computed [6, 10, 24, 29, 31]. The change statistic is the change in the statistic due to the addition or deletion of an arc, which is much faster to compute. For example the most basic statistic is zL, the count of the number of arcs in the graph. Computing this statistic therefore requires counting the number of arcs in the graph, however the corresponding change statistic is simply the constant 1 (or -1 for deleting an arc): adding an arc increases the statistic by 1, and deleting an arc decreases it by 1.
Equilibrium expectation algorithm
Monte Carlo based methods, such as Markov chain Monte Carlo maximum likelihood estimation (MCMCMLE) [6] and stochastic approximation [5] as well as Bayesian methods [8], as reviewed for example by Hunter et al. [10], all require drawing simulated networks from the ERGM distribution. This can be achieved using a Metropolis–Hastings algorithm, and a number of samplers are available [5, 10, 11, 35]. However all these methods require that a number of network samples are drawn from the stationary ERGM distribution, for each updated value of the parameter vector being estimated, which may require a very large number of iterations, and limits the size of networks to which these methods can be applied in practice.
In contrast, the EE algorithm [12] does not require these potentially very long MCMC simulations between parameter updates. The EE algorithm is related to persistent contrastive divergence (PCD) [25, 36, 37] and is fast because it adjusts its parameters according the difference between the observed network statistics and the statistics of a current non-equilibrium state of the Markov chain of simulated networks. It may be thought of as a kind of gradient ascent method, and depends on the property of the exponential family (to which the ERGM distribution belongs) that the expected value of a statistic is a monotonically increasing function of the corresponding parameter (see Ch. 8 of [38]). It works by starting the chain of simulated networks at the observed network (not the empty network for example), and taking only a small number of Metropolis–Hastings steps, before adjusting the estimated parameter values according to the divergence of the simulated network statistics from the observed network statistics. After sufficiently many iterations of this process (which in practice is many orders of magnitude smaller than the number of Metropolis–Hastings steps required to find the stationary ERGM distribution), the divergence of each of the statistics from the observed statistics oscillates around zero, and the corresponding parameters oscillate around a value which is taken to be an estimate of the MLE.
A version of contrastive divergence (CD) [39] is used to compute initial values of the ERGM parameter estimates [40, 41] for the EE algorithm [12]. Details of the EE algorithm, as first described in the Supplementary Information of [12], and of the IFD sampler, [11] are provided in S1 Appendix.
Materials and methods
Parameter estimation
Parameters are estimated using a new implementation of the EE algorithm, which we call EstimNetDirected. This implementation has change statistics for directed (rather than undirected as in the original description [12]) networks, and uses efficient data structures in order to scale to very large (over one million node) networks. Both the “basic” ERGM sampler (as used in the PNet [42] software) and the improved fixed density (IFD) ERGM sampler [11] are implemented.
The network is stored as an adjacency list data structure for space efficiency and fast computation of change statistics. A “reversed” adjacency list is also maintained. This stores, for each node j, a list of nodes i for which the arc i → j exists. This allows efficient computation of change statistics that require the in-neighbours of a node. In addition, a flat list of all arcs is maintained, for efficient implementation of the IFD sampler, which requires finding an arc uniformly at random for arc deletion moves [11].
For efficient computation of the “alternating” two-path and triangle change statistics, it is necessary to keep track of the counts of two-paths between each pair of nodes. It is not scalable to store these two-paths matrices as arrays as in earlier implementations [12, 42], so instead hash tables can be used, where the key is a node pair (i, j) and the value is the relevant two-path count (which is zero if the key is not present). This takes advantage of the sparsity (approximately 0.06% nonzero in the empirical example described here) of these matrices and still allows fast (asymptotically constant time) lookup. In addition, a Bloom filter [43] is used so that the overwhelmingly more frequent case of looking up an entry that is not present is faster. During the MCMC ERGM sampling process, in which arcs are added and deleted, entries in the two-path tables that fall to zero are deleted, in order to stop the tables from growing in size indefinitely, however this diminishes the effectiveness of the Bloom filter.
We run a number of estimations independently (and in parallel to minimize elapsed time).
Standard error estimation
For each parallel estimation run, the point estimates and their standard errors are estimated. The point estimate (mean) and asymptotic covariance matrix for MCMC standard error are estimated using the multivariate batch means method [44, 45] using the mcmcse R package [46]. The covariance matrix for the error in inherent in using the MLE is estimated as the inverse of the covariance matrix of the simulated statistics (Fisher information matrix) [5, 6] also using mcmcse. The total estimated covariance matrix is then estimated as the sum of these two covariance matrices, and from this we compute the standard error as the square root of the diagonal.
The overall estimate and its standard error are then computed as the inverse variance weighted average (see Ch. 4 of [47]) of the results calculated from each independent (parallel) estimation.
Implementation and availability
EstimNetDirected is implemented in the C programming language using the message passing interface (MPI) for parallelization on computing clusters. The uthash [48] macro collection is used to implement hash tables (including Bloom filter) and the Random123 counter-based pseudo-random number generator [49] is used to generate pseudo-random numbers for the MCMC process. Scripts for processing network data formats, estimating standard errors, and generating plots and fitting heavy-tailed distributions are written in R and Python and use the igraph [50], SNAP [51], ggplot2 [52], PropCIs [53], mcmcse [46], and poweRlaw [54] packages.
All source code and scripts are publicly available on GitHub at https://github.com/stivalaa/EstimNetDirected.
Simulated networks
To ensure that the parameter estimation algorithm works correctly, we first apply it to estimating ERGM parameters of networks with known true values, and measure the bias, root mean square error (RMSE), coverage, and Type I and Type II error rates in statistical inference. To do so we simulate sets of 100 networks sampled from an ERGM network distribution with known parameters, and then estimate the parameters of each of the 100 networks with EstimNetDirected. This allows the mean bias and RMSE to be estimated. The coverage is then the percentage of the 95% confidence intervals which contain the true value of the parameter. Coverage higher than the nominal 95% indicates overly conservative (high) estimates of the standard error, and coverage lower than the nominal value indicates overly optimistic (low) values of the standard error (uncertainty). In addition, we estimate the Type II error rate in inference (the false negative rate), as the percentage of estimations in which the estimated 95% confidence interval includes zero.
To estimate the Type I error rate (false positive rate) for inference of an ERGM parameter significance, we generate simulated networks in which the parameter in question is zero, and proceed as just described. Then the Type I error rate is estimated as the percentage of estimations in which the 95% confidence interval does not include zero.
We generate two sets of graphs from ERGM distributions, both with N = 2000 nodes. First, a network with binary node attributes and parameters (Arc, Reciprocity, AinS, AoutS, AT-T, A2P-TD, Interaction, Sender, Receiver) = (-1.00, 4.25, -2.00, -1.50, 0.60, -0.15, 2.00, 1.50, 1.00), and, second, a network with categorical node attributes and parameters (Arc, Reciprocity, AinS, AoutS, AT-T, A2P-TD, Matching, Matching reciprocity) = (-1.00, 4.25, -2.00, -1.50, 1.00, -0.15, 1.50, 2.00). For each of the two sets of parameters, we generated 100 samples from a network distribution with those parameters using PNet [42]. For networks with a binary attribute, 50 of the nodes (2.5%), selected at random, have the True value, and the rest False. For networks with a categorical attribute, the attribute at each node is assigned one of three possible values uniformly at random. The networks are sampled with sufficient burn-in (of the order of 109 iterations) to ensure initialization effects are minimized, and samples are taken sufficiently far apart (separation of the order of 108 iterations) to ensure that they are essentially independent. Table 1 shows summary statistics of the simulated networks.
Empirical network
As an example application we use EstimNetDirected to estimate an ERGM model of the Pokec online social network [55] which is publicly available from the Stanford large network dataset collection [56]. Pokec is the most popular online social network in Slovakia [55] and represents a sizable percentage of Slovakia’s population [57]. This publicly available data is also unusual and particularly useful as a test case for social network algorithms as it is the entire online social network at a point in time, rather than a sample as is often the case, and the nodes are annotated with attributes, specifically including gender, age, and region (187 of them, Slovakia or elsewhere) which we use as nodal covariates in the model.
This network has 1 632 803 nodes and 30 622 564 arcs (directed graph density approximately 10−5 and mean degree 37.5). The nodes are users of the Pokec online social network, and arcs represent directed “friendship” relations, i.e. unlike many online social networks, “friendships” are not assumed to be automatically reciprocated (undirected). In fact only approximately 54% of the “friendship” relations are reciprocated. More details and descriptive statistics of this network are in [55].
The Pokec online social network has been previously used as a test bed for social network analysis algorithms, and in particular by Kleineberg & Boguñá [57] who use it to test a model of the evolution of an online social network. They treat the final state of the network as a representation of a true social network, as do we, in order to demonstrate EstimNetDirected estimation of an ERGM, which is a cross-sectional network model. However they treat the network as undirected, including only reciprocated ties, while we maintain the directed nature of the network.
The Pokec degree distribution was described as “scale-free” in [55], but based only on visual examination of the degree-frequency plot. This technique, or similarly, fitting a straight line to a degree-frequency log-log plot, is now well known to be not a sound technique for assessing whether a distribution is scale-free or follows a power law [58–60]. Using the statistical method described by Clauset et al. [58] as implemented in the powerRlaw R package [54], we find that the neither the in- nor out-degree distribution of the Pokec online social network is consistent with a power law distribution (Fig 5).
Nevertheless, it is clear from Fig 5 that there are hubs in the network, that is, nodes with an order of magnitude higher degree than most other nodes. In particular, there is a noticeable “break” in the empirical cumulative distribution function (CDF) plot at degree 1000, most noticeable for the out-degree distribution. According to Takac & Zabovsky, “hubs in Pokec are not people but commercial companies…” (p. 5 of [55]).
Based on these observations, and on the fact that an initial attempts to estimate an ERGM for the entire network did not converge, we remove all nodes with in- or out-degree greater than 1000 from the network. There are only 20 such nodes (0.001% of the nodes), and removing them does not significantly change the density or mean degree. However it breaks the network, which was initially a single connected component, into 577 components, although the giant component has 1 632 199 nodes (99.96% of the total). This indicates that these hubs are not performing the function of holding different components of the network together into a connected whole, as their removal only results in the creation of a relatively small number of isolated nodes, rather than splitting the network into multiple large components, and that therefore their removal is not substantively changing the nature of the network structure. The in-degree distribution of network with hubs removed is consistent with a power law, although the out-degree distribution is not (Fig 6).
The EstimNetDirected parameter settings used in the estimations are detailed in Table A in S1 Appendix.
Convergence tests
As in [12] we use a t-ratio check for convergence, but for larger directed networks we weaken the criterion to conclude non-convergence if the absolute value of any parameter’s t-ratio is greater than 0.3. If the covariance matrix computed in the standard error estimation step is (nearly) computationally singular then that estimate is considered non-converged, possibly due to model degeneracy.
For empirical network estimations where there are not a large number of automated estimations to process, visual inspection of the parameter and statistic trace plots is used as a heuristic to confirm convergence. In the case of the simulated networks where large numbers of estimations are run, we automate the additional heuristic that estimations with numeric overflow (“NaN” values) or “huge” (greater in magnitude than 1010) parameter values are non-converged.
An additional heuristic (visual) convergence test for modeling empirical networks is to plot various network statistics of the observed network on the same plot as the distribution of these statistics in the EE algorithm simulated networks. This is the same principle as the t-ratio test, but it includes statistics other than those explicitly included in the model in order to give some indication of how well the model fits. The statistics included are the degree distribution, reciprocity, giant component size, average local and global clustering coefficients, triad census, geodesic distribution, and edgewise and dyadwise shared partners (similar to goodness-of-fit plots in statnet [30–33]).
Note, however, that these plots are not goodness-of-fit plots, as the simulated networks are not generated ab initio (e.g. from the empty network) from the estimated parameters, but rather are the networks that have been simulated in the EE algorithm where the starting point is the observed network. Hence the plots may be “over-optimistic” in indicating a good model fit: however a plot clearly showing a poor model fit definitely indicates lack of convergence or a poor model.
In addition, for every large networks it is impractical to compute some distributions in reasonable time (such as the edgewise and dyadwise shared partners, and the geodesic distribution) and so these plots are excluded for very large networks. An example of a convergence plot is shown in S1 Fig.
Results and discussion
Simulated networks
Table 2 shows the bias, RMSE, Type II error rate, and coverage in estimating the simulated networks. It can be seen that the Type II error rate in inference is very low on all parameters except for Arc and and A2P-TD. In the case of the Arc parameter, this is of little concern as we do not in practice need to make a statistical inference on this parameter, as previously noted it is analogous to an intercept and simply used to control for network density. Figs 7 and 8, plotting the point estimates and their error bars, shed some more light on this problem. It seems the high Type II error rate for Arc and A2P-TD is at least partly explained by the small magnitude of the true values of these parameters. The coverage in most of these cases indicates overly conservative error estimates (higher than the nominal 95%), and the Type II error rate is high.
Additionally, in the case of the Arc parameter especially, the parameter estimates appear to be biased. Although the Type II error rate is very low, in the case of the networks with binary attribute, the coverage rate is low for the Sender, Interaction, and (to a lesser degree) Reciprocity parameters. Fig 8 shows that this appears to be because of bias in these parameter estimates. In the case of the networks with a categorical attribute, Fig 7 shows positive bias in the Arc parameter resulting in a high Type II error rate despite lower than nominal coverage.
We should not be surprised by bias in the parameter estimates, as the MLE for ERGM canonical parameters is biased precisely because it is unbiased by construction in the mean value parameter space [61] (that is, the mean values of the statistics, not the corresponding ERGM parameters). For this reason, van Duijn, Gile, & Handcock [61] propose a framework for assessing estimators in which bias is compared in the mean value parameter space, by generating large numbers of simulated networks from the estimated parameter values, and comparing them to the statistics of the original simulated networks. However for large directed networks this procedure is impractical due to the time it takes to simulate each set of networks (the very reason the EE algorithm is so fast is that it avoids doing this). In addition, as the purpose of ERGM parameter estimation is usually to make statistical inferences from the estimated parameters, it is useful to measure the inferential error in this procedure.
In Tables 2 and 3, the 95% confidence interval for the Type II and Type I error rate, respectively, is computed as the Wilson score interval [62].
Table 3 shows the coverage and Type I error rates estimated from simulated networks with zero effects. Note that in this case, coverage and Type I error rate are effectively the same thing (as percentages, coverage is 100 − α where α is the Type I error rate) as the known true value of the parameter is zero by design. This table shows that the Type I error rate in all but two cases is within the nominal 5% range. In one case, Matching Reciprocity for the categorical networks, the point estimate of the Type I error rate is 9% but the 95% confidence interval extends down to 5%. However the other case, Matching for categorical networks, the point estimate of the Type I error rate is 11% and the confidence interval extends down to only 6%, so for this particular case the Type I error rate is too high. We note that the coverage for this zero parameter is only 89%, so it would appear that our method is potentially subject to inferential error in such cases where Matching Reciprocity is included in the model but the corresponding baseline Matching parameter is not. We would recommend not using such a model, and always including the corresponding baseline parameters, as is standard practice in ERGM model building, where configurations are nested within one another (Ch. 3 of [1]).
Also note that for two of the sets of simulated networks, the binary node attribute networks with the Interaction or Reciprocity effects set to zero, the number of runs and networks for which estimations converged is very low (less than half). This is not problematic, as these tests, where an effect not present in the network is included in the model, could be considered instances of model mis-specification, so the possibility of estimations not converging is to be expected. Note that for estimations shown in Table 2, where the model is exactly correct (it is the same model that generated the networks), converged estimates are obtained for all the simulated networks.
Empirical network example
Table 4 shows a model estimated for the Pokec online social network with the 20 highest degree hubs removed (N = 1 632 783). This estimation took approximately 22 hours on cluster nodes with Intel Xeon E5-2650 v3 2.30GHz processors using two parallel tasks with 512 GB RAM each.
Regarding structural features, these results confirm centralization on both in- and out-degree in the Pokec online social network. There is a significantly positive reciprocity effect: friendships are more likely to be reciprocated than not, conditional on the other features in the model. There is also positive activity closure: people who send friendship ties to the same people also tend to be friends. There is also positive path (transitive) closure, in combination with negative two-paths: friends of friends tend also to be friends. This can also be interpreted in terms of the “forbidden triad” [63], in which an open two-path (of what we assume to be “strong” ties as they represent friendship) is not closed transitively. Such a triad is indeed less likely than by chance, conditional on the other parameters in the model, in this network.
There is homophily on both region and age: people who live in the same region are more likely to be friends than those who live in different regions, and people of similar ages are more likely to be friends. Interestingly, there is significant heterophily on the gender attribute: people of different genders are more likely to be friends on this online social network.
The convergence test plot for the Pokec online social network estimation is shown in S1 Fig.
The algorithm parameters used for this estimation are shown in Table A in S1 Appendix in the column for the Pokec network. As a general guideline, we recommend using these parameters, with the exception of EEsteps. It may be useful to start with the default value of 1 000 (or even smaller) for this parameter, to relatively quickly obtain initial results and check the trace plots for any obvious failure to converge, such as a parameter value clearly diverging (due to a bad model for instance). If there are no obvious problems then the EEsteps parameter can be increased if the t-ratios or trace plots (as described in the section “Convergence tests”) indicate that the estimation has not yet converged.
Conclusion
We have demonstrated an implementation of the EE algorithm for ERGM parameter estimation capable of estimating models for social networks with over one million nodes, which is far larger than previously possible (without using network sampling). However there are several limitations and scope for future work. The implementation described here requires tuning some algorithm parameters (Table A in S1 Appendix), however a simplified version of the EE algorithm requires fewer parameters [25] and may make it easier to obtain converged models.
Although the use of hash tables to efficiently store the sparse two-path matrices allows scalability to networks of millions of nodes, it depends on sufficient sparsity of these matrices, and not all empirical networks of interest satisfy this requirement. For example the physician referral network described by An et al. [64], although having approximately one million nodes, making it smaller than the Pokec online social network, does not have a sufficiently sparse two-path table for our implementation to work even with the largest memory cluster node available to us (512 GB memory). Further work is required to find a means of alleviating this problem.
Although the convergence heuristic plots we have described resemble a goodness-of-fit test superficially, they are not actually goodness-of-fit tests. The difficulty of generating (simulating) numbers of very large networks from estimated ERGM parameters for conventional goodness-of-fit tests for ERGM makes these methods impractical for such large networks. One possibility to investigate is to simulate snowball samples from the estimated parameters and compare the distribution of network statistics of these simulated samples to the corresponding distributions of network statistics in snowball sample taken from the observed network. More generally, the idea of such goodness-of-fit tests for extremely large networks may have to be fundamentally re-examined, in light of the difficulty of any simple model adequately fitting such a large network—and the homogeneity assumption that the same local processes operate across such a large network may also no longer be realistic.
In addition, the ERGM MLE bias in canonical parameters may need to be addressed by some bias correction technique, as was originally explored for maximum pseudolikelihood estimation [61], but does not appear to have been successfully pursued since for other, now preferred (not least due to the results of [61]), estimation methods.
An important next step is the strengthening of the theoretical basis for the the EE algorithm. Existing commonly used methods for ERGM parameter estimation are based on well-known algorithms with well developed theoretical foundations such as stochastic approximation [5] using the Robbins–Monro algorithm [65] or MCMCMLE [6] based on the Geyer–Thompson method [66] with increasingly sophisticated variants [9]. In contrast, there are no theoretical guarantees behind the EE algorithm [12], and although contrastive divergence [39, 40] and persistent contrastive divergence [36, 37] are becoming widely used, more theoretical work is required to understand their convergence properties in general, and for ERGM parameter estimation in particular [67]. Regarding the EE algorithm, the experiments with simulated networks described in this paper give encouragement for the validity and usefulness of this approach, but further work to understand its convergence properties is a potentially fruitful area of research [25].
Supporting information
S1 Appendix [pdf]
Algorithm descriptions and EstimNetDirected parameter settings.
S1 Fig [pdf]
Convergence test plot for Pokec (hubs removed) estimation.
Zdroje
1. Lusher D, Koskinen J, Robins G, editors. Exponential random graph models for social networks. Structural Analysis in the Social Sciences. New York: Cambridge University Press; 2013.
2. Amati V, Lomi A, Mira A. Social network modeling. Annu Rev Stat Appl. 2018;5:343–369. doi: 10.1146/annurev-statistics-031017-100746
3. Corander J, Dahmström K, Dahmström P. Maximum likelihood estimation for Markov graphs. Stockholm University, Department of Statistics; 1998. 8.
4. Corander J, Dahmström K, Dahmström P. Maximum likelihood estimation for exponential random graph models. In: Hagberg J, editor. Contributions to social network analysis, information theory, and other topics in statistics; a Festschrift in honour of Ove Frank. Department of Statistics, University of Stockholm; 2002. p. 1–17.
5. Snijders TAB. Markov chain Monte Carlo estimation of exponential random graph models. J Soc Struct. 2002;3(2):1–40.
6. Hunter DR, Handcock MS. Inference in curved exponential family models for networks. J Comput Graph Stat. 2006;15(3):565–583. doi: 10.1198/106186006X133069
7. Robins G, Snijders T, Wang P, Handcock M, Pattison P. Recent developments in exponential random graph (p*) models for social networks. Soc Networks. 2007;29(2):192–215. doi: 10.1016/j.socnet.2006.08.003
8. Caimo A, Friel N. Bayesian inference for exponential random graph models. Soc Networks. 2011;33:41–55. doi: 10.1016/j.socnet.2010.09.004
9. Hummel RM, Hunter DR, Handcock MS. Improving simulation-based algorithms for fitting ERGMs. J Comput Graph Stat. 2012;21(4):920–939. doi: 10.1080/10618600.2012.679224 26120266
10. Hunter DR, Krivitsky PN, Schweinberger M. Computational statistical methods for social network models. J Comput Graph Stat. 2012;21(4):856–882. doi: 10.1080/10618600.2012.732921 23828720
11. Byshkin M, Stivala A, Mira A, Krause R, Robins G, Lomi A. Auxiliary parameter MCMC for exponential random graph models. J Stat Phys. 2016;165(4):740–754. doi: 10.1007/s10955-016-1650-5
12. Byshkin M, Stivala A, Mira A, Robins G, Lomi A. Fast maximum likelihood estimation via Equilibrium Expectation for large network data. Sci Rep. 2018;8:11509. doi: 10.1038/s41598-018-29725-8 30065311
13. Robins G, Pattison P, Wang P. Closure, connectivity and degree distributions: Exponential random graph (p*) models for directed social networks. Soc Networks. 2009;31(2):105–117. doi: 10.1016/j.socnet.2008.10.006
14. Coleman JS. Relational analysis: the study of social organizations with survey methods. Hum Organ. 1958;17(4):28–36. doi: 10.17730/humo.17.4.q5604m676260q8n7
15. Goodman LA. Snowball sampling. Ann Math Stat. 1961;32:148–170. doi: 10.1214/aoms/1177705148
16. Goodman LA. Comment: On respondent-driven sampling and snowball sampling in hard-to-reach populations and snowball sampling not in hard-to-reach populations. Sociol Methodol. 2011;41(1):347–353. doi: 10.1111/j.1467-9531.2011.01242.x
17. Heckathorn DD. Comment: Snowball versus respondent-driven sampling. Sociol Methodol. 2011;41(1):355–366. doi: 10.1111/j.1467-9531.2011.01244.x 22228916
18. Handcock MS, Gile KJ. Comment: On the concept of snowball sampling. Sociol Methodol. 2011;41(1):367–371. doi: 10.1111/j.1467-9531.2011.01243.x
19. Handcock MS, Gile KJ. Modeling social networks from sampled data. Ann Appl Stat. 2010;4(1):5–25. doi: 10.1214/08-AOAS221 26561513
20. Stivala AD, Koskinen JH, Rolls DA, Wang P, Robins GL. Snowball sampling for estimating exponential random graph models for large networks. Soc Networks. 2016;47:167–188. doi: 10.1016/j.socnet.2015.11.003
21. Pattison PE, Robins GL, Snijders TAB, Wang P. Conditional estimation of exponential random graph models from snowball sampling designs. J Math Psychol. 2013;57(6):284–296. doi: 10.1016/j.jmp.2013.05.004
22. Snijders TAB, Baerveldt C. A multilevel network study of the effects of delinquent behavior on friendship evolution. J Math Sociol. 2003;27(2–3):123–151. doi: 10.1080/00222500305892
23. Efron B. Better bootstrap confidence intervals. J Am Stat Assoc. 1987;82(397):171–185. doi: 10.2307/2289153
24. Hunter DR, Goodreau SM, Handcock MS. Goodness of fit of social network models. J Am Stat Assoc. 2008;103(481):248–258. doi: 10.1198/016214507000000446
25. Borisenko A, Byshkin M, Lomi A. A simple algorithm for scalable Monte Carlo inference; 2019. Preprint. Available from: arXiv:1901.00533v3. Cited 17 April 2019.
26. Thiemichen S, Kauermann G. Stable exponential random graph models with non-parametric components for large dense networks. Soc Networks. 2017;49:67–80. doi: 10.1016/j.socnet.2016.12.002
27. Babkin S, Schweinberger M. Massive-scale estimation of exponential-family random graph models with local dependence; 2017. Preprint. Available from: arXiv:1703.09301v1. Cited 17 April 2019.
28. Schweinberger M, Krivitsky PN, Butts CT, Stewart J. Exponential-family models of random graphs: Inference in finite-, super-, and infinite-population scenarios; 2019. Preprint. Available from: arXiv:1707.04800v4. Cited 15 October 2019.
29. Snijders TAB, Pattison PE, Robins GL, Handcock MS. New specifications for exponential random graph models. Sociol Methodol. 2006;36(1):99–153. doi: 10.1111/j.1467-9531.2006.00176.x
30. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M. statnet: Software tools for the representation, visualization, analysis and simulation of network data. J Stat Softw. 2008;24(1):1–11. doi: 10.18637/jss.v024.i01
31. Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M. ergm: A package to fit, simulate and diagnose exponential-family models for networks. J Stat Softw. 2008;24(3):1–29. doi: 10.18637/jss.v024.i03
32. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Bender-deMoll S, et al. statnet: Software tools for the statistical analysis of network data; 2016. Available from: CRAN.R-project.org/package=statnet.
33. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M. ergm: Fit, Simulate and Diagnose Exponential-Family Models for Networks; 2016. Available from: http://CRAN.R-project.org/package=ergm.
34. Wang P. Exponential random graph models for affiliation networks [PhD thesis]. The University of Melbourne. Melbourne, Australia; 2012.
35. Morris M, Handcock M, Hunter D. Specification of exponential-family random graph models: Terms and computational aspects. J Stat Softw. 2008;24(4):1–24. doi: 10.18637/jss.v024.i04
36. Younes L. Estimation and annealing for Gibbsian fields. Ann Inst Henri Poincaré B. 1988;24:269–294.
37. Tieleman T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th international conference on machine learning. ACM; 2008. p. 1064–1071.
38. Barndorff-Nielsen O. Information and exponential families in statistical theory. John Wiley & Sons; 2014.
39. Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002;14(8):1771–1800. doi: 10.1162/089976602760128018 12180402
40. Asuncion A, Liu Q, Ihler A, Smyth P. Learning with blocks: Composite likelihood and contrastive divergence. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; 2010. p. 33–40.
41. Krivitsky PN. Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Comput Stat Data Anal. 2017;107:149–161. doi: 10.1016/j.csda.2016.10.015
42. Wang P, Robins G, Pattison P. PNet: program for the simulation and estimation of exponential random graph (p*) models; 2009.
43. Bloom BH. Space/time trade-offs in hash coding with allowable errors. Commun ACM. 1970;13(7):422–426. doi: 10.1145/362686.362692
44. Jones GL, Haran M, Caffo BS, Neath R. Fixed-width output analysis for Markov chain Monte Carlo. J Am Stat Assoc. 2006;101(476):1537–1547. doi: 10.1198/016214506000000492
45. Vats D, Flegal JM, Jones GL. Multivariate output analysis for Markov chain Monte Carlo; 2017. Preprint. Available from: arXiv:1512.07713v4. Cited 17 April 2019.
46. Flegal JM, Hughes J, Vats D. mcmcse: Monte Carlo standard errors for MCMC; 2016. Available from: https://cran.r-project.org/package=mcmcse.
47. Hartung J, Knapp G, Sinha BK. Statistical meta-analysis with applications. Hoboken, NJ: John Wiley & Sons; 2008.
48. Hanson TD. uthash; 2018. https://github.com/troydhanson/uthash.
49. Salmon JK, Moraes MA, Dror RO, Shaw DE. Parallel random numbers: As easy as 1, 2, 3. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM; 2011. p. 16.
50. Csárdi G, Nepusz T. The igraph software package for complex network research. InterJournal Complex Systems: 1695. 2006.
51. Leskovec J, Sosič R. SNAP: A general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol. 2016;8(1):1. doi: 10.1145/2898361 28344853
52. Wickham H. ggplot2: elegant graphics for data analysis. Springer New York; 2009. Available from: http://had.co.nz/ggplot2/book.
53. Scherer R. PropCIs: Various confidence interval methods for proportions; 2014. Available from: https://CRAN.R-project.org/package=PropCIs.
54. Gillespie CS. Fitting heavy tailed distributions: The poweRlaw package. J Stat Softw. 2015;64(2):1–16. doi: 10.18637/jss.v064.i02
55. Takac L, Zabovsky M. Data analysis in public social networks. In: International Scientific Conference and International Workshop Present Day Trends of Innovations. vol. 1; 2012. p. 1–6. Available from: http://snap.stanford.edu/data/soc-pokec.pdf.
56. Leskovec J, Krevl A. SNAP Datasets: Stanford large network dataset collection; 2014. http://snap.stanford.edu/data.
57. Kleineberg KK, Boguñá M. Evolution of the digital society reveals balance between viral and mass media influence. Phys Rev X. 2014;4(3):031046.
58. Clauset A, Shalizi CR, Newman ME. Power-law distributions in empirical data. SIAM Rev. 2009;51(4):661–703. doi: 10.1137/070710111
59. Stumpf MP, Porter MA. Critical truths about power laws. Science. 2012;335(6069):665–666. doi: 10.1126/science.1216142 22323807
60. Broido AD, Clauset A. Scale-free networks are rare. Nat Commun. 2019;10(1):1017. doi: 10.1038/s41467-019-08746-5 30833554
61. Van Duijn MA, Gile KJ, Handcock MS. A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Soc Networks. 2009;31(1):52–62. doi: 10.1016/j.socnet.2008.10.003 23170041
62. Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22(158):209–212. doi: 10.1080/01621459.1927.10502953
63. Granovetter MS. The strength of weak ties. Am J Sociol. 1973;78(6):1360–1380. doi: 10.1086/225469
64. An C, O’Malley AJ, Rockmore DN, Stock CD. Analysis of the US patient referral network. Stat Med. 2018;37(5):847–866. doi: 10.1002/sim.7565 29205445
65. Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22(3):400–407. doi: 10.1214/aoms/1177729586
66. Geyer CJ, Thompson EA. Constrained Monte Carlo maximum likelihood for dependent data. J Roy Stat Soc B Met. 1992;54(3):657–699.
67. Fellows IE. Why (and when and how) contrastive divergence works; 2014. Preprint. Available from: arXiv:1405.0602v1. Cited 17 April 2019.
Článek vyšel v časopise
PLOS One
2020 Číslo 1
- S diagnostikou Parkinsonovy nemoci může nově pomoci AI nástroj pro hodnocení mrkacího reflexu
- Je libo čepici místo mozkového implantátu?
- Pomůže v budoucnu s triáží na pohotovostech umělá inteligence?
- AI může chirurgům poskytnout cenná data i zpětnou vazbu v reálném čase
- Nová metoda odlišení nádorové tkáně může zpřesnit resekci glioblastomů
Nejčtenější v tomto čísle
- Severity of misophonia symptoms is associated with worse cognitive control when exposed to misophonia trigger sounds
- Chemical analysis of snus products from the United States and northern Europe
- Calcium dobesilate reduces VEGF signaling by interfering with heparan sulfate binding site and protects from vascular complications in diabetic mice
- Effect of Lactobacillus acidophilus D2/CSL (CECT 4529) supplementation in drinking water on chicken crop and caeca microbiome