Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2002 Apr 16;99(8):5207–5211. doi: 10.1073/pnas.032085699

Winners don't take all: Characterizing the competition for links on the web

David M Pennock *, Gary W Flake *, Steve Lawrence *, Eric J Glover *, C Lee Giles *,†,
PMCID: PMC122747  PMID: 16578867

Abstract

As a whole, the World Wide Web displays a striking “rich get richer” behavior, with a relatively small number of sites receiving a disproportionately large share of hyperlink references and traffic. However, hidden in this skewed global distribution, we discover a qualitatively different and considerably less biased link distribution among subcategories of pages—for example, among all university homepages or all newspaper homepages. Although the connectivity distribution over the entire web is close to a pure power law, we find that the distribution within specific categories is typically unimodal on a log scale, with the location of the mode, and thus the extent of the rich get richer phenomenon, varying across different categories. Similar distributions occur in many other naturally occurring networks, including research paper citations, movie actor collaborations, and United States power grid connections. A simple generative model, incorporating a mixture of preferential and uniform attachment, quantifies the degree to which the rich nodes grow richer, and how new (and poorly connected) nodes can compete. The model accurately accounts for the true connectivity distributions of category-specific web pages, the web as a whole, and other social networks.


The World Wide Web is a reflection of human culture—a massive social network encoding associative links among almost 109 documents (1) authored by millions of people and organizations around the globe. The web's structure has emerged without central planning, the result of a bottom-up distributed process. Yet many aggregate web characteristics display a striking degree of regularity (2), including the distributions of traffic (3, 4), pages per site (5), file sizes (6, 7), and the depth to which a web user surfs (8). Several independent investigations show that the distribution of the number of links to (and from) a web page obeys a power law over many orders of magnitude (912). Power law scaling arises from a variety of physical, biological, and social processes (5, 1315). The emergence of a power law tail seems to characterize the connectivity distribution of many networks in addition to the web, including the graph of movie actor collaborations, the pattern of research paper citations, the topology of the power grid in the western United States, and the metabolic networks of many microorganisms (10, 16, 17).

Barabási and Albert (10, 18) attribute power law scaling to a “rich get richer” mechanism called preferential attachment: as the network grows, the probability that a given vertex receives an edge is proportional to that vertex's current connectivity. Adamic and Huberman (19) give an alternative explanation for power law behavior by adapting their model of the growth of web sites (5) to the case of web links.

Obscured behind the nearly pure power law distribution found for inbound links on the web as a whole, we uncover a richer structure among subsets of web pages in the same category. We find that these category-specific distributions exhibit very large deviations from power law scaling, with the magnitude of deviation varying from category to category. For these subsets of the web, we illustrate that the body of the distribution of incoming links is actually unimodal, rather than power law. Thus the rich get richer character of the web can be much less drastic among competing pages of the same type. In fact, pure power law scaling seems to be the exception rather than the rule. The distributions for outbound web links, and for a variety of other social and biological networks, also display significant deviations from power law, qualitatively similar in nature to those we find for web subsets (912, 16).

We employ a generalized Barabási–Albert (BA) model [similar to recent models (2023) independently proposed elsewhere] that incorporates both preferential attachment and a uniform baseline probability of attachment. The model predicts the observed shape of both the body and tail of typical connectivity distributions, including those observed within specific categories of web pages where the divergence from power law is especially marked. In the model, larger modes arise from faster rates of growth of edges as compared with vertices, suggesting an explanation for the different modes observed within different categories of web pages.

Generic vs. Category-Specific Degree Distributions

Several studies find that the probability that a randomly selected web page has k links is proportional to k−γ for large k (912), where γ is a constant, empirically determined as roughly 2.1 for inbound links and 2.72 for outbound links (11). When displayed on a log–log plot, this so-called power law distribution appears linear with slope −γ. A power law distribution has a heavy tail, which drops off much more slowly than the tail of a Gaussian distribution. As a result, although the vast majority of web pages have relatively small numbers of links, a few pages have enormous numbers of links—enough to skew the mean well above the median. If we interpret the number of inbound links to a web page as a measure of its popularity or impact, then power law scaling implies that a small fraction of web pages receive a disproportionately large share of such endorsements. As a result, these few popular pages typically benefit from a greater volume of traffic from web surfers, a higher probability of being indexed in search engine databases (1), and more prominent ranking within search engine results. Meanwhile, the majority of sites suffer from relatively poor visibility, and new commercial sites may have a difficult time competing for consumer attention. This state of affairs on the web has been referred to (metaphorically, if somewhat inaccurately) as a “winners take all” phenomenon.

At small connectivities k, the distribution of links on the web fails to fit a power law, with the discrepancy larger for outbound links than for inbound links (11). Systematic divergence from power law scaling at small k is also seen in the connectivity distributions of graphs encoding actor collaborations, the western United States power grid, scholarly citations, and outbound links from several subsets of the web (10).

Moreover, for some collections of web pages of the same type, we find that the distribution of inbound links departs drastically from a power law at small and moderate k. We examined the inbound link distributions for a set of public company homepages (obtained from http://www.investorguide.com/StockListA.htm, StockListB.htm, etc.), a set of American university homepages (from http://www.clas.ufl.edu/CLAS/american-universities.html), a set of US newspaper homepages (from http://www.usnewspaperlinks.com/), and a set of scientist homepages (from HPSearch at http://hpsearch.uni-trier.de/hp/). Diamond-shaped points in Fig. 1 graph the connectivity distribution for company homepages as a log-linear histogram. Pages are placed into buckets according to the number of their inbound links. Buckets are of exponentially increasing width, or constant width on the log scale—the same histogram type used in characterizing web file sizes (6), although different from the histograms used in some previous studies (5, 10). Specifically, in Fig. 1, the ith bucket point marks the normalized number of pages with between 10i/6 − 1 and 10(i+1)/6 − 1 inbound links. Although the tail of the distribution continues to fit a power law, the body appears roughly lognormal, with a sharp and singular mode, indicating that a plurality of company homepages have between 99 and 146 inbound links.

Figure 1.

Figure 1

Diamonds plot the empirically observed connectivity distribution for company homepages. Circles display the histogram resulting from a simulation of the model, with parameters t = 4,923, m = 1,356, and α = 0.950 set to match the company data. The dashed line marks the analytic solution (Eq. 3) instantiated with the same parameters.

Diamonds in Fig. 2 display the connectivity distributions of company homepages, university homepages, scientist homepages, and newspaper homepages on log-log scales. All four display the same qualitative shape—unimodal body and power law tail—although the modes vary among the different categories of pages. Heavy tails indicate that a handful of popular pages still gain a disproportionate percentage of all inbound links. Nevertheless, among less popular web pages of the same type, the distribution of inbound links is more evenly balanced. Many web pages can fare well when compared against the mode of all competing pages within the same category. Relative to their community, winners don't quite “take all.” Losing sites and mediocre sites attract a considerably higher proportion of links than would be the case under a pure power law distribution.

Figure 2.

Figure 2

Diamonds display log–log histograms of inbound connectivities for category-specific homepages, and inbound and outbound connectivities for random web pages. Circles mark the connectivity distributions, with m0 = 0, t set equal to the number of web pages, 2m set equal to the average number of inbound links per page, and α chosen according to a nonlinear least-squares fit. Dashed lines indicate the analytic solutions (Eq. 3).

It is an open question exactly how peaked distributions for subsets of the web like those in Figs. 1 and 2 sum together to produce the nearly pure power law for the web as a whole. We conjecture that the vast majority of subsets (or subsets containing the vast majority of pages) exhibit a nearly zero mode and dominate this sum, though more investigation is needed.

Network Growth Model

Generative Process Description.

We employ a generative model of network growth to explain the observed connectivity distributions for the web, for web categories, and for other social networks. The model is similar to other generalized BA models recently developed independently by other authors (2023). The network begins with m0 vertices. At each time step t, one vertex and m edges are added to the network. In the BA model, all m edges connect the new vertex with an old vertex according to preferential attachment: the probability Π(ki) that an edge connects to vertex i is proportional to ki, where ki is the current number of edges incident on vertex i.

We presume instead that every vertex has at least some baseline probability of gaining an edge. Both endpoints of edges are chosen according to a mixture of probability α for preferential attachment and 1−α for uniform attachment. The probability that an endpoint of a new edge connects to vertex i is

graphic file with name M1.gif 1

where m0 + t is the total number of vertices and 2mt is the total connectivity at time t. Edge endpoints are chosen symmetrically, rather than pinned to the newest vertex. Solitary vertices are not destined to remain forever disconnected. Under preferential attachment alone, sites that are already rich in links tend to get richer, resulting in a nearly pure power law distribution over connectivities. On the other hand, with the addition of a component for uniform attachment, the poorer sites (with some luck) can get rich too. Intuitively, the two growth components can be viewed as capturing two common behaviors of web page authors: (i) creating links to pages that the author is aware of because they are popular, and (ii) creating links to pages that the author is aware of because they are personally interesting or relevant, largely independent of popularity.

We generated a simulated network by using Eq. 1, with parameters set to model the company homepages data: t and 2m are set to the actual number of web pages (4,923) and the average number of inbound links per page (2,712), respectively. The seed set size m0 is set to zero. The only tuning parameter, α, is set according to a nonlinear least-squares fit of the analytic solution (Eq. 3) to the data (resulting in α = 0.950). Multiple edges between two vertices are allowed, although self-edges are not. Circles in Fig. 1 plot the resulting connectivity histogram, which corresponds very well with the true distribution. Circles in Fig. 2 display simulation results for all four data sets on log–log scales, again showing good agreement with empirical measurements.

Notice that the simulation builds a graph among subset members only, whereas empirical data includes inbound links originating from the entire web. There are two ways to interpret the model to reconcile this difference. First, one can think of the model as a prescription not for graph generation, but simply for connectivity growth, where each vertex increments its connectivity independently according to Eq. 1; this is the same abstraction adopted in other models (20, 22). Second, one can interpret an edge between vertex i and vertex j as a path between i and j, possibly traversing outside subset boundaries (with both endpoints being inbound links).

Finally, we note that the model can be easily generalized into a directed graph model. Two parameters αin and αout encode mixture probabilities for inbound and outbound links. The source of each new edge is chosen according to Eq. 1 using αout and the destination according to Eq. 1 using αin. Simulation and analytic results describing inbound (or outbound) connectivity growth are unaffected by this modification, modulo a factor of two.

Analytic Solution.

With a continuous mean-field approximation similar to that used by Barabási and Albert (18), we can derive the connectivity distribution for the model in closed form. Assuming that tm0, the probability that a vertex has connectivity k is

graphic file with name M2.gif 2

In the limit as k → ∞, the density Pr(k) is proportional to k−(1+1/α), or a power law with exponent γ = 1 + 1/α. For example, if α = 0.5, then γ = 3, the same as predicted in the BA model. Mixture parameters α of 0.909 and 0.581 yield exponents of 2.10 and 2.72, respectively, the empirically observed exponents for inbound and outbound web links (11). Other values for α yield alternative power law exponents.

Our log-scale histograms in Figs. 1 and 2 used exponentially increasing bucket sizes. We can perform an analogous transformation of the probability density (Eq. 2), to facilitate comparison on log-scale plots. We substitute k = 10k′/6 into the cumulative distribution, take the derivative with respect to k′, and substitute back by using k′ = 6log10 k. The resulting function displays the instantaneous probability mass at each k, where the widths of the infinitesimal “buckets” dk are constant on the logarithmic scale. This transformed density Inline graphic(k), suitable for log-scale visualization, is

graphic file with name M4.gif 3

The maximum of this function, corresponding to the mode of the distribution on a log scale, occurs at k = 2m(1 − α). The location of the mode is directly proportional to m, the rate of edge additions per time step. If α = 0.5, for example, then the mode is simply m, or the number of edges added per vertex. As the growth rate of edges increases compared with that of vertices, the mode shifts toward higher connectivities k. As the mixture parameter α approaches 1, or as m approaches 0, the distribution approaches a pure power law, and the mode appears at much lower connectivities.

Related Models.

Dorogovtsev et al. (20) and Levene et al. (22) independently propose similar generalizations of the BA model (the addition of a uniform component), motivating it in part as a natural way to parameterize the power-law exponent. Dorogovtsev et al. (20) solve for the exact degree distribution, showing that BA's mean-field approximation is correct in determining the asymptotic power-law exponent; the authors go on to study the connectedness properties of the network as it grows (21). Levene et al. (22) reformulate the growth process in terms of an urn transfer model, enabling them to obtain exact solutions in certain cases. Albert and Barabási (24) have proposed their own extension of their original model. Their augmented model involves a parameterized mixture of three processes: vertex additions, edge additions, and edge rewirings. The combination of these three processes leads to a connectivity growth function that is roughly a sum of uniform and preferential terms. Kleinberg et al. (23) propose a model in which some edges are added at random and some edges are copied from existing vertices, again leading to a mixture of uniform and preferential influences on network growth. Even Simon (25) in 1955 invokes a similar process to explain Estoup–Zipf word frequency distributions. Although most authors point out that the generalized form is flexible enough to admit any asymptotic power law exponent in the range (2, ∞), we focus on the fact that the same single additional degree of freedom is also sufficient to explain the often large deviation from power law behavior observed in the low-connectivity region.

Web Data and Model Comparisons.

The model's ability to fit both the body and the tail of typical degree distributions is especially evident for category-specific web data. Fig. 2 illustrates the fit between the model and the actual connectivity distributions for company, university, newspaper, and scientist homepages. The figure overlays web data, simulation data, and the mean-field solution (Eq. 3) for the four sets of web pages on log–log scales; Fig. 1 displays the same information on a log-linear scale for the company homepages. Any discrepancy between the analytic solution and the simulation is a result of the mean-field approximation. For the simulation and the analytic solution, the model parameters t and 2m are set to the number of web pages and the average number of inbound links per page, respectively. The seed set size m0 is set to zero and α is optimized by using a nonlinear regression. In all four cases, the model distributions fit very closely to the true distributions, capturing the same unimodal body and power law tail observed in the data.

Note that the only tuning parameter, α, affects both the mode and the slope of the tail, yet a single best-fit α captures both dimensions well. We also computed distributions for inbound and outbound links for the web as a whole, by using a collection of 100,000 pseudorandom web pages, sampled from roughly one billion URLs in Inktomi Corporation's Webmap. The model fits these distributions closely as well; moreover, the mixture parameters α imply power law slopes γ = 1 + 1/α precisely in line with previous measurements (11). Table 1 reports the best-fit parameters α, modes, and power law exponents γ for the four data sets and for the web as a whole.

Table 1.

Mixture parameters α, modes 2m(1 − α), and power law tail exponents γ = 1 + 1/α for inbound links to category-specific homepages, and for inbound and outbound links on the web as a whole

Data set α Mode γ
Companies 0.950 136 2.05
Newspapers 0.948 87 2.05
Web inlinks 0.909 0 2.10
Universities 0.612 839 2.63
Scientists 0.602 7 2.66
Web outlinks 0.581 8 2.72

The distribution of links to university homepages exhibits the largest deviation from a power law; on the other end of the spectrum, the distribution of inbound links on the web as a whole is closest to a pure power law. In all cases studied, mixture parameters α are greater than 0.5. Thus, preferential attachment appears to play a larger role in web link growth than does uniform attachment. The growth of links to company homepages (α = 0.950) and to newspaper homepages (α = 0.948) is mostly dominated by the “rich get richer” process of preferential attachment, whereas link growth on scientist homepages (α = 0.602) and university homepages (α = 0.612) suggest a more balanced mixture of preferential and uniform terms. The model also appears consistent with the shape of connectivity distributions reported for the graph of actor collaborations, the networks of western United States power stations, the citation pattern among publications, and outbound links from subsets of the web (10). The distribution of file sizes on the web is also qualitatively similar to our category-specific link distributions, with a log-normal body and a power law tail (6). Previous studies characterized the body and tail separately (6); Eq. 3 (when interpreted purely as a growth model rather than a graph generation model) serves as a single-function alternative for describing the full distribution.

Conclusions

The addition of pages and links to the web is a distributed, asynchronous, complex, and continual process: to an outside observer, fine-grained changes must appear almost haphazard. Yet, when examined on the large, discernible patterns emerge (2, 3, 5, 6, 8, 11), some of which are shared with other social and biological networks (10, 16, 17). For one, the distribution of the number of links to (and from) a page has been shown to follow a power law over many orders of magnitude (1012). We demonstrate that, among web pages of the same type, the body of the distribution of inbound links deviates strongly from a power law, exhibiting a roughly log-normal shape. A generative model incorporating uniform as well as preferential attachment explains data from the web as a whole, as well as category-specific data from company, university, newspaper, and scientist homepages.

As commerce and communication move to the web, the dynamics of link accumulation—at both global and local granularities—can strongly influence competition and diversity in business and society. Improved tools for characterizing and modeling these dynamics will have significant scientific and commercial value (23). Beyond the web, understanding commonalities among diverse network types promises to enrich our understanding of the evolution of social and ecological structures.

Acknowledgments

We thank A.-L. Barabási, D. Watts, F. Coetzee, P. Rusmevichientong, A. Khrabrov, S.-T. Park, S. Chakrabarti, and the anonymous reviewers for insights, advice, and pointers to related work. We thank Inktomi Corporation for access to their Webmap.

Abbreviation

BA

Barabási–Albert

References

  • 1.Lawrence S, Giles C L. Nature (London) 1999;400:107–109. doi: 10.1038/21987. [DOI] [PubMed] [Google Scholar]
  • 2.Pitkow J E. Comput Netw ISDN Sys. 1998;30:551–558. [Google Scholar]
  • 3.Adamic L A, Huberman B A. Q J Electr Commerce. 2000;1:5–12. [Google Scholar]
  • 4.Glassman S. Comput Netw ISDN Sys. 1994;27:165–173. [Google Scholar]
  • 5.Huberman B A, Adamic L A. Nature (London) 1999;401:131. [Google Scholar]
  • 6.Barford P, Bestavros A, Bradley A, Crovella M. World Wide Web J Spec Issue Character Perform Eval. 1999;2:15–28. [Google Scholar]
  • 7.Woodruff A, Aoki P M, Brewer E, Gauthier P, Rowe L A. Comput Netw ISDN Sys. 1996;28:963–980. [Google Scholar]
  • 8.Huberman B A, Pirolli P L T, Pitkow J E, Lukose R M. Science. 1994;280:95–97. doi: 10.1126/science.280.5360.95. [DOI] [PubMed] [Google Scholar]
  • 9.Albert R, Jeong H, Barabási A-L. Nature (London) 1999;401:130–131. [Google Scholar]
  • 10.Barabási A-L, Albert R. Science. 1999;286:509–512. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]
  • 11.Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Comput Netw. 2000;33:309–320. [Google Scholar]
  • 12.Kumar R, Raghavan P, Rajagopalan S, Tomkins A. Comput Netw. 1999;31:1481–1493. [Google Scholar]
  • 13.Casti J L. Complexity. 1995;1:12–15. [Google Scholar]
  • 14.May R M. Science. 1988;214:1441–1449. doi: 10.1126/science.241.4872.1441. [DOI] [PubMed] [Google Scholar]
  • 15.Wasserman S, Faust K. Social Network Analysis: Methods and Applications. Cambridge, U.K.: Cambridge Univ. Press; 1998. [Google Scholar]
  • 16.Jeong H, Tombor B, Albert R, Oltvai Z N, Barabási A-L. Nature (London) 2000;407:651–654. doi: 10.1038/35036627. [DOI] [PubMed] [Google Scholar]
  • 17.Watts D J, Strogatz S H. Nature (London) 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  • 18.Barabási A-L, Albert R, Jeong H. Physica A. 1999;272:173–187. [Google Scholar]
  • 19.Adamic L A, Huberman B A. Science. 2000;287:2115a. [Google Scholar]
  • 20.Dorogovtsev S, Mendes J, Samukhin A. Phys Rev Lett. 2000;85:4633–4636. doi: 10.1103/PhysRevLett.85.4633. [DOI] [PubMed] [Google Scholar]
  • 21.Dorogovtsev S, Mendes J, Samukhin A. Phys Rev E. 2001;64:66110. doi: 10.1103/PhysRevE.64.025101. [DOI] [PubMed] [Google Scholar]
  • 22. Levene, M., Fenner, T., Loizou, G. & Wheeldon, R. (2001) http://www.arXiv.org/abs/cond-mat/0110016.
  • 23.Kleinberg J M, Kumar R, Raghavan P, Rajagopalan S, Tomkins A S. Proceedings of the Fifth International Conference on Computing and Combinations. Berlin: Springer; 1999. pp. 1–18. [Google Scholar]
  • 24.Albert R, Barabási A-L. Phys Rev Lett. 2000;85:5234–5237. doi: 10.1103/PhysRevLett.85.5234. [DOI] [PubMed] [Google Scholar]
  • 25.Simon H. Biometrika. 1955;42:425–440. [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES