Experience versus talent shapes the structure of the Web

Joseph S Kong; Nima Sarshar; Vwani P Roychowdhury

doi:10.1073/pnas.0805921105

. 2008 Sep 8;105(37):13724–13729. doi: 10.1073/pnas.0805921105

Experience versus talent shapes the structure of the Web

Joseph S Kong ^*, Nima Sarshar ^†, Vwani P Roychowdhury ^*,^‡,^§

PMCID: PMC2544521 PMID: 18779560

Abstract

We use sequential large-scale crawl data to empirically investigate and validate the dynamics that underlie the evolution of the structure of the web. We find that the overall structure of the web is defined by an intricate interplay between experience or entitlement of the pages (as measured by the number of inbound hyperlinks a page already has), inherent talent or fitness of the pages (as measured by the likelihood that someone visiting the page would give a hyperlink to it), and the continual high rates of birth and death of pages on the web. We find that the web is conservative in judging talent and the overall fitness distribution is exponential, showing low variability. The small variance in talent, however, is enough to lead to experience distributions with high variance: The preferential attachment mechanism amplifies these small biases and leads to heavy-tailed power-law (PL) inbound degree distributions over all pages, as well as over pages that are of the same age. The balancing act between experience and talent on the web allows newly introduced pages with novel and interesting content to grow quickly and surpass older pages. In this regard, it is much like what we observe in high-mobility and meritocratic societies: People with entitlement continue to have access to the best resources, but there is just enough screening for fitness that allows for talented winners to emerge and join the ranks of the leaders. Finally, we show that the fitness estimates have potential practical applications in ranking query results.

At both the individual and societal levels, we constantly have to make decisions on how we should distribute our limited resources and time. We need to choose who to hire, elect, buy from, get information from, award grants to, or make friends with. In this competitive landscape, each candidate touts a resumé highlighting experience, a more easily quantifiable metric that summarizes past achievements, e.g., the total number of clients a service provider has served or the years a prospective employee has spent at similar jobs, and talent or inherent fitness, a more subjective metric that indicates how well the candidates might perform in the future, e.g., special pedigree or degree from a prestigious college, or knowledge of a brand new technology, or an articulation of an ideal that captures the imagination. How we strike a balance between entitlement/experience and fitness/potential is a key determining factor in how wealth and power get distributed in a society and how nimble it is in adapting to changes. Too much emphasis on experience alone could lead to an ossified social structure that lacks innovation and can collapse dramatically when confronted with change; world history is littered with numerous instances of failed societies that had chosen such a path. The opposite extreme of letting only promising upstarts rule can equally easily lead to a state of anarchy with no dominant institutions to hold the society together; the frequent failures of well intentioned revolutions that supplant existing institutions en masse and make fresh starts provide eloquent testimonies to the perils of such a path. A society-wide quantitative study of how the experience vs. talent question is resolved, however, has been difficult to perform because of the obvious lack of concrete data.

Background.

The World Wide Web (WWW) provides a unique opportunity in this regard. It has emerged as a symbiotic socioeconomic entity, enabling new forms of commerce and social intercourse while being constantly updated and modified by the activities that it itself enables. Given the organic nature of the web, its evolution, structure, and information dynamics should reflect many of the same dynamics that underlie its real-world counterparts, i.e., our social and economic institutions. Thus, we ask how does this thriving cyber-society deal with the experience vs. talent issue, and how does this interplay influence its own structure. The unprecedented scale and transparency of the activities on the web can provide data that hitherto have been unavailable. The web is typically modeled as an evolving network whose nodes are web pages and whose edges are URL links or hyperlinks. A web page's in-degree (i.e., the number of other pages that provide links to it) is a good approximation of its ability to compete because heavily linked web documents are entitled to numerous benefits, such as being easier to find via random browsing, being possibly ranked higher in search engine results, attracting higher traffic and, thus, higher revenue through online advertisements. Thus, the degree of a node can be considered as a proxy of its experience, and it is a reflection of its entitlement, status, and accomplishments to date.

In fact, motivated by a power-law (PL) distribution of the degree of nodes in the web graph [i.e., P (k) ∝ k^−γ, where k denotes node degree and γ is the PL exponent], the principle of preferential attachment (PA), known to sociologists and economists for decades [e.g., as the “cumulative-advantage” or the “rich-gets-richer” principle (1, 2)], was proposed as a dominant dynamic in the web (3–5). Note that Huberman and Adams (4) modeled web growth in terms of growth in the sizes of web sites/domains, which is identical to the model used by Willis and Yule (6) in 1922 to explain the PL in the sizes of the genus. However, as shown in ref. 7, the Yule model and the Simon model (1) are equivalent to each other, and both rely on the cumulative-advantage principle. Hence, we refer to both the models introduced in ref. 3 and in refs. 4 and 5 as the PA model. Alternate local dynamic models of the web, e.g., via copying of links (8) (again, inspired by analogous social dynamics, such as referral services), account for additional characteristics of the web graph, such as high clustering coefficients and bipartite clique communities, while still retaining the global PA mechanism.

The PA or equivalent models, however, imply that the scale is heavily tilted toward experience: the more experienced or older a page is, the more resources it will get, and the more dominant it will become. For example, PA predicts that almost all nodes with high in-degree are old nodes (disallowing newcomers to catch up) and that the degree distribution of pages introduced at the same time will be an exponential one, with very low variance. This extreme bias of the model was quickly realized and (9) presented empirical data showing that the degree distribution of nodes of the same age has a very high variance; they also introduced a fitness or talent parameter allowing different domains to grow at different rates to account theoretically for the high variance (4). This also prompted a number of researchers to propose (9, 10) and explore (11, 12) the “preferential attachment with fitness” dynamic model in which a node i acquires a new link with probability proportional to k_i × η_i, the product of its current number of links k_i and its intrinsic fitness or talent η_i. In such a linear fitness model, the degree distribution and the structure of the resulting network depend on the distribution of the talent parameter, η_i, and thus, without any knowledge of the exact distribution, one cannot quite say how exactly the talent vs. experience issue gets played out in the system. For example, a uniform distribution of talent has a very different implication than an exponential distribution. Moreover, a significant potential dynamic that has not been studied in the context of the web is the death or deletion dynamic, which is dominant in most societal settings, where institutions and individuals cease to operate. The deletion dynamic, however, has been studied in the context of other networks (13–16), and a surprising finding is that the heavy-tailed degree distribution disappears in the straight PA model under significant deletion.

This finding prompted us to ask data-driven questions, such as: How dominant is the churn or deletion dynamic in the web? Can a PA model with fitness preserve the heavy tail even in the presence of high deletion rates? Can one empirically verify that the proposed models are truly at work in the web? Can one empirically estimate the relative fitness of a significant number of pages on the web and quantify the distribution of talent on the web? Most interesting of all, how often can talents overtake the more experienced individuals and emerge as the winner? Such issues, although they have been partially theorized about, have not been empirically studied and validated.

Brief Summary of Findings.

By using web crawls that span the period of 1 year (i.e., 13 separate crawls, at 1-month intervals), we tracked both the death and the growth processes of the web pages. In particular, we tracked 17.5 thousand web hosts, via monthly crawls, with each crawl containing in excess of 22 million pages (see Materials and Methods). First, we discovered that there is a high turnover rate, and for every page created on the web, our conservative numerical estimates show that at least ≈0.77 pages are deleted [see Results and supporting information (SI) Appendix]. This is a significant enough deletion rate that it prompted us to analyze a theoretical model that integrates the deletion process with the fitness-based preferential attachment dynamics (see Materials and Methods). Previous models of the web had neglected the death dynamic; recent results, however, show that even a relatively low-grade deletion dynamic could alter network characteristics considerably. Given the distribution of fitness, our model can predict the overall degree distribution and the degree distribution of nodes with similar age.

The empirical crawl data are then used to estimate the parameters of the model. Doing so allows us to validate whether detailed time domain data are consistent with the predictions of the theoretical model. One of the most important assumptions of the model is that each page can be assigned a constant fitness (which can vary from page to page) that determines the rate at which it will accumulate hyperlinks. We perform an estimation of the fitness factor for each month and show that for the period of the crawls, the data do not reject the hypothesis that each page has a constant fitness (see SI Appendix). A further verification of the model is obtained by validating one of its most direct implications. In particular, the dynamic model predicts that the accumulated in-degree (i.e., counting all hyperlinks, including those made by pages that get deleted during our study period) of a page grows as a PL. We find that for a vast majority of pages that show any growth, the degree-vs.-time plots in the log–log scale have linear fits with correlation coefficients in excess of 0.9 (see Results). The slope of the linear fit is an affine function of the fitness of the page.

The robust estimation of the fitness factors of individual pages allows us to determine the overall distribution. We find the fitness on the web to be exponentially distributed (i.e., see Fig. 3A), with a truncation. When inserted into our analytical model, this exponential fitness distribution correctly predicts the PL degree distribution empirically observed in the overall web and for the set of nodes with similar age. For pages with similar age, the initial exponential distribution of fitness gets amplified by the PA mechanism, and as a result, the degree distribution of pages of the same age is a PL distribution with exponent 2, i.e., with high variance. Moreover, the truncated exponential distribution of fitness is one of the few distributions that would generate a constant PL exponent in the overall degree distribution, even as the turnover rate approaches unity (i.e., as many pages are deleted as created on the average). The empirical data agree with this prediction, and the PL degree distribution retains a constant low-magnitude exponent throughout the period of our study (see Results) even though the deletion rate of pages remains high. Thus, the fitness distribution of the pages helps in preserving the heavy-tailed PL degree distribution of the web.

Fig. 3. — Distribution of talent. (A) Distribution of the growth exponents: the log–linear plot is well fitted by a straight line in the range between 0 and 2, which suggests that the distribution is a truncated exponential. The slope of the fitted line is −1.44. (B) We find that the growth exponent distribution also exhibits an exponential form when restricting to sets of nodes with the same initial in-degree in June, 2006; the plot for the set of nodes with an initial in-degree of 10 is displayed here. Note that the growth exponent is an affine function of the underlying fitness parameter (see Eq. 6); hence, the fitness distributions are also truncated exponentials.

The sequential time-sampled data help us in better understanding the interplay between experience and talent (fitness). For example, the initial in-degree of a page (i.e., in June 2006) is a measure of its experience, and the accumulated final in-degree (i.e., in June 2007) is a measure of how it fared based on its fitness and its experience. We define a page to be a winner if its final degree exceeds a specific desired target, while starting with an initial degree less than the target value. Fig. 1A shows the initial in-degree distribution of all pages such that the initial degree was <1,000 and the accumulated final degree >1,000. The case of different target final degree values is discussed in the SI Appendix. If the growth of the number of hyperlinks acquired by a page was based purely on PA (i.e., all pages have the same talent/fitness), then only pages with an initial degree greater than a cutoff would end up with a final degree >1,000. Clearly, the empirical data show that this is not the case: There are talented winners who have very low initial in-degree (i.e., smaller than the cutoff) and yet end up as winners; similarly, there are experienced losers who start with a large in-degree (i.e., greater than the cutoff) but yet end up with cumulative in-degree less that 1K. Fig. 1B shows the number of talented winners and experienced losers as a function of the cutoff, and for the sake of fair comparison, we pick a value for the cutoff such that the number of talented winners equals the number of experienced losers. Thus, we find that for this sample set, the web collectively picked 48% talented winners and displaced an equal number of more experienced pages, thus striking a balance between talent and experience. As analyzed in the SI Appendix, the percentage of talented winners seems to remain relatively constant as the target degree is varied.

Fig. 1. — Fraction of talented winners on the web. (A) Cumulative distribution of the initial in-degree (or experience) of pages with measurable fitness (see *Results* for the definition of such pages) with the following properties: (i) initial in-degree in June 2006 was <1,000; (ii) the final accumulated in-degree at the end of the observation period (i.e., in June 2007) was >1,000. (B) Count vs. cutoff degree: the downward-slope line denotes the number of experienced losers (i.e., pages with initial in-degree greater than the cutoff degree, but final in-degree <1,000) for different cutoff degrees; the upward-slope line denotes the number of talented winners (i.e., pages with initial in-degree less than the cutoff and final in-degree >1,000) as a function of the cutoff. For comparison sake, we pick the critical cutoff degree that equalizes the number of talented winners and experienced losers. Inserting a dashed vertical line denoting the critical cutoff degree into A we find that 48% of the winners are talented winners.

For the fitness distributions of pages with similar initial in-degree and hence, similar experience, see Fig. 3B. They all are exponentially distributed, except that the average fitness is a function of the initial degrees of the nodes. Fig. 2 shows the average fitness as a function of initial in-degree. It shows that the average fitness is largest for nodes with least experience and decreases as a PL until approximately an in-degree value of 100; it levels off after that. Thus, the web encourages pages with low or little experience just a bit more than the mature pages; but for any group, it judges talent quite conservatively, keeping the distribution exponential. The concept of fitness has implications on how we rank the importance and attractiveness of web pages. In Discussion, we propose that one can use the fitness estimates of the pages to boost their rankings; this way, pages with low overall degree but that are growing fast will get higher ranking.

Fig. 2. — The average fitness value (as measured by the growth exponent) plotted as a function of the initial in-degree (or experience) of pages. The set of pages considered consists of those with measurable fitness (see *Results* for a definition of such pages). As the plot shows, pages with low initial in-degree have higher average fitness, even though the distribution is always exponential; moreover, the average fitness decreases as a power law form k^−0.4 until ≈k = 100 and then levels off to a constant value. Thus, the web on the average gives a slight fitness boost to the pages with low experience but then treats them statistically the same once they have experience above a certain value.

Results

Estimating the Fitness of Web Pages: Talents Are Exponentially Rare.

If the fitness with deletions model is indeed applicable to the web, the accumulated degree of each node should follow Eq. 11 as discussed in Materials and Methods. In particular, from Eq. 11, taking the logarithm of both sides of the accumulated degree of a page, we get:

where B is some time-invariant offset. Hence, the slope of the linear fit of the logarithm of the accumulated degree k* (i, t) and time t gives node i's growth exponent β_i. Note that the fitness value is related to the growth exponent of a node by a linear transformation with constant coefficients (see Eq. 6). Thus, the distribution characteristics of fitness can be obtained by measuring the growth exponent of each node.

The methodology for measuring the distribution of the growth exponents is described as follows. First, we identify ≈10 million web pages that persist through all 13 months from June 2006 to June 2007. For each of these web pages, the set of in-neighbors are identified for all months. The accumulated in-degree of a node at any month is the sum of the in-neighbors up to that particular month. In accordance with Eq. 1, after taking the logarithm of the accumulated in-degree and time (measured in months), the slope of the linear ordinary least-square fit (i.e., the empirical growth exponent) along with the Pearson correlation coefficient are obtained for each web page. We will refer to this methodology as the growth method; in the SI Appendix, we present an alternative methodology, the direct-kernel method, to estimate the fitness of web pages; the results from the alternative method are consistent with the results from the growth method.

We found that a large fraction of web pages do not gain any in-connection at all during the entire 1-year period. We consider a web page to have a zero-growth exponent if its in-degree values increase two times or less during the 13 months. We found that only 6.5% of the web pages have nonzero growth exponents. We will focus our study on the set of nodes with nonzero growth exponents. Note that the set of web pages with zero-growth exponents essentially introduces a delta function at the origin in a fitness distribution plot. It is simple to check that the delta function does not impact the derivation of results and hence is omitted from discussion for simplicity.

An overwhelming fraction of the linear fit produces a correlation coefficient of 0.8 or more, with an average correlation value of 0.89 (see SI Appendix). Thus, our empirical measurement is consistent with the model that the evolution of node in-degree as a function of time follows a PL as described in Eq. 1 for majority of the web pages.

We plot the distribution of the growth exponents for the set of nodes with correlation coefficient of 0.8 or more in Fig. 3. The distribution of the growth exponents has a mean of 0.30 and clearly follows an exponential curve with a truncation ≈2.0 and a slope of −1.44 in the log-linear plot (i.e., a characteristic parameter of λ = 1.44/log e). Because node fitness and the growth exponent are related by a linear transformation involving the constants A and c as Inline graphic the fitness distribution is also well modeled by the same form of a truncated exponential.

Examples of High-Talent Web Pages.

We now conduct checks to see whether the web pages identified to have a large growth exponent indeed contain interesting or important content that warrants the title of being highly fit or “talented.” We manually inspected the several highest-fitness pages in our dataset. One example is a web page from the John Muir Trust web site that calls on people to explore nature (www.jmt.org/journey). Many in-links to this page are from other sites on nature and outdoor activities. Another example is the web page that reports the crime rate of the U.S. from 1960 to 2006 (www.disastercenter.com/crime/uscrime.htm). This URL has many in-links from other sites that discuss different crimes such as murder.

PL Degree Distribution of the Web Pages with the Same Age.

For scientific citation networks, it is known that the in-degree distribution of the papers published in the same year follows a power law (see the ISI dataset in Fig. 1a in ref. 17). However, no parallel study has been performed for the web. By using our temporal web dataset, we studied the in-degree distribution of the set of web pages with the same age. The in-degree distribution is found to follow a PL with an exponent of 2.0 for over 3 decades (see Fig. 4). This result is consistent with the empirical finding by Adamic and Huberman that the degree distribution of web hosts with the same age has a large variance (5). Furthermore, the PL nature of the in-degree distribution for web pages with the same age is consistent with our theoretical prediction from Eq. 10 given that the fitness distribution is found to be a truncated exponential (see Materials and Methods).

Fig. 4. — This figure plots the degree distribution as measured in July 2007 for web pages that first appear in the month July 2006. The fitted PL exponent is γ = 2.0.

In contrast, a network dynamic model that does not account for fitness has a small variance for the nodes with the same age, which leads to the effect that the “rich” node must be the old node. In fact, this is the basis of the issue raised by Adamic and Huberman (5). Thus, the fitness-based model naturally generates the PL degree distribution for the set of nodes with the same age, which is not explained by other existing models that do not account for fitness such as (3, 13–16).

Ad Hoc Characteristics of the Web and the Resilience of the PL Exponent.

We now discuss the web page removal process as observed in our dataset. In our analytical model, a node is removed uniformly randomly (i.e., independent of node degree). We found empirical evidence to support the uniform random removal assumption: we observed that the degree distribution of the set of removed nodes that disappear in a given month is similar to the degree distribution of all nodes (see SI Appendix).

Recall that the turnover rate is defined as the average number of nodes removed per node added. From our dataset, the turnover rate is measured to be c = 0.91 (i.e., for every new web page inserted, 0.91 web page is removed per unit of time). However, this figure is an overestimate of the true turnover rate on the web because we are examining a fixed set of web hosts. Therefore, we also need to account for the growth in the number of web hosts. Nevertheless, even after accounting for the source of growth from the insertion of web hosts, the web still has a minimum turnover rate of 77% (see SI Appendix).

Despite the high rate of node turnovers, the PL degree distribution is found to be very stable (see SI Appendix). This finding is consistent with our ad hoc fitness model prediction that the PL exponent γ of the degree distribution P(k) ∝ k^−γ stays constant for varying rates of node deletion for a truncated exponential fitness distribution (see Eq. 9 in Materials and Methods). The resilience of the PL exponent is in stark contrast to the result obtained for the PA-with-deletion model (without any fitness variance), where the PL exponent is found to diverge rapidly as γ = 1 + 2/(1 − c) (13, 16). Thus, the natural variation of node fitness provides a self-stabilization force for the power law exponent of the degree distribution under high rate of node turnovers.

Talented Winners versus Experienced Losers.

In the Introduction, we proposed the idea of talented winners and experienced losers and how they are identified in our empirical web dataset for a given target degree k_tg. For the particular case of k_tg = 1,000, we find that 48% of the winners are talented winners (see Fig. 1) who successfully displaced the experienced losers (i.e., the nodes with higher initial in-degrees but fail to become a winner). This observation is seemingly paradoxical: how can talents emerge to win close to half of the times when talents are exponentially rare? We seek to understand the interplay between experience and talent through analytical modeling.

Consider a node with the initial degree k < k_tg in month 1 (i.e., June 2006, the start of our observation period). For the node to achieve k_tg in month 13 (i.e., June 2007, the end of our observation k period), the growth exponent of the node must exceed the critical value: Inline graphic The fraction of nodes that are winners is simply given as:

where C(β) is the complementary cumulative distribution function (CCDF) of the growth exponent distribution and P(k) is the initial degree distribution in month 1. Thus, one can find the fraction of winners for a given k_tg by performing numerical integration of Eq. 2.

We now introduce the cutoff k_cut: the set of winners with an initial degree k < k_cut are denoted as the talented winners because they start with a low initial degree but nevertheless reach the target degree k_tg in month 13; the set of losers with an initial degree k > k_cut are denoted as the experienced losers because they start with a high initial degree but still fail to reach the target degree k_tg in month 13. We can solve for the critical cutoff k*_cut such that the number of talented winners TW(k*_cut) is equal to the number of experienced losers EL(k*_cut) (i.e., the talented winners displace the experienced losers):

graphic file with name zpq03708-4775-m03.jpg

The above equation can be solved numerically to obtain k*_cut. Now, the fraction of talented winners or experienced losers is simply given by: r_TW (k_tg) = TW(k*_cut)/W(k_tg).

From our empirical web data, the growth exponent of the nodes is distributed according to a truncated exponential function C(β) with the parameter λ = 1.44/log e and the truncation β_max = 2.0. The initial degree distribution P(k) is a PL with exponent γ = 1.8. Substituting the empirically obtained C(η) and P(k) functions into Eqs. 2 and 3, we use numerical integration to find the fraction of talented winners r_TW(k_tg) for the target degree k_tg = 1,000 and obtain the theoretical prediction of 48.8%, which matches well with the empirical measurements obtained as described in Fig. 1 (see SI Appendix for theoretical and measurement results for different target degrees). For a given system with known talent and initial degree distribution, one can now estimate the fraction of talented winners by using our analytical model.

Discussion

In this article, taking advantage of the large, open, and dynamic nature of the WWW, we find an intricate interplay between talent and experience. Talents are empirically found to be exponentially rare. However, through empirical measurements and theoretical modeling, we show that the exponentially distributed talent accounts for the following observed phenomena: the heavy-tailed PL in-degree distribution of the web pages born at the same time; the preservation of the low PL exponent even in the face of high rates of node turnovers; and most intriguing of all, talented winners emerge and displace the experienced losers in just slightly less than half of all winning cases!

Beyond the interesting findings, we discuss several issues associated with this work. Although our data are statistically consistent with the model assumption of a constant fitness for each page, our observation period is over a relatively short period of 1 year. For longer periods, one would expect the fitness of a page to change. For example, occasionally, a page that has been lying dormant for a while might find that its content becomes topical and, hence, its fitness suddenly increases, allowing it to start accumulating links and becoming popular. Such pages can be referred to as sleeping beauties (18). Developing a model that accounts for time-varying fitness can be a subject for future work. In addition, the sample size on the order of tens of millions of nodes used in this work is arguably large, especially compared with studies from the social sciences. However, the size of the web is currently on the order of billions of pages. Nevertheless, the source of our data, the Stanford WebBase project, to the best of our knowledge is the largest publicly accessible web archive available for research studies. Finally, the statistics on node in-degrees as reported in this work is measured from the crawled web graph; potential in-links from web pages not included in the crawl are not accounted for. Future work focusing on examining larger web samples can mitigate these limitations.

On the WWW, the problem of search engine bias or the “entrenchment effect” (i.e., the rich-get-richer mechanism) has received considerable attention from a broad audience, from the popular press to researchers (19). However, researchers have shown evidence that the rich-get-richer mechanism might be less dominant than previously thought (20, 21); nevertheless, search engine bias and the entrenchment effect remain concerns. The findings in this article present an alternative perspective on this problem and show that talents, although being exponentially rare, are frequently afforded the opportunity to overtake more entrenched web pages and emerge as the winner.

Currently, for any given query, pages are ranked based on a number of metrics, including the relevance score of the query keywords in a document, and the document's page rank, which is computed based on the in-degree (or experience) of the page and the hyperlink structure of the web at the time of the crawl. To avoid the entitlement bias potentially introduced because of page rank, a number of researchers have advocated that one should also boost low page rank pages, for example, by randomly introducing them among the top pages (22, 23). The fitness of a page could be added as another metric that could influence the ranking. The determination of the exact functional form of how the fitness, η_i, of a page would influence its rank would require considerable experimentation and editorial evaluations.

Besides the web, the methodologies developed in this work are applicable for studying other complex networks and systems such as the citation network of scientific papers and the actor collaboration social network, where the interplay between experience and talent is also interesting. The fitness distribution is arguably an important parameter for dynamically evolving networks. The empirical study and theoretical models presented in this article pave the road for studying the fitness characteristics of other systems, which will allow us to better understand, characterize, and model a broad range of networks and systems.

Materials and Methods

Dataset.

Our dataset of the World Wide Web was obtained from the Stanford WebBase project. WebBase archives monthly web crawls from 2006 to 2007. We downloaded a total of 13 crawls for a 1-year period from June 2006 to June 2007. These crawls track the evolution of 17.5 thousand web hosts, with each crawl containing in excess of 22 million web pages.^¶ The set of hosts consists of a diverse sample of the web: it contains 5.4 thousand .com hosts, 4.7 thousand .org hosts, and 2.6 thousand .edu hosts. This set also includes many foreign hosts, such as hosts from China, India, and Europe.

Fitness-Based Model for Ad Hoc Networks.

The existing “preferential attachment with fitness model” is specified as follows (9). At each time step, a new node i with fitness η_i ≥ 0 joins the network, where η_i is chosen randomly from a fixed fitness distribution ρ(η); node i joins the network and makes m links to m nodes. A link is directed to node l with probability:

where k_l is the in-degree of the node l. We extend the fitness model to account for node deletion. The new model, which may be called “fitness with deletion model,” has the following extra dynamic added to the original fitness model: at each time step, with probability c, a randomly selected node is deleted, along with all of its edges. We present the analysis of the model by using the continuous mean-field rate equation approach as introduced in ref. 24. Other approaches would include the generating-function method as discussed in ref. 16 and the rigorous mathematical analytical method presented in ref. 25. However, we prefer the mean-field approach for its simplicity. In addition, the analytical results are verified by simulations. On another note, because the web is a directed graph, we note that the model can be easily generalized into a dynamic directed network model (details are discussed in the SI Appendix).

In the fitness with deletion model, we show that the evolution of the degree k of node i follows a power-law (see SI Appendix):

where the growth exponent β is a function of the fitness η_i:

The parameter A is given by:

graphic file with name zpq03708-4775-m07.jpg

where η_max is the maximum fitness in the system.

We now examine the case where the fitness distribution is a truncated exponential, which is shown to characterize the fitness distribution of web pages empirically. When ρ(η) is distributed exponentially in the interval [0,η_max], we have: ρ(η) = λe^−λη/(1 − e^−ληmax). The constant A can be determined from Eq. 7. For η_max large compared with 1/λ, we have Inline graphic where ε₁ is negligibly small. Thus, according to Eq. 6, the growth exponent is given by

For maximum fitness η = η_max, we have Inline graphic where ε₂ is negligibly small. Because the PL exponent is dominated by the highest β, we invoke the scaling relation (13) we obtain:

where ε₃ is negligibly small. Thus, the PL exponent stays at 2 regardless of the deletion rate (see SI Appendix for the detailed derivations and justifications on assumptions made). This is a rather surprising result. As was shown in refs. 13–16, for plain preferential attachment dynamics (where all nodes have the same fitness), the PL exponent depends on c as γ = 1 + 2/(1 − c), and diverges as c goes to 1. The introduction of fitness with a truncated exponential distribution stabilizes the PL exponent, in the sense that the exponent remains close to 2.0 and does not diverge, regardless of the value of c. To verify the result that the PL exponent does not depend on the turnover rate, we performed large-scale simulations and confirmed that the PL exponent stays constant at 2.0 even under high rates of node turnovers (see SI Appendix).

Degree Distribution of Nodes with the Same Age.

Given the fitness distribution ρ(η) and the degree, k, that grows exponentially with fitness for a fixed time interval t, we have k(η) ∝ t^η/C, where C is some constant. The degree distribution of nodes with the same age is given as: Inline graphic For the case that the fitness distribution is a truncated exponential, the degree distribution follows a power law:

where the power law exponent is Inline graphic . Effectively, the light-tailed distribution in fitness is amplified into the heavy-tailed degree distribution for nodes born at the same time through the PA mechanism. The phenomenon of heavy-tailed degree distribution of nodes with the same age has also been observed and analyzed in other contexts (18, 26, 27).

Evolution of the Accumulated Node Degree.

In our model, a node would gain neighbors and lose neighbors when the neighboring nodes are deleted. As a result, when we track the evolution of a node's degree over time, the time series shows a number of upward and downward jumps, making it difficult to estimate the growth exponent β(η) from Eq. 5 accurately. To reduce noise in the data, we can instead track the evolution of a node's accumulated degree over time. We define the set of accumulated neighbors of a node to include previous neighbors that have been deleted in addition to the current set of neighbors. Thus, the accumulated node degree is the size of the set of accumulated neighbors. It is simple to derive that the evolution of the accumulated degree of node i is (see SI Appendix):

where the growth exponent is found to be Inline graphic . Note that the growth exponent for the evolution of the accumulated node degree β*(η) is identical to the growth exponent of node degree as given in Eq. 6 [i.e., β*(η) = β(η)].

Supplementary Material

Supporting Information

supp_105_37_13724__index.html^{(574B, html)}

Acknowledgments.

We thank Gary Wesley of the Stanford WebBase project for providing patient help and instructions in downloading the web crawl data.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0805921105/DCSupplemental.

^¶

The WebBase crawler would extract a maximum of 10,000 pages per host. However, the 10,000 pages per host limit is not a problem because none of the page count of the tracked hosts reaches this limit.

References

1.Simon HA. On a class of skew distribution functions. Biometrika. 1955;42:425–440. [Google Scholar]
2.Price DJ. General theory of bibliometric and other cumulative advantage processes. J Am Soc Inform Sci. 1976;27:292–306. [Google Scholar]
3.Barabasi A-L, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–512. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]
4.Huberman BA, Adamic LA. Growth dynamics of the world wide web. Nature. 1999;401:131. [Google Scholar]
5.Adamic LA, et al. Power-law distribution of the World Wide Web. Science. 2000;287:2115. [Google Scholar]
6.Willis JC, Yule GU. Some statistics of evolution and geographical distributions in plants and animals and their significance. Nature. 1922;109:177–179. [Google Scholar]
7.Simkin M, Roychowdhury V. Reinventing Willis. 2006 Preprint, http://arxiv.org/abs/physics/0601192.
8.Kleinberg JM, Kumar R, Raghavan P, Rajagopalan S, Tomkins AS. The Web as a graph: Measurements, models and methods. Lecture Notes Computer Sci. 1999;1627:1–17. [Google Scholar]
9.Bianconi G, Barabási A-L. Competition and multiscaling in evolving networks. Europhys Lett. 2001;54:436–442. [Google Scholar]
10.Bianconi G, Barabási A-L. Bose–Einstein condensation in complex networks. Phys Rev Lett. 2001;86:5632–5635. doi: 10.1103/PhysRevLett.86.5632. [DOI] [PubMed] [Google Scholar]
11.Borgs C, Chayes J, Daskalakis C, Roch S. First to Market Is Not Everything: An Analysis of Preferential Attachment with Fitness; Symposium on Theory of Computing 2007; 2007. pp. 135–144. [Google Scholar]
12.Motwani R, Xu Y. Evolution of Page Popularity Under Random Web Graph Models; Symposium on Principles of Database Systems 2006; 2006. pp. 134–142. [Google Scholar]
13.Sarshar N, Roychowdhury V. Scale-free and stable structures in complex ad hoc networks. Phys Rev E. 2004;69 doi: 10.1103/PhysRevE.69.026101. 026101. [DOI] [PubMed] [Google Scholar]
14.Chung F, Lu L. Coupling online and offline analyses for random power law graphs. Internet Math. 2004;1:409–461. [Google Scholar]
15.Cooper C, Frieze A, Vera J. Random deletion in a scale-free random graph process. Internet Math. 2004;1:463–483. [Google Scholar]
16.Moore C, Ghoshal G, Newman MEJ. Exact solutions for models of evolving networks with addition and deletion of nodes. Phys Rev E. 2006;74 doi: 10.1103/PhysRevE.74.036121. 036121. [DOI] [PubMed] [Google Scholar]
17.Redner S. Citation statistics from more than a century of physical review. 2004 Preprint, http://aps.arxiv.org/abs/physics/0407137.
18.Simkin MV, Roychowdhury VP. A mathematical theory of citing. J Am Soc Inf Sci Technol. 2007;58:1661–1673. [Google Scholar]
19.Cho J, Roy S. Impact of Search Engines on Page Popularity. 2004:20–29. WWW 2004. [Google Scholar]
20.Pennock DM, Flake GW, Lawrence S, Glover EJ, Giles CL. Winners don't take all: Characterizing the competition for links on the web. Proc Natl Acad Sci USA. 2002;99:5207–5211. doi: 10.1073/pnas.032085699. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Fortunato S, Flammini A, Menczer F, Vespignani A. Topical interests and the mitigation of search engine bias. Proc Natl Acad Sci USA. 2006;103:12684–12689. doi: 10.1073/pnas.0605525103. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Cho J, Roy S, Adams RE Special Interest Group on Management Of Data 2005. Page Quality: In Search of an Unbiased Web Ranking. 2005:551–562. [Google Scholar]
23.Pandey S, Roy S, Olston C, Cho J, Chakrabarti S. Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results. 2005:781–792. Very Large Data Base 2005. [Google Scholar]
24.Dorogovtsev SN, Mendes JFF. Scaling properties of scale-free evolving networks: Continuous approach. Phys Rev E. 2001;63 doi: 10.1103/PhysRevE.63.056125. 056125. [DOI] [PubMed] [Google Scholar]
25.Aiello W, Chung F, Lu L Foundations of Computer Science 2001. Random Evolution in Massive Graphs. 2001:510. [Google Scholar]
26.Redner S. How popular is your paper? An empirical study of the citation distribution. Eur Phys J B. 1998;4:131–134. [Google Scholar]
27.Simkin MV, Roychowdhury VP. Theory of aces: Fame by chance or merit? J Math Sociol. 2006;30:33–42. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

supp_105_37_13724__index.html^{(574B, html)}

0805921105_Appendix_PDF.pdf^{(301.7KB, pdf)}

[B1] 1.Simon HA. On a class of skew distribution functions. Biometrika. 1955;42:425–440. [Google Scholar]

[B2] 2.Price DJ. General theory of bibliometric and other cumulative advantage processes. J Am Soc Inform Sci. 1976;27:292–306. [Google Scholar]

[B3] 3.Barabasi A-L, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–512. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]

[B4] 4.Huberman BA, Adamic LA. Growth dynamics of the world wide web. Nature. 1999;401:131. [Google Scholar]

[B5] 5.Adamic LA, et al. Power-law distribution of the World Wide Web. Science. 2000;287:2115. [Google Scholar]

[B6] 6.Willis JC, Yule GU. Some statistics of evolution and geographical distributions in plants and animals and their significance. Nature. 1922;109:177–179. [Google Scholar]

[B7] 7.Simkin M, Roychowdhury V. Reinventing Willis. 2006 Preprint, http://arxiv.org/abs/physics/0601192.

[B8] 8.Kleinberg JM, Kumar R, Raghavan P, Rajagopalan S, Tomkins AS. The Web as a graph: Measurements, models and methods. Lecture Notes Computer Sci. 1999;1627:1–17. [Google Scholar]

[B9] 9.Bianconi G, Barabási A-L. Competition and multiscaling in evolving networks. Europhys Lett. 2001;54:436–442. [Google Scholar]

[B10] 10.Bianconi G, Barabási A-L. Bose–Einstein condensation in complex networks. Phys Rev Lett. 2001;86:5632–5635. doi: 10.1103/PhysRevLett.86.5632. [DOI] [PubMed] [Google Scholar]

[B11] 11.Borgs C, Chayes J, Daskalakis C, Roch S. First to Market Is Not Everything: An Analysis of Preferential Attachment with Fitness; Symposium on Theory of Computing 2007; 2007. pp. 135–144. [Google Scholar]

[B12] 12.Motwani R, Xu Y. Evolution of Page Popularity Under Random Web Graph Models; Symposium on Principles of Database Systems 2006; 2006. pp. 134–142. [Google Scholar]

[B13] 13.Sarshar N, Roychowdhury V. Scale-free and stable structures in complex ad hoc networks. Phys Rev E. 2004;69 doi: 10.1103/PhysRevE.69.026101. 026101. [DOI] [PubMed] [Google Scholar]

[B14] 14.Chung F, Lu L. Coupling online and offline analyses for random power law graphs. Internet Math. 2004;1:409–461. [Google Scholar]

[B15] 15.Cooper C, Frieze A, Vera J. Random deletion in a scale-free random graph process. Internet Math. 2004;1:463–483. [Google Scholar]

[B16] 16.Moore C, Ghoshal G, Newman MEJ. Exact solutions for models of evolving networks with addition and deletion of nodes. Phys Rev E. 2006;74 doi: 10.1103/PhysRevE.74.036121. 036121. [DOI] [PubMed] [Google Scholar]

[B17] 17.Redner S. Citation statistics from more than a century of physical review. 2004 Preprint, http://aps.arxiv.org/abs/physics/0407137.

[B18] 18.Simkin MV, Roychowdhury VP. A mathematical theory of citing. J Am Soc Inf Sci Technol. 2007;58:1661–1673. [Google Scholar]

[B19] 19.Cho J, Roy S. Impact of Search Engines on Page Popularity. 2004:20–29. WWW 2004. [Google Scholar]

[B20] 20.Pennock DM, Flake GW, Lawrence S, Glover EJ, Giles CL. Winners don't take all: Characterizing the competition for links on the web. Proc Natl Acad Sci USA. 2002;99:5207–5211. doi: 10.1073/pnas.032085699. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Fortunato S, Flammini A, Menczer F, Vespignani A. Topical interests and the mitigation of search engine bias. Proc Natl Acad Sci USA. 2006;103:12684–12689. doi: 10.1073/pnas.0605525103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Cho J, Roy S, Adams RE Special Interest Group on Management Of Data 2005. Page Quality: In Search of an Unbiased Web Ranking. 2005:551–562. [Google Scholar]

[B23] 23.Pandey S, Roy S, Olston C, Cho J, Chakrabarti S. Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results. 2005:781–792. Very Large Data Base 2005. [Google Scholar]

[B24] 24.Dorogovtsev SN, Mendes JFF. Scaling properties of scale-free evolving networks: Continuous approach. Phys Rev E. 2001;63 doi: 10.1103/PhysRevE.63.056125. 056125. [DOI] [PubMed] [Google Scholar]

[B25] 25.Aiello W, Chung F, Lu L Foundations of Computer Science 2001. Random Evolution in Massive Graphs. 2001:510. [Google Scholar]

[B26] 26.Redner S. How popular is your paper? An empirical study of the citation distribution. Eur Phys J B. 1998;4:131–134. [Google Scholar]

[B27] 27.Simkin MV, Roychowdhury VP. Theory of aces: Fame by chance or merit? J Math Sociol. 2006;30:33–42. [Google Scholar]

PERMALINK

Experience versus talent shapes the structure of the Web

Joseph S Kong

Nima Sarshar

Vwani P Roychowdhury

Series information

Abstract

Background.

Brief Summary of Findings.

Fig. 3.

Fig. 1.

Fig. 2.

Results