Significance
A scientist will encounter many potential collaborators throughout his/her career. As such, the choice to start or terminate a collaboration can be an important strategic consideration with long-term implications. While previous studies have focused primarily on aggregate cross-sectional collaboration patterns, here we analyze the collaboration network from a researcher’s local perspective along his/her career. Our longitudinal approach reveals that scientific collaboration is characterized by a high turnover rate juxtaposed with surprisingly frequent “life partners.” We show that these extremely strong collaborations have a significant positive impact on productivity and citations—the apostle effect—representing the advantage of “super” social ties characterized by trust, conviction, and commitment.
Keywords: computational social science, cooperation, team science, career evaluation, bibliometrics
Abstract
Scientists are frequently faced with the important decision to start or terminate a creative partnership. This process can be influenced by strategic motivations, as early career researchers are pursuers, whereas senior researchers are typically attractors, of new collaborative opportunities. Focusing on the longitudinal aspects of scientific collaboration, we analyzed 473 collaboration profiles using an egocentric perspective that accounts for researcher-specific characteristics and provides insight into a range of topics, from career achievement and sustainability to team dynamics and efficiency. From more than 166,000 collaboration records, we quantify the frequency distributions of collaboration duration and tie strength, showing that collaboration networks are dominated by weak ties characterized by high turnover rates. We use analytic extreme value thresholds to identify a new class of indispensable super ties, the strongest of which commonly exhibit >50% publication overlap with the central scientist. The prevalence of super ties suggests that they arise from career strategies based upon cost, risk, and reward sharing and complementary skill matching. We then use a combination of descriptive and panel regression methods to compare the subset of publications coauthored with a super tie to the subset without one, controlling for pertinent features such as career age, prestige, team size, and prior group experience. We find that super ties contribute to above-average productivity and a 17% citation increase per publication, thus identifying these partnerships—the analog of life partners—as a major factor in science career development.
Science operates at multiple scales, ranging from the global and institutional scale down to the level of groups and individuals (1). Integrating this system are multiscale social networks that are ripe with structural, social, economic, and behavioral complexity (2). A subset of this multiplex is the scientific collaboration network, which forms the structural foundation for social capital investment, knowledge diffusion, reputation signaling, and important mentoring relations (3–8).
Here we focus on collaborative endeavors that result in scientific publication, a process that draws on various aspects of social ties, e.g., colocation, disciplinary identity, competition, mentoring, and knowledge flow (9). The dichotomy between strong and weak ties is a longstanding point of research (10). However, in “science of science” research, most studies have analyzed macroscopic collaboration networks aggregated across time, discipline, and individuals (11–21). Hence, despite these significant efforts, we know little about how properties of the local social network affect scientists’ strategic career decisions. For example, how might creative opportunities in the local collaboration network impact a researcher’s decision to explore new avenues versus exploiting old partnerships, and what may be the career tradeoffs in the short versus the long term, especially considering that academia is driven by dynamic knowledge frontiers (22, 23).
Against this background, we develop a quantitative approach for improving our understanding of the role of weak and strong ties, meanwhile uncovering a third classification—the super tie—which we find to occur rather frequently. We analyzed longitudinal career data for researchers from cell biology and physics, together comprising a set of 473 researcher profiles spanning more than 15,000 career years, 94,000 publications, and 166,000 collaborators. To account for prestige effects, we define two groups within each discipline set, facilitating a comparison of top-cited scientists with scientists who are more representative of the entire researcher population (henceforth referred to as “Other”). From the publication records spanning the first career years of each central scientists i, we constructed longitudinal representations of each scientist’s coauthorship history.
We adopt an egocentric perspective to track research careers from their inception along their longitudinal growth trajectory. By using a local perspective, we control for the heterogeneity in collaboration patterns that exists both between and within disciplines. We also control for other career-specific collaboration and productivity differences that would otherwise be averaged out by aggregate cross-sectional methods. Thus, by simultaneously leveraging multiple features of the data—resolved over the dimensions of time, individuals, productivity, and citation impact—our analysis contributes to the literature on science careers as well as team activities characterized by dynamic entry and exit of human, social, and creative capital. Given that collaborations in business, industry, and academia are increasingly operationalized via team structures, our findings provide relevant quantitative insights into the mechanisms of team formation (15), efficiency (24), and performance (25, 26).
The organization of our study is structured as follows. The longitudinal nature of a career requires that we start by quantifying the tie strength between two collaborators from two different perspectives: duration and strength. First we analyze the collaboration duration, , defined as the time period between the first and last publication between two researchers i and j. Our results indicate that the “invisible college” defined by collaborative research activities (i.e., excluding informal communication channels and arm’s-length associations) is surprisingly dominated by high-frequency interactions lasting only a few years. We then focus our analysis on the collaborative tie strength, , defined as the cumulative number of publications coauthored by i and j during the years of activity.
From the entire set of collaborators, we then identify a subset of super tie coauthors—those j with values that are statistically unlikely according to an author-specific extreme value criteria. Because almost all of the researchers we analyzed have more than one super tie, and roughly half of the publications we analyzed include at least one super tie coauthor, we were able to quantify the added value of super ties—for both productivity and citation impact—in two ways: (i) using descriptive measures and (ii) implementing a fixed-effects regression model. Controlling for author-specific features, we find that super ties are associated with increased publication rates and increased citation rates.
We term this finding the “apostle effect,” signifying the dividends generated by extremely strong social ties based upon mutual trust, conviction, and commitment. This term borrows from biblical context, where an apostle represents a distinguished partner selected according to his/her noteworthy attributes from among a large pool of candidates. What we do not connote is any particular power relation (hierarchy) between i and the super tie coauthors, which is beyond the scope of this study. Also, because the perspective is centered around i, our super tie definition is not symmetric, i.e., if j is a super tie of i, i is not necessarily a super tie of j.
Because super ties have significant long-term impact on productivity and citations, our results are important from a career development perspective, reflecting the strategic benefits of cost, risk, and reward sharing via long-term partnership. The implications of research partnerships will become increasingly relevant as more careers become inextricably embedded in team science environments, wherein it can be difficult to identify contributions, signal achievement, and distribute credit. The credit distribution problem has received recent attention from the perspectives of institutional policy (8), team ethics (7), and practical implementation (27–29).
Methods
Our study implements an ego network perspective, centered around each researcher career i, with weighted links connecting the central scientist to the peripheral nodes representing his/her collaborators (indexed by j). We constructed each ego network using longitudinal publication data from Thompson Reuters Web of Knowledge (TRWOK), comprising 193 biology and 280 physics careers in total. Each career profile is constructed by aggregating the publication, citation, and collaboration metadata over the first years of his/her career. We downloaded the TRWOK data in calendar year , which is the citation count census year. Each disciplinary set includes a subset of 100 highly cited scientists (hereafter referred to as “Top”), selected using a ranking of the top-cited researchers in the high-impact journals Physical Review Letters and Cell. The rest of the researcher profiles (Other) are aggregated across physics and cell biology, with subsets that are specifically active in the domains of graphene, neuroscience, molecular biology, and genomics. The Other dataset only includes i with at least as many publications as the smallest among the top-cited researchers: As such, for biology and for physics. This facilitates a reasonable comparison between Top and Other, possibly identifying differences attributable to innate success factors. See SI Text for further details on the data methods and selection.
This longitudinal approach leverages author-specific factors, revealing how career paths are affected by idiosyncratic events. To motivate this point, Fig. 1 illustrates the career trajectory of A. Geim, cowinner of the 2010 Nobel Prize in Physics. This schematic highlights three fundamental dimensions of collaboration ties—duration, strength, and impact: (i) each horizontal line indicates the collaboration of length between i and coauthor j, beginning with their first joint publication in year and ending with their last observed joint publication in year ; (ii) the circle color indicates the total number of joint publications, , representing our quantitative measure of tie strength; and (iii) the circle size indicates the net citations in , summed over the citations cj,p all publications p that include i and j.
Fig. 1.
Visualizing the embedding of academic careers in dynamic social networks. A career schematic showing A. Geim’s collaborations, ordered by entry year. Notable career events include the first publication in 2000 with K. S. Novoselov (cowinner of the 2010 Nobel Prize in Physics) and their first graphene publication in 2004. An interesting network reorganization accompanies Geim’s institutional move from Radboud University Nijmegen (The Netherlands) to University of Manchester (United Kingdom) in 2001. Moreover, the rapid accumulation of coauthors following the 2004 graphene discovery signals the new opportunities that accompany reputation growth.
This method of representing a science career, as illustrated in Figs. S1–S3, highlights the variability in collaboration strengths, both between and within career profiles. It is also worth mentioning that because multiple j may contribute to the same p, it is possible for coauthor measures to covary. However, for the remainder of the analysis, we focus on the dyadic relations between only i and j, leaving the triadic and higher-order team structures as an avenue for future work. For example, it would be interesting to know the likelihood of triadic closure between any two super ties of i, signaling coordinated cooperation; or, contrariwise, low triadic closure rates may indicate hierarchical organization around i.
Fig. S1.
Complex relations between productivity, collaboration, and impact. A−D are for A. K. Geim, who is characterized by an average collaboration duration of 2.1 y (calculated including the collaborations with but excluding the collaborations active in the last 2 y), a characteristic tie strength publications, a collaboration radius of coauthors, and total publications; E−F are for D. Acemoglu, who is characterized by an average collaboration duration of 1.6 y (also calculated including the collaborations with but excluding the collaborations active in the last 2 y), publications, coauthors, and publications. These schematics demonstrate how the visualization of dynamic ego network changes if we use publication and citation measures that are normalized by , resulting in per-year-of-collaboration (intensity) measures. (A) Collaboration measures calculated per unit time, for comparison with Fig. 1. (B−D) Scatter plots for the profile of A. K. Geim relating collaboration duration (), with (B) collaboration strength (), (D) pairwise team size (), and (C) citations (). is the total number of coauthors (nondistinct) on publications including i and j, a proxy for pairwise collaborative input, conditioned on i and j. The dashed line in each panel represents the ordinary least-squares fit of the log of the variables. As such, the logarithmic slope (scaling exponent) is listed in each panel, and the value in parentheses represents the SE in the last digit reported. (E and F) Economics is a field not traditionally considered to be collaborative at the rates of physics or biology. Nevertheless, prestige and collaboration life cycles are still important factors, independent of discipline. To demonstrate this, we show the career profile of the highly cited economist, Daron Acemoglu. Notable landmark achievements are indicated, including the early partnership with James A. Robinson in 2000, and their groundbreaking book, Economic Origins of Dictatorship and Democracy, published in 2005 (48). (E) Net collaboration measures for D. Acemoglu, analogous to Fig. 1. (F) Collaboration measures calculated per unit time, analogous to A.
Fig. S3.
Collaboration life cycle for the (A and C) Other biology and the (B and D) Other physics datasets. Other datasets: (A and B) Average collaboration strength, normalized to peak value, measured τ years after the initiation of the collaboration tie. (Insets) On log-linear axes, the decay appears as linear, corresponding to an exponential form. (C and D) For each group, we show the average and SD (error bar) of ; we use logarithmically spaced groups that correspond by color to the same as in A and B. The ζ value quantifies the scaling of as a function of the normalized coauthor strength . The sublinear () values indicate that collaborations are distributed over a timescale that grows slower than proportional to x; conversely, this means that longer collaborations are more productive, being characterized by increasing marginal returns (). Fig. 3 shows the analogous plot for the Top physics and biology datasets; all four datasets exhibit similar features.
Results
Quantifying the Collaboration Lifetime Distribution.
We use to measure the duration of the productive interaction between i and j. Across researcher profiles, we find that a remarkable 60−80% of the collaborations have year (see Fig. S4). Considering the overwhelming dominance of the events, in this subsection, we concentrate our analysis on the subset of repeat collaborations with that produced two or more publications. Furthermore, due to censoring bias, values estimated for j who are active around the final career year of the data () may be biased toward small values. To account for this bias, in this subsection, we also exclude those collaborations that were active within the final -year period, defining as an initial average value calculated across all j for each i. Then, we calculate a second representative mean value, , which is calculated excluding the j with and the j active in the final -year period. Fig. 2A shows the probability distribution , with mean values ranging from 4 y to 6 y, consistent with the typical duration of an early career position (e.g., PhD or postdoctoral fellow, assistant professor).
Fig. S4.
Additional collaboration profile measures. (A) Cumulative distribution of the number of super ties . The mean (vertical lines) and SD are (Top biology), (Other biology), (Top physics), and (Other physics). The K-S test P value calculated by comparing the biology distributions is 0.12, and, for the physics distributions, it is 0.34; in both cases, the null hypothesis that the two compared datasets arise from the same distribution is not rejected at the 5% level. (B) Cumulative distribution of the empirical (unnormalized) durations (years). The values dominate the distribution, with 0.73 y (Top biology), 0.78 y (Other biology), 0.61 y (Top physics), and 0.58 y (Other physics). Thus, including the values, the mean are 2.2 y (Top biology), 1.8 y (Other biology), 2.7 y (Top physics), and 2.7 y (Other physics). To avoid age cohort bias, collaborations commenced in the final period of each career profile are excluded from these distributions. (C) Cumulative distribution of the productivity premium defined in Eq. S1. The mean and SD are (Top biology), (Other biology), (Top physics), and (Other physics). Only the two physics datasets are significantly similar (K-S ). (D) Cumulative distribution of the citation premium defined in Eq. S5. The mean and SD are: (Top biology), (Other biology), (Top physics), and (Other physics). The K-S test P values calculated by comparing the two Top datasets and the two Other datasets are both greater than 0.05. An interesting and consistent pattern emerges when considering the distributions of both and : The Top scientist profiles have smaller mean values than their counterparts, and the biology profiles have smaller mean value than for physics. The mean, median, and maximum values across all datasets are 14.1, 11.3, and 134, respectively, with all but two values greater than unity. Because the maximum value is an extreme outlier, we truncate the x axes showing only values of <38, which represents more than 95% of the data.
Fig. 2.
Log-logistic distribution of collaboration duration. (A) The probability distribution is right-skewed and well fit by the log-logistic pdf defined in Eq. 1. (Insets) The probability distribution shows that the characteristic collaboration length in physics and biology is typically between 2 y and 6 y. (B) The decrease in the typical collaboration timescale, , reflects how careers transition from being pursuers of collaboration opportunities to attractors of collaboration opportunities.
Establishing statistical regularities across research profiles requires the use of a normalized duration measure, , which controls for author-specific collaboration patterns by measuring time in units of . The empirical distributions are right-skewed, with approximately of the data with (corresponding to ). Nevertheless, ∼1% of collaborations last longer than 15−20 y. Moreover, Fig. 2A shows that the log-logistic probability density function (pdf),
[1] |
provides a good fit to the empirical data over the entire range of . The log-logistic (Fisk) pdf is a well-known survival analysis distribution with property Median. By construction, the mean value , which reduces our parameter space to just b as . For each dataset, we calculate , estimating the parameter using ordinary least squares. Associated with each is a hazard function representing the likelihood that a collaboration terminates for a given . Because , the hazard function is unimodal, with a maximum value occurring at with bounds for and for ; using the best-fit a and b values, we estimate 0.94 (Top biology), 1.11 (Other biology), 0.77 (Top physics), and 1.08 (Other physics). Thus, represents a tipping point in the sustainability of a collaboration, because the likelihood that a collaboration terminates peaks at and then decreases monotonically for . This observation lends further significance to the author-specific time scale . The log-logistic pdf is also characterized by asymptotic power-law behavior for large .
To determine how the values are distributed across the career, we calculated the mean duration using a 5-y (sliding window) moving average centered around career age t. If the values were distributed independent of t, then . Instead, Fig. 2B shows a negative trend for each dataset. Interestingly, the values are consistently larger for the Top scientists, indicating that the relatively short are more concentrated at larger t. This pattern of increasing access to short-term collaboration opportunities points to an additional positive feedback mechanism contributing to cumulative advantage (30, 31).
Quantifying the Collaboration Life Cycle.
The distribution points to the variability of time scales in the scientific collaboration network—although a small number of collaborations last a lifetime, the remainder decay quite quickly in a collaboration environment characterized by a remarkably high churn rate. Because it is possible that a relatively long corresponds to just the minimum two publications, it is also important to analyze the collaboration rate. To this end, we quantify the patterns of growth and decay in tie strength using the more than 166,000 dyadic collaboration records: is the cumulative number of coauthored publications between i and j up to year t, and is the annual publication rate.
To define a collaboration trajectory that is better suited for averaging, we normalize each individual by its peak value,
[2] |
Here is the number of years since the initiation of a given collaboration. This normalization procedure is useful for comparing and averaging time series that are characterized by just a single peak.
Expecting that the collaboration trajectories depend on the tie strength, we grouped the individual according to the normalized coauthor strength, . The normalization factor is calculated across the distinct collaborators (the collaboration radius of i), and represents an intrinsic collaboration scale that grows in proportion to both an author’s typical collaboration size and his/her publication rate. We then aggregated the trajectories in each group and calculated the average trajectory,
[3] |
Indeed, Fig. 3 shows that the collaboration life cycle depends strongly on the relative tie strength . The trajectories with decay over a relatively long time scale, maintaining a value approximately even 20 y after initiation, reminiscent of a “research life partner.” The trajectories with represent common collaborations that decay exponentially over the characteristic time scale . A mathematical side note, useful as a modeling benchmark, is the linear decay when plotted on log-linear axes, suggesting a functional form that is exponential for large τ, .
Fig. 3.
Growth and decay of collaboration ties for (A and C) Top biology and (B and D) Top physics. (A and B) Average collaboration intensity, normalized to peak value, measured years after the initiation of the collaboration tie. (Insets) On log-linear axes, the decay appears as linear, corresponding to an exponential form. (C and D) For each group, we show the average and SD (error bar) of ; we use logarithmically spaced groups that correspond by color to the same as in A and B. The ζ value quantifies the scaling of as a function of the normalized coauthor strength . The sublinear () values indicate that collaborations are distributed over a timescale that grows slower than proportional to x; conversely, this means that longer collaborations are relatively more productive, being characterized by increasing marginal returns (). Fig. S3 shows the analogous plot for the Other physics and biology datasets; all four datasets exhibit similar features.
We further emphasize the ramifications of the life cycle variation by quantifying the relation between and the collaboration’s half-life , defined as the number of years to reach half of the total collaborative output according to the relation . We observe a scaling relation for the average half life, with ζ values ranging from 0.4 to 0.5. Sublinear values () indicate that a collaboration with twice the strength is likely to have a corresponding that is less than doubled. This feature captures the burstiness of collaborative activities, which likely arises from the heterogenous overlapping of multiple timescales, e.g., the variable contract lengths in science ranging from single-year contracts to lifetime tenure, the overlapping of multiple age cohorts, and the projects and grants themselves, which are typically characterized by relatively short terms. Nevertheless, is increasing function for , indicating an increasing marginal returns with increasing , further signaling the productivity benefits of long-term collaborations characterized by formalized roles, mutual trust, experience, and group learning which together can facilitate efficient interactions.
Quantifying the Tie Strength Distribution.
Here we focus on the cross-sectional distribution of tie strengths within the ego network. We use the final tie strength value to distinguish the strong ties () from the weak ties (). Fig. 4A shows the cumulative distribution of the mean tie strength , which can vary over a wide range depending on a researcher’s involvement in large-team science activities. We also quantify the concentration of tie strength using the Gini index calculated from each researcher’s values; the distribution is shown in Fig. 4B. Together, these two measures capture the variability in collaboration strengths across and within disciplines, with physics exhibiting larger and values.
Fig. 4.
Characteristic measures of collaboration tie strength. (A) Cumulative distribution of the mean collaboration strength, . The K-S test indicates that the are similar for biology () and significantly different for physics (). Vertical lines indicate median value. (B) Cumulative distribution of . The pairwise K-S test indicates that the are similar for biology () but not for physics (). Vertical lines indicate the mean value, with physics indicating significantly higher than for biology. In (C) biology and (D) physics, for each dataset, the cumulative distribution of normalized collaboration strength shows excellent agreement with the exponential distribution (gray line) over the bulk of the distribution, with the deviations in the tail regime representing less than 0.1% of the data.
Another important author-specific variable is the publication overlap between each researcher and his/her top collaborator. This measure is defined as the fraction of a researcher’s publications including his/her top collaborator, . We observe surprisingly large variation in , with mean and SD in the range of for the Top scientists and for the Other scientists. Across all profiles, the min and max values are 0.03 and 0.99, respectively, representing nearly the maximum possible variation in observed publication overlap. An example of this limiting scenario is shown in Fig. S2, highlighting the “dynamic duo” of J. L. Goldstein and M. S. Brown, winners of the 1985 Nobel Prize in Physiology or Medicine; Goldstein and Brown published more than 450 publications each, with roughly coauthored together. Remarkably, we find that overlaps larger than 50% are not uncommon, observing (biology) and (physics) of i having more than half of their publications with their strongest collaborator.
Fig. S2.
Visualizing the dynamic collaboration profile of individual researchers: the longitudinal coauthor trajectories of (A) Anderson, (B) Geim, (C) Blackburn, and (D) Goldstein; the cross-sectional rank-citation profiles of (E) Anderson, (F) Geim, (G) Blackburn, and (H) Goldstein. For each discipline, we show the collaboration profile of two Nobel laureates (A. K. Geim and J. L. Goldstein) whose top-cited research was done with their most intense collaborator, and two collaboration profiles for two Nobel laureates (P. W. Anderson and E. H. Blackburn) whose top-cited research did not exhibit this feature. Despite their common achievement, we observe a wide variation in the entry, strength, and saturation of their collaborations. To illustrate the variation in tie strength, both within and between researcher profiles, we show the rank−coauthor profile , which is defined for any given t by sorting the coauthors in decreasing order by rank r, . In this way, provides a cross-sectional representation of . As such, snapshots of taken at different t capture the temporal evolution of a researcher’s tie strength distribution, as illustrated by the gray data points in E−H. (A−D) Longitudinal growth of , the cumulative number of publications with coauthor j (colored curves), and the central author’s total number of publications (black curve). To reduce graphical clutter, we truncate each at the year of the last observed collaboration; otherwise, each panel would be dominated by horizontal lines. The gray dashed line indicates , which distinguishes the trajectories corresponding to super ties. The distance between the vertical yellow line and the right edge of each panel indicates the mean collaboration duration, , for each researcher. (E−H) To convey the dynamics of the rank-coauthor profile, we show snapshots of for t = 5 y, 10 y, and 20 y (increasing gray dot size), in addition to the final (colored circles) calculated for the most recently available career year . The lower dashed gray line indicates , which separates the weak from the strong ties. The upper dashed gray line indicates , which distinguishes the super ties within the subset of strong ties. Recently, the analog of the h-index has been suggested as a way to measure the “author core” derived from the rank-coauthor distribution (49). For all panels, to facilitate visual comparison, the color scale used in the left and right column is the same for each i. To identify the coauthors with the highest net citation impact, we plot curves (circles) using thickness (radius) and color that are scaled proportional to , which is the log of the total citation share of coauthor j in profile i (see Eq. S4).
However, within a researcher profile, it is likely that more than just the top collaborator was central to his/her career. Indeed, key to our investigation is the identification of the extremely strong collaborators—super ties—that are distinguished within the subset of strong ties. Hence, using the empirical information contained within each researcher’s tie strength distribution, , we develop an objective super tie criteria that is author specific. First, to gain a better understanding of the statistical distribution of , we aggregated the tie strength data across all research profiles, using the normalized collaboration strength . Fig. 4 C and D shows the cumulative distribution for each discipline. Each is in good agreement with the exponential distribution (with mean value by construction), with the exception in the tail, , which is home to extreme collaborator outliers. Thus, by a second means in addition to the result for , we find that roughly 2/3 of the ties we analyzed are weak (i.e., the fraction of observations with is given by ).
Based upon this empirical evidence, we use the discrete exponential distribution as our baseline model, . We then use extreme statistics arguments to precisely define the author-specific super tie threshold . The extreme statistic criterion posits that, out of the empirical observations, there should be just a single observation with . The threshold is operationalized by integrating the tail of according to the equation , with the analytic relation for small . In the relatively large limit, is given by the simple relation
[4] |
The advantage of this approach is that is nonparametric, depending only on the observables and . Thus, the super tie threshold is proportional to (the arises because the minimum value is 1), with a logarithmic factor reflecting the sample size dependence. This extreme value criteria is generic, and can be derived for any data following a baseline distribution; for a succinct explanation of this analytic method, see page 17 of ref. 32.
In what follows, we label each coauthor j with a super tie, with indicator variable . The rest of the ties with have an indicator variable . This method has limitations, specifically in the case that the collaboration profile does not follow an exponential . For example, consider the extreme case where every , meaning that (independent of ), resulting in all coauthors being super ties ( for all j). This scenario is rare and unlikely to occur for researchers with relatively large and , as in our researcher sample.
Quantifying the Prevalence and Impact of Super Ties.
How common are super ties? For each profile, we denote the number of coauthors that are super ties by (with complement ). Fig. S4 shows that the distribution of is rather broad, with mean and SD values (Top biology), (Other biology), (Top physics), and (Other physics). The super tie coauthor fraction, , measures the super tie frequency on a per-collaborator basis, with mean value (i.e., typically one super tie for every 25 coauthors). Furthermore, Fig. 5A shows that the distribution is common across the four datasets. We tested the universality of the probability distribution between the Top and Other researcher datasets using the Kolmogorov−Smirnov (K-S) statistic, which tests the null hypothesis that the data come from the same underlying pdf. The smallest pairwise K-S test P value between any two is , indicating that we fail to reject the null hypothesis that the distributions are equal, highlighting that the four datasets are remarkably well matched with respect to the distribution of .
Fig. 5.
The frequency of super ties. Vertical lines indicate the distribution mean. (A) Cumulative distribution of the fraction of the coauthors that are super ties. All pairwise comparisons of the distributions have K-S P values greater than 0.21, indicating a common underlying distribution . (B) Cumulative distribution of the fraction of publications that include at least one super tie coauthor. The Top scientist distributions show mean values that are significantly smaller than their counterparts. (C) Cumulative distribution of the fraction of publications coauthored with his/her top collaborator. The mean and SD for biology (Top) is , for biology (Other) is , for physics (Top) is , and for physics (Other) is . (D) The mean rate of super ties per new collaboration, , averaged over all of the profiles in each dataset using observations aggregated over consecutive 3-y periods.
On a per-paper basis, Fig. 5B shows that the fraction of a researcher’s portfolio coauthored with at least one super tie, , can vary over the entire range of possibilities, with mean and SD (Top biology), (Other biology), (Top physics), and (Other physics). Furthermore, we found that 41% of the Top scientists have . Interestingly, the distributions of and indicate that top scientists have lower levels of super tie dependency than their counterparts.
We also analyzed the arrival rate of super ties. For each profile, we tracked the number of super ties initiated in year t and normalized this number by the total number of new collaborations initiated in the same year. This ratio, , estimates the likelihood that a new collaboration eventually becomes a super tie as a function of career age t. For example, using the set of collaborations initiated in each scientist’s first year, we estimate the likelihood that a first-year collaborator (mentor) becomes a super tie at (Top biology), (Other biology), (Top physics), and (Other physics). Fig. 5D shows the mean arrival rate, , calculated by averaging over all profiles in each dataset. The super tie arrival rate declines across the career, reaching a 5% likelihood per new collaborator at and 2.5% likelihood by . The decay is not as fast for the top-cited scientists, possibly reflecting their preferential access to outstanding collaborators. However, the estimate for large t is biased toward smaller values because collaborations initiated late in the career may not have had sufficient time to grow.
In The Apostle Effect I and The Apostle Effect II, we investigate the role of super ties at the microlevel by analyzing productivity at the annual time resolution and the citation impact of individual publications. In SI Text, we provide additional evidence for the advantage of super ties by developing descriptive methods that measures the net productivity and citations of the super ties relative to all other ties.
The Apostle Effect I: Quantifying the Impact of Super Ties on Annual Productivity.
We analyzed each research profile over the career years , separating the data into nonoverlapping -year periods, and neglecting the first 5 y to allow the and sufficient time to grow. We then modeled the dependent variable, , which is the productivity aggregated over -year periods, normalized by the baseline average calculated over the period of analysis. Recent analysis of assistant and tenured professors has shown that the annual publication rate is governed by slow but substantial growth across the career, with fluctuations that are largely related to collaboration size (24).
To better understand the factors contributing to productivity growth, we include controls for career age t along with four additional variables measuring the composition of collaborators from each -year period. First, we calculated the average number of authors per publication, , a proxy for labor input, coordination costs, and the research technology level. Second, we calculated the mean duration, , by averaging the values (from the previous period) across only the j who are active in t, i.e., those coauthors with . In this way, we account for the possibility that j was not active in the previous period , in which case is even smaller than . Thus, measures the prior experience between i and his/her collaborators. Third, for the same set of coauthors as for , we calculated the Gini index of the collaboration strength, , using the tie strength values up to the previous period, . Thus, provides a standardized measure of the dispersion in coauthor activity, with values ranging from 0 (all coauthors published equally in the past with i) to 1 (extreme inequality in prior publication with i). Thus, whereas measures the lifetime of the group’s prior collaborations, measures the concentration of their prior experience. Finally, for each period t, we calculated the contribution of super tie collaborators normalized by the contribution of all other collaborators,
[5] |
accounting for the possibility that the relative contribution of super ties may affect productivity. Although the total coauthor contribution is highly correlated with , the correlation coefficient between and is only 0.07. We only include researchers in this analysis if there are data points for which the denominator of Eq. 5 is nonzero.
We implemented a fixed-effects regression of the model
[6] |
which accounts for author-specific time-invariant features (), using robust SEs to account for autocorrelation within each i. Because the predictors are calculated from the same ego profile, covariance is expected; for example, the highest correlation coefficient between any two independent variables is 0.32 between and , because the variance in increases proportional to the sample size (i.e., ). Table 1 shows the results of our model estimates for year, and Table S1 shows the results for years. We also ran the regression for all of the datasets together, “All,” and provide standardized coefficients that better facilitate a comparison of the coefficient magnitudes.
Table 1.
Parameter estimates for the productivity model for ni,t in Eq. 6 using -y-long periods
Dataset | A | t | Adj. | |||||
All | 466 | 8,483 | 0.19 | |||||
(Std. coeff.) | ||||||||
P value | 0.943 | 0.000 | 0.000 | 0.000 | 0.000 | |||
Biology (Top) | 99 | 2,202 | 0.24 | |||||
P value | 0.031 | 0.519 | 0.000 | 0.000 | 0.000 | |||
Biology (Other) | 95 | 1,467 | 0.29 | |||||
P value | 0.275 | 0.008 | 0.000 | 0.003 | 0.000 | |||
Physics (Top) | 100 | 2,056 | 0.15 | |||||
P value | 0.012 | 0.002 | 0.000 | 0.000 | 0.000 | |||
Physics (Other) | 172 | 2,758 | 0.15 | |||||
P value | 0.079 | 0.000 | 0.000 | 0.000 | 0.000 |
Each fixed-effects model was calculated using robust SEs, implemented by the Huber/White/sandwich method. Values significant at the level are indicated in boldface. Std. coeff., the estimates of the standardized (beta) coefficients; All, the combination of all datasets.
Table S1.
Apostle effect productivity model (): Parameter estimates for the fixed-effects regression model in Eq. 6 with -y-long periods, using robust SEs implemented by the Huber/White/sandwich method
Dataset | A | t | Adj. | |||||
All | 406 | 2,890 | 0.16 | |||||
(Std. coeff.) | ||||||||
P value | 0.004 | 0.000 | 0.000 | 0.000 | 0.000 | |||
Biology (Top) | 99 | 782 | 0.24 | |||||
P value | 0.110 | 0.199 | 0.000 | 0.016 | 0.000 | |||
Biology (Other) | 84 | 492 | 0.31 | |||||
P value | 0.184 | 0.104 | 0.000 | 0.146 | 0.000 | |||
Physics (Top) | 99 | 753 | 0.11 | |||||
P value | 0.514 | 0.000 | 0.000 | 0.000 | 0.000 | |||
Physics (Other) | 124 | 863 | 0.13 | |||||
P value | 0.047 | 0.001 | 0.000 | 0.000 | 0.000 |
See Table 2 for results with . Only profiles with four or more data values were included in the regression. Values significant at the level are indicated in boldface.
We observed a positive coefficient ( for all datasets), meaning that larger contributions by super ties are associated with above-average productivity. By way of example, consider a scenario where the super ties contribute a third of the total coauthor input, corresponding to , the average value we observed. Consider a second scenario with , corresponding to equal input by the super ties and their counterparts ( for 14% of the observations). If all other parameters contribute a baseline productivity value 1, then the additional contribution from corresponds to a % productivity increase. This value is consistent with the productivity spillover observed in a study of star scientists (33).
We also found that periods corresponding to higher levels of prior experience are associated with below-average productivity (, for all datasets except for Top biology). Despite the costs associated with tie formation, this result demonstrates that productivity can benefit from collaborator turnover. Nevertheless, above-average productivity is associated with higher inequality in the concentration of prior experience (, level for all datasets). Together, these results point to the benefits of strategically pairing new collaborators with incumbent ones to promote the atypical combination of knowledge backgrounds and to achieve higher scientific impact (34). The standardized coefficients in Table 1 indicate that is twice as strong as and ; interestingly, and have opposite signs yet are balanced in magnitude, suggesting a compensation strategy for group managers.
The age coefficient is also positive ( level for all datasets), consistent with patterns of steady productivity growth observed for successful research careers (5, 24, 31). Possible explanatory variables to consider in extended analyses are the SD in , a contact frequency () measure of tie strength intensity per Granovetter’s original operationalization (10), and absolute calendar year y, variables that we omit here to keep the model streamlined.
The Apostle Effect II: Quantifying the Impact of Super Ties on the Long-Term Citation of Individual Publications.
The impact of super ties on a publication’s long-term citation tally is difficult to measure, because, clearly, older publications have had more time to accrue citations than newer ones—a type of censoring bias—and so a direct comparison of raw citation counts for publications from different years is technically flawed. To address this measurement problem, we map each publication’s citation count in census year to a normalized z score,
[7] |
This citation measure is well suited for the comparison of publications from different y because is measured relative to the mean number of citations by publications from the same year y, in units of the SD, (31). Thus, we take advantage of the fact that the distribution of citations obeys a universal log-normal distribution for p from the same y and discipline (35). In this way, z is defined such that the distribution is sufficiently time invariant. To confirm this property, we aggregated within successive 8-y periods, and calculated the conditional distributions , which are stable and approximately normally distributed over the entire sample period (Fig. S5).
Fig. S5.
Distribution of normalized citation impact z. Each panel shows the pdf using z values aggregated over successive nonoverlapping 8-y periods. These panels demonstrate the distribution stability of over time, where z is the dependent variable in the citation apostle effect model in Eq. 8.
To define the detrending indices and , we use the baseline journal set m comprising all research articles collected from the journals Nature, Proceedings of the National Academy of Sciences, and Science. We use this aggregation of three multidisciplinary journals only to control for the time-dependent feature of citation counts. We chose these journals as our baseline because they have relatively large impact factors (high citation rates), and so the temporal information contained in and is less noisy than other m with lower citation rates. Furthermore, because most publications reach their peak citation rate within 5−10 y after publication (5), we only analyze with . In this way, the values we analyze are less sensitive to fluctuations early in the citation lifecycle, in addition to recent paradigm shifts in science such as the Internet, which affects the search, the retrieval, and the citation of prior literature, and the rise of open access publishing.
In our regression model, we use five explanatory variables that are author (i) and publication (p) specific. The first is the number of coauthors, , which controls for the tendency for publications with more coauthors to receive more citations (4). This variable is also a gross level of technology and coordination costs, because larger teams typically reflect endeavors with higher technical challenge distributed across a wider range of skill sets. We use because the range of values is rather broad, appearing to be approximately log-normally distributed in the right tail (7). The second explanatory variable is the dummy variable , which takes the value 1 if p includes a super tie and the value 0 otherwise. Remarkably, the percentage of publications including a super tie is rather close to parity for three of the four datasets: 54% (Top biology), 45% (Top physics), 74% (Other biology), and 54% (Other physics). The third age variable, , is the career age of i at the time of publication. The fourth variable, , is the total number of publications up to year , which is a non-citation-based measure of the central author’s reputation, visibility, and experience within the scientific community. The final explanatory variable is the collaboration radius, , which is the cumulative number of distinct coauthors up to , representing the central author’s access to collaborative resources, as well as an estimate of the number of researchers in the local community who, having published with i, may preferentially cite i. Hence, by including and , we control for two dimensions of cumulative advantage that could potentially affect a publication’s citation tally.
We then implement a fixed-effects regression to estimate the parameters of the citation impact model,
[8] |
using the Huber/White/sandwich method to calculate robust SE estimates that account for heteroskedasticity and within-panel serial correlation in the idiosyncratic error term . We excluded publications with , and, in order that the Top and Other datasets are well balanced, we also excluded the Other researchers with less than 43 (biology) and 33 (physics) publications (observations) as of 2003. Table 2 lists the (standardized) parameter estimates. We provide the data used for both regression models in Dataset S1.
Table 2.
Parameter estimates for the citation model for zi,p in Eq. 8 using only the publications with
Dataset | A | Adj. | ||||||
All | 377 | 68,589 | 0.27 | |||||
(Std. coeff.) | ||||||||
P value | 0.000 | 0.000 | 0.000 | 0.347 | 0.367 | |||
Biology (Top) | 100 | 22,135 | 0.12 | |||||
P value | 0.000 | 0.000 | 0.000 | 0.177 | 0.578 | |||
Biology (Other) | 55 | 4,801 | 0.20 | |||||
P value | 0.000 | 0.026 | 0.040 | 0.065 | 0.029 | |||
Physics (Top) | 100 | 22,673 | 0.19 | |||||
P value | 0.002 | 0.000 | 0.000 | 0.021 | 0.380 | |||
Physics (Other) | 122 | 18,980 | 0.19 | |||||
P value | 0.000 | 0.000 | 0.000 | 0.389 | 0.870 |
Each fixed-effects model was calculated using robust SEs, implemented by the Huber/White/sandwich method. Values significant at the level are indicated in boldface. Std. coeff., the estimates of the standardized (beta) coefficients; All, the combination of all datasets.
We estimated ( level in each regression), indicating a significant relative citation increase when a publication is coauthored with at least one super tie. The standardized and coefficients are roughly equal, meaning that increasing from 1 (a solo author publication) to coauthors produces roughly the same effect as a change in from 0 to 1. Thus, although larger team size correlates with more citations (4), the relative strength of stresses the importance of who in addition to how many.
Interestingly, the career age parameter is negative (significant at the level in each regression), meaning that researchers’ normalized citation impact decreases across the career, possibly due to finite career and knowledge life cycles. This finding is consistent with a large-scale analysis of researcher histories within high-impact journals, which also shows a negative trend in the citation impact across a career (31). Neither the reputation () nor collaboration radius () parameters were consistently statistically significant in explaining , likely because they are highly correlated with for established researchers. Modifications to consider in followup analysis are controls for the impact factor of the journal publishing p, the absolute year y to account for shifts in citation patterns in the post-Internet era, and removing self-citations from super ties. Unfortunately, this last task requires a substantial increase in data coverage, far beyond the relatively small amount needed to construct individual ego network collaboration profiles.
We develop three additional descriptive methods in SI Text to compare the subset of publications with at least one super tie to the complementary subset of publications without one. These investigations provide further evidence for the apostle effect. First, we defined an aggregate career measure, the productivity premium (see Eq. S1), which measures the average value among the super ties relative to all of the other collaborators. Second, we defined a similar career measure, the citation premium (see Eq. S5), which quantifies the average citation impact attributable to super ties relative to all of the other collaborators.
Independent of dataset, we observed rather substantial premium values. For example, the productivity premium has an average value , meaning that on a per-collaborator basis, productivity with super ties is roughly 8 times higher than with the remaining collaborators. Similarly, the citation premium is also significantly right-skewed, with average value , meaning that net citation impact per super tie is 14 times larger than the net citation impact from all other collaborators. We emphasize that appropriately accounts for team size by using an equal partitioning of citation credit across the coauthors, remedying the multiplicity problem concerning citation credit.
Third, we calculated an additional estimation of the publication-level citation advantage due to super ties (Fig. S6). For both biology and physics, we found that the publications with super ties receive roughly 17% more citations than their counterparts. In basic terms, this means that the average publication with a super tie has 21 more citations in biology and 8 more citations in physics than the average publication without a super tie. This is not a tail effect, because the citation boost factor applies a multiplicative shift to the entire citation distribution, , thereby impacting publications above and below the average.
Fig. S6.
Comparing the citation distribution for papers with and without super ties: (A and C) Top and Other biology datasets combined, and (B and D) Top and Other physics datasets combined. (A and B) The cumulative citation distribution, , of the detrended citations defined in Eq. S2. The solid orange curve represents for publications with , and the dashed black curve represents for publications with . Pairwise comparison of the distributions yield K-S P values less than , indicating that the distributions are significantly different. The distribution means are indicated by the vertical lines with corresponding numerical value shown in each panel. The ratio between the means yields the value 1.17 for biology and 1.16 for physics. Estimating using the ratio of the median values yields approximately the same value. Thus, represents a 16−17% citation boost for p with , which translates, on average, to a 21-citation difference for biology and an 8-citation difference for physics. (C and D) Scatter plots of the median values for p with versus the median values for p with . Values are calculated within researcher profiles; thus each dot represents a single researcher. The majority of researchers have , with 73% of the biology researchers and 76% of the physics researchers above the (dashed black) line. The μ value estimates the per-publication citation premium that accounts for heterogeneity across i. Because , these two methods yield consistent estimates of the citation premium per publication.
Discussion
The characteristic collaboration size in science has been steadily increasing over the last century (4, 7, 21), with consequences at every level of science, from education and academic careers to universities and funding bodies (8). Understanding how this team-oriented paradigm shift affects the sustainability of careers, the efficiency of the science system, and society’s capacity to overcome grand challenges will be of great importance to a broad range of scientific actors, from scientists to science policy makers.
Collaborative activities are also fundamental to the career growth process, especially in disciplines where research activities require a division of labor. This is especially true in biology and physics research, where computational, theoretical, and experimental methods provide complementary approaches to a wide array of problems. As a result, a contemporary research group leader is likely to find the assembly of team—one that is composed of individuals with diverse yet complementary skill sets—a daunting task, especially when under constraints to optimize financial resources, valuable facilities, and other material resources. Online social network platforms, such as VIVO (www.vivoweb.org/) and Profiles RNS (profiles.catalyst.harvard.edu/), which serve as match-making recommendation systems, have been developed to facilitate the challenges of team assembly.
Our analysis indicates that 2/3 of the collaborations analyzed here are weak. Nevertheless, the remaining strong ties represent social capital investments that can indeed have important long-term implications, for example, on information spreading (17), career paths (36), and access to key strategic resources (37). In the private sector, strong ties facilitate access to new growth opportunities, playing an important role in sustaining the competitiveness of firms and employees (38). These considerations further identify why it is important for researchers to understand the opportunities that exist within their local network. Understanding the redundancies in the local network (39) and the interaction capacity of team members (25) can help a group leader optimize group intelligence (26) and monitor team efficiency (24), thereby constituting a source of strategic competitive advantage.
In summary, we developed methods to better understand the diversity of collaboration strengths. We focused on the career as the unit of analysis, operationalized by using an ego perspective so that collaborations, publications, and impact scores fit together into a temporal framework ideal for cross-sectional and longitudinal modeling. Analyzing more than 166,000 collaborations, we found that a remarkable 60−80% of the collaborations last only year. Within a subset of repeat collaborations ( 2 y), we find that roughly 2/3 of these collaborations last less than a scientist’s average duration 5 y, yet 1% last more than y. This wide range in duration and the disparate frequencies of long and short together point to the dichotomy of burstiness and persistence in scientific collaboration. Closer inspection of individual career paths signals how idiosyncratic events, such as changing institutions or publishing a seminal study or book, can have significant downstream impact on the arrival rate of new collaboration opportunities and tie formation (see Fig. 1 and Fig. S1). Also, the frequency of relatively large publication overlap measures ( and ) indicates that career partners occur rather frequently in science.
In the first part of the study, we provided descriptive insights into basic questions such as how long are typical collaborations, how often does a scientist pair up with his/her main collaborator, and what is the characteristic half-life of a collaboration. We also found that as the career progresses, researchers become attractors rather than pursuers of new collaborations. This attractive potential can contribute to cumulative advantage (30, 31), as it provides select researchers access to a large source of collaborators, which can boost productivity and increase the potential for a big discovery.
We operationalized tie strength using an egocentric perspective of the collaboration network. Because the number of publications between the central scientist i and a given coauthor j was found to be exponentially distributed, the mean value is a natural author-specific threshold that distinguishes the strong () from the weak ties (). Within the subset of strong ties, we identified super tie outliers using an analytic extreme-statistics threshold defined in Eq. 4. Also, because the number of publications produced by a collaboration is highly correlated with its duration, a super tie also represents persistence that is in excess of the stochastic churn rate that is characteristic of the scientific system. On a per-collaborator basis, the fraction of coauthors within a research profile that are super ties () was remarkably common across datasets, indicating that super ties occur at an average rate of 1 in 25 collaborators.
There are various candidate explanations for why such extremely strong collaborations exist. Prosocial motivators may play a strong role, i.e., for some researchers, doing science in close community may be more rewarding than going it alone. Also, the search and formation of a compatible partnership requires time and other social capital investment, i.e., networking. Hence, for two researchers who have found a collaboration that leverages their complementarity, the potential benefits of improving on their match are likely outweighed by the long-term returns associated with their stable partnership. Complementarity, and the greater skill set the partnership brings, can also provide a competitive advantage by way of research agility, whereby a larger collective resource base can facilitate rapid adjustments to new and changing knowledge fronts, thereby balancing the risks associated with changing research direction. After all, a first-mover advantage can make a significant difference in a winner-takes-all credit and reward system (2).
Scientists may also strategically pair up to share costs, rewards, and risk across their careers. In this light, an additional incentive to form super ties may be explained, in part, by the benefits of reward sharing in the current scientific credit system, wherein publication and citation credit arising from a single publication are multiplied across the coauthors in everyday practice. Considered in this way, the career risk associated with productivity lulls can be reduced if a close partnership is formed. For example, we observed a few “twin profiles” characterized by a publication overlap fraction between the researcher and his/her top collaborator that was nearly 100%. Moreover, we found that 9% of the biologists and 20% of the physicists shared 50% or more of their papers with their top collaborator. This highlights a particularly difficult challenge for science, which is to develop a credit system that appropriately divides the net credit but, at the same time, does not reduce the incentives for scientists to collaborate (8, 27–29). Thus, it will be important to consider these relatively high levels of publication and citation overlap in the development of quantitative career evaluation measures; otherwise, there is no penalty to discourage coauthor free riding (7).
We concluded the analysis by implementing two fixed-effects regression models to determine the sign and strength of the apostle effect represented by (productivity) and (citations). Together, these two coefficients address the fundamental question: Is there a measurable advantage associated with heavily investing in a select group of research partners?
In the first model, we measured the impact of super ties on a researcher’s annual publication rate, controlling for career age, average team size, the prior experience of i with his/her coauthors, and the relative contribution of super ties within year t as measured by in Eq. 5. We found larger to be associated with above-average productivity (), indicating that super ties play a crucial role in sustaining career growth. We also found increased levels of prior experience to be associated with decreased productivity (), suggesting that maintaining older ties conflicts with the potential benefits from mixing new collaborators into the environment. Nevertheless, higher inequality in the concentration of prior experience was found to have a counterbalancing positive effect on productivity ().
In the second regression model, we analyzed the impact of super ties on the citation impact of individual publications, using the detrended citation measure defined in Eq. 7. This citation measure is normalized within publication year cohorts, thus allowing for a comparison of citation counts for research articles published in different years. We found that publications coauthored with super ties, corresponding to 52% of the papers we analyzed, have a significant increase in their long-term citations (). In SI Text, we provide additional evidence for the apostle effect, showing that publications with super ties receive 17% more citations. This added value may arise from the extra visibility the publications receives, because the super tie collaborator may also contribute a substantial reputation and future productivity that promote the visibility of the publication. This type of network-mediated reputation spillover is corroborated by a recent study finding a significant citation boost attributable to a researcher’s centrality within the collaboration network (40).
This data-oriented analysis also contributes to the literature on the science of science policy (41), providing insight and guidance in an increasingly metrics-based evaluation system on how to account for individual achievement in team settings. As such, we conclude with some policy recommendations. One particularly relevant scenario is fellowship, tenure, and career award evaluations, where it is a common practice to consider “independence from one’s thesis advisor” as a selection criteria. We show that to assess a researcher’s independence, evaluation committees should also take into consideration the level of publication overlap between a researcher and his/her strongest collaborator(s), e.g., and . However, at the same time, the beneficial role of super ties—as we have quantitatively demonstrated—should also be acknowledged and supported. For example, funding programs might consider career awards that are specifically multipolar (8), which would also benefit the research partners in academia who are actually life partners, and who may face the daunting “two-body problem” of coordinating two research careers. Furthermore, understanding the basic levels of publication overlap in science is also important for the ex post facto review of funding outcomes as a means to evaluate the efficiency of science. In large-team settings, measuring the efficiency of a laboratory or project is difficult without a better understanding of how to measure overlapping labor inputs (i.e., collaborator contributions) relative to the project outputs (e.g., publications, patents, etc.). Finally, our study informs early career researchers—who are likely to face important decisions concerning the (possibly strategic) selection of collaborative opportunities—on the positive impact that the right research partner can have on their career’s long-term sustainability and growth. In all, our results provide quantitative insights into the benefits associated with strong collaborative partnerships, pointing to the added value derived from skill-set complementarity, social trust, and long-term commitment.
SI Text
Aggregate Measures for Supertie Impact
In The Apostle Effect I and The Apostle Effect II, we implemented a regression model that elucidates the role of super ties at the annual level for productivity and at the paper level for citations. To provide additional quantitative evidence for the apostle effect, in this section, we develop additional descriptive measures that compare the contributions by super ties to the contributions from the rest of the collaborators.
Productivity Premium.
A researcher is likely to have a relatively small number of super ties, corresponding on average to of his/her coauthors (see Fig. 5A). However, these coauthors, by definition, contribute to a large fraction of the total output of i (corresponding on average to 40−75% of all publications; see Fig. 5B). Thus, it is important to know the relative contributions of the super ties to nonsuper ties, because there are typically very many nonsuper tie coauthors whose inputs also contribute to the output of i.
To facilitate a comparison of productivity at the aggregate career level, we first separated the sum of the tie strengths, , into the contribution from the super ties (j with indicator value ), and the complementary contribution from the other ties (with ). We then define the productivity premium as the ratio of the mean tie strengths,
[S1] |
between the coauthor subsets with (totaling coauthors) and (totaling coauthors). This quantity increases as the ratio decreases (smaller ) and as the ratio increases; its maximum value is equal to the total number of publications published by the central scientist, , and is bounded by the minimum value for large .
Fig. S4C shows the cumulative distribution . In all cases, we observe , with average values between 7 and 10. Interestingly, the Top scientists from biology tend to have smaller values than the Other scientists (Mann−Whitney difference in median test P value = 0.0008, and K-S difference in distribution test P value = 0.0007). However, the same tests failed to indicate any significant difference for the for physics.
Citation Premium.
In economic analyses, to compare nominal prices across time, it is fundamentally necessary to account for price inflation/deflation by means of an appropriate deflator index. For the same reason, it is equally important to use deflators when comparing success measures derived from other socioeconomic systems. In professional sports, for example, the rate of achievement can be era dependent—e.g., the nonstationary home run rate in Major League Baseball is an implication of the steroids era (42, 43). In science, the publication rate in physics and biology is growing at roughly a 5% rate (5). Nevertheless, this persistent growth has been subject to periods of nonstationary growth spurts, such as during the period of the US National Institutes of Health budget doubling between 1998 and 2003 (2). Thus, with these considerations in mind, in developing comparative citation measures, it is important to appropriately account for two nonstationary features of citation credit.
First, there is the time dependence of citations, arising from the fact that papers published in different years are at different points in their citation life cycle in the citation census year . The citation tallies are also affected by the underlying growth of the citation supply—due to “inflation” or “secular growth” of scientific output—which also systematically biases the comparison of raw citation counts for p from different y. Second, it is also important to divide the citation credit among the coauthors of each publication p, in this way placing a cap on the net credit introduced by p, and accounting for the slow but steady exponential growth in the mean number of coauthors per paper over time (7).
To address these two underlying trends, we apply two normalizations to the raw citation count (measured in census year for a paper p published in year y). First, we “deflated” by dividing by the mean citation value for publications from the same year, , and then transformed this ratio into the mean citation values for the (arbitrary) baseline year , giving the rescaled value
[S2] |
This also accounts for the fact that more recent publications have had less time to accrue citations than older publications. Second, we control for trends in team size, choosing a naive approach that divides the citations into equal shares among the coauthors (44). As such, we define the normalized citations credited to coauthor j of p as
[S3] |
Similar to the normalization procedure used for the citation z score in Eq. 7, is the average number of citations for publications published in a benchmark set m, choosing m to be the aggregation of articles appearing in the multidisciplinary journals Nature, Proceedings of the National Academy of Sciences, and Science. We restricted our query to publications denoted as “Articles,” which excludes reviews, letters to the editor, corrections, and other content types. We use these high-impact journals because they have high citation rates and hence provide a robust detrending baseline for the time-dependent component of . Again, the choice of baseline year is arbitrary (as is the deflation year 2000 commonly used in economic analyses) and is mainly used to recover the units of citations for the measure. Because the constant factor is used for all values, it does not affect our results. The advantage of over is that the former is a positive number including the value 0, and hence can be added across p; , however, can be negative and is centered around 0, and, therefore, summing across p has a different interpretation that is not suitable for what follows.
We define the cumulative measure of citation impact for coauthors i and j as
[S4] |
where the sum includes only those publications in the profile of i that also include coauthor j. In the extreme case that j is a coauthor of every publication, , this pairwise measure has the upper limit equal to the citation share of the central scientist, . The sum across all j including i, , yields the net detrended citation value, which is independent of the distribution of .
To define a similar citation premium, we also separated the citations into the contributions from the super ties and the contributions from the nonsuper tie collaborators. Because the total is conserved, we split the into two groups: The total for the coauthors with is , and the total for the remaining coauthors is . We then define the citation premium to be the ratio of the average citation shares of the coauthors in each subset,
[S5] |
which has a minimum possible value equal to 0 and, in principle, has no upper bound. Fig. S4D shows the distribution of , with mean, median, and maximum values across all datasets of 14.1, 11.3, and 134, respectively. We observed only two profiles (2 out of 473) with . Thus, using a group-to-group comparison, this measure shows that the relative citation impact contribution of super ties to other ties is significantly greater than unity. There may be a self-selection, because high-quality work may induce follow-up research, presumably with a similar set of collaborators. Hence, the citation premium is also evidence for the value of persistent collaboration, which can leverage and build upon prior experience and cumulative pairwise achievement.
Also of interest, we observe a consistent pattern considering the distributions of both and : The Top scientist profiles have smaller mean values than their counterparts, and the biology profiles have smaller mean value than for physics. In the case of productivity, this may follow from their privileged access to short-term collaboration opportunities. In the case of the citation impact, this pattern may emerge due to the reputation asymmetry of top scientists, who, by way of their prestige, may have more control over their choice of collaborators, possibly aimed at reducing redundancy within the team, reducing the team size, which also increases the citation credit per coauthor, . In large-team efforts, because most collaboration durations are short with relatively small , increasing is most likely to decrease by way of decreasing the numerator and increasing the denominator.
Because is an aggregate career measure, and the dependent variable in our citation regression model (Eq. 7) is a normalized measure that does not have the dimensionality of citations, it is difficult to use these quantities to measure the citation boost on a per-publication basis. Thus, to estimate the apostle effect on the long-term citation tally of individual publications, we separated the set of publications with at least one super tie coauthor () from the complementary set of publications without any super tie coauthors (). To compare p from a similar era, we took all of the publications from the 11-y window 1990−2000. Also, because citation rates are discipline dependent, we distinguished between biology and physics publications. During this period, 62% (7,814) of the p have for biology and 57% (10,128) of the p have for physics. From these well-balanced subsets, we then estimated the citation impact due to in two ways.
First, we calculated the cumulative citation distribution, , for the publications with . Fig. S6 A and B shows each distribution on log-linear axes, which emphasizes the log-normal features of . On this log-linear scale, the two distributions are characterized by a horizontal offset, which is visible for the majority of the range. This graphical feature indicates that, in distribution, the for are larger by an approximately constant factor , i.e., . We estimate αR by comparing the means and the median values of the distributions. For example, the ratio between the means yields the value 1.17 for biology and 1.16 for physics. Estimating using the ratio of the median values yields approximately the same value. Thus, represents a 16−17% citation boost for p with . For the average-cited p, this boost translates to a 21-citation difference for biology and an 8-citation difference for physics. These numbers, however, arise from an aggregated dataset, so it is not necessarily true that is representative of all scientists.
To confirm the per-publication citation premium at the researcher level, we grouped the publications with within each profile i. To reduce the sensitivity to fluctuations, we analyzed only the i with at least 10 publications in the subset and at least 10 publications in the subset. Then, to obtain a characteristic citation measure for each the two subsets, we calculated the median value, , for the subset of p with , and the median value, , for the complementary publication subset with .
Fig. S6 C and D shows the scatter plot of and for each i. The line distinguishes the researchers with . There is notable heterogeneity across the i in terms of the citation premium from super ties. Nevertheless, the majority of researchers have , with 73% of the biology researchers and 76% of the physics researchers above the line. We then obtained a second estimate of the per-publication citation premium by fitting a least-squares model, , where ϵ is an ordinary least squares (OLS) error term, obtaining best-fit values (biology) and (physics).
Thus, these last two methods provide consistent estimates of the citation boost at the publication level, corresponding to a 16−24% citation boost, pointing to a significant long-term citation impact attributable to the presence of super ties.
Data Description
Name Disambiguation Strategy.
We obtained the top-cited researcher publication data using the Distinct Author Sets function provided by TRWOK to increase the likelihood that only publications actually authored by each central author i are analyzed. On a case by case basis, we performed further author disambiguation within each profile. The Other (matched set) profiles were also downloaded from TRWOK, either by using the Distinct Author database option, or by collecting distinct researcher profile data from ResearcherID.com.
In this latter case of ResearcherID.com profiles, we collected biology and physics profiles by querying the database for profiles listing any of the following keywords: graphene, neuroscience, molecular biology, or genomics. For further details on the selection procedure and for extensive analysis of the statistical properties of these datasets, see the data descriptions in refs. 45–47.
The data census year refers to the calendar year in which the researcher profile data were downloaded. Let be the first calendar year of his/her first publication and be the calendar year of the last observed publication, so that the total number of years of data for i is . Hence, depending on if the career i was completed in , there are two possible scenarios relating and : (scenario a) if the researcher i was still active in , then and ; or (scenario b) if his/her career terminated at some time before , then , with and corresponding to the final career length. The datasets comprise profiles with census year varying from 2010 to 2012 (47). These relatively small variations in do not alter the citation results because all citation measures are appropriately detrended to make possible comparisons across time. Moreover, the regression data are longitudinal, meaning that the observations are made according to t, and so the results do not depend on or the completeness of the career. Furthermore, the regression models each include an author-level fixed-effect parameter that controls for time-invariant author-specific properties, thereby absorbing factors related to the starting calendar year and the lag .
For a given central author i, we aggregate the TRWOK publications and create a registry of surname and first/middle-initial pairs, {Surname, FM}, where FM can consist of one, two, or three alphabetic first-letter character abbreviations α, FM . Because the number of distinct coauthors per i is relatively small, on the order of 10−1,000 distinct names per profile, we assume that a name disambiguation problem among the coauthors does not introduce significant levels of type 1 “splitting” or type 2 “clumping” disambiguation errors. Hence, we perform a string matching on similar last names and , ignoring and so that publications with variable listing of and do not result in a type 1 “profile splitting” error. We then aggregate the publication information into the profile of coauthor j of central author i. Because our approach is egocentric, we do not analyze the publications of j that do not include i. Clearly, this would require nearly comprehensive TRWOK publication data, which is a major data limitation.
Matched Profile Selection Criteria.
To account for possible prestige effects, we compared top-cited profiles to a set of Other profiles that we matched within each discipline. To match the datasets, we collected “not top-cited” researcher profiles that had levels of career length and productivity similar to the top-cited profiles. More specifically, we introduced a productivity criteria requiring that an Other profile must have at least as many publications, , as all of the researchers in the corresponding top-cited dataset: For biology, this minimum threshold value is , and for physics, it is . Altogether, our career dataset comprised 100 top-cited and 93 matched profiles from biology, and 100 top-cited and 180 matched profiles from physics.
Throughout our analysis, we introduced various quantities that summarize the career (career length , total publications , etc.) and collaboration pattern (mean duration , mean strength , strength Gini coefficient , etc.) of any given research profile i. We found that the Top and Other datasets are statistically well-matched with respect to some variables, using the K-S test to certify the null hypothesis that the underlying distributions are statistically similar. For example, the super tie coauthor fraction exhibits the same distribution across all four datasets, as shown in Fig. 5A. Other variables were well matched only within discipline, e.g., , or were well matched only within Top or Other datasets, e.g., .
One variable worth mentioning, for which the Top and Other datasets were not well matched, was the career length distribution, . Because the Top scientists were selected on account of cumulative citation tallies, they are biased toward longer , many of which are completed careers. Because the maximum possible is given by , the variables may be biased toward longer values for the top-cited researcher profiles. As such, we avoid making any comparisons on account of this type of measure. Instead, our comparisons in the manuscript are based on more intensive measures, e.g., the super tie coauthor fraction , which are less sensitive to biases arising from systematic differences in and .
Moreover, our analysis of the apostle effect, by design, avoids the potential bias due to . For example, the productivity premium and the citation premium are ratios in which both the numerator and the denominator should have approximately the same dependence on , and so the effect cancels out. In the case of the regression models, the dependent and independent variables are all specific to a particular career year t.
Supplementary Material
Acknowledgments
The author is grateful for helpful discussions with O. Doria, M. Imbruno, B. Tuncay, and R. Metulini and constructive criticism and keen insights from two anonymous referees. The author also acknowledges feedback from participants in the European Union Cooperation in Science and Technology (COST) Action TD1210 (KnowEscape) workshop on “Quantifying scientific impact: Networks, measures, insights?” and support from the Italian Ministry of Education for the National Research Project (PNR) “Crisis Lab.”
Footnotes
The author declares no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1501444112/-/DCSupplemental.
References
- 1.Börner K, et al. A multi-level systems perspective for the science of team science. Sci Transl Med. 2010;2(49):49cm24. doi: 10.1126/scitranslmed.3001399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stephan P. How Economics Shapes Science. Harvard Univ Press; Cambridge, MA: 2012. [Google Scholar]
- 3.Nahapiet J, Ghoshal S. Social capital, intellectual capital, and the organizational advantage. Acad Manage Rev. 1998;23(2):242–266. [Google Scholar]
- 4.Wuchty S, Jones BF, Uzzi B. The increasing dominance of teams in production of knowledge. Science. 2007;316(5827):1036–1039. doi: 10.1126/science.1136099. [DOI] [PubMed] [Google Scholar]
- 5.Petersen AM, et al. Reputation and impact in academic careers. Proc Natl Acad Sci USA. 2014;111(43):15316–15321. doi: 10.1073/pnas.1323111111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Malmgren RD, Ottino JM, Nunes Amaral LA. The role of mentorship in protégé performance. Nature. 2010;465(7298):622–626. doi: 10.1038/nature09040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Petersen AM, Pavlidis I, Semendeferi I. A quantitative perspective on ethics in large team science. Sci Eng Ethics. 2014;20(4):923–945. doi: 10.1007/s11948-014-9562-8. [DOI] [PubMed] [Google Scholar]
- 8.Pavlidis I, Petersen AM, Semendeferi I. Together we stand. Nat Phys. 2014;10:700–702. [Google Scholar]
- 9.Borgatti SP, Mehra A, Brass DJ, Labianca G. Network analysis in the social sciences. Science. 2009;323(5916):892–895. doi: 10.1126/science.1165821. [DOI] [PubMed] [Google Scholar]
- 10.Granovetter MS. The strength of weak ties. Am J Sociol. 1973;78(6):1360–1380. [Google Scholar]
- 11.Newman MEJ. The structure of scientific collaboration networks. Proc Natl Acad Sci USA. 2001;98(2):404–409. doi: 10.1073/pnas.021544898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Newman MEJ. Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E Stat Nonlin Soft Matter Phys. 2001;64(1 Pt 2):016131. doi: 10.1103/PhysRevE.64.016131. [DOI] [PubMed] [Google Scholar]
- 13.Barabasi AL, et al. Evolution of the social network of scientific collaborations. Physica A. 2002;311(34):590–614. [Google Scholar]
- 14.Newman MEJ. Coauthorship networks and patterns of scientific collaboration. Proc Natl Acad Sci USA. 2004;101(Suppl 1):5200–5205. doi: 10.1073/pnas.0307545100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Guimerà R, Uzzi B, Spiro J, Amaral LAN. Team assembly mechanisms determine collaboration network structure and team performance. Science. 2005;308(5722):697–702. doi: 10.1126/science.1106340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Palla G, Barabási AL, Vicsek T. Quantifying social group evolution. Nature. 2007;446(7136):664–667. doi: 10.1038/nature05670. [DOI] [PubMed] [Google Scholar]
- 17.Pan RK, Saramäki J. The strength of strong ties in scientific collaboration networks. Europhys Lett. 2012;97(1):18007. [Google Scholar]
- 18.Martin T, Ball B, Karrer B, Newman MEJ. Coauthorship and citation patterns in the Physical Review. Phys Rev E Stat Nonlin Soft Matter Phys. 2013;88(1):012814. doi: 10.1103/PhysRevE.88.012814. [DOI] [PubMed] [Google Scholar]
- 19.Ke Q, Ahn YY. Tie strength distribution in scientific collaboration networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2014;90(3):032804. doi: 10.1103/PhysRevE.90.032804. [DOI] [PubMed] [Google Scholar]
- 20.Börner K, Maru JT, Goldstone RL. The simultaneous evolution of author and paper networks. Proc Natl Acad Sci USA. 2004;101(Suppl 1):5266–5273. doi: 10.1073/pnas.0307625100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Milojević S. Principles of scientific research team formation and evolution. Proc Natl Acad Sci USA. 2014;111(11):3984–3989. doi: 10.1073/pnas.1309723111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.March JG. Exploration and exploitation in organizational learning. Organ Sci. 1991;2(1):71–87. [Google Scholar]
- 23.Lazer D, Friedman A. The network structure of exploration and exploitation. Adm Sci Q. 2007;52(4):667–694. [Google Scholar]
- 24.Petersen AM, Riccaboni M, Stanley HE, Pammolli F. Persistence and uncertainty in the academic career. Proc Natl Acad Sci USA. 2012;109(14):5213–5218. doi: 10.1073/pnas.1121429109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pentland A. The new science of building great teams. Harv Bus Rev. 2012;90:60–69. [Google Scholar]
- 26.Woolley AW, Chabris CF, Pentland A, Hashmi N, Malone TW. Evidence for a collective intelligence factor in the performance of human groups. Science. 2010;330(6004):686–688. doi: 10.1126/science.1193147. [DOI] [PubMed] [Google Scholar]
- 27.Stallings J, et al. Determining scientific impact using a collaboration index. Proc Natl Acad Sci USA. 2013;110(24):9680–9685. doi: 10.1073/pnas.1220184110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Allen L, Scott J, Brand A, Hlava M, Altman M. Publishing: Credit where credit is due. Nature. 2014;508(7496):312–313. doi: 10.1038/508312a. [DOI] [PubMed] [Google Scholar]
- 29.Shen HW, Barabási AL. Collective credit allocation in science. Proc Natl Acad Sci USA. 2014;111(34):12325–12330. doi: 10.1073/pnas.1401992111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Petersen AM, Jung WS, Yang JS, Stanley HE. Quantitative and empirical demonstration of the Matthew effect in a study of career longevity. Proc Natl Acad Sci USA. 2011;108(1):18–23. doi: 10.1073/pnas.1016733108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Petersen AM, Penner O. Inequality and cumulative advantage in science careers: A case study of high-impact journals. EPJ Data Sci. 2014;3:24. [Google Scholar]
- 32.Krapivsky P, Redner S, Ben-Naim E. A Kinetic View of Statistical Physics. Cambridge Univ Press; Cambridge, UK: 2010. [Google Scholar]
- 33.Azoulay P, Zivin JSG, Wang J. Superstar extinction. Q J Econ. 2010;125(2):549–589. [Google Scholar]
- 34.Uzzi B, Mukherjee S, Stringer M, Jones B. Atypical combinations and scientific impact. Science. 2013;342(6157):468–472. doi: 10.1126/science.1240474. [DOI] [PubMed] [Google Scholar]
- 35.Radicchi F, Fortunato S, Castellano C. Universality of citation distributions: Toward an objective measure of scientific impact. Proc Natl Acad Sci USA. 2008;105(45):17268–17272. doi: 10.1073/pnas.0806977105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Clauset A, Arbesman S, Larremore DB. Systematic inequality and hierarchy in faculty hiring networks. Sci Adv. 2015;1(1):e1400005. doi: 10.1126/sciadv.1400005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Duch J, et al. The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact. PLoS One. 2012;7(12):e51332. doi: 10.1371/journal.pone.0051332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Uzzi B. Embeddedness in the making of financial capital: How social relations and networks benefit firms seeking financing. Am Sociol Rev. 1999;64(4):481–505. [Google Scholar]
- 39.Burt RS. Structural Holes. Harvard Univ Press; Cambridge, MA: 1992. [Google Scholar]
- 40.Sarigl E, Pfitzner R, Scholtes I, Garas A, Schweitzer F. Predicting scientific success based on coauthorship networks. EPJ Data Sci. 2014;3:9. [Google Scholar]
- 41.Fealing KH, editor. The Science of Science Policy: A Handbook. Stanford Business Books; Stanford, CA: 2011. [Google Scholar]
- 42.Petersen AM, Jung WS, Stanley HE. On the distribution of career longevity and the evolution of home run prowess in professional baseball. Europhys Lett. 2008;83(5):50010. [Google Scholar]
- 43.Petersen AM, Penner O, Stanley HE. Methods for detrending success metrics to account for inflationary and deflationary factors. Eur Phys J B. 2011;79(1):67–78. [Google Scholar]
- 44.Petersen AM, Wang F, Stanley HE. Methods for measuring the citations and productivity of scientists across time and discipline. Phys Rev E Stat Nonlin Soft Matter Phys. 2010;81(3 Pt 2):036114. doi: 10.1103/PhysRevE.81.036114. [DOI] [PubMed] [Google Scholar]
- 45.Petersen AM, Stanley HE, Succi S. Statistical regularities in the rank-citation profile of scientists. Sci Rep. 2011;1:181. doi: 10.1038/srep00181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Petersen AM, Succi S. The Z-index: A geometric representation of productivity and impact which accounts for information in the entire rank-citation profile. J Informetrics. 2013;7(4):823–832. [Google Scholar]
- 47.Penner O, Pan RK, Petersen AM, Kaski K, Fortunato S. On the predictability of future impact in science. Sci Rep. 2013;3:3052. doi: 10.1038/srep03052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Acemoglu D, Robinson JA. Economic Origins of Dictatorship and Democracy. Cambridge Univ Press; Cambridge, UK: 2005. [Google Scholar]
- 49.Ausloos M. A scientometrics law about co-authors and their ranking: The co-author core. Scientometrics. 2013;95:895–909. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.