Significance
What are the mechanisms behind one’s research success as measured by one’s papers’ citability? By acknowledging the perceived esteem might be a consequence not only of how valuable one’s works are but also of pure luck, we arrived at a model that can accurately recreate a citation record based on just three parameters: the number of publications, the total number of citations, and the degree of randomness in the citation patterns. As a by-product, we show that a single index will never be able to embrace the complex reality of the scientific impact. However, three of them can already provide us with a reliable summary.
Keywords: science of science, scientometrics, bibliometric indexes, rich get richer
Abstract
The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one’s scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.
Ever since Garfield’s (1) impact factor for journals and Hirsch’s (2) index for individual researchers, the popularity of bibliometric impact measures has been growing rapidly. The fact that they summarize one’s scientific performance with just a single number is appealing to many. However, some argue (3) that the nature of scientific activities is too multidimensional for such a simple description to be possible and a few quantitative metrics will never be sufficient to capture this complex reality in its entirety.
In this paper we address this issue from the perspective of the increasingly popular science of science (Sci-Sci) (4, 5) approach, which can be dated back to the classical book by de Solla Price, Little Science, Big Science (6). The modern Sci-Sci utilizes complex systems methodology and can be considered a fusion of agent-based modeling and big data analysis.
We have developed a model of an author’s research activity that is based on two simple assumptions: 1) In each time step one new paper is added into the simulation. 2) Each newly added paper cites the existing publications according to a combination of a) the preferential attachment rule—highly cited papers are more likely to attract even more citations [compare the rich get richer mechanism (7), the success breeds success phenomenon (8), and the effect of a scientist’s reputation (9)]—and b) sheer chance—papers might be discovered by the citing authors by accident or be included in the bibliography completely at random.
While the importance of the rich get richer rule (7) in bibliometrics is unquestionable [first part of Merton’s (10) Matthew effect, referred to as the cumulative advantage process by de Solla Price (8) or success-breeds-success phenomenon (6, 11), confirmed experimentally (12)], we argue here that a purely preferential model is incapable of explaining our reality well enough and the accidental component is necessary (13, 14).
Furthermore, in our case we adopt different levels of analysis [as known from social sciences (15)] (Fig. 1) for generated bibliometric data. Agent-based models are formulated at the microlevel—from the perspective of an individual paper. The Sci-Sci perspective usually investigates the structure of the citation network in its entirety, for instance to describe general citation patterns across the whole scientific discipline (macrolevel). Here we are mainly focusing on the rarely considered mesolevel (Table 1), which is the perspective of a single scientist, i.e., a small-sample one. As such, the above publication–citation process can be thought of as an extension of the iterative procedure known as the Ionescu–Chopard model (16, 17) (Materials and Methods, Model Description).
Table 1.
Microlevel | Mesolevel | Macrolevel | |
Purely preferential | Distribution of the number of citations (21–30) | Lotkaian informetrics (19), Ionescu–Chopard model (16, 17) | Barábasi–Albert model and its modifications (31) |
Preferential and/or accidental | Microscopic model (14) implies Tsalis–Pareto distribution (32) | This paper | Empirical data (33, 34), models studied in refs. 35–46 |
By assuming that citations might be assigned completely at random as well as follow the rich get richer rule, we revealed the underlying dimensionality of the mesolevel, leading to an accurate description of the output of an individual author.
Model Derivation
Assume is a descending sequence of citation counts for each of the papers of an author. In other words, denotes the number of bibliographic references to the author’s most cited paper, is the second most cited, …, and is the least cited one. Famous approaches (18) to the problem of approximating observed citation records with simple mathematical models that depend on a small number of parameters were mostly based on the power law (19) or other functions (20). Unfortunately, they do not provide a good fit at the mesolevel—they are usually applied for describing papers sampled from the whole citation network (21, 22).
Our model, on the other hand, not only has a clear interpretation (recall the two simple assumptions above), but also provides high-accuracy approximations of citation records of individuals. Due to this, we are able to describe this complex reality with merely three self-explanatory parameters: the number of papers ; the total number of citations ; and the ratio of citations distributed according to the preferential attachment rule , where means that all papers receive citations completely at random and that all of them follow the rich get richer rule.
For the derivation of the model please refer to Materials and Methods, Model Description. The citation process proposed above, after all of the papers have been published and all of the citations have been distributed, yields the following analytic formula for the estimated number of citations of the th most cited paper (Materials and Methods, Exact Solution of the Model):
[1] |
Dataset Description
To demonstrate the usefulness of the model, we study the DBLP Computer Science Bibliography (47) dataset of computer science papers; see Materials and Methods, Data Availability for description. We consider citation records of all 123,621 scholars whose index is at least 5. To determine the three model parameters characterizing each author, we omit the papers with no citations (as overfitting to a tail composed of zeros cannot lead to a good overall description). Then we compute the author’s (number of papers that were cited at least once) and (the total number of citations) and then estimate using the least-squares fit with respect to the Cauchy loss to weaken the influence of any potential outliers.
Once we obtain an author’s , , and , we can reproduce the author’s citation record quite accurately (Fig. 2). The high variance of for each fixed and (Fig. 3) indicates that this parameter is necessary for a precise description of data. This suggests that indeed the modeled reality might be three-dimensional (3D), which roughly agrees with the estimates in ref. 48.
Results and Discussion
It turns out that ca. of the authors have their corresponding , which means that, under our model, their citations appear to be distributed in an almost purely accidental manner. These authors publish on average half as many papers as those with , which might indicate that they are at the beginning of their careers or their best papers are still yet to come. We observe a positive correlation between and as well as (Fig. 3). In other words, more productive and/or influential authors tend to have more papers distributed according to the rich get richer rule. This observation is consistent with the well-known fact (5) that one’s highest-impact paper can occur at any time during the course of one’s career; thus, authors with more papers are more likely to have published their best work already. However, as there is a considerable variability in at all levels, even some outstanding careers might still be a result of more luck than reason (13, 49).
By indicating that the citation record space is 3D, we have proved that any single citation measure, including the index and the author’s ranking it generates, necessarily yields an oversimplified projection of a more complex space (3). In other words, whenever one chooses a single citation index, some information must inherently be lost; we will never be able to see the whole picture through the lenses of any single measure.
The proposed model emphasizes the use of multiple indexes in the evaluation of scientific work. We have indicated that merely three parameters are sufficient to provide an accurate description of our reality. In the near future, we plan to perform a broad study of bibliometric indexes to come up with an intuitive and insightful classification for which of the three dimensions each index focuses on the most. This will allow policy makers to make better-informed decisions when choosing particular evaluation tools. The questions of how to best combine , , and to cause the least information loss and how well popular citation indexes perform with regard to the quality of data approximation will also be explored.
Materials and Methods
Model Description.
Let us introduce the proposed model in a formal manner. For the description of the citation dynamics we use the following parameters: the total number of papers , the total number citations that will be distributed among all papers, and ratio of the number of preferential citations to the total number of citations .
Due to the assumed boundary conditions in Eq. 3, we disallow both and .
The stages of the model’s simulation are strictly connected to the scientific activity of the considered author. Each of the steps corresponds to the publication of one of the author’s papers. At the th step, the articles already in existence are to receive citations in total, where citations are distributed according to the preferential attachment rule, and citations are uniformly distributed between the papers
Note that both and do not need to be integers—we consider them as averages.
The rate equation for the number of citations of the th mostly cited paper at the th stage of the simulation, , takes the form
[2] |
for . As each paper has initially no citations, we introduce the following boundary conditions:
[3] |
Note that in the rightmost term in Eq. 2, i.e., the preferential part, we assume that accidental citations are distributed first to avoid singularities with the very natural boundary conditions of the form given by Eq. 3. This explains the occurrence of there. The structure of the preferential part is the expected value of the Bernoulli distribution with the number of trials and the probability resulting from the assumed rich get richer mechanism—the number of citations thus obtained is proportional to the actual number of citations (i.e., ).
Exact Solution of the Model.
Below we derive the exact formula for . Note that Eq. 2 can be simplified as
Moreover, the second term can be further simplified due the fact that in each of the steps, the papers receive citations; i.e.,
Therefore,
Furthermore, since , the following holds:
[4] |
Moreover,
[5] |
Keeping in mind that the Euler gamma function (e.g., ref. 50, chap. 5), defined as
satisfies the factorial-like relation (equation 5.5.1 in ref. 50)
[6] |
for every number , we can transform Eq. 5 as
[7] |
By continuing evaluation of Eq. 4 of the form given by Eq. 7, we obtain
[8] |
In Eq. 8 we can stop the nesting procedure by using the boundary conditions given by Eq. 3. The final formula for with the change of the summation variable takes the form
This can be simplified further, because the sum of the ratios of gamma functions satisfies the identity
[9] |
which leads to
[10] |
Finally, we put , which leads to the situation where each paper has been published and every citation has been distributed. This yields such that
[11] |
Gamma functions, although very elegant, are not computationally well behaving. This is the reason why we should be interested in deriving the following equivalent of Eq. 11. Due to Eq. 6, we can substitute the gamma functions with the following product:
[12] |
The Pochhammer symbol (section 5.2 in ref. 50) is defined as
[13] |
Employing it in Eq. 11 yields
[14] |
Note that the Pochhammer symbol is implemented in many numerical software packages, thus enabling fast and accurate computations.
Data Availability.
Empirical data analysis conveyed in this paper is based on the DBLP V10 bibliography database (47) (https://aminer.org/citation), consisting of 3,079,007 papers and 25,16,994 citation relationships. DBLP includes most of the journals related to computer science. It also tracks numerous conference proceedings papers from the field.
We have extracted citation records of 1,762,044 authors. Most of them have published a small number of papers or have received very few citations. Therefore, we restricted the analysis to the subset of researchers characterized by the index not less than 5. This gave 123,621 citation records. Moreover, papers with 0 citations have been omitted from the analysis, as they are problematic when performing computations on the log scale. Note that most impact indexes, including the index, ignore zeros anyway.
The raw citation sequences, estimated parameters, and source code used to perform the data analysis can be accessed at the GitHub repository: https://github.com/gagolews/three_dimensions_of_scientific_impact (51).
Acknowledgments
We thank Maciej J. Mrowiński, Tessa Koumoundouros, and the reviewers for valuable feedback and constructive remarks.
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission. A.V.R. is a guest editor invited by the Editorial Board.
Data deposition: The raw citation sequences, estimated parameters, and source code used to perform the data analysis can be accessed at the GitHub repository: https://github.com/gagolews/three_dimensions_of_scientific_impact.
References
- 1.Garfield E., Citation indexes for science: A new dimension in documentation through association of ideas. Science 122, 108–111 (1955). [DOI] [PubMed] [Google Scholar]
- 2.Hirsch J. E., An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gagolewski M., Scientific impact assessment cannot be fair. J. Informetrics 7, 792–802 (2013). [Google Scholar]
- 4.Clauset A., Larremore D. B., Sinatra R., Data-driven predictions in the science of science. Science 355, 477–480 (2017). [DOI] [PubMed] [Google Scholar]
- 5.Fortunato S., et al. , Science of science. Science 359, eaao0185 (2018).29496846 [Google Scholar]
- 6.de Solla Price D. J., Little Science, Big Science (Columbia University Press, New York, NY, 1963). [Google Scholar]
- 7.Perc M., The Matthew effect in empirical data. J. R. Soc. Interface 11, 20140378 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.de Solla Price D. J., A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27, 292–306 (1976). [Google Scholar]
- 9.Petersen A. M., et al. , Reputation and impact in academic careers. Proc. Natl. Acad. Sci. U.S.A. 111, 15316–15321 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Merton R. K., The Matthew effect in science. Science 159, 56–63 (1968). [PubMed] [Google Scholar]
- 11.Tague J., The success-breeds-success phenomenon and bibliometric processes. J. Am. Soc. Inf. Sci. 32, 280–286 (1981). [Google Scholar]
- 12.van de Rijt A., Kang S. M., Restivo M., Patil A., Field experiments of success-breeds-success dynamics. Proc. Natl. Acad. Sci. U.S.A. 111, 6934–6939 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Barabási A. L., Luck or reason. Nature 489, 507–508 (2012). [DOI] [PubMed] [Google Scholar]
- 14.Néda Z., Varga L., Biró T. S., Science and Facebook: The same popularity law!. PLoS ONE 12, 1–11 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Blalock H. M., Social Statistics (McGraw-Hill, New York, NY, ed. 2, 1972). [Google Scholar]
- 16.Ionescu G., Chopard B., An agent-based model for the bibliometric h-index. Eur. Phys. J. B 86, 426 (2013). [Google Scholar]
- 17.Żogała-Siudem B., Siudem G., Cena A., Gagolewski M., Agent-based model for the h-index – Exact solution. Eur. Phys. J. B 89, 21 (2016). [Google Scholar]
- 18.Petersen A. M., Stanley H. E., Succi S., Statistical regularities in the rank-citation profile of scientists. Sci. Rep. 1, 181 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Egghe L., Lotkaian informetrics and applications to social networks. Bull. Belg. Math. Soc. Simon Stevin 16, 689–703 (2009). [Google Scholar]
- 20.Sangwal K., Comparison of different mathematical functions for the analysis of citation distribution of papers of individual authors. J. Informetrics 7, 36–49 (2013). [Google Scholar]
- 21.Thelwall M., Are the discretised lognormal and hooked power law distributions plausible for citation data?. J. Informetrics 10, 454–470 (2016). [Google Scholar]
- 22.Radicchi F., Fortunato S., Castellano C., Universality of citation distributions: Toward an objective measure of scientific impact. Proc. Natl. Acad. Sci. U.S.A. 105, 17268–17272 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Redner S., How popular is your paper? An empirical study of the citation distribution. Eur. Phys. J. B Condens. Matter Complex Syst. 4, 131–134 (1998). [Google Scholar]
- 24.Wallace M. L., Larivière V., Gingras Y., Modeling a century of citation distributions. J. Informetrics 3, 296–303 (2009). [Google Scholar]
- 25.Brzezinski M., Power laws in citation distributions: Evidence from Scopus. Scientometrics 103, 213–228 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Fenner T., Levene M., Loizou G., A model for collaboration networks giving rise to a power-law distribution with an exponential cutoff. Soc. Network. 29, 70–80 (2007). [Google Scholar]
- 27.Thelwall M., Are there too many uncited articles? Zero inflated variants of the discretised lognormal and hooked power law distributions. J. Informetrics 10, 622–633 (2016). [Google Scholar]
- 28.Moreira J. A. G., Zeng X. H. T., Amaral L. A. N., The distribution of the asymptotic number of citations to sets of publications by a researcher or from an academic department are consistent with a discrete lognormal model. PLoS ONE 10, e0143108 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Thelwall M., Wilson P., Distributions for cited articles from individual subjects and years. J. Informetrics 8, 824–839 (2014). [Google Scholar]
- 30.Thelwall M., The discretised lognormal and hooked power law distributions for complete citation data: Best options for modelling and regression. J. Informetrics 10, 336–346 (2016). [Google Scholar]
- 31.Barabási A. L., Scale-free networks: A decade and beyond. Science 325, 412–413 (2009). [DOI] [PubMed] [Google Scholar]
- 32.Thurner S., Kyriakopoulos F., Tsallis C., Unified model for network dynamics exhibiting nonextensive statistics. Phys. Rev. E 76, 036111 (2007). [DOI] [PubMed] [Google Scholar]
- 33.Leicht E. A., Clarkson G., Shedden K., Newman M. E., Large-scale structure of time evolving citation networks. Eur. Phys. J. B 59, 75–83 (2007). [Google Scholar]
- 34.Barabási A., et al. , Evolution of the social network of scientific collaborations. Phys. Stat. Mech. Appl. 311, 590–614 (2002). [Google Scholar]
- 35.Barabási A. L., Albert R., Jeong H., Mean-field theory for scale-free random networks. Phys. Stat. Mech. Appl. 272, 173–187 (1999). [Google Scholar]
- 36.Papadopoulos F., Kitsak M., Serrano M. A., Boguñá M., Krioukov D., Popularity versus similarity in growing networks. Nature 489, 537–540 (2012). [DOI] [PubMed] [Google Scholar]
- 37.Shao Z. G., Zou X. W., Tan Z. J., Jin Z. Z., Growing networks with mixed attachment mechanisms. J. Phys. Math. Gen. 39, 2035–2042 (2006). [Google Scholar]
- 38.Shao Z. G., Chen T., Ai B.-q., Growing networks with temporal effect and mixed attachment mechanisms. Phys. Stat. Mech. Appl. 413, 147–152 (2014). [Google Scholar]
- 39.Goldstein M. L., Morris S. A., Yen G. G., Group-based yule model for bipartite author-paper networks. Phys. Rev. E 71, 026108 (2005). [DOI] [PubMed] [Google Scholar]
- 40.Wu Z. X., Holme P., Modeling scientific-citation patterns and other triangle-rich acyclic networks. Phys. Rev. E 80, 037101 (2009). [DOI] [PubMed] [Google Scholar]
- 41.Xie Z., Ouyang Z., Zhang P., Yi D., Kong D., Modeling the citation network by network cosmology. PLoS ONE 10, e0120687 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Zalányi L., et al. , Properties of a random attachment growing network. Phys. Rev. E 68, 066104 (2003). [DOI] [PubMed] [Google Scholar]
- 43.Goldberg S. R., Anthony H., Evans T. S., Modelling citation networks. Scientometrics 105, 1577–1604 (2015). [Google Scholar]
- 44.Simkin M. V., Roychowdhury V. P., A mathematical theory of citing. J. Am. Soc. Inf. Sci. Technol. 58, 1661–1673 (2007). [Google Scholar]
- 45.Golosovsky M., Solomon S., Growing complex network of citations of scientific papers: Modeling and measurements. Phys. Rev. E 95, 012324 (2017). [DOI] [PubMed] [Google Scholar]
- 46.Eom Y. H., Fortunato S., Characterizing and modeling citation dynamics. PLoS ONE 6, e24926 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Tang J., et al. , “ArnetMiner: Extraction and mining of academic social networks” in Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008) (Association for Computing Machinery, New York, NY, 2008), pp. 990–998. [Google Scholar]
- 48.Clough J. R., Evans T. S., What is the dimension of citation space? Phys. Stat. Mech. Appl. 448, 235–247 (2016). [Google Scholar]
- 49.Heesen R., Academic superstars: Competent or lucky? Synthese 194, 4499–4518 (2017). [Google Scholar]
- 50.Olver F. W. J., et al., Eds., NIST digital library of mathematical functions, Version 1.0.24. http://dlmf.nist.gov/. Accessed 1 January 2020.
- 51.Siudem G., Żogała-Siudem B., Cena A., Gagolewski M., Three dimensions of scientific impact: Supplementary files and data, estimated_parameters_aminer_dblp_v10.csv.gz. https://github.com/gagolews/three_dimensions_of_scientific_impact. Deposited 26 April 2020.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Empirical data analysis conveyed in this paper is based on the DBLP V10 bibliography database (47) (https://aminer.org/citation), consisting of 3,079,007 papers and 25,16,994 citation relationships. DBLP includes most of the journals related to computer science. It also tracks numerous conference proceedings papers from the field.
We have extracted citation records of 1,762,044 authors. Most of them have published a small number of papers or have received very few citations. Therefore, we restricted the analysis to the subset of researchers characterized by the index not less than 5. This gave 123,621 citation records. Moreover, papers with 0 citations have been omitted from the analysis, as they are problematic when performing computations on the log scale. Note that most impact indexes, including the index, ignore zeros anyway.
The raw citation sequences, estimated parameters, and source code used to perform the data analysis can be accessed at the GitHub repository: https://github.com/gagolews/three_dimensions_of_scientific_impact (51).