SciSciNet: A large-scale open data lake for the science of science research

Zihang Lin; Yian Yin; Lu Liu; Dashun Wang

doi:10.1038/s41597-023-02198-9

. 2023 Jun 1;10:315. doi: 10.1038/s41597-023-02198-9

SciSciNet: A large-scale open data lake for the science of science research

Zihang Lin ^1,^2,^3,⁴, Yian Yin ^1,^2,^3,⁵, Lu Liu ^1,^2,³, Dashun Wang ^1,^2,^3,^5,^✉

PMCID: PMC10235093 PMID: 37264014

Abstract

The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.

Subject terms: Scientific community, Publishing

Background & Summary

Modern databases capturing the innerworkings of science have been growing exponentially over the past decades, offering new opportunities to study scientific production and use at larger scales and finer resolution than previously possible. Fuelled in part by the increasing availability of large-scale datasets, the science of science community turns scientific methods on science itself^1–6, helping us understand in a quantitative fashion a range of important questions that are central to scientific progress—and of great interest to scientists themselves—from the evolution of individual scientific careers^7–18 to collaborations^19–25 and science institutions^26–28 to the evolution of science^{2,3,5,29–34} to the nature of scientific progress and impact³⁵–⁵⁵.

Scholarly big data have flourished over the past decade, with several large-scale initiatives providing researchers free access to data. For example, CiteSeerX⁵⁶, one of the earliest digital library search engines, offers a large-scale scientific library focusing on the literature in computer and information science. Building on a series of advanced data mining techniques, AMiner⁵⁷ indexes and integrates a wide range of data about academic social networks⁵⁸. Crossref (https://www.crossref.org/)⁵⁹, as well as other initiatives in the open metadata community, have collected metadata such as Digital Object Identifier (DOI) in each publication record and linked them to a broad body of event data covering scholarly discussions. OpenAlex (https://openalex.org/)⁶⁰, based on Microsoft Academic Graph (MAG)^61–63, aims to build a large-scale open catalog for the global research system, incorporating scholarly entities and their connections across multiple datasets. In addition to data on scientific publications and citations capturing within-science dynamics, researchers have also tracked interactions between science and other socioeconomic spheres by tracing, for example, how science is referenced in patented inventions^64–66, regarding both front-page and in-text citations from patents to publications^67,68. Table 1 summarizes several exemplary datasets commonly used in the science of science literature, with information on their coverage and accessibility.

Table 1.

Brief summary of major data sources commonly used in the science of science literature.

Data source	Highlights	API	Data dump
Crossref	Data on publications with DOIs registered in Crossref.	✓	✓
OpenAlex	Data connecting publications, authors, institutions, and concepts.	✓	✓
Dimensions	Data connecting publications, grants, datasets, trials, and patents.	—	—
Overton	Policy documents and their citations to science and policy.	—	—
OpenCitations	DOI-DOI open citation links.	✓	✓
AMiner	Advanced information generated through data mining techniques.	✓	✓
CiteSeerX	Full-text publications, one of the earliest digital library search engines.	✓	—
ORCID	Data on researchers with ORCID IDs (funding, works, peer review, etc.).	✓	✓
ROR	Data on research organizations with ROR IDs, seeded by GRID.	✓	✓
Retraction Watch	Data on retracted papers and reasons for retraction.	✗	—
Semantic Scholar	Publication dataset featuring AI-derived products (e.g., embeddings).	✓	—
Web of Science	Curated by in-house experts, basis for Journal Citation Reports.	—	—
PubMed	Biomedical literature with PubMed IDs, linked to NIH projects, clinical trials, and other biomedical entities.	✓	✓
NIH RePORTER	Data on NIH-funded projects, with linkages to publications, patents, and clinical studies.	✓	✓
NSF Awards	Data on NSF-funded projects, with linkages to publications.	✓	✓
Clinical Trials	Information on clinical studies and linkages to references worldwide.	✓	✓
PatentsView	Data on USPTO patents (citations, classifications, inventors, etc.).	✓	✓
Patent Citation to Science	Patent-science citations extracted from USPTO and EPO patents.	✗	✓
Publications of Nobel laureates	Publication records and prize-winning papers of Nobel laureates.	✗	✓
Altmetric	Data on online attention (e.g., mainstream and social media).	✓	—
CORE	Metadata and full-text information of 87 M + papers.	✓	✓
Unpaywall	Publication metadata and open-access related information.	✓	✓
DOAJ	Community-curated data on open-access journals and papers.	✓	✓
OpenAIRE Research Graph	Data connecting scientific products, organizations, funded projects, etc. from 70 K + sources.	✓	✓
Faculty Opinions with Gender	Metadata of authors from Faculty Opinions with gender classification from Faculty Opinions and Web of Science.	—	✓
Scopus	Documents selected by an independent review board of experts.	—	—
Lens	Citation relationships within and across papers and patents.	—	—
Springer Nature SciGraph	Triples connecting multiple entities in the research landscape, including publications, funders, and affiliations.	✓	✓
Google Scholar	Large-scale data on publications, citations, and disambiguated scholar profiles indexed by Google.	✗	✗

File	Lines	Short Description (all files are in TSV format)
SciSciNet_Papers	134,129,188	File containing primary papers with Paper IDs, categories, counts, and calculated foundational metrics.
SciSciNet_PaperAuthorAffiliations	413,869,501	File containing paper-author-affiliation linkages.
SciSciNet_PaperReferences	1,588,739,703	File containing paper reference pairs within primary papers that appear in SciSciNet_Papers.
SciSciNet_Fields	311	File containing Field IDs with names and types (top-level or sub-level).
SciSciNet_Journals	49066	File containing Journal IDs with names, ISSNs, publishers, and official webpages.
SciSciNet_ConferenceSeries	4551	File containing Conference Series IDs with names.
SciSciNet_Authors_Gender	134,197,162	File containing Author IDs with names and individual career-level metrics.
SciSciNet_PaperFields	277,494,994	File containing linkages between Paper ID and Field ID.
SciSciNet_PaperDetails	136,726,948	File containing detailed information of papers (covering retracted papers and affiliated papers in paper families as well) including titles, journals, and publishers.
SciSciNet_Affiliations	26,998	File containing Affiliation IDs with names and institution-level metrics.
SciSciNet_Link_NSF	1,309,518	File containing linkages between Paper ID and NSF Award Number.
SciSciNet_Link_NIH	6,013,187	File containing linkages between Paper ID and NIH Project Number.
SciSciNet_Link_ClinicalTrials	438,220	File containing linkages between referenced Paper ID and NCT Number.
SciSciNet_Link_NobelLaureates	87,316	File containing linkages between Paper ID and Nobel Laureate ID.
SciSciNet_Link_Twitter	55,846,550	File containing linkages between Paper ID and Tweet ID.
SciSciNet_Link_Newsfeed	595,241	File containing linkages between Paper ID and Newsfeed ID.
SciSciNet_Link_Patents	38,740,313	File containing linkages between Paper ID and Patent ID.
SciSciNet_NSF_Metadata	489,446	File containing metadata of NSF awards from nsf.gov.
SciSciNet_Newsfeed_Metadata	947,160	File containing metadata of scientific mentions in Newsfeed from Crossref Event API.
SciSciNet_Twitter_Metadata	59,593,281	File containing metadata of scientific mentions in Twitter from Crossref Event API.

Index	Format	Short Description
PaperID	Integer	Unique MAG Paper ID of the paper.
DOI	String	Digital Object Identifier (DOI) of the paper.
DocType	String	Book, BookChapter, Conference, Dataset, Journal, Repository, Thesis, or NULL (unknown).
Year	Integer	Publication year of the paper.
Date	DateTime	Publication date of the paper formatted as YYYY-MM-DD.
JournalID	Integer	MAG Journal ID for published journal of the paper.
ConferenceSeriesID	Integer	MAG ConferenceSeries ID for published conference series of the paper.
Reference_Count	Integer	Total reference count of the paper.
Citation_Count	Integer	Total citation count of the paper.
C5	Integer	The number of citations 5 years after publication.
C10	Integer	The number of citations 10 years after publication.
Disruption	Float	Disruption score of the paper defined in Wu et al.²⁰
Atyp_Median_Z	Float	Median Z-score of the paper defined in Uzzi et al.⁴⁷
Atyp_10pct_Z	Float	10^th percentile Z-score of the paper defined in Uzzi et al.⁴⁷
Atyp_Pairs	Integer	The number of journal pairs cite by the paper defined in Uzzi et al.⁴⁷
WSB_mu	Float	Immediacy μ of the paper as introduced in WSB model⁴⁶.
WSB_sigma	Float	Longevity σ of the paper as introduced in WSB model⁴⁶.
WSB_Cinf	Integer	Ultimate impact of the paper predicted by WSB model⁴⁶.
SB_B	Float	Beauty coefficient of the paper as introduced in Ke et al.⁹³
SB_T	Integer	Awakening time of the paper as introduced in Ke et al.⁹³
Team_Size	Integer	The number of researchers in the paper.
Institution_Count	Integer	The number of institutions in the paper.
Patent_Count	Integer	The number of citations by patents from USPTO and EPO.
Newsfeed_Count	Integer	The number of mentions by news from Newsfeed.
Tweet_Count	Integer	The number of mentions by tweets from Twitter.
NCT_Count	Integer	The number of citations by clinical trials from ClinicalTrials.gov.
NIH_Count	Integer	The number of supporting grants from NIH.
NSF_Count	Integer	The number of supporting grants from NSF.

Index	Format	Short Description
PaperID	Integer	MAG Paper ID of the paper.
DOI	String	Digital Object Identifier (DOI) of the paper.
DocType	String	Book, BookChapter, Conference, Dataset, Journal, Repository, Thesis, or NULL (unknown).
PaperTitle	String	Title of the paper.
BookTitle	String	Book title of the paper.
Year	Integer	Publication year of the paper.
Date	DateTime	Publication date of the paper formatted as YYYY-MM-DD.
Publisher	String	Publisher name of the paper.
JournalID	Integer	MAG Journal ID for published journal of the paper.
ConferenceSeriesID	Integer	MAG ConferenceSeries ID for published conference series of the paper.
OriginalVenue	String	Original published venue name of the paper.
Volume	String	Volume of the paper.
Issue	String	Issue of the paper.
FirstPage	String	First page of the paper.
LastPage	String	Last page of the paper.
FamilyID	Integer	Primary MAG Paper ID of the paper in the same paper family.
RetractionType	String	“Retracted Publication”, “Retraction Notice”.
ReferenceCount	Integer	Reference count of the paper in MAG original papers data table.
CitationCount	Integer	Citation count of the paper in MAG original papers data table.

Index	Format	Short Description
Citing_PaperID	Integer	MAG Paper ID of the citing paper in the citation pair.
Cited_PaperID	Integer	MAG Paper ID of the cited paper in the citation pair.

Index	Format	Short Description
AuthorID	Integer	MAG Author ID of the author.
Author_Name	String	Original name of the author.
H-index	Integer	H-index of the author.
Productivity	Integer	Total number of publications of the author.
Average_C10	Float	Average c₁₀ of the author.
Average_LogC10	Float	Average logc₁₀ of the author.
Inference_Sources	Integer	The number of name-gender inference source datasets⁸⁰.
Inference_Counts	Integer	The number of empirical count of humans with the first name and gendered label in the source datasets⁸⁰.
P(gf)	Float	The probability that indicates to what extent a name belongs to an individual gendered female⁸⁰.

Index	Format	Short Description
FieldID	Integer	MAG Field ID of the field of study.
Field_Name	String	Original field name of the field of study.
Field_Type	String	Top or Sub. Top indicates the top-level field. Sub indicates the subfield.

Index	Format	Short Description
PaperID	Integer	MAG Paper ID in the paper-author-affiliation record.
AuthorID	Integer	MAG Author ID in the paper-author-affiliation record.
AffiliationID	Integer	MAG Affiliation ID in the paper-author-affiliation record.
AuthorSequenceNumber	Integer	Original author sequence number starting with 1.

Index	Format	Short Description
PaperID	Integer	MAG Paper ID in the paper-field linkage record.
FieldID	Integer	MAG Field ID in the paper-field linkage record.
Hit_1pct	Integer	1 is hit paper with top 1% total citations within the same level field and the same year, and 0 is not.
Hit_5pct	Integer	1 is hit paper with top 5% total citations within the same level field and the same year, and 0 is not.
Hit_10pct	Integer	1 is hit paper with top 10% total citations within the same level field and the same year, and 0 is not.
C_f	Float	Normalized citation as defined by Radicchi et al.⁴⁸

Index	Format	Short Description
AffiliationID	Integer	MAG Affiliation ID of the affiliation.
Affiliation_Name	String	Original name of the affiliation.
GridID	String	GRID (Global Research Identifier Database) ID of the affiliation.
Official_Page	String	Official webpage of the affiliation.
ISO3166Code	String	ISO 3166 two-letter country codes of the affiliation.
Latitude	Float	Latitude of the affiliation.
Longitude	Float	Longitude of the affiliation.
H-index	Integer	H-index of the affiliation.
Productivity	Integer	Total number of publications of the affiliation.
Average_C10	Float	Average c₁₀ of the affiliation.
Average_LogC10	Float	Average log c₁₀ of the affiliation.

Index	Format	Short Description
JournalID	Integer	MAG Journal ID of the journal.
Journal_Name	String	Original name of the journal.
ISSN	String	ISSN (International Standard Serial Number) of the journal.
Publisher	String	Original publisher of the journal.
Webpage	String	Original web link of the journal.

Index	Format	Short Description
PaperID	Integer	MAG Paper ID.
NSF_Award_Number	String	NSF award number.
Type	String	“First” and “Crossref” are exact matches, and “Second” is fuzzy match. “Crossref” type is derived from Crossref funder-paper links.
Diff_ZScore	Float	The difference of Z-scores using heuristic method for the “Second” type.

NSF grant-paper pairs	In SciSciNet	Not in SciSciNet
In Dimensions	529,382	141,388
Not in dimensions	103,186	\

NIH grant-paper pairs	In SciSciNet	Not in SciSciNet
In Dimensions	5,356,652	264,119
Not in dimensions	15,157	\

Country	# Clinical Trials	Country	# Clinical Trials linked to papers
United States	153,632	United States	22,358
France	31,328	Canada	3,666
Canada	26,036	China	3,099
China	24,095	France	3,036
Germany	23,669	Italy	2,907
United Kingdom	22,304	Germany	2,703
Spain	17,454	United Kingdom	2,554
Italy	17,163	Spain	2,351
Korea, Republic of	13,213	Turkey	1,712
Belgium	12,182	Netherlands	1,456

Index	Format	Short Description
ConferenceSeriesID	Integer	MAG ConferenceSeries ID of the conference series.
Abbr_Name	String	Abbreviated name of the conference series.
ConferenceSeries_Name	String	Original name of the conference series.

Index	Format	Short Description
NSF_Award_Number	String	Unique NSF award number of the NSF award.
Title	String	Original title of the NSF award.
Publication_Research	String	Publications associated with the NSF award.
Date	DateTime	Date when the NSF award is signed by the NSF Grants Officer.

Index	Format	Short Description
NewsfeedID	String	Newsfeed ID of the news article or blog post.
Occurred_Time	DateTime	Publication time of the news.
ObjectID	String	DOI object link of the mention.
Subject_Infomation	String	Detailed information of the subject news mention.

Index	Format	Short Description
TweetID	Integer	Unique Tweet ID of the tweet.
Occurred_Time	DateTime	Publication time of the tweet.
ObjectID	String	DOI object link of the mention.
OriginalTweetID	String	Web link of the tweet.

PERMALINK

SciSciNet: A large-scale open data lake for the science of science research

Zihang Lin

Yian Yin

Lu Liu

Dashun Wang

Abstract

Background & Summary

Table 1.

Fig. 1.

Table 2.

Methods

Data selection and curation from MAG

Table 3.

Table 8.

Table 5.

Table 9.

Table 6.

Table 4.

Table 7.

Table 10.

Table 21.

Table 22.

Linking publication data with external sources

NIH funding

Table 11.

NSF funding

Table 20.

Fig. 2.

Table 12.

Patent citations to science

Table 15.

Clinical trials citations to science

Table 13.

News and social mentions of science

Table 17.

Table 16.

Table 18.

Table 19.

Nobel Prize data from the dataset of publication records for Nobel laureates

Table 14.

Calculation of commonly used measurements

Publication-level

The number of researchers and institutions in a scientific paper

Five-year citations (c5), ten-year citations (c10), normalized citation (cf), and hit paper

Citation dynamics

Sleeping beauty coefficient

Novelty and conventionality

Disruption score

The number of NSF and NIH supporting grants

The number of patent citations, Newsfeed mentions, Twitter mentions, and clinical trial citations

Individual- and Institutional-level measures

Productivity

H-index

Scientific impact

Name-gender associations

Data Records

Data structure

Descriptive statistics

Fig. 3.

Fig. 4.

Fig. 5.

Technical Validation

Validation of publication and citation records

Validation of external data linkages

Fig. 6.

Table 23.

Table 24.

Validation of calculations of commonly used measurements

Disruption

Fig. 7.

Novelty and conventionality

WSB model

Sleeping beauty

Usage Notes

Table 25.

Supplementary information

Acknowledgements

Author contributions

Code availability

Five-year citations (c₅), ten-year citations (c₁₀), normalized citation (c_f), and hit paper