The OpenScience Slovenia metadata dataset

Mladen Borovič; Marko Ferme; Janez Brezovnik; Sandi Majninger; Albin Bregant; Goran Hrovat; Milan Ojsteršek

doi:10.1016/j.dib.2019.104942

. 2019 Dec 5;28:104942. doi: 10.1016/j.dib.2019.104942

The OpenScience Slovenia metadata dataset

Mladen Borovič ^1,^∗, Marko Ferme ¹, Janez Brezovnik ¹, Sandi Majninger ¹, Albin Bregant ¹, Goran Hrovat ¹, Milan Ojsteršek ¹

PMCID: PMC6928342 PMID: 31890793

Abstract

The OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.

Keywords: Metadata, Real-world data, Text data, Text mining, Text classification, Natural language processing

Specifications Table

Subject	Computer Science Applications
Specific subject area	Text mining, text classification, natural language processing
Type of data	Table
How data were acquired	Document metadata was acquired by fetching available public domain documents in academic digital library systems and document repositories in Slovenia
Data format	Raw Filtered
Parameters for data collection	Document metadata was filtered to publications which are in public domain, are electronically available and have non-empty values for fields of title, keywords, publication year, UDC, typology and organization. Where available, certain metadata columns were included in an alternative language. Metadata columns from different schemas (COMARC/XML, OAI/PMH, Dublin Core, ePrints) were merged where appropriate.
Description of data collection	Document metadata was collected from the document repositories included in the Slovenian open-access infrastructure. Document metadata, stored in different schemas (COMARC/XML, OAI-PMH, Dublin Core, ePrints), was processed and merged where duplicates were found using a matching criteria on titles, abstracts, publication years, authors, other identifiers (URL, UDC) and the document metadata schema was extended accordingly.
Data source location	Document metadata was collected at University of Maribor in Maribor, Slovenia (latitude 46.559244, longitude 15.642718)
Data accessibility	Data is accessible publicly via Mendeley Data repository in JSON and CSV format. Repository name: Mendeley Data Direct URL to data: https://doi.org/10.17632/7wh9xvvmgk.1
Related research article	Milan Ojsteršek, Janez Brezovnik, Mojca Kotar, Marko Ferme, Goran Hrovat, Albin Bregant, Mladen Borovič Establishing of a Slovenian open access infrastructure: a technical point of view Program: electronic library and information systems, 2014, 48 (4), 384–412 https://doi.org/10.1108/PROG-02-2014-0005

Open in a new tab

Value of the Data

•
The data consists of metadata fields which represent segmented information and features of publicly published academic documents in Slovenia.
•
The dataset can be used to test categorization and clustering approaches and techniques used in text mining and text similarity detection as well as content-based recommendation system approaches.
•
The data can be used in future studies for examination of document categorization or clustering techniques and content-based approaches to recommendations.
•
The data includes the UDC (Universal Decimal Classification) field which is determined manually by experts (in this case trained librarians) and can be used in machine learning approaches in UDC prediction or recommendation scenarios for new documents.

Open in a new tab

1. Data

The OpenScience Slovenia metadata dataset comprises of metadata and author data of diploma, master and doctoral theses and research publications of Slovenian universities and research institutions. The data is available in CSV and JSON formats. The dataset consists of four tables: metadata, authors, organizations and typology. The metadata table consists of the following attributes:

Attribute	Description/remark
ID	Metadata identifier; table linking attribute
UDC	Universal Decimal Classification
Typology	Typology according to the COBISS.SI system; table linking attribute
Language	Primary language of the document
Title	Title in the primary language
Subtitle	Subtitle in the primary language (can be missing)
Abstract	Abstract in the primary language (can be missing)
Keywords	Keywords in the primary language (semicolon “; ” separated)
AlternativeLanguage	Alternative language of the document (can be missing)
AlternativeTitle	Title in the alternative language (can be missing)
AlternativeSubtitle	Subtitle in the alternative language (can be missing)
AlternativeAbstract	Abstract in the alternative language (can be missing)
AlternativeKeywords	Keywords in the alternative language (semicolon “; ” separated, can be missing)
IssueYear	Year of issue/publication
URL	URL link to the publicly available file (links can be dead)
OrganizationID_COBISS	Organization identifier according to the COBISS.SI system (https://plus.cobiss.si/opac7/help/cobib/fc/codelist)
OrganizationID_eVS	Organization identifier according to the eVŠ (https://www.gov.si/assets/ministrstva/MIZS/Dokumenti/Visoko-solstvo/eVS-evidenca-VSZ-in-SP/eVS_VSZ_31052019.xlsx); table linking attribute
COBISSID	COBISS.SI system identifier; can be used to query the COBISS.SI system or to link metadata entries in this dataset with the KAS dataset
NRID	National repository identifier; can be used to query the OpenScience Open Data web API

Open in a new tab

The authors table consists of the following attributes:

Attribute	Description/remark
MetadataID	Metadata identifier; table linking attribute
AuthorID	Author identifier
Name	Name of the author
Surname	Surname of the author

Open in a new tab

The organizations table consists of the following attributes:

Attribute	Description/remark
eVSID	eVŠ identifier; table linking attribute
Abbreviation	Organization abbreviation
Name	Organization name in Slovenian
EnglishName	Organization name in English

Open in a new tab

The typology table consists of the following attributes:

Attribute	Description/remark
Typology	Typology designation according to the COBISS.SI system (https://plus.cobiss.si/opac7/help/cobib/td/codelist); table linking attribute
Language	Language of the typology designation description (can be either “slv” for Slovenian or “eng” for English)
Description	Typology designation description

Open in a new tab

2. Experimental design and methods

As project partners in establishing the Slovenian Open-Access Infrastructure [1], all universities in Slovenia and other included college, higher education and research institutions agreed to provide us with access to their existing digital library systems in order to obtain documents and their metadata. The data collection process consisted of three phases: the existing repository analysis, document and metadata acquisition, and metadata deduplication and enrichment.

2.1. Existing repository analysis

The first phase was an analysis of institutions that should be included in the Slovenian Open-Access Infrastructure and their digital library systems capabilities. We analysed what types of documents are stored in these digital library systems and existing options for data and metadata transfer. The analysis included different digital library systems such as COBISS.SI [2], ePrints and dLib.si with their data and metadata transfer services such as SRU (Search/Retrieve via URL), SRW (Search/Retrieve Web Service) and OAI-PMH (Open Archives Initiative - Protocol for Metadata Harvesting). The formats used were COMARC/B XML and Dublin Core.

2.2. Document and metadata acquisition

The second phase was the acquisition of documents and their metadata from analysed digital library systems. First, the document metadata was acquired using services in existing digital library systems. For metadata acquired from the COBISS.SI system, SRU and SRW services were used. Metadata acquired from ePrints and dLib.si based systems was acquired using OAI-PMH services provided by these systems. The metadata from the COBISS.SI system was retrieved in COMARC/B XML format and metadata acquired via OAI-PMH services was retrieved in Dublin Core format. The documents were acquired using identifiers found in the metadata with the majority of documents being publicly available online in PDF format. URL, URN and DOI identifiers were used in this part to resolve the online source of these documents which were then downloaded to the Slovenian Open-Access Infrastructure servers. The documents which were not publicly available online were either made publicly available online in one or more of the analysed digital library systems or were transferred to our servers using custom-built software after being provided to us on FTP servers and storage devices.

2.3. Metadata deduplication and enrichment

The document metadata was stored in different schemas, depending on the source of the metadata. In the third phase, our custom-built software was developed and used to resolve any duplicate metadata entries describing the same document. Additionally, the same custom-built software was used to detect missing metadata across different metadata schemas and enrich the metadata entries with missing fields using metadata from other metadata schemas. The matching criteria during the merging process were based on titles, abstracts, publication years, authors and other identifiers (URL, UDC). In case of duplicate metadata entries, fields obtained from COBISS.SI services were deemed better quality and were considered final, while metadata from other sources were used to enrich missing metadata fields.

Acknowledgments

The provided data stemmed from the establishment of the Slovenian open access infrastructure. The Slovenian open access infrastructure is partly financed by the European Union, European Regional Development Fund, and the Ministry of Education, Science and Sports of the Republic of Slovenia within the framework of the Operational Programme for Strengthening Regional Development Potentials for Period 2007–2013.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

1.Ojsteršek M., Brezovnik J., Kotar M., Ferme M., Hrovat G., Bregant A., Borovič M. Establishing of a Slovenian open access infrastructure: a technical point of view. Program Electron. Libr. Inf. Syst. 2014;48(4):384–412. [Google Scholar]
2.Seljak M., Seljak T. The development of the COBISS system and services in Slovenia, Program. Electron. Libr. Inf. Syst. 2002;36(2):89–98. [Google Scholar]

[bib1] 1.Ojsteršek M., Brezovnik J., Kotar M., Ferme M., Hrovat G., Bregant A., Borovič M. Establishing of a Slovenian open access infrastructure: a technical point of view. Program Electron. Libr. Inf. Syst. 2014;48(4):384–412. [Google Scholar]

[bib2] 2.Seljak M., Seljak T. The development of the COBISS system and services in Slovenia, Program. Electron. Libr. Inf. Syst. 2002;36(2):89–98. [Google Scholar]

PERMALINK

The OpenScience Slovenia metadata dataset

Mladen Borovič

Marko Ferme

Janez Brezovnik

Sandi Majninger

Albin Bregant

Goran Hrovat

Milan Ojsteršek

Abstract

1. Data

2. Experimental design and methods

2.1. Existing repository analysis

2.2. Document and metadata acquisition

2.3. Metadata deduplication and enrichment

Acknowledgments

Conflict of Interest

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The OpenScience Slovenia metadata dataset

Mladen Borovič

Marko Ferme

Janez Brezovnik

Sandi Majninger

Albin Bregant

Goran Hrovat

Milan Ojsteršek

Abstract

1. Data

2. Experimental design and methods

2.1. Existing repository analysis

2.2. Document and metadata acquisition

2.3. Metadata deduplication and enrichment

Acknowledgments

Conflict of Interest

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases