Modern data analysis faces challenging problems, bringing the incorporation of intelligent methods to the forefront. To make efficient decisions in industrial and government applications, one needs to have a richness of real-world data, the means to interpret the data, and guidance on how to use such data to inform how to make an efficient decision about a process. Researchers transfer their focus from the theoretical background to investigating ways to make these methods more robust and efficient. For that reason, real-world data is necessary. Machine Learning (ML) approaches require the existence of complicated data collections for developing models with strong generalization abilities. In addition, intelligent optimization algorithms require real-world instances that consider all aspects of a problem depicting its actual complexity.
In response to the above matter, we organized the Special Issue “Data in Management and Decision Engineering”, aiming to collect data articles describing data sets valuable for industrial and research purposes, i.e., ML and intelligent optimization algorithms.
We received 9 articles covering a broad range of data types, e.g., real-world operational data, survey data, and artificial benchmark data. This range showcases the value of the collected datasets. The operational data described in this special issue contains real-world scenarios that consider several aspects of their respective problems. Therefore, they are meaningful for practitioners incorporating intelligent methods into their operations. The survey data described provides insight into the various factors affecting a decision in government applications. This enables the development of intelligent methods with strong generalization abilities. The artificial benchmark data aims to explore the capabilities of modern approaches, such as quantum computing algorithms, and facilitate the investigation of more robust and efficient methods.
In this first attempt to collect valuable datasets from various management and decision engineering applications, we focused on areas that are of high research interest, and thus, we primarily invited experts from topics of management and decision engineering that have garnered broad academic and lay interest recently. Such areas include logistics and delivery problems [[1], [2], [3], [4]], energy management and renewables [5,6], and stock and resource management [7,8]. Considering the value of the submitted data articles, we believe that more focused special issues on these topics could be organized in the near future. In addition, the current Special Issue includes a dataset related to sports science [9] that enables the development and optimization of cycling training routes generation algorithms. Moreover, this dataset can be modified to enable the development of bike delivery route generation algorithms that can be used in relevant delivery problems.
The most widespread repository for data science is perhaps the UCI Machine Learning Repository [10]. The ML community maintains the repository by contributing datasets and updating the website's functionality. Another repository of community-published data is Kaggle [11]. Both repositories contain a large number of datasets varying from Natural Language Processing data to data in the physical sciences. But they lack adequate data descriptions. The existence of a proper description enables interested researchers to easily integrate the corresponding data into their study, understand how the data were collected, and thus gain an appreciation for experimental replication as well as novel uses of existing data. In that spirit, this Special Issue aims to provide well-described datasets useful for data analysis and benchmarking of intelligent methods.
Additionally, problem-related repositories contain benchmark data for various optimization problems. Such an example is the TSPLIB [12], a library of sample instances for routing problems, mainly the Travelling Salesman Problem (TSP). Another example is the PSPLIB [13], a library of benchmark instances for resource-constrained project scheduling problems. However, such repositories contain artificial benchmark datasets that do not always depict real-world problem complexity. The current Special Issue contains several real-world operational data sets enabling the development of intelligent optimization algorithms that consider various aspects of the corresponding problem.
The popularity of repositories such as the UCI Machine Learning Repository and PSPLIB showcases the importance of the current collection comprising the Special Issue. The UCI Machine Learning Repository has been cited more than 6300 times, according to Google Scholar. This number can grow significantly considering its individual datasets' citations. PSPLIB has been cited 1738 times, according to Google Scholar, and 1033 times, according to Scopus.
The current Special Issue consists of 9 data articles describing real-world operational data, surveyed data, and artificial benchmark data.
In [1], a real operational dataset of 263 instances for the Concrete Delivery Problem (CDP) is described. The authors cleaned, anonymized, and processed the raw data provided by a concrete producer to form instances useful for benchmarking optimization algorithms developed to solve the CDP. Researchers and practitioners studying the CDP can further process the dataset to create artificial data for variations of the CDP. Furthermore, selected instances from the dataset are useful for CDPʼs dynamic aspect considering real-time orders.
In [2], a dataset for the Van-Drone Routing Problem with Multiple Delivery Points and Cooperation (VDRPMDPC) is presented. The continuous development of both electronic and quick commerce has driven carriers and courier operators to identify more effective methods for express parcel delivery. Along these lines, the VDRPMDPC assesses the design of more sustainable and cost-effective delivery routes in urban and semi-urban environments via the use of Unmanned Aerial Vehicles (UAVs). The corresponding dataset is composed of 14 instances comprised of 20, 40, 60, and 100 client nodes corresponding to real geographical positions located in two different areas of Athens, Greece.
Survey data aiming to provide insight into the factors affecting a residence's electricity consumption are presented in [5]. The imminent energy crisis highlights the value of such insight. The data were collected in Greece through an anonymous survey that comprised 26 questions, resulting in 188 data points from 104 households from different periods. Each data point contains four categories of attributes: (a) the household data, such as the type and properties of the residence, (b) the occupants’ socio-economic features, such as the employment status, and the total income of the residents, (c) the energy-related occupants’ behavior, and (d) the location of the household to estimate the weather conditions for the provided time. The authors investigated non-trivial relationships between the data points using data augmentation. As a result, they computed and included a secondary set of features based on the raw attributes.
The authors of [9] describe a graph-based dataset of cycling routes in Slovenia comprised of 152,659 nodes representing individual road intersections and 410,922 edges representing the roads between them. Cycling training plans usually consist of a group of nodes and edges of a bi-directed graph to be covered by the athlete. An optimal training plan considers the cyclist training parameters, e.g., the suitable distance, ascent, and descent. The dataset enables the development and optimization of cycling training generation algorithms considering various factors, such as distance, ascent, descent, and road type.
The work in [6] presents data collected using standard communication equipment and invoices provided by an established civil construction and renewable energy development and operation company. The authors provide four types of data: (a) Project Management Data, (b) Life Cycle Inventory (LCI), (c) Electricity Generation Data, and (d) Operational Cost Data. The Project Management Data can be further combined with the costs from different geographical and time regions to estimate overall project implementation costs for similar projects. The LCI data for the materials and transportation used can set the basis for life cycle assessment modeling of ground-mounted photovoltaic farms of that size and type. The Electricity Generation Data, along with meteorological parameters and location coordinates, can be further enhanced to predict and manage energy generation as well as the cash flow of expectations for installations of this type and size over time. Finally, the Operational Cost Data, especially combined with the previously mentioned types of data, can support a holistic techno-economic and environmental assessment of comparable commercial photovoltaic installations. Furthermore, these data can be used for a comparative multi-disciplinary evaluation between photovoltaics and various renewable electricity generation alternatives as well as traditional fossil fuel-based options.
Data for another UAV-related routing problem, the Cumulative Unmanned Aerial Vehicle Routing Problem (CUAVRP), are presented in [3]. These synthetic data were generated using an algorithmic procedure on well-known vehicle routing problem instances for different parameters for the UAV fleet size and their capabilities, such as the viewing window of the UAVs’ sensor, their maximum range, the size of the UAV fleet, and the unknown locations of the targets within the area of interest. The dataset provides a benchmark for the CUAVRP solution methodologies and can be extended to form benchmarks for other variants of CUAVRP with additional constraints and different objectives.
In [7], the authors introduced a benchmark dataset for real-world bin-packing problems, encouraging quantum computing researchers to work on such problems. This dataset consists of 12 instances of varying levels of complexity regarding size and user-defined requirements. The instances consider several real-world-oriented restrictions, e.g., item and bin dimensions, weight restrictions, affinities among package categories, preferences for package ordering, and load balancing. Moreover, the authors provide a Python script for dataset generation to allow the construction of general-purpose benchmarks. This benchmark dataset aims to provide an evaluation floor for the performance of quantum solvers, and thus, the instances were designed according to the current limitations of quantum devices.
A large dataset for resource-leveling optimization in project management is described in [8]. Specifically, the data corresponds to a large real-world ship construction problem consisting of 1178 activities. Resource leveling is a highly complex optimization problem where the goal is to adjust a project's activities to achieve better resource allocation. Only small-size benchmark problems consisting of ten to twenty activities can be found in the literature. Therefore, these data are valuable for testing and benchmark experimentation of robust and efficient optimization algorithms.
In [4], a dataset for the Electric Capacitated Travelling Salesman Problem (EC-TSP) is presented. Numerous cities seek to shift toward more sustainable routing models and tackle traffic congestion. Along those lines, the cargo bike is showing itself to be an attractive and versatile last-mile delivery alternative. This EC-TSP describes the e-cargo bike parcel distribution problem in urban environments. To enable the assessment of e-cargo bikes versus typical delivery vans in terms of operational efficiency and CO2e emissions [4] introduces a dataset of 9 instances comprised of 14–29 nodes each, where real city centers of Athens, Thessaloniki, Patra, and Larisa in Greece are considered.
It seems that over the years, the focus of research from classical production and operations management problems and monitoring and fault diagnosis of real-time systems has clearly moved to logistics and supply chain problems as well as to energy management and environmental matters. Therefore, the organization of special issues such as this related to datasets for testing intelligent optimization, classification, or forecasting approaches to real-world managerial applications in logistics, transportation, energy, and the environment will be of great interest and value to the research community. Furthermore, future relevant Special Issues can also comprise datasets from other promising management engineering areas, such as financial engineering, reliability engineering, quality engineering, as well as quantitative managerial decision-making (e.g., credit scoring, bankruptcy prediction, credit fraud detection applications, etc.).
Acknowledgments
The guest editors would like to thank everyone who contributed to the Special Issue, particularly the Editors-in-Chief and the Scientific Editors for the support provided throughout the process, the authors who submitted their articles, and the anonymous reviewers who greatly helped in the selection process.
References
- 1.Tzanetos A., Blondin M. Real operational data for the concrete delivery problem. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Athanasiadis E., Koutras V., Zeimpekis V. Dataset for the van-drone routing problem with multiple delivery drop points. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kyriakakis N.A., Stamadianos T., Marinaki M., Marinakis Y. Dataset for the cumulative unmanned aerial vehicle routing problem. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gialos A., Zeimpekis V. Dataset for the electric capacitated traveling salesman problem. Data Br. 2023;50 doi: 10.1016/j.dib.2023.109464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mischos S., Gkalinikis N.V., Manolopoulou A., Dalagdi E., Zaikis D., Lazaridis A., Vlachava D., Lagouvardos K., Vrakas D. Household electricity consumption in Greece: a dataset based on socio-economic features. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tsinarakis G., Kouloumpis V., Pavlidou A., Arampatzis G. Data for the project management, life cycle inventory, costings and energy production of a ground-mounted photovoltaic farm in Greece. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Osaba E., Villar-Rodriguez E., Romero S.V. Benchmark dataset and instance generator for real-world three-dimensional bin packing problems. Data Br. 2023;49 doi: 10.1016/j.dib.2023.109309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kyriklidis C., Dounias G. A ship-construction dataset for resource leveling optimization in large project management problems. Data Br. 2023;49 doi: 10.1016/j.dib.2023.109340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rajšp A., Fister I. Neo4j graph dataset of cycling paths in Slovenia. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.M. Kelly, R. Longjohn, K. Nottingham, UCI machine learning repository, (n.d.). https://archive.ics.uci.edu/.
- 11.Kaggle: your machine learning and data science community, (n.d.). https://www.kaggle.com/ (accessed August 17, 2023).
- 12.G. Reinelt, {TSPLIB}: a library of sample instances for the TSP (and related problems) from various sources and of various types, (2014). http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/ (accessed December 26, 2023).
- 13.Kolisch R., Sprecher A. PSPLIB - A project scheduling problem library: OR software - ORSEP operations research software exchange program. Eur. J. Oper. Res. 1997;96:205–216. doi: 10.1016/S0377-2217(96)00170-1. [DOI] [Google Scholar]
