Skip to main content
Data in Brief logoLink to Data in Brief
. 2024 Jun 3;55:110587. doi: 10.1016/j.dib.2024.110587

Daily electric vehicle charging dataset for training reinforcement learning algorithms

Nastaran Gholizadeh a,, Petr Musilek a,b
PMCID: PMC11209004  PMID: 38939017

Abstract

Reinforcement learning algorithms are increasingly utilized across diverse domains within power systems. One notable challenge in training and deploying these algorithms is the acquisition of large, realistic datasets. It is imperative that these algorithms are trained on extensive, realistic datasets over numerous iterations to ensure optimal performance in real-world scenarios. In pursuit of this goal, we curated a comprehensive dataset capturing electric vehicle (EV) charging details over a span of 29,600 days within a designated parking facility. This dataset encompasses necessary information such as connection times, charging durations, and energy consumption of individual EVs. The methodology involved employing conditional tabular generative adversarial networks (CTGAN) to craft a pool of synthetic dataset from a smaller initial dataset collected from an EV charging facility located on the Caltech campus. Subsequently, multiple post-processing techniques were implemented to extract data from this pool, ensuring compliance with the charging station's capacity constraint while maintaining a realistic daily EV demand profile derived from historical data. Using kernel density estimation (KDE), the distributional characteristics of the historical data, especially concerning the timing of EV connections, were faithfully replicated. The developed dataset is specifically useful in training offline reinforcement learning algorithms.

Keywords: Generative adversarial networks, conditional tabular GAN, Kernel density estimation, Charging station, Adaptive charging networks, ACN-data


Specifications Table

Subject Renewable Energy, Sustainability and the Environment/Artificial Intelligence
Specific subject area Realistic electric vehicle charging dataset generation in a public parking lot
Type of data Table, Analyzed
Data collection Conditional tabular generative adversarial networks (CTGAN) was employed to generate a pool of electric vehicle charging data from a smaller historical dataset. Kernel density estimation was utilized to sample data from this pool in a way that the daily electric vehicle connection profiles replicated distributional characteristics of the daily profiles in the historical data. Additionally, through post-processing of the dataset, the charging station capacity constraint was checked and enforced. The resultant dataset includes electric vehicle connection times, charging durations, energy consumption, and a day indicator column.
Data source location Institution: University of Alberta
City: Edmonton
Country: Canada
Primary data source: https://ev.caltech.edu/dataset
Data accessibility Repository name: Mendeley Data
Data identification number: 10.17632/5zrtmp7gwd.2
Direct URL to data: https://data.mendeley.com/datasets/5zrtmp7gwd/2

1. Value of the Data

  • In recent years, reinforcement learning algorithms have gained significant popularity across various power system applications, including electric vehicle (EV) charging management. However, their effectiveness heavily relies on the quality and quantity of training data. For successful training, extensive datasets are essential, spanning a substantial time frame (e.g., over 20,000 days). Most available EV datasets either primarily consist of residential EV data or are too short to adequately support reinforcement learning algorithms. Public charging stations play a crucial role in collecting EV data. Monitoring, storing, and cleaning this data over a period of 20,000 days is an exhaustive and costly endeavor for these stations. Additionally, the surge in popularity of EV charging stations has occurred only within the last 15 years. Remarkably, 20,000 days of data corresponds to over 54 years, making it practically impossible to gather such a vast amount of real-world data from a single public charging station.

  • We have developed a dataset spanning 29,600 days from a public charging station located on the Caltech campus. The original Caltech dataset, on the other hand, only covers 185 days of EV charging data. The primary challenge in generating synthetic data for EV charging lies in maintaining the relationships between various columns (such as charging duration and energy consumption) while ensuring a realistic load profile. For instance, at the selected public charging station, most EVs charge between 2:00 p.m. and 5:00 p.m. Our goal is to preserve such patterns across the generated EV dataset as well.

  • We utilized conditional tabular generative adversarial networks (CTGAN) to maintain the inter-column relationships within the dataset, while also employing kernel density estimation (KDE) to replicate the distributional characteristics of the EV connection time data from the Caltech dataset. This data proves particularly valuable for training reinforcement learning algorithms aimed at controlling various aspects related to EV charging such as designing pricing mechanisms for EV charging or controlling EV charging rate in a public charging station.

2. Background

The motivation behind creating this dataset [1] originated from the growing popularity of reinforcement learning algorithms for managing EV charging in public charging stations [2,3]. However, most available EV datasets are either too short or primarily consist of residential EV data [4]. Public charging stations play a crucial role in collecting EV data, but monitoring, storing, and cleaning this data over a 20,000-day period is a labor-intensive and costly endeavour. Collecting 20,000 days of data corresponds to over 54 years of monitoring and considering that the EV charging stations have only gained popularity in the past 15 years, it is practically impossible to gather such a vast amount of real-world data from a single public charging station.

Data augmentation is a technique used to increase training data by adding synthetic data, thereby improving the model's robustness to new, unseen data. It is widely utilized in computer vision and has recently been applied to model-free reinforcement learning algorithms. Various studies have explored the use of data augmentation to boost the performance of reinforcement learning algorithms. To this end, methods such as Koopman-mixup [5], differential privacy-based data augmentation [6], variational autoencoder-based augmentation [7], model-assisted experience augmentation [8], random translation and random amplitude scaling [9], and domain randomization [10] have been developed.

While generative adversarial networks have been used in previous literature to create synthetic residential load data [11], photovoltaic (PV) generation data [12], and other types of data, these datasets are typically single-column time-series data, which makes their generation straightforward. However, in the case of generating EV charging data, the charging duration and energy consumption columns are inherently correlated. Additionally, EV consumption exhibits patterns across various days. For instance, creating data that suggests EVs have their highest consumption at 6:00 a.m. in the morning is unrealistic for a public charging station. To address these, our study employed CTGAN [13] to preserve the correlation between charging duration and energy consumption in the generated data. Additionally, we utilized KDE to replicate the distributional characteristics of EV connection time data from a real dataset collected at a public charging station on the Caltech campus [14]. To ensure data quality, we continuously applied various postprocessing functions, checking both the station capacity constraint and connection time ranges. Despite the Caltech dataset containing data for only 185 days, we successfully generated a valuable resource containing 29,600 days of data for EV charging research.

3. Data Description

The EV dataset comprises a single CSV file named ‘SYNTHETIC_EV_DATA.csv’ with four columns. The first column, labeled ‘connectionTime_decimal,’ represents the connection time of individual EVs for charging in the parking area. This column is presented in a 24-hour time format. The subsequent columns, ‘chargingDuration’ and ‘kWhDelivered,’ provide information on the duration of charging and the energy consumed by each EV during the charging process. The duration is indicated in hours, and the energy is in kWh. Lastly, the ‘dayIndicator’ column indicates the specific day for which the data was recorded. All EVs sharing the same day indicator value belong to the same day. The dataset contains a total of 1965,239 rows, corresponding to EV charging data spanning 29,600 days. The parking lot's capacity in this dataset is assumed to be 37 EVs, as no more than 37 EVs were observed charging simultaneously in the Caltech dataset.

4. Experimental Design, Materials and Methods

In the EV dataset, there is a considerable correlation between the columns for charging duration and energy consumption. To preserve this relationship in the synthesized data, the CTGAN method was employed on the Caltech dataset to generate a pool of EV data. Additionally, EV connection times typically display a daily pattern. For example, a notable surge in charging activity occurs around 3:00 p.m. in the Caltech parking lot. Consequently, to enhance realism, the synthetic data for this location must mirror this trend. To achieve this, first, a random day from the Caltech dataset was selected to follow its connection time distribution. The KDE method was utilized to derive the probability density function for this day, which was then used to assign scores to the data rows in the synthetic data pool. Subsequently, a random count of EVs was determined for the day, and an equivalent number of data entries were selected from the synthetic pool, with the KDE scores guiding the selection probabilities. Furthermore, the maximum capacity of the charging station was verified against the sampled data to ensure that no more than 37 EVs were charging at the same time—a figure established by analyzing the Caltech dataset, which indicated this as the upper limit for simultaneous charging events. When this limit was exceeded, the specific time causing the breach was identified, and data entries were systematically removed until the constraint was fulfilled. The newly generated data was then added to an empty list, and the procedure was repeated for another day and appended to the same list. Finally, it was verified that all generated data across various columns remained positive and that connection times fell within the 0 to 24-hour range. Coding was performed using the Python programming language. Table 1 compares the descriptive statistics for the generated and Caltech datasets.

Table 1.

Descriptive statistics comparison between the generated and Caltech datasets.

Connection Time
Charging Duration
Energy Consumption
Generated Caltech Generated Caltech Generated Caltech
Count 1965,239 12,663 1965,239 12,663 1965,239 12,663
Mean 14.486 13.435 3.800 3.280 9.437 8.998
Standard Deviation 6.427 7.006 3.387 3.060 5.700 6.638
Minimum 0.000 0.016 0.000 0.001 0.000 0.501
25 % Quantile 14.247 6.200 1.443 1.287 4.868 4.228
50 % Quantile 16.168 15.633 2.547 2.275 9.174 7.633
75 % Quantile 18.041 17.733 5.436 4.275 13.566 13.207
Maximum 23.999 23.983 39.411 40.375 62.537 69.373

Following the data generation, the correlations among the dataset's columns were computed and compared against the Caltech dataset. This analysis is illustrated in Fig. 1. Notably, the correlation between charging duration and energy consumption remains consistent. Moreover, an increase in the correlation between connection time with charging duration and energy consumption is observed, which can be attributed to the sampling of data from recurrent connection time profiles. Nonetheless, this heightened correlation is reasonable, as EV drivers who initiate charging post 10:00 a.m. are likely to engage in longer charging sessions, possibly due to work commitments, resulting in extended charging durations and increased energy consumption.

Fig. 1.

Fig 1

Correlation heatmap comparison between the generated and Caltech datasets.

To evaluate the distribution patterns of the generated data in comparison to the Caltech dataset, a subset comprising 185 consecutive days was selected from the generated data. Fig. 2 illustrates the scatter plots for connection time, charging duration, and energy consumption of the sampled data and Caltech dataset. A visual inspection reveals a notable resemblance between the scatter plots of the two datasets. Particularly, the connection times for both datasets exhibit a reduced frequency of EV charging during the hours of 6:00 a.m. to 10:00 a.m. Furthermore, the majority of charging sessions in both datasets do not exceed 11 h. In addition, the energy consumption predominantly falls below 20 kWh for both datasets.

Fig. 2.

Fig 2

Comparison between the scatter plots of the sample generated data and Caltech dataset.

Finally, Fig. 3 presents the density plot alongside the KDE, offering a refined visualization of the data's distribution through a smooth approximation of its underlying probability density function. This assists in understanding the distribution's shape and inherent characteristics. Notably, the probability density functions for connection time, charging duration, and energy consumption exhibit remarkable similarity across both datasets. This similarity was also tested and verified on the rest of the generated dataset.

Fig. 3.

Fig 3

Comparison between the density plots of the sample generated data and Caltech dataset.

Limitations

This dataset is primarily intended for training reinforcement learning models which require a substantial volume of data from a consistent distribution. It should be noted that the dataset excludes any EV charging behaviors, which may exhibit irregular patterns such as a surge in charging demand at uncommon times, for instance, at 5:00 a.m., e.g. due to power outages or emergencies.

Ethics Statement

All authors have read and follow the ethical requirements for publication in Data in Brief and confirm that this work meets these requirements. This work does not involve human subjects, animal experiments, or any data collected from social media platforms.

CRediT authorship contribution statement

Nastaran Gholizadeh: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Visualization. Petr Musilek: Conceptualization, Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Acknowledgments

Acknowledgements

This research has been supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada CREATE program "From Data to Decision (fD2D): Artificial Intelligence from Data Value Chain to Human Value", grant number 565078-2022.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

References

  • 1.Gholizadeh N., Musilek P. Electric vehicle charging dataset. Mendeley Data. 2024:V2. doi: 10.17632/5zrtmp7gwd.2. [DOI] [Google Scholar]
  • 2.Yan L., Chen X., Chen Y., Wen J. A cooperative charging control strategy for electric vehicles based on multiagent deep reinforcement learning. IEEE Trans. Ind. Inform. 2022;18:8765–8775. doi: 10.1109/TII.2022.3152218. [DOI] [Google Scholar]
  • 3.Zhao Z., Lee C.K.M. Dynamic pricing for EV charging stations: a deep reinforcement learning approach. IEEE Trans. Transport. Electrif. 2022;8:2456–2468. doi: 10.1109/TTE.2021.3139674. [DOI] [Google Scholar]
  • 4.Amara-Ouali Y., Goude Y., Massart P., Poggi J.-M., Yan H. A review of electric vehicle load open data and models. Energies. 2021;14:2233. doi: 10.3390/en14082233. [DOI] [Google Scholar]
  • 5.Jang J., Han J., Kim J. K-mixup: data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace. Expert Syst. Appl. 2023;225 doi: 10.1016/j.eswa.2023.120136. ISSN 0957-4174. [DOI] [Google Scholar]
  • 6.Liu T., Chen H., Hu J., Yang Z., Yu B., Du X., Miao Y., Chang Y. Generalized multi-agent competitive reinforcement learning with differential augmentation. Expert Syst. Appl. 2024;238(Part C) doi: 10.1016/j.eswa.2023.121760. ISSN 0957-4174. [DOI] [Google Scholar]
  • 7.Wang Y., Jia Y., Zhong Y., Huang J., Xiao J. Balanced incremental deep reinforcement learning based on variational autoencoder data augmentation for customer credit scoring. Eng. Appl. Artif. Intell. 2023;122 doi: 10.1016/j.engappai.2023.106056. ISSN 0952-1976. [DOI] [Google Scholar]
  • 8.Lin R., Chen J., Xie L., Su H. Accelerating reinforcement learning with case-based model-assisted experience augmentation for process control. Neural Netw. 2023;158:197–215. doi: 10.1016/j.neunet.2022.10.016. ISSN 0893-6080. [DOI] [PubMed] [Google Scholar]
  • 9.Laskin M., Lee K., Stooke A., Pinto L., Abbeel P., Srinivas A. Reinforcement learning with augmented data. Proceedings of the 34th International Conference on Neural Information Processing Systems; Red Hook, NY, USA; Curran Associates Inc.; 2020. [Google Scholar]
  • 10.J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel, Domain randomization for transferring deep neural networks from simulation to the real world, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017), Vancouver, BC, Canada, 23–30, doi: 10.1109/IROS.2017.8202133.
  • 11.Rizzato M., Morizet N., Maréchal W., Geissler C. Stress testing electrical grids: generative adversarial networks for load scenario generation. Energy AI. 2022;9 doi: 10.1016/j.egyai.2022.100177. ISSN 2666-5468. [DOI] [Google Scholar]
  • 12.Liu J., Zang H., Zhang F., Cheng L., Ding T., Wei Z., Sun G. A hybrid meteorological data simulation framework based on time-series generative adversarial network for global daily solar radiation estimation. Renew. Energy. 2023;219 doi: 10.1016/j.renene.2023.119374. ISSN 0960-1481. [DOI] [Google Scholar]
  • 13.Xu L., Skoularidou M., Cuesta-Infante A., Veeramachaneni K. Modeling tabular data using conditional GAN. Proceedings of the 33rd International Conference on Neural Information Processing Systems; Red Hook, NY, USA; Curran Associates Inc.; 2019. [Google Scholar]
  • 14.Lee Z.J., Li T., Low S.H. ACN-data: analysis and applications of an open EV charging dataset. Proceedings of the Tenth ACM International Conference on Future Energy Systems (e-Energy '19); New York, NY, USA; Association for Computing Machinery; 2019. pp. 139–149. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES