Skip to main content
Scientific Data logoLink to Scientific Data
. 2026 Feb 27;13:548. doi: 10.1038/s41597-026-06821-3

Open data, private learners: a de-identified student activity and performance dataset for learning analytics

Elena Tiukhova 1,, Dimitri Van Landuyt 1, Bart Baesens 1,2, Monique Snoeck 1
PMCID: PMC13066476  PMID: 41748579

Abstract

The data-driven study and optimization of learning processes has become possible through collecting the digital traces left by learners as they interact with digital study materials and learning tools — a field known as Learning Analytics (LA). The increased availability of such data has fueled the growth of the LA field, enabling the development of new frameworks and models, and supporting the transfer of findings across domains, and collaboration on joint educational initiatives. However, given the sensitive and personal nature of intermediary learning results and outcomes, such data availability also raises significant ethical and privacy concerns. To support and foster the open and transparent development and evaluation of LA solutions, we present a detailed clickstream dataset collected at KU Leuven across two first-year bachelor courses over three academic years. The public dataset is accompanied by transparent documentation of the de-identification process, and we report on privacy and utility validation results.

Subject terms: Education, Education

Background & Summary

Advances in data collection and processing techniques have enabled data-driven decision-making across various domains, including education. A substantial amount of data is generated throughout educational processes, such as enrollment information, study activity logs, and academic performance records, with data management powered by modern Learning Management Systems (LMSs). These data can be further leveraged to optimize learning processes, a key focus of the field of Learning Analytics (LA)1.

The availability and publication of datasets in supporting LA research and experimental reproducibility of findings is crucial. This importance is illustrated by the examples of published datasets that include demographic and aggregated clickstream data from a large-scale distance-learning university2, data on high school students’ physics learning outcomes3, and eye-tracking data capturing cognitive and metacognitive engagement4. Public availability of realistic and representative educational datasets supports the development and research towards open, validated and generalizable LA solutions.

This paper contributes the publication of a detailed clickstream dataset which was collected as part of the ALPACAS project5 in three academic years (AYs): 2018-2019, 2019-2020, and 2020-2021. The ALPACAS project, launched at KU Leuven in 2019, aims to foster active learning among first-year bachelor students5. The instructors of the first-year bachelor courses extensively use an LMS (Blackboard in case of KU Leuven in the respective academic years) in delivering their courses, making it possible to collect data about student online activity. Moreover, switching to an online teaching mode during COVID-19 resulted in even higher volumes of educational data collected worldwide6, being the case for KU Leuven as well. The ALPACAS initiative seeks to enhance students’ self-regulation, thereby improving retention and academic success7,8. It has empowered instructional teams to use LA insights to better understand student behavior and course dynamics within the LMS.

The published dataset has demonstrated merits, as prior research has applied various techniques to this specific dataset, including anomaly detection to identify irregular study patterns9, explainable AI to examine the robustness of success prediction models10, and LA dashboards to inform instructional design11. The dataset introduced here aligns and corresponds with those used in Tiukhova et al.9,10, thereby enhancing the reproducibility and transparency of this existing LA research. In addition to supporting the replication of these existing scientific outcomes, and given that public LA datasets are generally scarce, our dataset also supports a number of promising avenues of future research in LA. According to self-regulated learning theory (SRL)12, trace data can be interpreted as a manifestation of motivated learning choices and used to construct higher-level features that reflect diverse aspects of learning13. These features can be further leveraged in the downstream LA tasks such as student success prediction10 or process mining14.

However, the publication of educational data involving learners –in this case bachelor students– raises significant concerns regarding privacy and confidentiality15, which makes the proactive protection of these data subjects a critical priority in LA applications. Chicaiza et al.16 emphasize that privacy is often overlooked in LA research, where privacy-preservation is rarely embedded in methodological frameworks. They argue for its routine inclusion, alongside and in complement to efforts to ensure ethical and legal compliance. To address privacy issues, Marshall et al.17 propose a comprehensive blueprint for embedding ethics and privacy in LA systems. Their framework introduces risk metrics for quantifying privacy exposure and provides actionable guidance for integrating privacy safeguards into institutional practices. Similarly, Tzoni et al.18 present a data pipeline that generates anonymous, low-risk data for analysis, employing aggregation techniques and automated risk assessments to ensure compliance. Maintaining data utility while protecting privacy is another key challenge. Joksimovic et al.19 argue that privacy-risk measurement should precede mitigation strategies to ensure the retention of dataset-specific features. Data de-identification20 refers to a class of techniques that involve specific data transformations (e.g., generalization, scrubbing or aggregation) conducted prior to publication or further processing to proactively address privacy risks.

The dataset introduced in this paper is released with transparent documentation of the de-identification process and reports of the associated privacy and utility metrics used in this process. The use of these methods not only helps preserve privacy but also promotes fairness and expands opportunities for collaboration and open LA practices.

This paper is structured as follows. The Methods section presents the overall data collection and processing methodology adopted in preparation of this dataset, including the active measures taken to preserve privacy while maintaining data utility. The Data Records section then documents the overall data model of the dataset components. The Technical Validation section then discusses the results of validating the dataset, both in terms of utility and privacy. Finally, the Usage Notes section outlines a number of promising avenues for future LA research that can benefit from the dataset.

Methods

The data was prepared for publication through a series of steps, as illustrated in Fig. 1. The final dataset is published in Zenodo21.

Fig. 1.

Fig. 1

Data preparation.

First, data was sourced from two primary information systems: the Blackboard LMS and the SAP Student Information System (SIS) for three subsequent academic years. Blackboard provides detailed records of student learning activities, while SIS contributes student information and academic performance data, including exam grades. Next, the dataset was narrowed in scope by selecting specific courses and filtering relevant tables and attributes. The data was then preprocessed and de-identified to ensure privacy and consistency. After institutional ethical review and approval, the processed dataset was finally published for use.

Scope Definition and Attribute Selection

The dataset is a representative sample drawn from a larger student population, capturing student activity and academic performance in two compulsory first-year bachelor courses – Accountancy and Global Economics – across three academic years: 2018-2019, 2019-2020, and 2020-2021. At KU Leuven, each academic year is divided into two semesters, with three exam periods: January, June, and September. The Accountancy course is taught in the first semester with its exam in January, while Global Economics is offered in the second semester with the exam in June. Retake opportunities for both courses are available during the September exam period. Each semester consists of 13 teaching weeks, followed by 1-2 study weeks and 3-4 exam weeks.

Both the Accountancy and Global Economics courses are followed by bachelor’s students across four study programs at the Faculty of Economics and Business at KU Leuven. The course instructor or teaching team has full autonomy regarding instructional design, pedagogical approach, and the selection of LMS tools. The courses are delivered in a blended learning format, with instructional materials made available through the LMS. Weekly practical sessions are organized and facilitated by a monitor, during which students work independently and may request assistance as needed.

Student performance is measured using exam grades, which consists of a written end-of-semester exam including both multiple-choice and open-ended questions (for the Accountancy course), or multiple-choice questions only (for the Global Economics course).

Participation in the discussion forum was voluntary, and students were informed of its availability within the LMS and their option to post questions there.

For our analysis, we selected tables containing student online activity data relevant to LA tasks, aligned with SRL theory12. First, we include a table detailing course content items. We then extract fine-grained data on student interactions with these course items via an LMS page, where each click on a content item is logged with a timestamp.

Additional sources include data on a discussion forum activity. Forum contribution data captures detailed logs of student posting behavior, while forum consumption data summarizes the number of the posts read in each discussion forum by each student.

Due to KU Leuven’s privacy regulations on LA, no personal information beyond the student’s ID is available to researchers. Course enrollment data is incorporated as course participation data, including grades from multiple exam attempts as well as the program in the context of which the student enrolled in the course.

Figure 2 shows the data model of the data sources before preprocessing and de-identification. The attributes that can potentially identify a student (as direct identifiers or quasi-identifiers) are marked in red. The privacy-preserving treatment of these attributes is discussed in the Data De-Identification section.

Fig. 2.

Fig. 2

Data diagram before de-identification (Source: Author’s own illustration). Primary keys are marked in bold, foreign keys are marked in italics. THREAD_ID and REPLY_TO_POST_ID columns are only available in AYs 2019-2020 and 2020-2021.

The dashed entities in Figure 2 are not represented by dedicated tables in the dataset; instead, they exist only through their primary keys and contain no additional attributes. Their inclusion in the diagram is solely to improve the clarity of the domain model. A course (course entity) may be linked to zero or more course participations (course_participation entity), representing students who registered for and took exams for that course. It may also be associated with zero or more content items (course_content entity), such as lecture materials posted on the LMS page, and zero or more discussion forums (forum entity). Each course participation (course_participation entity) is tied to a student (student entity) and may be further linked to the student’s activity (log_activity entity) on course materials (course_content entity) as well as the student’s engagement in discussion forums, both consumption (df_consumption entity) and contribution (df_contribution entity). The df_consumption and df_constribution entities are indirectly linked through the forum entity, as both share the CONTEXT_ID attribute.

Data De-Identification

As can be seen in Figure 2, the data obtained from the source systems includes student system numbers (i.e., unique LMS-wide student identifiers) as direct identifiers. These student numbers, which are direct identifiers to individuals for the system administrators, are replaced by pseudonyms to avoid direct identifiability while retaining referential integrity across records. Specifically, in all data sources, student numbers were replaced with a randomly-assigned and unique number from the range [0, N] where N is the total number of unique students, while preserving the referential integrity between different data sources. Pseudonyms are not reused across academic years and courses to prevent identification of the student retaking the course. Furthermore, the mapping between original identifiers and assigned numbers was discarded immediately and irreversibly after this pseudonymization step to minimize re-identification risk. Course identifiers are replaced by the course names since they are already known.

The subsections below provide more details on how each entity or table was further preprocessed to ensure de-identification of students.

Course participation (course_participation entity)

KU Leuven uses a 20-points grading system, with a grade of 10 being a passing threshold. If eligible, bachelor students can also use tolerance credits that let them skip retaking exams for narrowly failed courses (grades of 8 or 9 out of 20). They do not award course credit but prevent academic delays.

To make sure the combination of final exam grades for the two courses cannot be used as a fingerprint for (re-)identifying a specific student, we apply the generalization technique, i.e., we bin the grades (for the exam attempts in January, June and September as well as the final score itself) into the categories defined by Vemuri et al.7:

Grade Category
0–6 Fail
7–9 Able to Push
10–12 Borderline
13–15 Pass
16–20 Excellent

The PROGRAM_NAME attribute is excluded from the data to prevent the identification of a student enrolled in a specific program and assigned to a particular grade bin.

Course content data (course_content entity)

Since the course content data originally includes detailed information about content items – such as their titles and file paths – we assign each item to a general category based on this metadata, i.e., we generalize them. The possible categories are: Course Main Page, Assessments, Course Material, Forum, Grades, and Other. The original titles and paths are then discarded to preserve the privacy of the course items. The content identifier (as well as the parent identifier) is also replaced by a random number from a range [0, N] where N refers to the total number of unique content items (the same strategy is applied in the log_activity entity where content identifiers appear too). The mapping between original identifiers and assigned numbers was discarded immediately after the replacement to ensure anonymity.

LMS interactions log data (log_activity entity)

To reduce the risk of identifying students through exact access timestamps, we applied the noise addition strategy. Specifically, for each student, a random value from the interval [-5, 5] (in seconds) was selected and added to the timestamp. This value remains consistent for each individual student but differs across students. Using distinct values per student enhances anonymity, as applying a single offset to all timestamps would increase the risk of that offset being discovered and the data being de-anonymized. Importantly, since only the seconds component of each timestamp is modified, the impact on the order of the learning actions is minimal which is important when the data is used for downstream LA tasks (e.g., process mining14 or engineering learning features representing study regularity10).

Following the same approach as for the student identifiers, session identifiers were replaced by randomly assigning each session a unique number from the range [0, N] where N refers to the total number of unique sessions. The mapping between original identifiers and assigned numbers was discarded immediately after the replacement to ensure anonymity.

Discussion forum contribution data (df_contribution entity)

The post/context identifiers (POST_ID and CONTEXT_ID) were replaced by randomly assigning each post/context a unique number from the range [0, N] where N is a total number of unique posts/contexts. The thread and reply identifiers relate to the post identifiers (thread is identified by the first post written in it), so their identifiers are replaced accordingly. The mapping between original identifiers and assigned numbers was discarded immediately after the replacement to ensure anonymity.

Similar to the access timestamps in the interaction log data, the creation and modification timestamps of each post were adjusted by adding a random value from the interval [-5, 5] seconds. The same value was applied to both timestamps for a given post, preserving their relative consistency.

Discussion forum consumption data (df_consumption entity)

Since the original file has already a high level of aggregation (see df_consumption table in Fig. 2), the only adjustment that is done to the file is the replacement of student and context identifiers (explained above).

Ethics statement

The research conducts applied in this study has been reviewed and approved by the Social and Societal Ethics Committee (in Dutch, SMEC – Sociaal-maatschappelijke Ethische Commissie) at KU Leuven under the ethical approval number G-2020-2673-R4(AMD). With regard to the General Data Protection Regulation (GDPR) assessment, the committee has confirmed that the processing of personal data meets the requirements of the GDPR. With regard to the ethical review of the project, the committee has confirmed that the project meets the standards for academic research.

Data Records

Table 1 shows the number of unique students present in the data after preprocessing (across all the tables). This includes all the students enrolled in the LMS, regardless of taking the exam.

Table 1.

Total number of students in the dataset, per course and per academic year.

#students Academic year (AY)
2018-2019 2019-2020 2020-2021
Accountancy 936 918 944
Global Economics 898 876 829

Figure 3 shows the data model of the public dataset. Below, we provide a detailed description of the columns in each table after de-identification.

Fig. 3.

Fig. 3

Data diagram after de-identification. Primary keys are marked in bold, foreign keys are marked in italics. THREAD_ID and REPLY_TO_POST_ID columns are only available in AYs 2019-2020 and 2020-2021.

Table course_participation

  • COURSE_ID – a course name.

  • USER_ID – a unique identifier of a student.

  • SCORE_CATEGORY_JANUARY – a bin of an exam grade obtained in January (first exam period).

  • SCORE_CATEGORY_JUNE – a bin of an exam grade obtained in June (second exam period).

  • SCORE_CATEGORY_SEPTEMBER– a bin of an exam grade obtained in September (third exam period).

  • SCORE_CATEGORY_FINAL – a bin of a final exam grade (an end score in case several exam attempts were taken – the score of the highest attempt is recorded).

  • PASSED_FIRST_ATTEMPT – an indicator on whether an exam was passed from the first attempt.

  • PASSED – an indicator on whether an exam was passed.

Table course_content

  • COURSE_ID – a course name.

  • CONTENT_ID – a unique identifier of the course content item (e.g., video, slides, etc.).

  • PARENT_ID – an identifier of a parent of the content item.

  • CONTENTLENGTH – a length of the readable text in the content item, in characters.

  • DTCREATED – a creation date of a content item.

  • DTMODIFIED – a modification date of a content item.

  • START_DATE – a start date of a content item availability (if available).

  • END_DATE – an end date of a content item availability (if available).

  • AR_ENABLED – an indicator whether adaptive release is enabled for a content item.

  • UNAVAILABLE – an indicator whether a content item is available.

  • AR_ENABLED_PARENT – an indicator whether an adaptive release enabled for at least one parent of this item.

  • UNAVAILABLE_PARENT – an indicator whether a parent of a content item is unavailable.

  • GROUP_ASSIGNMENT – an indicator whether an item is a group assignment.

  • CONTENT_TYPE – a type of a content item (a choice from Assessments, Attempts, Course Material, Course Main Page, Forum, Grades, and Other).

Table log_activity

  • ACTION_ID – a unique identifier of an action.

  • COURSE_ID – a course name.

  • USER_ID – a unique identifier of a student.

  • SESSION_ID – an indicator of a session (a period of using an LMS to access content items for a maximum of 8 hours).

  • CONTENT_ID – an indicator of the content item that was accessed.

  • TIMESTAMP – a timestamp indicating when the content item was accessed.

Table df_contribution

  • POST_ID – a unique identifier of each post on a discussion forum.

  • COURSE_ID – a course name.

  • USER_ID – a unique identifier of a student.

  • CONTEXT_ID – a unique identifier for each forum; each of these can have several posts linked to them.

  • THREAD_ID – a unique identifier of a thread; in each context (forum), there can be many posts, with a possibility to reply to each of them; when such replies exist, they are grouped together as a thread. A thread takes an identifier of the first post that was written in it.

  • REPLY_TO_POST_ID – an identifier of a post that a current post replies to (if it is a reply, otherwise the value is missing).

  • DTMODIFIED – date and time of post modification.

  • MSG_TEXT_LENGTH – length of the readable text.

  • NUM_COMS_OR_REPLS_FOR_POST - a number of comments/replies of a post.

  • DTCREATED – a date and time of post creation.

Table df_consumption

  • CONSUMPTION_ID – a unique consumption identifier.

  • COURSE_ID – a course name.

  • USER_ID – a unique identifier of a student.

  • CONTEXT_ID – a unique identifier for each forum; each of these can have several posts linked to them.

  • NUM_READ_POSTS – a total number of posts read by a student on a discussion forum of a course.

Technical Validation

This section validates the resulting data set, first from the perspective of privacy and then in terms of utility. Finally, it summarizes the main findings.

Privacy

In order to analyze the effects of data curation on data privacy, we calculate the k-anonymity metric22. k-anonymity is a privacy metric that measures how well a dataset has been anonymized by grouping records so that each one is indistinguishable from at least k − 1 others based on a set of quasi-identifiers (1). Higher metric values mean higher privacy.

k=minCEC 1

where E is the set of equivalence classes formed by the quasi-identifiers, ∣C∣ is the number of records in class C.

In our case, precise student’s grades from different exam attempts can be considered quasi-identifiers as their combination can hint at a particular student. Therefore, generalization was applied for the course_participation entity.

Table 2 presents the k-anonymity metric calculated for various grade-related attributes before and after data curation. In most cases, the metric shows a significant improvement, indicating enhanced data anonymity. Two notable exceptions are the September score for the Global Economics course in the academic year 2018-2019, and the September scores for both courses in the 2020-2021 academic year. In both instances, the de-identification approach has not led to an increased k-value. We attribute this to the limited number of non-empty data points in this specific subset of the data – reflecting only a small subset of students who retook the exam – which results in sparsely populated grade bins and restricts further privacy improvement. We emphasize that direct student identifiers have been removed in consequence of the pseudonymization step, and thus that any re-identification would require access to external or prior knowledge sources: only some of the students that know (or remember) they had to retake the exam for these courses in these specific academic years may be able to recognize themselves in the dataset, and link their pseudonymous identity to their real identity (re-identification). The information they would gain in such a case would only pertain to themselves and would be information already known to them (engagement in the LMS). Adversaries without such external or prior knowledge will not be able to perform such re-identification at all.

Table 2.

k-anonymity metrics.

Academic Year (AY) Score Before de-identification After de-identification
Accountancy Global Economics Accountancy Global Economics
2018-2019 Final 2 6 90 60
January 2 9
June 6 8
September 1 1 5 1
2019-2020 Final 4 3 88 60
January 4 5
June 3 4
September 2 2 8 6
2020-2021 Final 1 6 74 49
January 1 5
June 6 13
September 1 2 1 2

Utility

To demonstrate the utility of de-identified data, we use it for a downstream LA task focused on learning features engineering and quantify the overall utility loss by comparing the task outcomes on both the original and de-identified datasets. Specifically, we reproduce the features used in the research studies by Tiukhova et al.9,10, which address anomaly detection in learning behavior and student success prediction, respectively. Table 3 provides a brief description of each engineered feature, along with the corresponding tables and columns from the de-identified dataset used in their construction. A complete feature engineering code package is available in Zenodo21. The final column reports the potential effect of data curation on the engineered variable where a filled circle denotes no potential utility loss, and a half-filled circle indicates partial potential loss of utility. The partial potential loss is mostly expected for the features that use timestamp columns in their calculation. Moreover, in the published dataset, we improved the CONTENT_TYPE attribute compared to the version used by Tiukhova et al.9. While their study included only three categories – Course Main Page, Course Material, and Other – our research further analyzed the Other category, leading to a more detailed classification. As a result, some items were reclassified into Course Main Page and Course Material types, and several new categories were introduced, including Assessments, Forum, and Grades. Consequently, we expect discrepancies due to these refinements in the features derived using this column, hence the half-circles.

Table 3.

Feature engineering.

Feature Description Tables Columns U
Total number of sessions (cont.) Total number of sessions of non-zero duration log_activity USER_ID, SESSION_ID
Total sessions duration (cont.) Sum over all the sessions duration during the course, seconds log_activity USER_ID, SESSION_ID, TIMESTAMP
Median session duration (cont.) Median calculated over all the sessions’ duration during the course, seconds log_activity USER_ID, SESSION_ID, TIMESTAMP
Median number of actions per session (discrete) Median calculated over all the sessions’ total learning actions counts log_activity USER_ID, SESSION_ID
Proportion of active days (cont.) Total number of active days relative to the duration of the course in days log_activity USER_ID, TIMESTAMP
Median number of active days per week (discrete) Median calculated over all the weeks with active days during the course log_activity USER_ID, TIMESTAMP
Median difference between active days (discrete) Median calculated over the time distances between consecutive active days during the course. log_activity USER_ID, SESSION_ID, TIMESTAMP
Proportion of active weeks (cont.) Total number of active weeks relative to the duration of the course in weeks log_activity USER_ID, TIMESTAMP
Proportion of active days: course material/main page (cont.) Total number of active days with course materials/main page views relative to the duration of the course in days log_activity, course_content USER_ID, TIMESTAMP, CONTENT_TYPE
Proportion of active weeks: course material/main page (cont.) Total number of active weeks with course materials/main page views relative to the duration of the course in weeks log_activity, course_content USER_ID, TIMESTAMP, CONTENT_TYPE
Proportion of posts read (cont.) Total number of posts read on the forum relative to the total number of posts available on a discussion forum df_consumption, df_contribution POST_ID, NUM_READ_POSTS, USER_ID
Total number of created posts (discrete) Total number of posts written on discussion forum during the course’s duration df_contribution POST_ID, USER_ID
Constancy of clicks (cont.) Entropy calculated with the probabilities estimated as the proportion of the number of learning actions per session relative to the total number of learning actions across all sessions log_activity USER_ID, SESSION_ID, TIMESTAMP
Constancy of session length (cont.) Entropy calculated based on the probabilities estimated as the proportion of a session’s length relative to the total sessions length across all sessions log_activity USER_ID, SESSION_ID, TIMESTAMP
Proportion of weeks with first-day activity (cont.) Total number of weeks with activity on the first day relative to the duration of the course in weeks log_activity USER_ID, TIMESTAMP
Proportion of first-day-of-week activity (cont.) Median calculated over all the weeks for the proportion of the learning actions performed on the first day of the week relative to the total number of actions performed in this week log_activity USER_ID, TIMESTAMP
Constancy of clicks: course material/main page daily (cont.) Entropy calculated based on the probabilities estimated as the proportion of the number of course material/main page views per day relative to the total number of course material/main page views log_activity, course_content USER_ID, SESSION_ID, TIMESTAMP, CONTENT_TYPE
Constancy of clicks: course material/main page weekly (cont.) Entropy calculated based on the probabilities estimated as the proportion of the number of course material/main page views per week relative to the total number of course material/main page views log_activity, course_content USER_ID, SESSION_ID, TIMESTAMP, CONTENT_TYPE
Bingeing of sessions The share of a student’s sessions that fall within their three most active weeks, relative to their overall distribution of activity across all weeks log_activity USER_ID, SESSION_ID, TIMESTAMP
Uniformity of sessions How evenly a student spreads their sessions across time, specifically a semester log_activity USER_ID, SESSION_ID, TIMESTAMP
Regularity of sessions Differences in behavioral patterns captured by comparing a user’s Uniformity and Bingeing over a semester study weeks with those same measures over the entire semester log_activity USER_ID, SESSION_ID, TIMESTAMP
Passed exam? An indicator whether a final exam was successfully passed grades PASSED

Features have been adopted from Tiukhova et al.10.

• Utility fully preserved.

◐ Utility partially preserved.

Another difference from the data used by Tiukhova et al.10 concerns the discussion forum reading data. There, for AY 2018-2019, the raw data was extracted in an aggregated format, whereas the data for AYs 2019–2020 and 2020–2021 included more detailed information. To ensure consistency across all academic years, we constructed the df_consumption table using only aggregated data. As a result, there may be minor discrepancies in the Proportion of posts read feature compared to the version constructed by Tiukhova et al.10, which relied on the original, more detailed data for AYs 2019–2020 and 2020–2021.

Figure 4 shows the distributions of the features from Tiukhova et al.9 constructed with the original and with the de-identified data for the Accountancy course in AY 2018-2019 (Figs. 59 demonstrate distributions for the other course and AYs). We apply the two-sample Kolmogorov-Smirnov test for goodness of fit, comparing the underlying distributions of two independent samples. It tests the null hypothesis that the two distributions are identical. To reduce discrepancies caused by numerical precision, we round values to five decimal places.

Fig. 6.

Fig. 6

Distributions of learning features built on original (light-blue) and de-identified (orange) data: Accountancy course in AY 2020-2021.

Fig. 7.

Fig. 7

Distributions of learning features built on original (light-blue) and de-identified (orange) data: Global Economics course in AY 2018-2019.

Fig. 8.

Fig. 8

Distributions of learning features built on original (light-blue) and de-identified (orange) data: Global Economics course in AY 2019-2020.

Fig. 4.

Fig. 4

Distributions of learning features built on original (light-blue) and de-identified (orange) data: Accountancy course in AY 2018-2019.

Fig. 5.

Fig. 5

Distributions of learning features built on original (light-blue) and de-identified (orange) data: Accountancy course in AY 2019-2020.

Fig. 9.

Fig. 9

Distributions of learning features built on original (light-blue) and de-identified (orange) data: Global Economics course in AY 2020-2021.

Across Figs. 49, we generally fail to reject the null hypothesis: for most features, the distribution computed from real data is indistinguishable from that computed from synthetic data. The few exceptions involve features that depend on CONTENT_TYPE (e.g., features Proportion of active days: course main page, Proportion of active weeks: course main page, Constancy of clicks: course main page, weekly); however, even in these cases the differences are not statistically significant.

The only statistically significant differences that can be observed in both courses in AY 2020–2021, are the features Constancy of clicks: course main page weekly (both Accountancy and Global economics) and the Constancy of clicks: course main page daily, Proportion of active weeks: course material and Proportion of active days: course material features (Global economics only). These differences are explained by the aforementioned regrouping of content types, which reassigned more items to the Course Material category. Thus, the observed deviations reflect data-quality improvements rather than effects of de-identification. The expected utility loss associated with the TIMESTAMP column modification did not materialize in the task of learning feature engineering.

For features constructed from synthetic data, we also report the best-fitting probability distribution. For continuous features, the candidate families are Normal, Exponential, Gamma, Beta, Student’s t, Cauchy, Weibull (minimum), Weibull (maximum), Uniform, Triangular, and Pareto; for discrete features, they are Poisson, Geometric, Negative Binomial, and Binomial. For each candidate, we estimate parameters by maximum likelihood, compute the log-likelihood of the data under the fitted model, and obtain the Akaike Information Criterion (AIC); the distribution with the smallest AIC is selected. In Figs. 49, we plot the fitted probability density (for continuous features) or mass (for discrete features) functions over the empirical histograms and report the corresponding parameter estimates. These are presented in dark blue in these figures.

Findings

We summarize our findings and conclusions of our privacy and utility evaluations:

  • In almost all cases, the de-identification approach has led to an increased k-value, indicating an improvement in terms of privacy.

  • The applied transformations have largely retained the feature distributions.

  • All features could be successfully fitted with a distribution selected from a set of well-known candidates. For most continuous features, the best fit was provided by the Weibull Minimum Extreme Value distribution. In the majority of cases, its shape parameter exceeded 1, indicating that students are unlikely to be completely inactive. Moreover, once a moderate level of engagement is reached, the likelihood of observing higher engagement increases gradually. This implies that most students demonstrate stable engagement patterns. This outcome is consistent with the data preprocessing step, in which students without any recorded activity on the LMS platform were excluded (see Tiukhova et al.10 for more details). These findings align with the Weibull distribution’s ability to model a wide range of random variables, including, for example, the time a user spends on a webpage23.

  • For the discrete features, the best-fitting distribution varied by feature. At the day level, daily activity (active vs. inactive) naturally follows a Binomial distribution in most of the cases, which fits the feature Median number of active days per week (with seven fixed trials). Similarly, the feature Median difference between active days corresponds to a Geometric distribution in most of the cases, as it captures the waiting time until the next success. At the within-session level, actions resemble arrivals in a Poisson process: the feature Median number of actions per session is well modeled by a Poisson distribution. In contrast, the feature Total number of created posts reflects a memoryless “continue/stop” decision after each post, making the Geometric distribution the most appropriate fit.

Usage Notes

We first discuss usage limitations and then discuss and motivate the broader suitability of the presented dataset for various LA use cases.

Limitations

LMS idiosyncrasies bring some important usage limitations. When a directory is opened, event logs are created for opening all sub-directories, resulting in multiple clicks per timestamp, which is not realistic when thinking about physical clicks. Such cases need to be accounted for when processing the data (see an example of doing in the source code for the Median number of actions per session feature).

In AY 2020–2021, the Global Economics course was managed through two separate LMS pages, each assigned to a distinct group of students based on the program they follow. Consequently, this course has two different COURSE_ID values for that academic year: Global Economics 1 and Global Economics 2. The activity on the “wrong” LMS page (that can potentially exist) is not taken into account during feature engineering, and only the activity on the page for the program is used.

Suitability for Learning Analytics use cases

The dataset presented in this paper possesses several characteristics that make it valuable for a wide range of downstream LA tasks.

First, the nature of the data itself is conducive to diverse analytical approaches. In the Utility section, we demonstrated how high-level learning features can be engineered from granular data. These features can be leveraged in both supervised (e.g., binary or multiclass classification with decision trees or neural networks) and unsupervised (e.g., clustering, principal component analysis) machine learning tasks. Moreover, the temporal structure of student interactions with the LMS enables time-aware analyses such as process mining. The transparent publication of this dataset facilitates its use in explainable AI initiatives – particularly explainable education (XAI-ED), which has been strongly advocated in the LA community24. In addition, the availability of the detailed forum contribution and consumption data also allows for potential social network analysis, widely recognised in the LA domain25.

Second, the data originates from two distinct courses, making it suitable for research on the generalizability of findings across different learning contexts (e.g., examining how predictive model performance shifts when applied to a different course).

Third, the dataset spans three academic years, providing opportunities to investigate the stability and generalizability of findings over time and under varying external conditions. As discussed, the dataset also covers the COVID-19 pandemic, which supports the evaluation of performance stability under such disruptive external factors (e.g., concept drift).

Finally, this paper offers a detailed and standalone description of the dataset, including its domain, structure, and statistical characteristics of the derived features. Future publication efforts of learning data may adopt a similar data curation approach as presented in this paper.

Acknowledgements

We would like to thank our colleagues from ICTS at KU Leuven for their assistance with exporting data from the LMS.

Author contributions

Elena Tiukhova was responsible for data curation, preprocessing, technical validation, and drafting the manuscript. Dimitri Van Landuyt contributed to data de-identification and manuscript review. Monique Snoeck led the ideation and project management and also reviewed the manuscript. Bart Baesens provided critical manuscript review.

Data availability

The dataset files can be found on Zenodo21.

Code availability

The code for feature engineering can be found on Zenodo21.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Siemens, G. & Long, P. Penetrating the fog: Analytics in learning and education. EDUCAUSE review46, 30 (2011). [Google Scholar]
  • 2.Kuzilek, J., Hlosta, M. & Zdrahal, Z. Open university learning analytics dataset. Scientific Data4, 170171, 10.1038/sdata.2017.171 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Santoso, P. H., Setiaji, B. & Kurniawan, Y. et al. Students’ performance dataset for using machine learning technique in physics education research. Scientific Data12, 987, 10.1038/s41597-025-04913-0 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jur^ík, V., Juhaňák, L. & Ružičková, A. et al. Experimental dataset on eye-tracking activity during self-regulated learning. Scientific Data12, 967, 10.1038/s41597-025-05304-1 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.KU Leuven Learning Lab. Alpacas: Adaptieve leerlijnen voor activering en assessment van studenten. https://www.kuleuven.be/english/study/onderwijs/learninglab/projecten/idl-2019/alpacas-adaptieve-leerpaden-voor-activering-assessment-studenten (2019).
  • 6.Flores, N. L., Islind, A. S. & Óskarsdóttir, M. A learning analytics-driven intervention to support students’ learning activity and experiences. In Digitalization and Digital Competence in Educational Contexts, 81–102 (Routledge, 2023).
  • 7.Vemuri, P., Snoeck, M. & Poelmans, S. Adaptive release learning paths to motivate active learning and engagement in students. In Proceedings of the 18th International Conference on Cognition and Exploratory Learning in Digital Age (CELDA 2021), 287-290 (ERIC, 2021).
  • 8.Vemuri, P., Poelmans, S., Pandya, H. & Snoeck, M. Studying cohort influence on student performance prediction in multi-cohort university courses. In European Conference on Technology Enhanced Learning, 623–630 (Springer, 2022).
  • 9.Tiukhova, E. et al. Discovering unusual study patterns using anomaly detection and xai. In Proceedings of the 57th Hawaii International Conference on System Sciences, 1427–1436, https://hdl.handle.net/10125/106555 (2024).
  • 10.Tiukhova, E. et al. Explainable learning analytics: Assessing the stability of student success prediction models by means of explainable ai. Decision Support Systems182, 114229 (2024). [Google Scholar]
  • 11.Tiukhova, E. et al. Should I change my course? instructional design theory-informed learning analytics dashboard for teachers. In Machine Learning and Principles and Practice of Knowledge Discovery in Databases: International Workshops of ECML PKDD 2024 (2024).
  • 12.Winne, P. H. & Baker, R. S. et al. The potentials of educational data mining for researching metacognition, motivation and self-regulated learning. J. Educ. Data Min.5, 1–8 (2013). [Google Scholar]
  • 13.Jovanović, J., Saqr, M., Joksimović, S. & Gašević, D. Students matter the most in learning analytics: The effects of internal and instructional conditions in predicting academic success. Comput. Educ.172, 104251 (2021). [Google Scholar]
  • 14.Deeva, G., De Smedt, J., De Koninck, P. & De Weerdt, J. Dropout prediction in MOOCs: a comparison between process and sequence mining. In Business Process Management Workshops: BPM 2017 International Workshops, Barcelona, Spain, September 10-11, 2017, Revised Papers 15, 243-255 (Springer, 2018).
  • 15.Francis, M., Avoseh, M., Card, K., Newland, L. & Streff, K. Student privacy and learning analytics: Investigating the application of privacy within a student success information system in higher education. Journal of Learning Analytics10, 102–114 (2023). [Google Scholar]
  • 16.Chicaiza, J., Cabrera-Loayza, M. C., Elizalde, R. & Piedra, N. Application of data anonymization in learning analytics. In Proceedings of the 3rd International Conference on Applications of Intelligent Systems, 1-6 (2020).
  • 17.Marshall, R., Pardo, A., Smith, D. & Watson, T. Implementing next generation privacy and ethics research in education technology. British Journal of Educational Technology53, 737–755 (2022). [Google Scholar]
  • 18.Tsoni, R., Zorkadis, V. & S. Verykios, V. A data pipeline to preserve privacy in educational settings. In Proceedings of the 25th Pan-Hellenic Conference on Informatics, PCI ’21, 138-142 (Association for Computing Machinery, New York, NY, USA, 10.1145/3503823.3503850 2022).
  • 19.Joksimović, S. et al. Privacy-driven learning analytics. In Manage your own learning analytics: Implement a Rasch modelling approach, 1-22 (Springer, 2021).
  • 20.Garfinkel, S., Near, J., Dajani, A., Singer, P. & Guttman, B. De-identifying government datasets: Techniques and governance (US Department of Commerce, National Institute of Standards and Technology,2023).
  • 21.Tiukhova, E., Van Landuyt, D. & Snoeck, M. Open data, private learners: A de-identified dataset for learning analytics research. Dataset 10.5281/zenodo.17087849 (2025). [DOI] [PMC free article] [PubMed]
  • 22.Samarati, P. & Sweeney, L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression (1998).
  • 23.Liu, C., White, R. W. & Dumais, S. Understanding web browsing behaviors through weibull analysis of dwell time. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, 379-386, 10.1145/1835449.1835513 (Association for Computing Machinery,New York, NY, USA, 2010).
  • 24.Khosravi, H. et al. Explainable artificial intelligence in education. Computers and Education: Artificial Intelligence3, 100074 (2022). [Google Scholar]
  • 25.Flores, N. G. L., Islind, A. S. & Oskarsdóttir, M. Exploring study profiles of computer science students with social network analysis. In HICSS, 1–10 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset files can be found on Zenodo21.

The code for feature engineering can be found on Zenodo21.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES