Skip to main content
Data in Brief logoLink to Data in Brief
. 2020 Jan 3;28:105090. doi: 10.1016/j.dib.2019.105090

WET: Word embedding-topic distribution vectors for MOOC video lectures dataset

Zenun Kastrati a,, Arianit Kurti a, Ali Shariq Imran b
PMCID: PMC6950834  PMID: 31921958

Abstract

In this article, we present a dataset containing word embeddings and document topic distribution vectors generated from MOOCs video lecture transcripts. Transcripts of 12,032 video lectures from 200 courses were collected from Coursera learning platform. This large corpus of transcripts was used as input to two well-known NLP techniques, namely Word2Vec and Latent Dirichlet Allocation (LDA) to generate word embeddings and topic vectors, respectively. We used Word2Vec and LDA implementation in the Gensim package in Python. The data presented in this article are related to the research article entitled “Integrating word embeddings and document topics with deep learning in a video classification framework” [1]. The dataset is hosted in the Mendeley Data repository [2].

Keywords: Word embedding, Document topics, Video lecture transcript, MOOC, LDA, Word2Vec


Specifications Table

Subject Computer Science
Specific subject area Machine Learning, Natural Language Processing, Text Classification, eLearning
Type of data Table in csv format
How data were acquired Dataset was collected and created using video lectures and their corresponding transcripts gathered from a MOOC learning platform
Data format Raw and Analyzed
Parameters for data collection Standard text processing methods were applied
Description of data collection The dataset contains word embeddings and document topics generated from MOOCs video lecture transcripts using Python's Gensim implementation of Word2Vec and LDA algorithms.
Data source location Coursera MOOC learning platform
Data accessibility Accessible on the Mendeley Data repository
Related research article Zenun Kastrati, Ali Shariq Imran, and Arianit Kurti, Integrating word embeddings and document topics with deep learning in a video classification framework, Pattern Recognition Letters, 128C (2019) pp. 85–92, https://doi.org/10.1016/j.patrec.2019.08.019
Value of the Data
  • This dataset is useful for the research community for two reasons: First, it is the first of its kind dataset aimed at video lectures classification using NLP techniques which is collected from the wild massive open online courses (MOOCs), and second, this dataset with its large-scale corpus size could serve as a standard benchmark for these research areas as well as for testing performance of the existing and new methods and techniques.

  • The research community in the fields of machine learning, information retrieval, video processing, education can benefit from these data by using them in various research tasks such as: transfer learning, video lectures classification and recommendation, contextual analysis, short text enrichment with topics, performance analysis of deep learning models and techniques, personalized learning.

  • Another possible value of these data is that they could be used and adopted from content providers and managers of MOOCs learning platforms to organize educational resources for maximum visibility and to easily search and find the best content.

1. Data

The key summary statistics of the MOOC video lecture transcripts corpus used to generate word embeddings and topic representation vectors is presented in Table 1. The dataset contains 12,032 video lecture transcripts that are composed of over 878 thousand sentences and more than 79 million tokens. The vocabulary size is over 68 thousand unique words.

Table 1.

Statistics of the data used to generate word embeddings and topic representation vectors.

# of documents # of sentences # of tokens vocabulary
12,032 878,209 79,680,144 68,176

Video transcripts are of different length, from 228 to 32,767 tokens, with an average of 6622 tokens per video transcript. Video transcripts length variation illustrated in box plot and the distribution of tokens among the entire video transcripts corpus represented by a density function are shown in Fig. 1.

Fig. 1.

Fig. 1

Length of video transcripts and distribution of tokens among them.

In Table 2, we show categories of the dataset including general-level and fined-grained along with the number of video lecture transcripts associated to each category.

Table 2.

Distribution of video transcripts among general-level and fine-grained categories.

General-level Categories Fine-grained Categories # of docs
Art and Humanities History 310
Music and Art 338
Philosophy 267
Physical Sciences and Engineering Electrical Engineering 516
Mechanical Engineering 287
Chemistry 411
Environmental Science and Sustainability 340
Physics and Astronomy 455
Research Methods 199
Computer Science Software Development 284
Mobile and Web Development 390
Algorithms 338
Computer Security and Networks 351
Design and Product 228
Data Science Data Analysis 205
Machine Learning 549
Probability and Statistics 283
Business Leadership and Management 281
Finance 346
Marketing 242
Entrepreneurship 216
Business Essentials 223
Business Strategy 261
Information Technology Cloud Computing 171
Security 139
Data Management 236
Networking 153
Support and Operations 349
Health Animal Health 227
Basic Science 480
Health Informatics 209
Healthcare Management 167
Patient Care 325
Public Health 210
Research 274
Psychology 299
Social Sciences Economics 516
Education 293
Governance and Society 331
Law 333

A visual representation of word embeddings generated from the MOOC video lectures corpora using principal component analysis (PCA) projected in a geometric space is illustrated in Fig. 2. More specifically, Fig. 2 shows an example of mapping of word ‘studying’ and its neighbours e.g. academic, studies, institution, reading, etc., in three-dimensional space.

Fig. 2.

Fig. 2

3D word embeddings visualization.

2. Experimental design, materials, and methods

A new real-world dataset from the education domain is presented in this article. The dataset contains word embeddings and document topic distribution vectors generated by a corpus of 12,032 video lecture transcripts. The steps involved in collecting and creating the WET dataset are illustrated in Fig. 3.

Fig. 3.

Fig. 3

Dataset collection and creation scheme.

As a first step, we downloaded video lecture transcripts from Coursera learning platform and annotated them. For annotation, we used a two-level hierarchical organizational structure of Coursera where each downloaded video transcript is associated with one fine-grained category and one general-level category of the structure. 8 general-level and 40 fine-grained categories constitute the dataset and the distribution of lecture transcripts among these categories is given in Table 2.

Prior to creating corpus and dictionary for generating word embeddings and topic distribution vectors, video lecture transcripts have undergone some preprocessing tasks including converting text to lowercase, removing stop words, punctuations, and removing words that are not purely comprised of alphabetical characters and those that are only one character. In addition, WordNetLemmatizer is used to lemmatize all words in transcripts. An open source Python Library for symbolic and statistical natural language processing called Natural Language Toolkit (NLTK) is used for performing preprocessing tasks.

2.1. Word embeddings

To train and generate word embeddings, we used the Word2Vec [3] word embedding technique implemented in Python's Gensim package [5]. Word2Vec is an unsupervised learning method in which word embeddings are learned using distribution of word co-occurrences within a local context, that is, a separate text window space scanned across the whole corpus. There are two model architecture of Word2Vec, namely the continuous bag-of-words (CBOW), and the skip-gram.

We set various parameters for Word2Vec as shown in Table 3. Word embeddings of different vector sizes including 50, 100, 200 and 300 dimensions, are generated.

Table 3.

Parameters settings used for Word2Vec.

Parameter settings
window size = 5
min_count = 1
alpha = auto
workers = 4

Each of four csv files contains 68176 lines comprising of a unique word followed by either 50, 100, 200, or 300 real numbers that correspond to 50, 100, 200, 300 respective dimensions.

2.2. Topic distribution vectors

We conducted unsupervised topic modeling on the MOOC lecture transcript corpus. A latent Dirichlet allocation (LDA) [4] conventional topic modeling scheme implemented in the Python's Gensim package is used for generating document topic distribution vectors. LDA is a generative statistical model in which each document of a corpus is represented by a finite mixture of topics/themes which, in turn, are represented by a group of words. The parameter settings given in Table 4 are used for LDA and we performed training for varying number of topics including 50, 100, 200, 300 document topics.

Table 4.

Parameters settings used for LDA.

Parameter settings
chunksize = 500
iterations = 400
passes = 20
alpha = auto
eta = auto

Each csv file contains 12,032 lines comprising of a unique word followed by either 50, 100, 200, or 300 real numbers that correspond to 50, 100, 200, 300 respective number of topics.

Concatenation of document topics obtained from LDA model with either general-level or fine-grained categories produces the final csv file as shown in Fig. 3.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.dib.2019.105090.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1
mmc1.xml (1.2KB, xml)

References

  • 1.Kastrati Z., Imran A.S., Kurti A. 2019. Integrating word embeddings and document topics with deep learning in a video classification framework. Pattern Recognit. Lett. 2019;128C:85–92. [Google Scholar]
  • 2.Kastrati Z., Kurti A., Imran A.S. Mendeley Data, v1; 2019. (Embeddings and Topic Vectors for MOOC Lectures Dataset). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13) 2013. Distributed representations of words and phrases and their compositionality; pp. 3111–3119. [Google Scholar]
  • 4.Blei D.M., Ng A.Y., Jordan M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003;3(4–5):993–1022. [Google Scholar]
  • 5.Rehurek R., Sojkai P. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 2010. Software framework for topic modelling with large corpora; pp. 45–50. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.xml (1.2KB, xml)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES