Employing automatic content recognition for teaching methodology analysis in classroom videos

Muhammad Aasim Rafique; Faheem Khaskheli; Malik Tahir Hassan; Sheraz Naseer; Moongu Jeon

doi:10.1371/journal.pone.0263448

. 2022 Feb 17;17(2):e0263448. doi: 10.1371/journal.pone.0263448

Employing automatic content recognition for teaching methodology analysis in classroom videos

Muhammad Aasim Rafique ¹, Faheem Khaskheli ², Malik Tahir Hassan ³, Sheraz Naseer ², Moongu Jeon ^1,^*

Editor: Felix Albu⁴

PMCID: PMC8853489 PMID: 35176072

Abstract

A teacher plays a pivotal role in grooming a society and paves way for its social and economic developments. Teaching is a dynamic role and demands continuous adaptation. A teacher adopts teaching techniques suitable for a certain discipline and a situation. A thorough, detailed, and impartial observation of a teacher is a desideratum for adaptation of an effective teaching methodology and it is a laborious exercise. An automatic strategy for analyzing a teacher’s teaching methodology in a classroom environment is suggested in this work. The proposed strategy recognizes a teacher’s actions in videos while he is delivering lectures. In this study, 3D CNN and Conv2DLSTM with time-distributed layers are used for experimentation. A range of actions are recognized for a complete classroom session during experimentation and the reported results are considered effective for analysis of a teacher’s teaching technique.

1 Introduction

Quality education is one of the seventeen sustainable development goals of the department of economic and social affairs of the United Nations. A pivotal countenance of quality education is a skilled and erudite teacher. Quality teaching is essential for quality education and a teacher herself improves on skills like communication, delivery, enthusiasm, confidence, gestures, and so on. Hence, an authentic and thorough evaluation of teaching skills in a classroom environment and auxiliary feedback is essential. Teaching is a dynamic role that demands adaptation in teaching styles as courses and audiences change. A complete assessment while sitting in a classroom lecture is an exhausting exercise. However, recording of video lectures is a norm and religiously practiced during the recent pandemic time. These recorded videos can be used conveniently for the analysis of the teaching methods of a teacher. A contemporary and effective strategy is to employ advancements in artificial intelligence for automated teaching style analysis in recorded video lectures.

A resolve for automated analysis of teaching style in recorded videos is automatic content recognition (ACR), and a retrospective approach is action recognition in the recorded videos. Action recognition in videos is a challenging problem and many of its challenges arise due to the change in viewpoint, camera motion, the scale of objects, pose of the person, change in lighting condition, and background change. Automatic action recognition in videos is a long-standing problem in computer vision. However, thriving solutions provide desirable results in specific domains such as surveillance [1], entertainment, content-based video retrieval [2], Human-Computer Interaction and Robotics [3]. However, a general solution may not work as humans and their actions are not universal. Thus, automatic analysis of teaching style in videos is a specialized ACR problem with its peculiarities.

Recorded video is a sequence of images with implicit branched into a spatial image and temporal sequence of images. The two data are often evaluated with different techniques suitable for exploiting the spatial features in images and capturing the relation in a sequence of images in video datasets. Convolutional Neural Networks (CNNs) [4] are state-of-the-art in object detection and classification in images data [5, 6]. CNN is composed of multiple layers of artificial neural networks (ANN) in which each layer is composed of a filter that hierarchically detects features from the images at different scales. Hierarchical composition of layers of filters learn simple features like edges and curves in early layers and higher layers learn abstraction of meaningful components like faces, shapes, etc., present in the training data. Temporal relationship in a sequence of images has been captured in learning parameters using recurrent neural networks (RNN) [7] and extended in recent years with attention modeling [8, 9]. CNN and RNN need a training dataset of videos with labeled sequences. Some of the publicly available datasets are as follows: UCF101 [10], HMDB51 [11], ViHasi [12], MuHavi [13] and BOSS [14].

This study proposes a framework for automated analysis of teaching methods in wake of the pandemic hit the academia along with all fields of life. The main contributions are as follows: a) An ACR framework equipped with two new deep learning architectures with a self-imposing hard inductive bias for the action recognition of a teacher in a recorded video lecture. b) The study proposes statistical fact generation of the teaching methodology connecting education studies and technology advancement. The collected statistics provide feedback that may be beneficial to the educationists for quality teaching. c) This study chooses potent parameters to understand and measure the amount of time spent by a teacher in content delivery, whiteboard usage and student engagements.

This paper is organized as follows: Section 2 contains a survey of related works. We explain our proposed method in section 3, while section 4 discusses experiments and results. In the end, section 5 concludes this study.

2 Related work

Sustaining the quality of education requires continuous and efficient analysis of academic processes, learning objectives, and outcomes, as well as the teaching methodologies applied. It is well realized that teachers and their teaching methods influence students learning significantly in a classroom environment. Monitoring of a teacher’s teaching methodology can be online (i.e., in a classroom environment), or offline (e.g., through recorded lecture videos). The objective of teachers’ monitoring though should be improved students’ learning, not criticizing the respectable teachers. Researchers have been using various techniques for human monitoring, action recognition, and behavior analysis.

Prieto et al. [15] contributed to teacher behavior analysis by applying learning algorithms on sensor data. They used sensors such as accelerometers, EEG, and eye-trackers to collect data and use these to generate statistics using the random forest algorithm. Their model showed good accuracy on training data but achieved only 63% accuracy on the test dataset. They further extended their work and classified the cases to a teacher explaining a concept and monitoring work of the students; used gradient boosting tree, and achieved an improved accuracy of 89% on the test dataset. For action recognition, many different sensors have been used to collect data. Biying et al. [16] have surveyed different sensors and their hardware and software limitations. They talk about different sensors used for activity recognition. The sensors used are divided into different categories such as acoustic, electric, mechanical, optical, and electromagnetic.

Videos and image data are available in abundance these days as closed-circuit television (CCTV) cameras and storage has become cheaper. Frequent use of images and video data for human activity analysis in various applications has been noticed. It also inspires action recognition techniques in computer vision (CV), machine learning (ML), and artificial intelligence (AI). Overall, these techniques include two types of approaches: the first type of approach focuses on feature engineering by extracting useful features such as optic flow, sift, surf, manifold learning, and improved dense trajectory (IDT); and the second type of approach uses deep learning techniques with raw videos. Deep learning architectures for object detection popularly include variants of CNN architectures like ResNet [17], EfficientNet [18] and Inception [19]. In contrast to simple image data, video data have temporal (or sequential) aspects as well. The popular deep learning architectures for predictions in such types of sequential data are recurrent networks with gated units [20] and attention-based model [8].

For better readability, the related works have been organized into three subsections. The works that use sensor-based data come first, then the approaches that use feature engineering, then the approaches that use raw input for the analysis.

2.1 Sensor based data

Anna et al. [21] used a smartphone embedded with an accelerometer to collect data and detect human activities. They collected a personalized dataset and created a personalized model for each person based on his age, weight, and height. Their model was evaluated on three different accelerometer-based public datasets which are UniMiB-SHAR, MobiAct, and Motion Sense. Two different classifiers were used which are Support Vector Machine and K-Nearest Neighbor. On Subject dependent testing they were able to achieve 84.79% accuracy on UniMiD-SHAR, 45.57% accuracy on MobiAct, and 43.55% accuracy on the Motion Sense dataset. The average accuracy was 57.79%. When doing subject independent tests they were able to achieve a higher accuracy of 70.19%.

Yu Liang et al. [22] used wearable sensors: a triaxial accelerometer and a triaxial gyroscope on 23 different people. These people wore the sensors on ankles and wrists to perform ten different activities of everyday life, and eleven different sports activities in a laboratory environment. The data from the sensors were used in a Wearable Inertial Sensor Network to recognize the activity performed by the user. Their model was able to achieve 98.23% validation accuracy on everyday activities and 99.55% validation accuracy on sports activities.

Abel et al. [23] also used sensors to compute analytics related to educational design methodologies and lecture delivery by teachers. Four different aspects were observed including integration of learning and teaching analytics, analysis of real-time data collected through sensors and devices in the classroom, teacher’s digital literacy in analytics, and planning and evaluation of teaching activities. The major problems they faced were the analysis of results after lecture delivery, and not having standard measures of performance.

Gregor et al. [24] used sensor data to recognize elderly people activity recognition. The dataset for this project was collected from several sensors including 51 motion sensors, 4 item sensors on selected items, 15 door sensors, and 5 temperature sensors installed in different rooms of the apartment. They tried to use all these sensors’ data to recognize 8 different activities, i.e., a bed to toilet transition, cleaning, cooking, grooming, shower, sleep, wake up and work. They used a hidden Markov model (HMM) based system to recognize the activities. Finally, they used a dataset from the CASAS project to test their model. The accuracy was 94.52% on individual activities 70.95% over the combination of all concurrent activities.

Jun Huai et al. [25] have used sensor data to recognize basic activities (BA) and transitional activities (TA). They used the public dataset SBHAR to evaluate their method. Fragments between adjacent basic activities have been used to recognize if the activity is a disturbance activity or a transitional activity. They first split the sensor data into segments, then extract the features of activities based on different sized window segmentation. They used the random forest model to classify the activities either as basic activity or transitional activity. Their model was able to achieve 90% accuracy on different duration windows of sensor data.

Besides sensor data, some researchers have also used notes, comments, or audio data collected during a lecture to analyze a teacher’s behavior and methodology of lecture delivery. Anmol et al. [26] created a software that uses audio and visual input from a presentation to summarize the entire presentation. Audio of speaker, slides of the presentation, and handwritten notes combined are used to analyze, summarize and log the summary to a server accessible to authorized users. For this project author used a microphone for audio input and a camera for visual input. Google Speech Recognition was used for the audio input to extract text from the audio. RCNN model was used on visual inputs for text detection. Libraries from OpenCV were used for text extraction. Hence, natural language processing concepts were used for the analysis and generating the summarized data.

Zhao et al. [27] analyzed the behavior of a teacher in the online teaching environment. They used comments provided by 1168 students about 9 different teachers from different mainstream live platforms. These comments were processed through Nvivo software. Their results were subjective. They found that teachers need to focus on characteristics including professional attitude, scientific knowledge, logical reasoning, rhythmic language; and teaching behavior such as precise teaching, flexible interaction, and after-school counseling.

Some researchers have focused on students’ behavior so that interaction between teacher and student can be improved. Ku Yu-Te et al. [28] introduced a system called ClassFu which focuses on student behavior in-class activities. ClassFU uses image sensors to analyze a student’s behavior. By monitoring the classroom’s environment and the level of a student’s interaction, ClassFu can collect data during online classes as well as during physical classroom setup. Summary of the data is presented to the teacher, which can be used to improve the student and teacher interactions in the future.

RoboThespian is a life-size humanoid robot that was used to teach children of grade 5 to 7 (Igor et al. [29]). This study focuses on human and robot interaction along with how well can a robot teach to a human student. Two different groups of school students were taught in two different classroom environments. After that, effectiveness in learning new science concepts and creating positive perceptions of the robot teacher was evaluated. The results show that the students were able to understand the concept taught by the robot.

2.2 Feature engineering approach

Popular feature extraction algorithms include optic flow, sift, surf, manifold learning, and IDT features. Tran et al. [30] used 3D convolution on video data to create a linear model which took multiple frames as input. The model proposed by [30] was trained on the Sport-1M dataset [31] which yielded 82 to 85% accuracy. Tran et al. further proposed the use of handcrafted IDT features which increased the accuracy of their model up to 90%.

Donahue et al. [32] extracted spatial data from videos using convolution layer blocks and used Recurrent Neural Network (RNN) layer based on Long Short Term Memory (LSTM) cells for temporal features. LSTM, proposed by Hochreiter et al. [7], is a more powerful generalization of vanilla RNN with certain differences in cell architecture. Donahue et al. [32] trained their model using raw RGB frame data and optic flow data. On UCF101 dataset, they were able to achieve 82% accuracy.

Sheng et al. [33] used the OpenPose to extract the coordinates of human joints which are key points from an image or video of a person. These key points were used to classify the actions of a teacher using DenseNet. Along with the teacher‘s pose, facial emotions were also analyzed using the emotion analysis model of Microsoft. However, they used a self-made dataset to test the model.

Chao and Qiushi et al. [34] have used image data of faces to recognize facial expression. They created their dataset of 10 different people to test their models. They used two different approaches to extract features, in the first approach they used the LBP operator to extract facial contours. After which they created a pseudo-3-D model, which was used to create six facial expression sub-regions. They used two different classification algorithms, SVM and Softmax to classify two different types of expressions which are called the basic emotion model and the circumplex emotion model. Their final results prove that eyes and mouth are the major factors when identifying facial expressions.

2.3 Raw input approach

Feichtenhofer et al. [35] proposed a two-stream network consisting of spatial and temporal components. They fused two different 3D CNN models, one for the spatial component and one for the temporal component. UCF101 [10] and HMDB51 [11] datasets were used to train and evaluate the model. The best results they were able to achieve were 92% accuracy on UCF101, and 65% accuracy on HMDB51. In another work, Feichtenhofer et al. [36] used spatial and temporal components of video inputs to recognize actions performed in a video. The input was extracted based on the difference between two successive frames of a video. The authors used these differences between multiple frames in a video as a feed to the CNN model constructed using different fusion techniques. They employed different deep neural network (DNN) architectures including their previous two-stream network combined with LSTM, two-stream combined with CNN, two-stream combined with pre-trained ImageNet weights of VGG16 and ST-ResNet (Spatio-Temporal Residual Networks) [37]. However, from all experiments, the ST-ResNet model with Imagenet weights was able to achieve the best results. UCF101 and HMDB51 datasets were used to evaluate the model and the best results reported by Feichtenhofer et al. [36] were 93% accuracy on UCF101 dataset and 66% accuracy on HMDB51 dataset.

Wang et al. [38] created a different model architecture using CNN layers which accepted both spatial and temporal data as input. The proposed model by [38] was also trained and tested on UCF101 and HMDB51 datasets. The authors reported test accuracies of 94% and 68% respectively on USF101 and HMDB51 datasets that are so far the best on the two aforementioned datasets. Ullah et al. [39] also proposed a deep neural network comprising of CNN and LSTM to do action recognition. The proposed model by [39] used the CNN layer to extract features from each frame. These features were fed to the LSTM layer which was used to understand the connection between each frame in a video. The proposed model was evaluated on UCF101, HMDB51, and YouTube datasets. The model was able to achieve 87% accuracy on HMDB51 dataset, 91% accuracy on UCF101 dataset, and 92% accuracy on the YouTube dataset.

Takuhiro et al. [40] used a video dataset to recognize group activities performed by a human. Their approach is also based on spatial-temporal features. They have also used CNN for the extraction of spatial features and LSTM for the extraction of temporal features. After extracting these features, they have used a fully connected conditional random field (FCCRF) to classify actions performed by people in the video. They used two different datasets; Collective Activity Data-set and Collective Activity Extended Dataset, to evaluate their model. They were able to improve their baseline accuracy over these datasets. Sharma et. al. [41] present a human action recognition dataset in classroom environments, called EduNet. The dataset is a collection of classroom activities from 1 to 12 standard schools. It has 20 action classes containing both teacher and student activities. Authors report the accuracy of a standard I3D-ResNet-50 model on the EduNet dataset to be 72.3. Sun et. al [42] have also worked on classroom videos and have contributed a dataset but their focus is on students’ classroom behavior, not teachers, e.g. a student is listening, yawning, sleeping, etc. Nida et. al. [43] propose a deep learning method for a teacher’s activity recognition in a classroom. They develop a dataset IAVID-I (Instructor Activity Video Dataset-I) having nine action classes and report an accuracy of 81.43% of their model on this dataset. Gang et. al. [44] propose a method to recognize eight kinds of teacher behavior (action classes) in an actual teaching scene achieving an accuracy of 81% on their TAD-08 dataset. In comparison to the above-related works, the focus of our work is on a teacher’s classroom activities primarily, not the students. These are university classroom videos. We have 11 action classes and the proposed 3DCNN model achieves a high accuracy of 94%.

3 Methodology

Our study progresses in five steps, starting with data collection and ending in experimentation and results analysis. Each of these steps is discussed in detail in the coming subsections.

3.1 Dataset introduction and preprocessing

Data abundance plays a vital role [45, 46] in state-of-the-art Deep Learning techniques. Although the data is abundantly produced in today’s gadget-loaded environments, yet the usable data is scarce. AI techniques need labeled data for supervised learning algorithms. In this study, data are collected from the lecture recordings of CCTV videos of multiple class sessions held at a university campus. The research has been approved by the Research Ethics and Support Committee, University of Management and Technology, Lahore with a signed approval letter RE-016-2021. The consent of the participants visible in CCTV videos was obtained and approved by the committee. The dataset contains ten videos of five different teachers and each video is more than one hour long. The videos are recorded at a frame rate of 25 frames per second (fps), and the size of each frame is 704 pixels in width and 576 pixels in height (704 x 576). The lecture videos are split into three seconds clips, i.e., 75 frames in each data chunk; and there are around 1050 clips (after removing noisy data). The images from the dataset are augmented using horizontal flip, zoom-in, zoom-out, change of brightness, image rotation and image blur. images from the data Clips generated from the videos are manually annotated with labels of a teacher’s actions from the following categories:

Standing: contains videos of a teacher standing or walking. A standing action facing the students suggests a teacher’s openness.
Writing: contains videos of a teacher writing on a board. These videos are based on hand movement since we cannot see what is being written on board because the quality of the video is not good enough. A writing action suggests the engagement of the students using a visible medium.
Pointing: contains videos of a teacher pointing to a board. A pointing action suggests the interaction.
Talking: categorized videos of a teacher attending students. In all videos, students are standing close to the teacher. Talking also suggests the interaction, delivery and engagement.
Cleaning: contains videos of a teacher cleaning the board with a duster or hand.
Delivering Presentation, Teacher Standing: contains videos of the teacher standing while the projector displays slides. It is self-explanatory.
Presentation, Writing: contains videos of the teacher writing on a board while the projector displays slides. It is self-explanatory as it suggests teacher’s engagement.
Delivering Presentation, Pointing on Board: This class contains videos of the teacher pointing to the board while the projector shows slides.
Presentation, Talking: contains videos of a teacher and a student talking while the projector displays slides. It is self-explanatory as it suggests teacher’s delivery.
Presentation, Cleaning: contains videos of a teacher cleaning the board while the projector shows slides.
Writing, Talking: contains videos of a teacher writing on a board while students standing close to the teacher and interacting with him. A writing action suggests the engagement of the students using a visible medium.

Sample images from selected videos from the dataset are depicted in Fig 1.

3.2 Background

Inductive bias [47] assorted various compositions of ANN and their utilities have been wedded with certain problem domains. A challenge, however, is the modeling of event-based segmentation. An event-based segmentation problem is often mapped on a clock-based segmentation. A motivation for exploring models with an inductive bias for our framework, this study probes multiple DNN models including 3D CNN, Conv2DLSTM, and time distributed with 2D CNN and LSTM. Fig 2 shows block diagrams of the investigated models. Details of the selected models are given in the following sections.

Fig 2 — This figure depicts architectures of three reported DNN models, side by side, (a) CNN and LSTM implemented using Time distributed layer. An image frame from the video is given as input at one time step and action is predicted in the end. (b) Conv2DLSTM: image frames at each time step is presented to a the network and the features from all the images are used to action prediction. (c) 3DCNN: all images are concatenated and presented to the network for an action prediction.

3.2.1 Conv2D LSTM DNN

A video data set is spatio-temporal and a sequential learning algorithm can be employed to explore the relationship of the frame. A recurrent neural network (RNN) exhibits an inductive bias toward time data, thus corresponds to equivariance in the time property of a learning algorithm. An LSTM unit in RNN is a complex activation function that carries a sequence of useful information in the form of representations from previous time steps and integrates it with the current time step for prediction. The representations, cell states, are trimmed and decorated with the activation from the hidden units at the previous time step and inputs at the current time step with the help of forgetting and input activation functions. The output activation of the current time step is computed using the hidden activation and cell state at the previous time step, and input at the current time step. An LSTM unit is depicted in Fig 3 and the following are the equations for the computation of an RNN with LSTM units:

\begin{matrix} \begin{matrix} C_{t}^{'} = t a n h (W_{c} [h_{t - 1}, X_{t}]) + b_{c} \\ i_{t} = σ (W_{i} [h_{t - 1}, X_{t}] + b_{i}) \\ o_{t} = σ (W_{o} [h_{t - 1}, X_{t}] + b_{o}) \\ f_{t} = σ (W_{f} [h_{t - 1}, X_{t}] + b_{f}) \\ C_{t} = i_{t} * C_{t}^{'} + f_{t} * C_{t - 1} \\ h_{t} = o_{t} * t a n h (C_{t}) \end{matrix} \end{matrix}

(1)

where W are wight matrix, i, o, f and C are input, output, forget gates and cell states, t is the time and b is for bias. A convolution operation equation for a single location (i, j) is given as:

\begin{matrix} \begin{matrix} r_{i j} = ϕ (\sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} \sum_{k} w_{m, n, k} v_{i + m - 1, j + n - 1, k}) \end{matrix} \end{matrix}

(2)

where M, N are dimensions of the kernel, k is the filter size and ϕ is the activation function.

Fig 3 — C_t−1 is cell state at previous frame and C^t is the cell state at processing image frame. h_t−1 and h_t are hidden states activation at previous and current time steps respectively. The patterned boxes depict the forget (f_t), input (i_t) and output (o_t) gates, respectively at current time step.

However, the image frames at each time step are spatially diverse and pixels dimensions are too many inputs to be passed to LSTM layers in an RNN. A CNN, on the other hand, works well with the image data by extracting the localized features, and it exhibits group equivariance over space. Conv2D LSTM [48] is a combination of Convolution 2D of a CNN and LSTM unit (Fig 3) of RNN which reduces the implicit redundancy of pixels in an RNN. In a Convolution 2D layer, the weight is multiplied by the input similarly as it is done in the convolution operation. The composite LSTM units are addendum and the gates as well as input and cell states are localized within the neighborhood as shown in Fig 4. Following are the changed equations for Conv2DLSTM:

\begin{matrix} \begin{matrix} C_{t}^{'} = t a n h (W_{h c} ⋆ h_{t - 1} + W_{x c} ⋆ X_{t}) + b_{c} \\ i_{t} = σ (W_{h i} ⋆ h_{t - 1} + W_{x i} ⋆ X_{t} + W_{c i} C_{t - 1} + b_{i}) \\ o_{t} = σ (W_{h o} ⋆ h_{t - 1} + W_{x o} ⋆ X_{t} + W_{c o} C_{t - 1} + b_{o}) \\ f_{t} = σ (W_{h f} ⋆ h_{t - 1} + W_{x f} ⋆ X_{t} + W_{c f} C_{t - 1} + b_{f}) \\ C_{t} = i_{t} * C_{t}^{'} + f_{t} * C_{t - 1} \\ h_{t} = o_{t} * t a n h (C_{t}) \end{matrix} \end{matrix}

(3)

where ⋆ is a convolution operation and the rest of the operations are the same as given in Eq 1.

Fig 4 — X_t depicts the image and the red block in the middle encloses the local representation. h_t−1 is the hidden state of the processing of the previous image frame and C_t−1 is the cell state. h_t and C_t are the hidden and cell states respectively computed from (X_t), h_t−1 and C_t−1. A red box shows the neighboring states that contribute to computation.

The input of Conv2DLSTM is a tensor with 5 dimensions, where the first two channels are samples and frames in the videos and the last three dimensions are the height, the width, and the channels in each frame.

3.2.2 3D CNNs

CNN is an established supervised learning technique for image data that works with individual spatial data. Adding a new human perceptible dimension of input data say time, a new computational dimension is a potent extension in the hard inductive bias. 3DCNN [49] is a composition of convolutional neural networks where a combination of subsequent data is used as an input tensor with an assumption that the tensor encompasses information for a single prediction. An illustration of 3DCNN is depicted in Fig 5. In comparison to the Eq 2 of 2D convolution, 3D convolution of a location (i, j, l) (where l is a time dimension) is given as:

\begin{matrix} \begin{matrix} r_{i j l} = ϕ (\sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} \sum_{o = 0}^{O - 1} \sum_{k} w_{m, n, o, k} v_{i + m - 1, j + n - 1, l + o - 1, k}) \end{matrix} \end{matrix}

(4)

where M, N, O are dimensions of the kernel, k is the filter size and ϕ is the activation function.

3.3 Proposed study

In this study, the architectures (section 3.2.1, Fig 5) possessing hard inductive bias are trained with the data prepared for teaching-methodology analysis. The inductive bias of a deep neural network can be improved for a specific domain and it has been experienced that the choice of hyper-parameters, post-processing of domain-specific data and initialization of the parameters(hyper-parameters and learn-able parameters) transcend the general models. The three architectures depicted in Fig 6 are trained with an image sequence {I₁, I₂, …, I_N} extracted from the video V_a of action category a. In this section, we discuss the composition of the proposed networks. The selected hyper-parameters are discussed in section 4.

3.3.1 ConvLSTM for teaching methodology analysis

Fig 7a gives the detailed composition of the first model. The network has four layers Conv2DLSTM with interleaving sub-sampling layers. The input frame I_t at time t is clamped to the first layer and extracts the features from the images using 3x3 convolutional kernels in convolutional layers, whereas the LSTM units keep the context from subsequent images {I₁, …, I_t} of the whole video. The encoded features from the final sub-sampling layer are forwarded to a feed forward network (FC3) with three fully connected layers including an output layer followed by a softmax layer. A dropout layer with 50% probability is added after the final sub-sampling layer. The network predicts an action at the end of the whole sequence of image frames in a video at test time. The number of filters in each layer are written under a shape in Fig 7a. The size of the receptive field of each convLSTM layer is calculated based on the input images height and width (HxW) and kernerl sizes of the previous layer.

In this study, the convolution and LSTM are tested with two different compositions: Conv2DLSTM recurrent network builds the context from the cell states and the hidden states generated by LSTM layers using all the subsequent image frames {I₁, …., I_t} in the sequence for a current input I_t. Whereas in the convolution time distributed layers, the network only builds context from the LSTM outputs of the last image frame I_t−1 for the input I_t. Furthermore, output of the Conv2DLSTM layer can be either 5 or 4 dimensional based on the implementation strategy for input dimension that can have single input for the whole video or single input for each image in the temporal sequence.

3.3.2 3DCNN for teaching methodology analysis

Fig 7b gives the detailed composition of the 3DCNN model used for a teacher’s action classification. A video is a sequence of stacked images that can be represented as a tensor, where width and height are two dimensions in the tensor and time is the third dimension. The action recognition spans over the contiguous image frames and it can be visualized as a certain pattern in a tensor. The proposed architecture of a 3DCNN is composed of seven 3D convolution layers with two dropout layers with 50% probability after the first and third 3D convolutional layers. As shown in Fig 2, the whole sequence of images {I₁, …, I_N} in a video is concatenated and clamped to the input layer. The convolutional layers extract features and learn the association of features across the sequence of image frames. The features from the final layer are passed to a fully connected layer and a final output layer predicts an action a using a softmax layer.

The input of Conv3D is five-dimensional; the 1st dimension is the number of samples, the 2nd dimension is depth, the 3rd is width, the 4th is height, and the 5th is the number of channels. In the video dataset usage with 3DCNN, samples are numbers of video clips in a batch, and depth is the number of frames in a video clip.

4 Results, discussion and limitations

Several experiments are performed with the collected video dataset discussed in section 3.1 using the Deep Learning models discussed in section 3.2. Fig 7a and 7b give details of the compositions of DNNs employed for experimentation in this study. The activation units in convolution layers are ReLU, and for optimization, Adam stochastic optimizer is used with following hyper parameters: α = 0.001, β₁ = 0.9, β₂ = 0.999 and ϵ = 10e⁻⁸. Here α is the learning rate, β₁ is the exponential decay rate for the first-moment estimate, β₂ is the exponential decay rate for the second-moment estimate and ϵ is used to avoid division by zero in the implementation in case of zero gradient [50]. All experiments use multi-class cross entropy loss function and a batch size of one.

The training and testing datasets are drawn out from the labeled video clips. The proportion of video clips in each category are depicted in histograms in Figs 6 and 8. Before feeding the dataset into the Deep Learning model, it is normalized by changing the channel information and pixel values. Initial values for pixels were between 0 and 255 but during preprocessing these were normalized between 0 and 1. We also reduced the size of the frames from 704x576 to 360x240.

The dataset is also augmented to increase the number of available samples and add a data inductive bias. Data augmentation includes: image rotation with an angle of ±5° (shocks/jerks/jolt invariance), flipping horizontally (viewpoint invariance), changing brightness (illuminance invariance), and zooming in or zooming out (scale invariance). The advantage of using data augmentation is that our training data become more diverse, and an increase in the number of training samples also improves the performance of deep neural networks with better generalization. The flipping and the brightness augmentation operations are performed on a complete labeled clip, whereas the other operations are performed on image frames within a clip. For the former, a clip is selected at random with a probability of 0.2. The image rotation is performed on an image frame randomly selected from a clip with a probability of 0.1 and the angle sign is selected with a probability of 0.5. The rotation is performed on two consecutive frames in a selected angle and the opposite rotation is performed in the next four frames and then the selected rotation is performed in two consecutive frames. This imitates a complete motion of the camera in case of shock or jolt. The zooming and zooming out are also performed on a randomly selected image frame with a probability of 0.1 from a clip. The zooming operations are performed with interpolation and in the case of zooming out the operation, the borders are filled with mirror padding.

DNN models of Conv2DLSTM and Convolution with time distributed layers are trained with one frame at a time, and 3DCNN is trained with all frames in a clip (as mentioned in section 3.1)) stacked vertically. A leave-one-out cross-validation (LOOCV) technique is used for evaluations. The chunks for the video selected for testing are not included in training data to achieve a true out-of-distribution generalization. Moreover, the chunks from the different videos are interleaved and shuffled for training in addition to the data augmentation disscussed in previous paragraph. Accumulated figures of validation accuracy results with the tested models are presented in Table 1. It is observed that the 3DCNN network is effective with the videos dataset since their composite layers work well with spatio-temporal data. Although, time-distributed layers with CNN, LSTM and Conv2DLSTM have hard inductive bias for retaining the long sequential association of the data, the quantitative results suggest the complex composition of 3DCNN layers adds benefits in this particular problem. It extracts more valuable features and keeps the right temporal associations that are needed for accurate predictions with a small and noisy sequential images dataset.

Table 1. This table presents the accuracy of the tests averaged over nine runs with LOOCV cross validation techniques.

Models	Average accuracy ± SD
TD 2D CNN + LSTM	91.00% ±2
Conv2DLSTM	91.00% ±1.5
3D CNN	94.00% ±3

Open in a new tab

In this study, we also use a complete lecture video to generate summary statistics. First, the video is split into 3 seconds chunks. Each chunk is separately fed into the model to predict the action performed by a teacher. Finally, the results are generated by measuring the number of classified chunks. In this experiment, we were able to get 90% accuracy for a single lecture video of 1 hour. In Fig 9, we can see specific actions that were performed during the lecture. Y-axis shows the number of three seconds clips in which that action was performed. A confusion matrix of results of the video is presented in Fig 10, and it shows which actions are confused with other actions.

Fig 9 — The results are generated with Conv2DLSTM. These results can be correlated with the teaching standards.

According to lecturing guidelines published online by the University of Waterloo (https://uwaterloo.ca/centre-for-teaching-excellence/teaching-resources/teaching-tips/lecturing-and-presenting/delivery/lecturing-effectively-university. Latest visit(11th May 2021) [51, 52], it is recommended that a teacher should turn her face towards the students, and she should not spend most of her time just writing on board or looking at the slides. Communicating with students is considered the most vital part of a lecture. So, based on this observation, interesting actions of a teacher according to our labeled categorizes are: talking to a student, pointing to the board, and standing while facing students. A lecture in which a teacher spends a substantial amount of time in these interesting categories of actions along with presentation and writing on a board can be considered an effective teaching session.

4.1 Discussion

In this section, we reflect on the challenges peculiar to teaching methodology analysis using CCTV videos. A significant list is the poor quality of CCTV videos (Fig 11), imprecise class labels, and high inter-class similarity. The videos used in this study are CCTV videos and lack important details due to the point of view and quality of CCTV cameras. Often, it is not discernible with a naked eye if a teacher is holding a marker/chalk. Moreover, in multiple videos, the whiteboard is not visible and the teacher’s writing on the board becomes illegible, as depicted in Fig 11.

Another limitation is overlapping actions in consecutive video chunks of the collected dataset. Some of the intermediate actions do not have any precise labels such as teacher walking, teacher waving hand, or teacher standing still. Some video clips have very brief recognizable actions either at the beginning or the ending of the video, such as a three-second video that has less than one second of writing on board. Since it is less than a second, so we chose not to classify these as the action being performed. In some videos, less than three seconds of a certain action is performed, which is divided into two different video clips because the videos are split evenly at three seconds mark. This makes it hard to get actual statistics. To overcome this challenge, a strategy of frame dropout was adopted but it did not improve the results, instead, it was exacerbated. A detailed analysis is needed, however, to find that if it is an inherent bias of the dataset.

Another challenge we faced in this study is high inter-class similarity, such as pointing on the board and writing on the board are almost similar actions. So, it becomes difficult to classify these as separate classes. This problem aggravates because of the poor quality of the videos. Another case is a student and a teacher talking, i.e., sometimes a few students stand close to the teacher or pass by the teacher. It is not easy to classify when they are talking since our model does not include sound input. Another example is delivering a presentation where sometimes a teacher does not move much or points towards the presentation so it is also hard to recognize.

It is important to discuss here the utility of background segmentation and region demarcation in a video clip for training and testing. First, background segmentation is not used in this study because modern cameras have enhanced functionality of following a person. Moreover, PTZ cameras are frequently used in video recording these days. The region demarcation is not used to restrict the focus on the teacher as communication with students and their response is significant for some classes. However, these two preprocessing techniques can be tested in the future for certain objectives and can help to improve results.

We compared our dataset with the IAVID-1 dataset. This dataset is similar to our dataset in terms of output classes and the purpose of this dataset aligns with our study, but our dataset is more generalized which is evident from facts in Table 2. The IAVID-1 dataset is very simple in terms of the number of videos per output class, and the view of videos focused on the teacher only. This may not be considered as a true representation of a complete classroom environment, since communication with students is an essential component of the lecture. However, the two datasets can be merged and used for further analysis and evaluations.

Table 2. Comparison with IAVID-1 dataset.

Our Dataset	IAVID-1 Dataset
1050 Training + Testing Videos	100 Training + Testing Videos
11 Total Actions	8 Total Actions
6 teachers	12 actors, Each actor perform 8 different actions
Videos contain Front view from top left or top right (Whole Class View)	Videos contain front view (Teacher focused view)
Mixed actions, at most 2 type of action per video	Single action per video
Highest accuracy was 94%	Highest accuracy was 82%
704x576 scale down to 360x240	1088x1920

Open in a new tab

5 Conclusion and future work

In this study, automatic content recognition (ACR) is experimented with real-world classroom video data for analysis of teaching methods. It facilitates monitoring, self-reflection, and adaptation of an improved teaching style. The actions of a teacher are categorized and labeled in classroom videos, and a comprehensive dataset of labeled video clips is generated. The dataset is used to train Deep Learning action recognition models. The proposed deep neural network-based model generates valuable statistics which are effectively tested on 1000 video clips for quantitative assessments. Furthermore, the use of ACR techniques for the analysis of online lecture videos is a pertinent study and becomes indispensable with online teaching due to the Covid-19 pandemic which caused a forced shutdown of educational campuses around the world. The proposed models can be extended with experiments on online sessions with students’ action recognition and a comprehensive analysis of the effectiveness of online teaching sessions.

Data Availability

All relevant data are within the manuscript.

Funding Statement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2014-3-00077, AI National Strategy Project).

References

1.Flammini F, Pragliola C, Pappalardo A, Vittorini V. A robust approach for on-line and off-line threat detection based on event tree similarity analysis. Advanced Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE International Conference on. 2011; p. 414–419. 10.1109/AVSS.2011.6027364 [DOI]
2.Shyu ML, Haruechaiyasak C, Chen SC, Zhao N. Collaborative filtering by mining association rules from user access sequences. In: Web Information Retrieval and Integration, 2005. WIRI’05. Proceedings. International Workshop on Challenges in. IEEE; 2005. p. 128–135.
3.Ngxande M, Tapamo JR, Burke M. Driver drowsiness detection using behavioral measures and machine learning techniques: A review of state-of-art techniques. In: Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), 2017. IEEE; 2017. p. 156–161.
4. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324. doi: 10.1109/5.726791 [DOI] [Google Scholar]
5. Naseer S, Saleem Y, Khalid S, Bashir MK, Han J, Iqbal MM, et al. Enhanced Network Anomaly Detection Based on Deep Neural Networks. IEEE Access. 2018; p. 1–1. 10.1109/ACCESS.2018.2863036 [DOI] [Google Scholar]
6.Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018. p. 7132–7141.
7. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
8.Sutskever I, Vinyals O, Le QV. Sequence to Sequence Learning with Neural Networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2. NIPS’14. Cambridge, MA, USA: MIT Press; 2014. p. 3104–3112.
9.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017. p. 5998–6008.
10.Soomro K, Zamir AR, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402. 2012;.
11.Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision. IEEE; 2011. p. 2556–2563.
12.Ragheb H, Velastin S, Remagnino P, Ellis T. ViHASi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. In: 2008 Second ACM/IEEE International Conference on Distributed Smart Cameras. IEEE; 2008. p. 1–10.
13. Murtaza F, Yousaf MH, Velastin SA. Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description. IET Computer Vision. 2016;10(7):758–767. doi: 10.1049/iet-cvi.2015.0416 [DOI] [Google Scholar]
14.Velastin SA, Gómez-Lira DA. People Detection and Pose Classification Inside a Moving Train Using Computer Vision. In: International Visual Informatics Conference. Springer; 2017. p. 319–330.
15.Prieto LP, Sharma K, Dillenbourg P, Jesús M. Teaching analytics: towards automatic extraction of orchestration graphs using wearable sensors. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM; 2016. p. 148–157.
16. Fu B, Damer N, Kirchbuchner F, Kuijper A. Sensing Technology for Human Activity Recognition: A Comprehensive Survey. IEEE Access. 2020;8:83791–83820. doi: 10.1109/ACCESS.2020.2991891 [DOI] [Google Scholar]
17.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 770–778.
18.Tan M, Le Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. vol. 97 of Proceedings of Machine Learning Research. PMLR; 2019. p. 6105–6114.
19.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015. p. 1–9.
20.Cho K, van Merrienboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. In: Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). Doha, Qatar; 2014.
21. Ferrari A, Micucci D, Mobilio M, Napoletano P. On the Personalization of Classification Models for Human Activity Recognition. IEEE Access. 2020;8:32066–32079. doi: 10.1109/ACCESS.2020.2973425 [DOI] [Google Scholar]
22. Hsu Y, Yang S, Chang H, Lai H. Human Daily and Sport Activity Recognition Using a Wearable Inertial Sensor Network. IEEE Access. 2018;6:31715–31728. doi: 10.1109/ACCESS.2018.2839766 [DOI] [Google Scholar]
23. Hoyos AAC, Velásquez JD. Teaching Analytics: Current Challenges and Future Development. IEEE Revista Iberoamericana de Tecnologias del Aprendizaje. 2020;15(1):1–9. doi: 10.1109/RITA.2020.2979245 [DOI] [Google Scholar]
24. Donaj G, Sepesy Maučec M. Extension of HMM-Based ADL Recognition With Markov Chains of Activities and Activity Transition Cost. IEEE Access. 2019;7:130650–130662. doi: 10.1109/ACCESS.2019.2937350 [DOI] [Google Scholar]
25. Li J, Tian L, Wang H, An Y, Wang K, Yu L. Segmentation and Recognition of Basic and Transitional Activities for Continuous Physical Human Activity. IEEE Access. 2019;7:42565–42576. doi: 10.1109/ACCESS.2019.2905575 [DOI] [Google Scholar]
26.Bhat A, Rao AC, Bhaskar A, Adithya V, Pratiba D. A Cost-Effective Audio-Visual Summarizer for Summarization of Presentations and Seminars. In: 2018 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS); 2018. p. 271–276.
27.Zhao C, Li H, Jiang Z, Xiong Z. Learners’ Appeal: An Analysis of Teachers’ Behavior in Online Live Teaching. In: 2017 International Symposium on Educational Technology (ISET); 2017. p. 44–47.
28.Yu-Te K, Han-Yen Y, Yi-Chi C. A Classroom Atmosphere Management System for Analyzing Human Behaviors in Class Activities. In: 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC); 2019. p. 224–231.
29. Verner IM, Polishuk A, Krayner N. Science Class with RoboThespian: Using a Robot Teacher to Make Science Fun and Engage Students. IEEE Robotics Automation Magazine. 2016;23(2):74–80. doi: 10.1109/MRA.2016.2515018 [DOI] [Google Scholar]
30.Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision; 2015. p. 4489–4497.
31.Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale Video Classification with Convolutional Neural Networks. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 1725–1732.
32.Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 2625–2634. [DOI] [PubMed]
33.Li S, Ding Z, Chen H. A Neural Network-Based Teaching Style Analysis Model. In: 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC). vol. 2; 2019. p. 154–157.
34. Qi C, Li M, Wang Q, Zhang H, Xing J, Gao Z, et al. Facial Expressions Recognition Based on Cognition and Mapped Binary Patterns. IEEE Access. 2018;6:18795–18803. doi: 10.1109/ACCESS.2018.2816044 [DOI] [Google Scholar]
35.Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 1933–1941.
36.Feichtenhofer C, Pinz A, Wildes RP. Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4768–4777.
37. Chen E, Bai X, Gao L, Tinega HC, Ding Y. A spatiotemporal heterogeneous two-stream network for action recognition. IEEE Access. 2019;7:57267–57275. doi: 10.1109/ACCESS.2019.2910604 [DOI] [Google Scholar]
38.Wang Y, Long M, Wang J, Yu PS. Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition; 2017. p. 1529–1538.
39. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access. 2017;6:1155–1166. doi: 10.1109/ACCESS.2017.2778011 [DOI] [Google Scholar]
40. Liu J, Wang C, Gong Y, Xue H. Deep Fully Connected Model for Collective Activity Recognition. IEEE Access. 2019;7:104308–104314. doi: 10.1109/ACCESS.2019.2929684 [DOI] [Google Scholar]
41. Sharma V, Gupta M, Kumar A, Mishra D. EduNet: A New Video Dataset for Understanding Human Activity in the Classroom Environment. Sensors. 2021;21(17). doi: 10.3390/s21175699 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Sun B, Wu Y, Zhao K, He J, Yu L, Yan H, et al. Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Computing and Applications. 2021;33(14):8335–8354. doi: 10.1007/s00521-020-05587-y [DOI] [Google Scholar]
43. Nida N, Yousaf MH, Irtaza A, Velastin SA. Instructor Activity Recognition through Deep Spatiotemporal Features and Feedforward Extreme Learning Machines. Mathematical Problems in Engineering. 2019;2019:1–13. doi: 10.1155/2019/2474865 [DOI] [Google Scholar]
44. Gang Z, Wenjuan Z, Biling H, Jie C, Hui H, Qing X. A simple teacher behavior recognition method for massive teaching videos based on teacher set. Applied Intelligence. 2021;51:8828–8849. doi: 10.1007/s10489-021-02329-y [DOI] [Google Scholar]
45. Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press; 2016. [Google Scholar]
46.Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C. Efficient object localization using Convolutional Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015. p. 648–656.
47.Goyal A, Bengio Y. Inductive Biases for Deep Learning of Higher-Level Cognition; 2021.
48.Shi X, Chen Z, Wang H, Yeung DY, Wong Wk, Woo Wc. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1. NIPS’15. Cambridge, MA, USA: MIT Press; 2015. p. 802–810.
49. Ji S, Xu W, Yang M, Yu K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(1):221–231. doi: 10.1109/TPAMI.2012.59 [DOI] [PubMed] [Google Scholar]
50.Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. In: Bengio Y, LeCun Y, editors. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings; 2015. Available from: http://arxiv.org/abs/1412.6980.
51. Harrington C, Zakrajsek T. Dynamic Lecturing: Research-Based Strategies to Enhance Lecture Effectiveness. USA: Stylus Publishing; 2017. [Google Scholar]
52. Brown S. Lecturing: a practical guide. London: Kogan Page; 2002. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0263448.r001

Decision Letter 0

Felix Albu

20 Oct 2021

PONE-D-21-18820Employing Automatic Content Recognition for Teaching Methodology Analysis in Classroom VideosPLOS ONE

Dear Dr. Jeon,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 04 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Felix Albu, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE does not allow footnotes, so please include all text in the footnotes in the main text. Please also include the name of the Ethics committee which approved the study, and clarify how participants provided consent.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“We would like to thank the University of Management and Technology for providing the 430 CCTV camera lecture videos. This work was supported by a grant from the Institute of 431 Information & Communications Technology Planning & Evaluation (IITP) funded by 432 the Korean government (MSIT) (No. 2014-3-00077, AI National Strategy Project).”

We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2014-3-00077, AI National Strategy Project).”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. We note that Figure 1 includes an image of a [patient / participant / in the study]. As per the PLOS ONE policy (http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research) on papers that include identifying, or potentially identifying, information, the individual(s) or parent(s)/guardian(s) must be informed of the terms of the PLOS open-access (CC-BY) license and provide specific permission for publication of these details under the terms of this license. Please download the Consent Form for Publication in a PLOS Journal (http://journals.plos.org/plosone/s/file?id=8ce6/plos-consent-form-english.pdf). The signed consent form should not be submitted with the manuscript, but should be securely filed in the individual's case notes. Please amend the methods section and ethics statement of the manuscript to explicitly state that the patient/participant has provided consent for publication: “The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish these case details”.

If you are unable to obtain consent from the subject of the photograph, you will need to remove the figure and any other textual identifying information or case descriptions for this individual.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Additional Editor Comments (if provided):

The authors have to address the comments of the reviewers.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The article presents the importance of the existence of a module for recognizing human actions that can be used to improve the teaching act. Also, this includes a description of how the dataset was collected and three types of video classifiers.

I suggest the following improvements:

Section 2.1 should detail the preprocessing stage (what transforms are used for data preprocessing before training). It is specified that augmentation was used only in the results section.

Figure 2 can be improved because it is not explicit enough.

- it is not clear if the first architecture uses only the first and last image or the whole video;

- in the second architecture, it is not clear if only three frames;

- it is not clear how the prediction of the action is determined, starting from the output of the LSTMs;

- it is not clear whether a convolutional layer is used for each image or all images are passed through the same layer.

Section 3 contains general information about various types of neural networks but does not include a clear presentation of the proposed methods and the pipeline used to train them. This section could be renamed and used as Background and introduced a new section with the presentation of the proposed methods (containing a presentation specific to the problem solved).

In section 4, it is not specified how the ADAM optimizer was chosen, what value was used for the learning rate and if other variants were tested.

Figure 6 could also include the percentages for each type of action. The size used for the batch is not specified. The error function used is not specified.

It would be useful if Table 2 turned into a figure. It would be easier to understand the proposed model.

The authors should point out the contributions they bring in the context of human activity recognition or video classification compared to the other approaches presented in the state of the art section.

A comparison should be made between the results obtained by these three methods presented.

Reviewer #2: Comments to the Author:

This paper has proposed a teaching methodology analysis in classroom videos. The work of this paper is practical. However, I think there are some revisions for the manuscript. The comments are list as follow.

(1) The author needs to analyze the difference of effect for each label.

(2) The author does not reflect the novelty of the method. The method used is the existing method. Whether it is advanced in this field needs to be compared with the existing advanced method in this field.

(3) The latest references cited by the author contain too few articles in this direction.

(4) The first appearance abbreviation should show all words, such as CCTV in line 71 and IDTs in line 77.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review for PONE-D-21-18820.docx

Click here for additional data file.^{(12.9KB, docx)}

PLoS One. 2022 Feb 17;17(2):e0263448. doi: 10.1371/journal.pone.0263448.r002

Author response to Decision Letter 0

9 Nov 2021

We are grateful to the respected editor, the editorial team and erudite reviewers for extending this opportunity to improve our manuscript. We are thankful for the insightful comments and valuable suggestions by the erudite reviewers. It encouraged us to improve the proposed study and the quality of the manuscript. We diligently deliberated the suggestions, supplemented the proposed study and ameliorated the manuscript accordingly. We address the journal requirements guidelines and comments of each respected reviewer separately in the following:

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Response: The revised manuscript complies the PLOS ONE’s style requirements.

Response: The footnotes are accommodated in the text and a letter including the names of members of ethics committee with details about consent is provided as supplementary document.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2014-3-00077, AI National Strategy Project).”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

Response: A revised funding and acknowledgment statement is provided in cover letter.

If you are unable to obtain consent from the subject of the photograph, you will need to remove the figure and any other textual identifying information or case descriptions for this individual.

Response: A letter from the ethics committee providing the consent of the participants is provided as a supplementary document.

Response: The requirement is accommodated in the revised manuscript.

Additional Editor Comments (if provided):

The authors have to address the comments of the reviewers.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

Reviewer #2: Yes

Response: We are grateful for the acknowledgment by the reviewers and improved the deficiencies with the help of suggested changes and feedback.

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

Response: We acknowledge the insightful feedback provided by the reviewers and revised the manuscript anticipating the astute observations by the reviewers.

3. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

Reviewer #2: Yes

Response: We are grateful for the acknowledgment by the reviewers.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

Response: We are grateful for the acknowledgment by the reviewers.

5. Review Comments to the Author

I suggest the following improvements:

Section 2.1 should detail the preprocessing stage (what transforms are used for data preprocessing before training). It is specified that augmentation was used only in the results section.

Response: Section 2.1 is updated with the pre-processing and data augmentation information. Moreover, the details are appended in section 3 (Results, Discussion and limitation). Briefly, this study uses horizontal flip, zoom-in, zoom-out, change of brightness, image rotation and image blur for data augmentation.

Figure 2 can be improved because it is not explicit enough.

Response: Figure 2 has been updated in the manuscript and the caption provides additional information to remove the ambiguities. The description below is also updated section 2 of the manuscript for better readability and understanding.

- it is not clear if the first architecture uses only the first and last image or the whole video;

Response: It takes one image at one time step ‘t’ and predicts an action after t_N (N number of images).

- in the second architecture, it is not clear if only three frames;

Response: It takes one image at one time step ‘t’ and predicts an action after t_N (N number of images). However, in implementation we can pass all N images at the same time.

- it is not clear how the prediction of the action is determined, starting from the output of the LSTMs;

Response: The action is predicted at the end of a sequence.

- it is not clear whether a convolutional layer is used for each image or all images are passed through the same layer.

Response: CNN is used for each image in architectures 1 and 2. While in architecture 3 all the images are clamped at once to the 3DCNN.

Response: As per the suggestion, a new subsection has been added to the manuscript with precise details of the proposed techniques relevant to the selected problem.

In section 4, it is not specified how the ADAM optimizer was chosen, what value was used for the learning rate and if other variants were tested.

Response: The details for Adam optimizer have been added in section 3 (Results, Discussion and Limitations).

Figure 6 could also include the percentages for each type of action. The size used for the batch is not specified. The error function used is not specified.

Response: As per suggestion Figure 6 is updated and the details of batch size and the error /loss function information are added in section 3 (Results, Discussion and Limitations).

It would be useful if Table 2 turned into a figure. It would be easier to understand the proposed model.

Response: As per suggestion Table 1 and Table 2 are transformed into figures (Figure 10 (a) and (b)).

Response: The introduction section has been updated with elaborated contributions.

A comparison should be made between the results obtained by these three methods presented.

Response: The comparison is presented in discussion section and updated in Table 3.

Reviewer #2: Comments to the Author:

(1) The author needs to analyze the difference of effect for each label.

Response: It is added to the section 2.1 of the manuscript.

Response: The introduction section is updated with elaborated contributions and novelty of the idea in this study.

(3) The latest references cited by the author contain too few articles in this direction.

Response: The reference section is updated with recent relevant articles.

(4) The first appearance abbreviation should show all words, such as CCTV in line 71 and IDTs in line 77.

Response: The manuscript is carefully revised and all words are used for the first use of an abbreviation.

Attachment

Submitted filename: Response to Reviewers.pdf

Click here for additional data file.^{(70.6KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0263448.r003

Decision Letter 1

Felix Albu

30 Dec 2021

PONE-D-21-18820R1Employing Automatic Content Recognition for Teaching Methodology Analysis in Classroom VideosPLOS ONE

Dear Dr. Jeon,

Please submit your revised manuscript by Feb 13 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Felix Albu, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

The comments of the reviewers should be better addressed.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: Preprocessing information has been added, but it is too minimal. It was not specified with what probability these transformations were applied and what parameters are used for each transform (i.e. the scale used for zooming in / zooming out).

The authors did not specify all parameters for the neural networks (ConvLSTM / 3DCNN). For example, no parameters were specified for LSTMs such as the number of features in the hidden state h, the number of recurrent layers. The two proposed models should be better detailed. Some of this information existed in the previous table.

The authors provide general details: "The input of Conv3D is five-dimensional; the 1st dimension is the number of samples, the 2nd dimension is depth, the 3rd is width, the 4th is height, and the 5th is the number of channels. In the video dataset, samples are numbers of videos, and depth is the number of frames in a video". What is the significance of the number of videos in the context of the proposed solution?

The authors provided details for the optimization algorithm using notation symbols. It is necessary to explain what each notation represents. For example, alpha represents the learning rate, epsilon represents the term added to the denominator to improve numerical stability.

Reviewer #2: The author has made great modification to the structure of the paper. Recent references have been added as required. Meanwhile, the relative methods are compared. The paper has met the basic requirements of the journal. I don’t have any problems. I suggest to accept it.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. 2022 Feb 17;17(2):e0263448. doi: 10.1371/journal.pone.0263448.r004

Author response to Decision Letter 1

6 Jan 2022

6. Review Comments to the Author

Response : As suggested by the reviewer the details of data augmentation operations used in this study are elaborated with selected parameters in section 3 (line 409-420 - page 22) (“Results, Discussion and Limitations”).

Response : As suggested by the reviewer the details from the mentioned tables are added to the images. The hyper-parameters of the neural networks are added in the text in images and elaborated in the captions. The details are pronounced in the text added earlier for clarity and better readability. (Figure 10 (image and caption), Section 2.3.1, line 359-362, page 22)

Response : The notified sentence is rephrased to clearly express the authors intention. The video clips (batch size) often play a critical role as an implicit regularizer, which in this study can not have much impact as a single clip is used in a batch. The frames in a clip form a volume and are assumed to serve the purpose of the batch, but the claims need further inquiry which can be pursued in future work.

Response : As suggested by the reviewer the descriptions are added to the manuscript in the Section 3 (line 393-395 – page 22) “ Results, Discussion and Limitations”. Moreover, following reference is added for a detailed reading.

“Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. In: 3rd International Conference on Learning Representations, ICLR , San Diego, CA, USA, May 7-9, 2015.”

Reviewer #2: Comments to the Author:

The author has made great modification to the structure of the paper. Recent references have been added as required. Meanwhile, the relative methods are compared. The paper has met the basic requirements of the journal. I don’t have any problems. I suggest to accept it.

Response : We are grateful to the respected reviewer for providing us with valuable comments which helped us to improve the manuscript and prepare it for positive appraisal.

Attachment

Submitted filename: Revision_Response_2.pdf

Click here for additional data file.^{(69.2KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0263448.r005

Decision Letter 2

Felix Albu

20 Jan 2022

Employing Automatic Content Recognition for Teaching Methodology Analysis in Classroom Videos

PONE-D-21-18820R2

Dear Dr. Jeon,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Felix Albu, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

The decision is Accept.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have clearly described the research problem and the proposed solution in this revision. The previous missing details are clarified in this version.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0263448.r006

Acceptance letter

Felix Albu

8 Feb 2022

PONE-D-21-18820R2

Employing Automatic Content Recognition for Teaching Methodology Analysis in Classroom Videos

Dear Dr. Jeon:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Felix Albu

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Review for PONE-D-21-18820.docx

Click here for additional data file.^{(12.9KB, docx)}

Attachment

Submitted filename: Response to Reviewers.pdf

Click here for additional data file.^{(70.6KB, pdf)}

Attachment

Submitted filename: Revision_Response_2.pdf

Click here for additional data file.^{(69.2KB, pdf)}

Data Availability Statement

All relevant data are within the manuscript.

[pone.0263448.ref001] 1.Flammini F, Pragliola C, Pappalardo A, Vittorini V. A robust approach for on-line and off-line threat detection based on event tree similarity analysis. Advanced Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE International Conference on. 2011; p. 414–419. 10.1109/AVSS.2011.6027364 [DOI]

[pone.0263448.ref002] 2.Shyu ML, Haruechaiyasak C, Chen SC, Zhao N. Collaborative filtering by mining association rules from user access sequences. In: Web Information Retrieval and Integration, 2005. WIRI’05. Proceedings. International Workshop on Challenges in. IEEE; 2005. p. 128–135.

[pone.0263448.ref003] 3.Ngxande M, Tapamo JR, Burke M. Driver drowsiness detection using behavioral measures and machine learning techniques: A review of state-of-art techniques. In: Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), 2017. IEEE; 2017. p. 156–161.

[pone.0263448.ref004] 4. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324. doi: 10.1109/5.726791 [DOI] [Google Scholar]

[pone.0263448.ref005] 5. Naseer S, Saleem Y, Khalid S, Bashir MK, Han J, Iqbal MM, et al. Enhanced Network Anomaly Detection Based on Deep Neural Networks. IEEE Access. 2018; p. 1–1. 10.1109/ACCESS.2018.2863036 [DOI] [Google Scholar]

[pone.0263448.ref006] 6.Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018. p. 7132–7141.

[pone.0263448.ref007] 7. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]

[pone.0263448.ref008] 8.Sutskever I, Vinyals O, Le QV. Sequence to Sequence Learning with Neural Networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2. NIPS’14. Cambridge, MA, USA: MIT Press; 2014. p. 3104–3112.

[pone.0263448.ref009] 9.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017. p. 5998–6008.

[pone.0263448.ref010] 10.Soomro K, Zamir AR, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402. 2012;.

[pone.0263448.ref011] 11.Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision. IEEE; 2011. p. 2556–2563.

[pone.0263448.ref012] 12.Ragheb H, Velastin S, Remagnino P, Ellis T. ViHASi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. In: 2008 Second ACM/IEEE International Conference on Distributed Smart Cameras. IEEE; 2008. p. 1–10.

[pone.0263448.ref013] 13. Murtaza F, Yousaf MH, Velastin SA. Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description. IET Computer Vision. 2016;10(7):758–767. doi: 10.1049/iet-cvi.2015.0416 [DOI] [Google Scholar]

[pone.0263448.ref014] 14.Velastin SA, Gómez-Lira DA. People Detection and Pose Classification Inside a Moving Train Using Computer Vision. In: International Visual Informatics Conference. Springer; 2017. p. 319–330.

[pone.0263448.ref015] 15.Prieto LP, Sharma K, Dillenbourg P, Jesús M. Teaching analytics: towards automatic extraction of orchestration graphs using wearable sensors. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM; 2016. p. 148–157.

[pone.0263448.ref016] 16. Fu B, Damer N, Kirchbuchner F, Kuijper A. Sensing Technology for Human Activity Recognition: A Comprehensive Survey. IEEE Access. 2020;8:83791–83820. doi: 10.1109/ACCESS.2020.2991891 [DOI] [Google Scholar]

[pone.0263448.ref017] 17.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 770–778.

[pone.0263448.ref018] 18.Tan M, Le Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. vol. 97 of Proceedings of Machine Learning Research. PMLR; 2019. p. 6105–6114.

[pone.0263448.ref019] 19.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015. p. 1–9.

[pone.0263448.ref020] 20.Cho K, van Merrienboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. In: Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). Doha, Qatar; 2014.

[pone.0263448.ref021] 21. Ferrari A, Micucci D, Mobilio M, Napoletano P. On the Personalization of Classification Models for Human Activity Recognition. IEEE Access. 2020;8:32066–32079. doi: 10.1109/ACCESS.2020.2973425 [DOI] [Google Scholar]

[pone.0263448.ref022] 22. Hsu Y, Yang S, Chang H, Lai H. Human Daily and Sport Activity Recognition Using a Wearable Inertial Sensor Network. IEEE Access. 2018;6:31715–31728. doi: 10.1109/ACCESS.2018.2839766 [DOI] [Google Scholar]

[pone.0263448.ref023] 23. Hoyos AAC, Velásquez JD. Teaching Analytics: Current Challenges and Future Development. IEEE Revista Iberoamericana de Tecnologias del Aprendizaje. 2020;15(1):1–9. doi: 10.1109/RITA.2020.2979245 [DOI] [Google Scholar]

[pone.0263448.ref024] 24. Donaj G, Sepesy Maučec M. Extension of HMM-Based ADL Recognition With Markov Chains of Activities and Activity Transition Cost. IEEE Access. 2019;7:130650–130662. doi: 10.1109/ACCESS.2019.2937350 [DOI] [Google Scholar]

[pone.0263448.ref025] 25. Li J, Tian L, Wang H, An Y, Wang K, Yu L. Segmentation and Recognition of Basic and Transitional Activities for Continuous Physical Human Activity. IEEE Access. 2019;7:42565–42576. doi: 10.1109/ACCESS.2019.2905575 [DOI] [Google Scholar]

[pone.0263448.ref026] 26.Bhat A, Rao AC, Bhaskar A, Adithya V, Pratiba D. A Cost-Effective Audio-Visual Summarizer for Summarization of Presentations and Seminars. In: 2018 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS); 2018. p. 271–276.

[pone.0263448.ref027] 27.Zhao C, Li H, Jiang Z, Xiong Z. Learners’ Appeal: An Analysis of Teachers’ Behavior in Online Live Teaching. In: 2017 International Symposium on Educational Technology (ISET); 2017. p. 44–47.

[pone.0263448.ref028] 28.Yu-Te K, Han-Yen Y, Yi-Chi C. A Classroom Atmosphere Management System for Analyzing Human Behaviors in Class Activities. In: 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC); 2019. p. 224–231.

[pone.0263448.ref029] 29. Verner IM, Polishuk A, Krayner N. Science Class with RoboThespian: Using a Robot Teacher to Make Science Fun and Engage Students. IEEE Robotics Automation Magazine. 2016;23(2):74–80. doi: 10.1109/MRA.2016.2515018 [DOI] [Google Scholar]

[pone.0263448.ref030] 30.Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision; 2015. p. 4489–4497.

[pone.0263448.ref031] 31.Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale Video Classification with Convolutional Neural Networks. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 1725–1732.

[pone.0263448.ref032] 32.Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 2625–2634. [DOI] [PubMed]

[pone.0263448.ref033] 33.Li S, Ding Z, Chen H. A Neural Network-Based Teaching Style Analysis Model. In: 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC). vol. 2; 2019. p. 154–157.

[pone.0263448.ref034] 34. Qi C, Li M, Wang Q, Zhang H, Xing J, Gao Z, et al. Facial Expressions Recognition Based on Cognition and Mapped Binary Patterns. IEEE Access. 2018;6:18795–18803. doi: 10.1109/ACCESS.2018.2816044 [DOI] [Google Scholar]

[pone.0263448.ref035] 35.Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 1933–1941.

[pone.0263448.ref036] 36.Feichtenhofer C, Pinz A, Wildes RP. Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4768–4777.

[pone.0263448.ref037] 37. Chen E, Bai X, Gao L, Tinega HC, Ding Y. A spatiotemporal heterogeneous two-stream network for action recognition. IEEE Access. 2019;7:57267–57275. doi: 10.1109/ACCESS.2019.2910604 [DOI] [Google Scholar]

[pone.0263448.ref038] 38.Wang Y, Long M, Wang J, Yu PS. Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition; 2017. p. 1529–1538.

[pone.0263448.ref039] 39. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access. 2017;6:1155–1166. doi: 10.1109/ACCESS.2017.2778011 [DOI] [Google Scholar]

[pone.0263448.ref040] 40. Liu J, Wang C, Gong Y, Xue H. Deep Fully Connected Model for Collective Activity Recognition. IEEE Access. 2019;7:104308–104314. doi: 10.1109/ACCESS.2019.2929684 [DOI] [Google Scholar]

[pone.0263448.ref041] 41. Sharma V, Gupta M, Kumar A, Mishra D. EduNet: A New Video Dataset for Understanding Human Activity in the Classroom Environment. Sensors. 2021;21(17). doi: 10.3390/s21175699 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0263448.ref042] 42. Sun B, Wu Y, Zhao K, He J, Yu L, Yan H, et al. Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Computing and Applications. 2021;33(14):8335–8354. doi: 10.1007/s00521-020-05587-y [DOI] [Google Scholar]

[pone.0263448.ref043] 43. Nida N, Yousaf MH, Irtaza A, Velastin SA. Instructor Activity Recognition through Deep Spatiotemporal Features and Feedforward Extreme Learning Machines. Mathematical Problems in Engineering. 2019;2019:1–13. doi: 10.1155/2019/2474865 [DOI] [Google Scholar]

[pone.0263448.ref044] 44. Gang Z, Wenjuan Z, Biling H, Jie C, Hui H, Qing X. A simple teacher behavior recognition method for massive teaching videos based on teacher set. Applied Intelligence. 2021;51:8828–8849. doi: 10.1007/s10489-021-02329-y [DOI] [Google Scholar]

[pone.0263448.ref045] 45. Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press; 2016. [Google Scholar]

[pone.0263448.ref046] 46.Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C. Efficient object localization using Convolutional Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015. p. 648–656.

[pone.0263448.ref047] 47.Goyal A, Bengio Y. Inductive Biases for Deep Learning of Higher-Level Cognition; 2021.

[pone.0263448.ref048] 48.Shi X, Chen Z, Wang H, Yeung DY, Wong Wk, Woo Wc. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1. NIPS’15. Cambridge, MA, USA: MIT Press; 2015. p. 802–810.

[pone.0263448.ref049] 49. Ji S, Xu W, Yang M, Yu K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(1):221–231. doi: 10.1109/TPAMI.2012.59 [DOI] [PubMed] [Google Scholar]

[pone.0263448.ref050] 50.Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. In: Bengio Y, LeCun Y, editors. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings; 2015. Available from: http://arxiv.org/abs/1412.6980.

[pone.0263448.ref051] 51. Harrington C, Zakrajsek T. Dynamic Lecturing: Research-Based Strategies to Enhance Lecture Effectiveness. USA: Stylus Publishing; 2017. [Google Scholar]

[pone.0263448.ref052] 52. Brown S. Lecturing: a practical guide. London: Kogan Page; 2002. [Google Scholar]

PERMALINK

Employing automatic content recognition for teaching methodology analysis in classroom videos

Muhammad Aasim Rafique

Faheem Khaskheli

Malik Tahir Hassan

Sheraz Naseer

Moongu Jeon

Roles

Abstract

1 Introduction

2 Related work

2.1 Sensor based data

2.2 Feature engineering approach

2.3 Raw input approach

3 Methodology

3.1 Dataset introduction and preprocessing

Fig 1. In this figure two sample images from the dataset are shown.

3.2 Background

Fig 2.

3.2.1 Conv2D LSTM DNN

Fig 3. This figure depicts a LSTM node of a RNN.

Fig 4. This figure illustrates convolutional LSTM.

3.2.2 3D CNNs

Fig 5. This figure illustrates 3DCNN.

3.3 Proposed study

Fig 6. The graph in this image depicts a distribution of classes in training data in a LOOCV sample.

3.3.1 ConvLSTM for teaching methodology analysis

Fig 7.

3.3.2 3DCNN for teaching methodology analysis

4 Results, discussion and limitations

Fig 8. The graph in this image depicts a distribution of classes in testing data in a LOOCV sample.

Table 1. This table presents the accuracy of the tests averaged over nine runs with LOOCV cross validation techniques.

Fig 9. This chart depicts statistics of the classified actions in a complete lecture video of one hour.

Fig 10. This image depicts a confusion matrix of the stats presented in Fig 9.

4.1 Discussion

Fig 11. Poor quality video example as writing on board is not visible.

Table 2. Comparison with IAVID-1 dataset.

5 Conclusion and future work

Data Availability

Funding Statement

References

Decision Letter 0

Felix Albu

Roles

Author response to Decision Letter 0

Decision Letter 1

Felix Albu

Roles

Author response to Decision Letter 1

Decision Letter 2

Felix Albu

Roles

Acceptance letter

Felix Albu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases