POCASUM : Policy Categorizer and Summarizer Based on Text Mining and Machine Learning

Rushikesh Deotale; Shreyash Rawat; V Vijayarajan; V B Surya Prasath

doi:10.1007/s00500-021-05916-w

. Author manuscript; available in PMC: 2022 Jul 1.

Published in final edited form as: Soft comput. 2021 Jun 11;25(14):9365–9375. doi: 10.1007/s00500-021-05916-w

POCASUM : Policy Categorizer and Summarizer Based on Text Mining and Machine Learning

Rushikesh Deotale ^†, Shreyash Rawat ^†, V Vijayarajan ^†, V B Surya Prasath ^*

PMCID: PMC8932938 NIHMSID: NIHMS1738975 PMID: 35308599

Abstract

Having control over your data is a right and a duty that every citizen has in our digital society. It is often that users skip entire policies of applications or websites to save time and energy without realizing the potential sticky points in these policies. Due to obscure language and verbose explanations majority of users of hypermedia do not bother to read them. Further, sometimes digital media companies do not spend enough effort in stating their policies clearly which often time can also be incomplete. A summarized version of these privacy policies that can be categorized into the useful information can help the users. To solve this problem, in this work we propose to use machine learning based models for policy categorizer that classifies the policy paragraphs under the attributes proposed like security, contact etc. By benchmarking different machine learning based classifier models, we show that artificial neural network model performs with higher accuracy on a challenging dataset of textual privacy policies. We thus show that machine learning can help summarize the relevant paragraphs under the various attributes so that the user can get the gist of that topic within a few lines.

Keywords: Text classification, text summarization, privacy policy, text mining, machine learning, artificial neural network

1. Introduction

Controlling and protecting your personal data is a prerogative unless you allow someone to have control over it. Privacy policies mentions the regulation which the particular application promises to follow regarding the user’s data. It usually mentions what kind of data they collect from you, how they use the data, under which circumstances is the data shared, who has the access of your data, the security regarding the storage of data, the update policies and a lot more. It intends to mention beforehand of all the risk the user might undertake by agreeing to the policies of the application. The real problem lies in the inability of the user to comprehend the policy and understand the risks associated with it due to a plethora of reasons. With increasing technology almost everyone has a smartphone today which allows even illiterate people to gain access to these applications. In general, due to their lack of techinical knowledge they have to accept the policies without reading and understanding them. However, that is a relatively small percentage, the remaining majority of users of these applications and websites in spite of having the required education do simply click to agree to the policies without even reading in whole since it requires an arduous effort in understanding the verbose and technical terms utilized in these privacy policies.

The formal language of the policies along with its confusing semantics makes it irksome for the users which then accept these policies without understanding. The companies also hold the users responsible for any kind of misuse of data as the users had already agreed to the privacy policy which mentions about it. There are some applications which make it optional for the user to go through the policy, which leads to even fewer users referring to the policy. The enormous power of data can offer insights but can also cause obstacles in terms of the huge volumes. To facilitate the users to comprehend the policies, we present a policy summarizer which shows the various attributes that have been covered by the policy and the ones which they have omitted. The attributes include the various measures or the topics that the privacy policy should include based on directives, regulations and practices that an ideal policy should incorporate. After showing the topics the policy covers, the users also get the facility to view the passage which contains the particular topic in the policy in a summarized form. Hence this saves the user plenty of time by only reading the short summarized points present in the policy of the topics the user wishes to view. Users will more likely go through the policy if they are encountered with short summarized points related to the topics that do matter to them rather than reading huge blocks of information containing all the points in them without categorization. Reading the part of the policy which mentions about the attributes in one or two lines is much more beneficial than searching through the whole text. Users do not always go through the content of privacy policies while installing apps. The problem lies in the length of the content. To address the problems faced by the user there have been advancements in text mining that are considered to facilitate users of webpages, and apps to see summarized textual data (Izumi, Matsui, & Matsuo, 2007; J. Li, Fong, Zhuang, & Khoury, 2016). (Costante, Sun, Petković, & den Hartog, 2012) have proposed a method to resolve such issues by providing a completeness analyzer. They check the completeness and give a grade to privacy policies. This helps the users in understanding policies based on the grades given by the analyzer. They have classified each of the paragraphs present in a policy and checked if all classes are present and gives a completeness grade for the policies. Harkous et al. (2018) focused on creating a full fledged framework so users can easily use that framework to automate the entire process. They have not only focused on low level features but have also focused on high-level, more complex classes. Privacy policies also have options for users to opt-out of some policies and take control of their data. Efforts have been made to find these particular options present in a privacy policy (Sathyendra, Wilson, Schaub, Zimmeck, & Sadeh, 2017).

Privacy policies comprises a lot of text and a lot of attributes, thus summarizing it directly would miss out salient text. Thus to tackle that problem, classification of the text is needed based on the attributes it covers. Multi class text-classification is a classification problem in which text is classified in more than two classes (Cherfi, Napoli, & Toussaint, 2006). The existing solutions to the problem of multi-class classification is focused on using machine learning techniques like naive Bayes, support vector machines (SVM) (Rennie & Rifkin, 2001; Silva & Ribeiro, 2007). Another method is to counter the problem of multiple classification as n binary classification and solve these problems one by one and combining the result of these binary classifications. There have been exciting progress in utilizing using neural networks and deep learning based techniques for classifying text (Minaee et al., 2021; Satapathy, Li, Cavallari, & Cambria, 2019), document modeling (Majumder, Poria, Gelbukh, & Cambria, 2017) and for various natural language processing (NLP) tasks (Young, Hazarika, Poria, & Cambria, 2018). Among these techniques convolutional and recurrent neural networks based architectures for labeling multiple categories (Chen, Ye, Xing, Chen, & Cambria, 2017), generative models for synthetically generating text for training and classifications (Y. Li, Pan, Wang, Yang, & Cambria, 2018; Russell, Li, & Tian, 2019) are important. In text summarization, deep recurrent belief networks can be effectively used to model word dependencies (Chaturvedi, Ong, Tsang, Welsch, & Cambria, 2016). Recent improvements in deep learning involves the usage of capsule networks (Zhao, Peng, Eger, Cambria, & Yang, 2019) for challenging NLP applications such as the multi-label text classification and question answering, and long short-term (LSTM) model (Ma, Peng, Khan, Cambria, & Hussain, 2018) for aspect-based sentiment analysis.

Text summarization models have been considered before the deep learning era as well. For example, Nomoto and Matsumoto (2001) present a novel approach to unsupervised text summarization. The novelty lies in exploiting the diversity of concepts in text for summarization, which has not received much attention in the summarization literature. A diversity-based approach here is a principled generalization of maximal marginal relevance (MMR) criterion by Carbonell and Goldstein (1998). They have presented a new summarization scheme where evaluation does not rely on matching extracts against human made summaries but measuring the loss. There are many text summarization techniques presented using classical machine learning techniques like hidden Markov model (HMM), SVM, and Bayes model. Another way of text summarization can be done using lexical chains (Barzilay & Elhadad, 1999) formed with help of lexical cohesion. AutoEncoders are also being used for text summarization (Yousefi-Azar & Hamey, 2017) which used sentence ranking to extract sentences in their summary. For training the model on more specific kind of policies we can use a web-based information retrieval framework (Vijayarajan, Dinakaran, Tejaswin, & Lohani, 2016) to ingest peculiar policies.

In this work, we consider automatic text mining and machine learning base summarization of policies. Our policy categorizer and summarizer (POCASUM) allows us to test various plug-n-play machine learning classifiers. We test various classical machine learning models such as the K-nearest neighbors (KNN) (Abu Alfeilat et al., 2019), support vector classifier (SVC) (Suykens & Vandewalle, 1999), random forests (RF) (Breiman, 2001), stochastic gradient descent (SGD) (Kabir, Siddique, Kotwal, & Huda, 2015), along with deep learning artificial neural networks (ANNs). Experimental results on the APP-350 dataset (Zimmeck et al., 2019) indicate that ANN driven model obtains the highest accuracy among other KNN, SVC, SGD, RF based approaches.

The rest of the work is organized as follows. Section 2 introduces our machine learning driven approach to policy categorization and summarization. Section 3 provides experimental results with different machine learning models. Section 4 concludes the paper.

2. Policy Categorizer and Summarization with Machine Learning

2.1. Aim and Methodology

The goal of our proposed policy categorizer and summarizer (POCASUM) is to make it easy for users to comprehend the policies by reading short paragraphs and be clearly aware about how complete the policy is in terms of attributes that need to be mentioned in an ideal privacy policy. Hence, this includes two main tasks namely categorization and summarization. The policy paragraphs are classified under the proposed attributes so that a completeness score can be assigned to it. After checking its completeness the user might want to see the policy under a certain topic (say security). Users can then read the summarized text of the security attribute in about 2 to 3 lines which makes it tremendously easy for the user to comprehend and understand it properly.

The proposed flow of our POCASUM approach is given in Figure 1, and consists of the following steps.

Data Annotation - The dataset has been created manually wherein each paragraph of every policy has been labelled according to the categories mentioned above.
Cleaning - Includes tokenization, stemming, lemmatization, removal of stop words
Feature Extraction - Extracting features from text using tf-idf vectorizer or word-2-vec.
Data Split - The reduced vectors are then split into testing and training data.
Dimensionality Reduction - Number of features are reduced using LSI which uses SVD for getting the number of features to a bare minimum.
Fitting Models - Training various machine learning models to find the best potential model.
Testing - Checking the accuracy by various testing parameters.
Sentence Embedding - Creating the sentences as vector that represents the sentence semantically and syntactically.
Clustering - Clustering of the embeddings in proximity gives us the most relevant sentences.
Extractive Summarization - Picking n most relevant lines and putting it up as a paragraph.

We note that the plug-and-play methodology of our POCASUM, at the fitting models stage one can utilize various machine learning models from classical models such as KNN, SVC, RF, SGD to deep learning ANN. In what follows, we benchmark various machine learning models and test their accuracy on text summarization in privacy policies.

2.2. Dataset

The dataset was manually created taking into consideration 60 privacy policies of applications from the APP-350 dataset (Zimmeck et al., 2019) (https://usableprivacy.org/data). The policies were divided into discrete paragraphs and each paragraph was labelled manually by the annotator. There are 5 attributes or categories that have been chosen to classify the paragraphs of the policy. The speculated important topics in a privacy policy that would affect the user’s have been taken as the attributes of the analyzer. These attributes are:

Collection - The type of personal data that the company collects from the user
Info Usage - How is the data being utilized by the company
Location - How is the location information being used
Share - Under what circumstances will the users data be disclosed or shared
Contact - Do they provide any contact information about the company

All 60 policies were labelled manually under these categories by 3 annotators in-order to create the dataset. Each policy had been segmented into clusters of 2 or 3 sentences depending on the context and were then annotated within these attributes.

Algorithm 1: Text Preprocessing

\begin{matrix} Input : Raw text input R_{t} \\ Output : Vectors V \\ foreach S \in R_{t} do \\ ⌊ \begin{matrix} T = Tokenize (s) \\ foreach t \in T do \\ ⌊ \begin{matrix} if t i n s t o p_w o r d s then \\ ∣ remove t; \\ else \\ ⌊ continue; \\ if c h e c k t w i t h R E f o r c o m p o u n d w o r d s \\ then \\ ∣ continue; \\ else \\ ⌊ remove t; \\ l = Lemmatize (t) \\ v = tf-idf (l) \end{matrix} \\ Add v to V \end{matrix} \end{matrix}

Open in a new tab

2.3. Text Preprocessing

Text preprocessing plays a vital role in the classification of the policy statements as it provides the attributes that are to be fed into the model (Uysal & Gunal, 2014). There is no perfect way of performing text preprocessing as it is subjective to the problem and the dataset available. In case of privacy policies it was important to include compound words and headings which may not be necessary for another problem. The step-by-step process has been stated in Algorithm 1. The raw text was initially converted into tokens and the stop words were removed. The stop word removal is the process of removal of very frequently occurring words that do not contribute to the prominence of the text and was performed by a list of stop words in the english language. Then certain regular expression patterns were compiled to include certain kinds of words like email or compound words which might contribute to the prediction process.

The tokens were further normalized using a Lemmatizer and then were marshalled into a list of tokens which created our cleaned tokens. These tokens were then converted into a tf-idf vector (Fautsch & Savoy, 2010) instead of Count Vectorizer as it suppresses the importance of the word occurrence and takes an egalitarian weightage of the words. This allows us to take into consideration those words that are important to the topic rather than those that appear very frequently like “an” or “the”. Figure 2 shows the flowchart of the text preprocessing steps. Due to the large number of features the vectors were further reduced by applying dimensionality reduction using latent semantic indexing (LSI) approach. Again the number of dimensions to be reduced is based on the dataset at hand, and the optimal number of dimensions can only be found through experimentation.

Figure 2: — Text preprocessing steps employed here for our POCASUM approach.

2.4. Modeling

After the text has been cleaned and been converted to vectors they are then taken for the model training. Constructing a multi class text classifier involves assigning the class label to the particular vector of text that represents the features of the raw text. Supervised learning techniques were used for this purpose as the dataset had been annotated and gives it a better chance to improve the model’s accuracy. We have used machine learning (Kotthoff, Gent, & Miguel, 2011) and deep learning techniques in-order to get the best accuracy on multi label classification. Under classical machine learning models we trained four different models namely K-nearest neighbors (KNN), support vector classifier (SVC), stochastic gradient descent (SGD) and random forest (RF) classifier.

The KNN (Abu Alfeilat et al., 2019; Yang & Liu, 1999) is the go-to algorithm for ubiquitous clustering problems due to it’s simple approach without a mathematical model and is very effective in many cases. The number of clusters is an important parameter to be set which is found out by empirical method. The elbow method is used to find out the ideal number of neighbors but in case of supervised learning, the number of clusters are already been provided in the dataset. The SVC (Burges, 1998) is a novel technique and has proved to be immensely accurate in classification tasks due to the less number of parameters to tune. The SGD (Kabir et al., 2015) is the most commonly used linear model and due to good reasons due to it is a good fit to the data. It aims at reducing the error by choosing the number of parameters to be changed at a given descent or step. The ensemble random forest (RF) (Breiman, 2001) classifier was also used due to the multitude of decision trees giving a holistic prediction rather than a specific one leading to more chances of a higher accuracy and better prediction. We also tried out different architectures of deep artificial neural network (ANN) (Ghiassi, Olschimke, Moon, & Arnaudo, 2012) for the multi class classification problems due to its success in the past. The one drawback of ANN is that there are a lot of things to decide on before the model can start training. The first and foremost is the architecture of the network which is again to be decided by empirical process. After building the architecture other parameters like batch size, number of epochs, cross validation, metrics, loss function and a lot others are to be tuned as per your requirement which is again optimally found by experimentation.

2.5. Testing

After the models have been trained on the training data, we need to test the model by giving it data that the model has never seen before and validate the results for those. There are a ton of evaluation metrics available out there. For an efficient classifier the precision and recall are two ubiquitous evaluation metrics that one relies on. Apart from these two metrics, the F1 score and classification accuracy will also be used to judge the model. F1 Score is a measure concocted by precision and recall, however one might want to look at the specific metrics according to the type of problem.

K-fold cross validation (Fushiki, 2011) was also used for increasing the robustness of the model wherein the data is divided into k equal parts, and each part is given one iteration where k-1 parts are utilized for training and the remaining part is utilized for testing. We have used a k value as 5 for training the model. All experiments were performed in the same conditions. K-fold cross validation improves the adaptability of the model to a great extent by increasing the variance in the test data provided to the model.

2.6. Sentence Embedding

After the data has been correctly classified by the model, we need to combine the texts of all attributes in order to summarize the text of the topics. The raw combined topic wise text is to be summarized so that it can be very easy for the users to comprehend the meaning of it. Just as we represent each word as a vector in a tf-idf or any other vectorizer, in order to carry out extractive summarization each sentence needs to be represented as a vector (Ahamad, 2019). That vector should be made by taking the syntactic, semantic and all the important properties of the sentence it represents. Sentence embedding are those vectors that represent the sentence taking its various properties so that we can get to know the proximity between sentences. Hence sentences that share similar semantic and syntactic properties lead to similar vector representation (Barzilay & Elhadad, 1999).

The sentence embedding was facilitated using skip-thought vectors which uses an encoder-decoder model for encoding sentences into vectors. The encode maps the words to a sentence vector and the decoder generates the sentences surrounding the current sentence. ConvNet and RNN were specifically used to create encoder and decoder models which generate sentence embeddings. Now these sentences can also be represented on a graph as a vector which will enable us to see the characteristics or the surroundings of the vector. Since we only require the sentence embeddings and not the surrounding sentences we will only utilize the encoder part of the model as explained using Algorithm 2.

Algorithm 2: Sentence Embedding

\begin{matrix} Input : Words {w_{i} : W \in s}, word \\ embedding (x_{i})^{t} denote its word \\ embeddings,a set of sentences S \\ Output : Sentence Embeddings {v_{s} : s \in S} \\ foreach s \in S do \\ ⌊ \begin{matrix} foreach W \in s do \\ ⌊ \begin{matrix} r^{t} = σ (W_{r} x^{t} + U_{r} h^{t - 1}) \\ z^{t} = σ (W_{z} x^{t} + U_{z} h^{t - 1} \\ {\bar{h}}^{t} = \tanh (W x^{t} + U (r^{t} ⊙ h^{t - 1})) \\ h^{t} = (1 - z) ⊙ h^{t - 1} + z^{t} ⊙ {\bar{h}}^{t} \end{matrix} \end{matrix} \\ T h e h i d d e n s t a t e (h_{i})^{t} t h u s r e p r e s e n t s t h e f u l l \\ e n c o d e d s e n t e n c e e m b e d d i n g \end{matrix}

Open in a new tab

2.7. Clustering

Once the sentences have been converted into vectors, now we need to take n most relevant sentences from the sample space by using an unsupervised clustering algorithm (Bennani-Smires, Musat, Hossmann, Baeriswyl, & Jaggi, 2018; Valdivia et al., 2020). These embeddings are clustered in highdimensional vector space into a pre-defined number of clusters according to the number of sentences required in the summary of the text. K-nearest neighbors (KNN) (Abu Alfeilat et al., 2019) was used to cluster the vectors and the top 3 vectors were chosen. The measure used to cluster was euclidean distance however many other metrics are also present. These vectors were then mapped to their text and their corresponding sentences are extracted and displayed as a paragraph. This gives us the extractive summary of the particular attribute the user wishes to read.

3. Experimental Results

We tested different machine learning classifiers in our POCASUM approach, namely, the K-nearest neighbors classifier (KNN), support vector classifier (SVC), stochastic gradient descent (SGD), random forest (RF) and artificial neural network (ANN) models. There have been a plethora of empirical methods that have been used throughout the process. Starting from what kind of words to keep, trying out various cleaning techniques, number of dimensions, and the features that must be reduced to all experiments, have been tried out to get the best possible results. Different dimensions were for each classification model as shown in Figure 3. Even the kind of machine learning model to choose, the parameters on which the model is being tuned to and the architecture of the neural network are all subjective things that do not provide ideal answers for the input data. These choices vary from problem to problem and carrying out empirical methods is the only way to get an idealized value. As can be seen, the performance of KNN, RF, and SGD are lower, in general than the SVC and ANN.

Figure 3: — Accuracy of different machine learning classifiers with respect to the number of dimensions. Here, we tested K-nearest neighbors (KNN), support vector classifier (SVC), stochastic gradient descent (SGD) and random forest (RF), and artificial neural network (ANN) classifiers.

In terms of the performance of various machine learning classifiers, our experimental testing was done with classical - KNN, SVC, RF, SGD, as well as various architectures of ANN. Among the obtained F1 scores of KNN (68%), SVC (70%), SGD (71%) and RF (67%), SGD proved out to be the best machine learning model in our sample space with an F1 score of 71%. The deep learning ANN however performed very well with an F1 score of 78% and an accuracy score of 75.62 thus outperforming the rest of the models. Figure 4 shows the performance of various models with F1 score and accuracy. To assess the summarizer we took a random policy on which the model had not been trained on. The policy was then segmented into chunks of 2 to 3 lines and were fed into the ANN model for classification as shown in the input text Figure 5. For testing purposes we took up the first category which was the type of information the particular company collected. All the sentences from the policy that belong to this class were combined and appended. Then these sentences were then fed into the summarizer to check if the results semantically made sense. The results shown in Figure 5 turned out to be comprehensible which covered the main points of the text paragraph and hence making it convenient for the user.

Figure 4: — Comparison of different classical (KNN, SVC, RF, SGD), and deep ANN machine learning models in terms of (a) F1 score, and (b) accuracy.

Figure 5: — Examples of POCASUM with ANN result from the APP-350 data set. (a) Input, and (b) output from the system.

We have tried and tested different architectures for the ANN model for the classification task, and ANN with architecture (64+32+6) as in Table 1 proves out to be the superior of them all which obtained 78% F1 score. The performance of this best ANN model in terms of training and validation is shown in Figure 6. Out of all available machine learning models SGD turned out to be the accurate classifier among classical machine learning models as shown in Table 2. As can be seen, the SGD model obtained reduced F1 score as the number of classes increased. Overall, the ANN model within POCSAUM obtained the best performance indicating promise in obtaining comprehensible text summarization. However, our tested ANN architecture as shown in Table 1 are limited to 3 or 4 layers and unlike typical ‘deeper’ architectures considered in other works in text mining and visual computing. The reason for this is the limitation of the dataset we considered here, deeper layers tend to overfit and designing an optimal ANN remains a challenge.

Table 1:

Different architectures of ANN tested in our POCASUM approach for text summarization.

Architecture	F1 Score	Precision
64+32+6	0.78	76%
64+32+32+6	0.73	72.4%
32+32+6	0.72	72.35%
128+64+32+6	0.69	70.1%
128+32+6	0.70	70.9%

Open in a new tab

Figure 6: — Performance of the best ANN model (64+32+6 architecture) with epoch steps in terms of training and validation (a) loss, and (b) accuracy.

Table 2:

Classification with stochastic gradient descent (SGD) for different classes.

Classes	F1 score	Precision	Recall	Support
0	0.79	0.78	0.814	340
1	0.44	0.41	0.483	60
2	0.39	0.41	0.37	58
3	0	0	0	4
4	0.52	0.58	0.47	44
5	0.52	0.65	0.43	30

Open in a new tab

In summary, in this work we have provided a framework to analyze the authenticity of a privacy policy by carrying out multi-class classification and summarization techniques. Further, we have combined the text that belongs to a particular class or attribute so that the paragraphs can be summarized and only the necessary information is presented to the user which makes it easy for them to comprehend the policy. Despite the positive results obtained with ANN combined POCASUM, however there is still improvements that can be done to make this work more convenient for users, (i) the tested architectures represent our experimental heuristics and model reduction within the ANN context remains to be tested further, and (ii) an expanded dataset that includes other textual policies are required to study the feasibility of machine learning driven POCASUM approach. Moreover, semantic analysis of sentences (Nguyen, Duong, & Cambria, 2019; Tur, Deng, Hakkani-Tür, & He, 2012) is one area which we might be lacking in NLP based text summarization considered here. Due to the changing dynamics of the language and several ways to write the same sentence it is very difficult to make the algorithm understand the meaning behind the sentence. If semantic analysis is facilitated within our POCASUM approach, it will make it very convenient for users as we can directly tell them if a certain topic abides by the rules mentioned in an ideal privacy policy. We can directly assign a score which will tell them how safe the policy is, and the users will not even have to see the content of the policy and will get to know how safe the policy is automatically.

4. Conclusion

In this work, we used text mining and machine learning models for policy notices categorization and summarization to aid users with relevant information in a succinct way. Our policy categorizer and summarizer (POCASUM) approach can aid the users of hypermedia to be more careful about their rights and data and makes it easy for the users to understand the digital media company’s motives. We benchmarked different plug-n-play machine learning classifiers including KNN, SVM, RF, SGD, and deep ANN models. Our experimental on a dataset of policies indicate that ANN based model performs well with good accuracy. Our initial experiments indicate that we can reduce the time taken to analyze a policy from 15-20 minutes down per user to a few seconds. This indicates that deeper networks can be used in our POCASUM approach thereby improving the summarization capabilities, however this requires a bigger dataset for training the learning model. Further, benchmarking this reduction of reading time requires a deeper user-based feedback study. Another area of improvement can be where sentence embedding is executed. Apart from skip thought vectors there are many other embedding techniques that can be used to better represent the sentences as vectors like skip gram vectors and quick thought vectors. Quick thought vectors (Russell et al., 2019) is a recent development in skip thought vectors in which the task of forecasting the next sentence given the previous one is handled as a classification problem.

Funding

VBSP is supported by NCATS/NIH grant U2CTR002818, NHLBI/NIH grant U24HL148865, NIAID/NIH grant U01AI150748, Cincinnati Children’s Hospital Medical Center–Advanced Research Council (ARC) Grants 2018–2020, and the Cincinnati Children’s Research Foundation–Center for Pediatric Genomics (CPG) grants 2019–2021.

Footnotes

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

References

Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, & Prasath VS (2019). Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big data, 7 (4), 221–248. [DOI] [PubMed] [Google Scholar]
Ahamad A (2019). Generating text through adversarial training using skip-thought vectors. In Annual conference of the north american chapter of the association for computational linguistics (pp. 53–60). [Google Scholar]
Barzilay R, & Elhadad M (1999). Using lexical chains for text summarization. Advances in Automatic Text Summarization, 111–121. [Google Scholar]
Bennani-Smires K, Musat C, Hossmann A, Baeriswyl M, & Jaggi M (2018). Simple unsupervised keyphrase extraction using sentence embeddings. In 22nd conference on computational natural language learning (conll) (pp. 221–229). [Google Scholar]
Breiman L (2001). Random forests. Machine learning, 45 (1), 5–32. [Google Scholar]
Burges CJ (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2 (2), 121–167. [Google Scholar]
Carbonell J, & Goldstein J (1998). The use of mmr, diversity-based reranking for reordering documents and producing summaries. In 21st annual international acm sigir conference on research and development in information retrieval (pp. 335–336). [Google Scholar]
Chaturvedi I, Ong Y-S, Tsang IW, Welsch RE, & Cambria E (2016). Learning word dependencies in text by means of a deep recurrent belief network. Knowledge-Based Systems, 108, 144–154. [Google Scholar]
Chen G, Ye D, Xing Z, Chen J, & Cambria E (2017). Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In 2017 international joint conference on neural networks (ijcnn) (pp. 2377–2383). [Google Scholar]
Cherfi H, Napoli A, & Toussaint Y (2006). Towards a text mining methodology using association rule extraction. Soft Computing, 10 (5), 431–441. [Google Scholar]
Costante E, Sun Y, Petković M, & den Hartog J (2012). A machine learning solution to assess privacy policy completeness. In ACM workshop on privacy in the electronic society (pp. 91–96). [Google Scholar]
Fautsch C, & Savoy J (2010). Adapting the tf idf vector-space model to domain specific information retrieval. In Acm symposium on applied computing (pp. 1708–1712). [Google Scholar]
Fushiki T (2011). Estimation of prediction error by using k-fold cross-validation. Statistics and Computing, 21 (2), 137–146. [Google Scholar]
Ghiassi M, Olschimke M, Moon B, & Arnaudo P (2012). Automated text classification using a dynamic artificial neural network model. Expert Systems with Applications, 39 (12), 10967–10976. [Google Scholar]
Harkous H, Fawaz K, Lebret R, Schaub F, Shin KG, & Aberer K (2018). Polisis: Automated analysis and presentation of privacy policies using deep learning. In 27th usenix security symposium (pp. 531–548). [Google Scholar]
Izumi K, Matsui H, & Matsuo Y (2007). Integration of artificial market simulation and text mining for market analysis. Soft Computing, 1199–1205. [Google Scholar]
Kabir F, Siddique S, Kotwal MRA, & Huda MN (2015). Bangla text document categorization using stochastic gradient descent (sgd) classifier. In International conference on cognitive computing and information processing (pp. 1–4). [Google Scholar]
Kotthoff L, Gent IP, & Miguel I (2011). A preliminary evaluation of machine learning in algorithm selection for search problems. In Fourth annual symposium on combinatorial search. Barcelona, Catalonia, Spain. [Google Scholar]
Li J, Fong S, Zhuang Y, & Khoury R (2016). Hierarchical classification in text mining for sentiment analysis of online news. Soft Computing, 20 (9), 3411–3420. [Google Scholar]
Li Y, Pan Q, Wang S, Yang T, & Cambria E (2018). A generative model for category text generation. Information Sciences, 450, 301–315. [Google Scholar]
Ma Y, Peng H, Khan T, Cambria E, & Hussain A (2018). Sentic lstm: a hybrid network for targeted aspect-based sentiment analysis. Cognitive Computation, 10 (4), 639–650. [Google Scholar]
Majumder N, Poria S, Gelbukh A, & Cambria E (2017). Deep learning-based document modeling for personality detection from text. IEEE Intelligent Systems, 32 (2), 74–79. [Google Scholar]
Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, & Gao J (2021). Deep learning–based text classification: A comprehensive review. ACM Computing Surveys (CSUR), 54 (3), 1–40. [Google Scholar]
Nguyen HT, Duong PH, & Cambria E (2019). Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowledge-Based Systems, 182, 104842. [Google Scholar]
Nomoto T, & Matsumoto Y (2001). A new approach to unsupervised text summarization. In 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 26–34). [Google Scholar]
Rennie JD, & Rifkin R (2001). Improving multiclass text classification with the support vector machine (Tech. Rep. No. 210). Cambridge, MA, USA: MIT Artificial Intelligence laboratory. [Google Scholar]
Russell D, Li L, & Tian F (2019). Generating text using generative adversarial networks and quick-thought vectors. In IEEE international conference on computer and communication engineering technology (pp. 129–133). [Google Scholar]
Satapathy R, Li Y, Cavallari S, & Cambria E (2019). Seq2seq deep learning models for microtext normalization. In 2019 international joint conference on neural networks (ijcnn) (pp. 1–8). [Google Scholar]
Sathyendra KM, Wilson S, Schaub F, Zimmeck S, & Sadeh N (2017). Identifying the provision of choices in privacy policy text. In 2017 conference on empirical methods in natural language processing (pp. 2774–2779). [Google Scholar]
Silva C, & Ribeiro B (2007). On text-based mining with active learning and background knowledge using svm. Soft Computing, 11 (6), 519–530. [Google Scholar]
Suykens JA, & Vandewalle J (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9 (3), 293–300. [Google Scholar]
Tur G, Deng L, Hakkani-Tür D, & He X (2012). Towards deeper understanding: Deep convex networks for semantic utterance classification. In IEEE international conference on acoustics, speech and signal processing (pp. 5045–5048). [Google Scholar]
Uysal AK, & Gunal S (2014). The impact of preprocessing on text classification. Information Processing & Management, 50 (1), 104–112. [Google Scholar]
Valdivia A, Martinez-Camara E, Chaturvedi I, Luzón MV, Cambria E, Ong Y-S, & Herrera F (2020). What do people think about this monument? understanding negative reviews via deep learning, clustering and descriptive rules. Journal of Ambient Intelligence and Humanized Computing, 11 (1), 39–52. [Google Scholar]
Vijayarajan V, Dinakaran M, Tejaswin P, & Lohani M (2016). A generic framework for ontology-based information retrieval and image retrieval in web data. Human-Centric Computing and Information Sciences, 6 (1), 18. [Google Scholar]
Yang Y, & Liu X (1999). A re-examination of text categorization methods. In 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 42–49). [Google Scholar]
Young T, Hazarika D, Poria S, & Cambria E (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13 (3), 55–75. [Google Scholar]
Yousefi-Azar M, & Hamey L (2017). Text summarization using unsupervised deep learning. Expert Systems with Applications, 68, 93–105. [Google Scholar]
Zhao W, Peng H, Eger S, Cambria E, & Yang M (2019). Towards scalable and reliable capsule networks for challenging NLP applications. In 57th annual meeting of the association for computational linguistics (pp. 1549–1559). [Google Scholar]
Zimmeck S, Story P, Smullen D, Ravichander A, Wang Z, Reidenberg J, … Sadeh N (2019). MAPS: Scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019 (3), 66–86. [Google Scholar]

[R1] Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, & Prasath VS (2019). Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big data, 7 (4), 221–248. [DOI] [PubMed] [Google Scholar]

[R2] Ahamad A (2019). Generating text through adversarial training using skip-thought vectors. In Annual conference of the north american chapter of the association for computational linguistics (pp. 53–60). [Google Scholar]

[R3] Barzilay R, & Elhadad M (1999). Using lexical chains for text summarization. Advances in Automatic Text Summarization, 111–121. [Google Scholar]

[R4] Bennani-Smires K, Musat C, Hossmann A, Baeriswyl M, & Jaggi M (2018). Simple unsupervised keyphrase extraction using sentence embeddings. In 22nd conference on computational natural language learning (conll) (pp. 221–229). [Google Scholar]

[R5] Breiman L (2001). Random forests. Machine learning, 45 (1), 5–32. [Google Scholar]

[R6] Burges CJ (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2 (2), 121–167. [Google Scholar]

[R7] Carbonell J, & Goldstein J (1998). The use of mmr, diversity-based reranking for reordering documents and producing summaries. In 21st annual international acm sigir conference on research and development in information retrieval (pp. 335–336). [Google Scholar]

[R8] Chaturvedi I, Ong Y-S, Tsang IW, Welsch RE, & Cambria E (2016). Learning word dependencies in text by means of a deep recurrent belief network. Knowledge-Based Systems, 108, 144–154. [Google Scholar]

[R9] Chen G, Ye D, Xing Z, Chen J, & Cambria E (2017). Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In 2017 international joint conference on neural networks (ijcnn) (pp. 2377–2383). [Google Scholar]

[R10] Cherfi H, Napoli A, & Toussaint Y (2006). Towards a text mining methodology using association rule extraction. Soft Computing, 10 (5), 431–441. [Google Scholar]

[R11] Costante E, Sun Y, Petković M, & den Hartog J (2012). A machine learning solution to assess privacy policy completeness. In ACM workshop on privacy in the electronic society (pp. 91–96). [Google Scholar]

[R12] Fautsch C, & Savoy J (2010). Adapting the tf idf vector-space model to domain specific information retrieval. In Acm symposium on applied computing (pp. 1708–1712). [Google Scholar]

[R13] Fushiki T (2011). Estimation of prediction error by using k-fold cross-validation. Statistics and Computing, 21 (2), 137–146. [Google Scholar]

[R14] Ghiassi M, Olschimke M, Moon B, & Arnaudo P (2012). Automated text classification using a dynamic artificial neural network model. Expert Systems with Applications, 39 (12), 10967–10976. [Google Scholar]

[R15] Harkous H, Fawaz K, Lebret R, Schaub F, Shin KG, & Aberer K (2018). Polisis: Automated analysis and presentation of privacy policies using deep learning. In 27th usenix security symposium (pp. 531–548). [Google Scholar]

[R16] Izumi K, Matsui H, & Matsuo Y (2007). Integration of artificial market simulation and text mining for market analysis. Soft Computing, 1199–1205. [Google Scholar]

[R17] Kabir F, Siddique S, Kotwal MRA, & Huda MN (2015). Bangla text document categorization using stochastic gradient descent (sgd) classifier. In International conference on cognitive computing and information processing (pp. 1–4). [Google Scholar]

[R18] Kotthoff L, Gent IP, & Miguel I (2011). A preliminary evaluation of machine learning in algorithm selection for search problems. In Fourth annual symposium on combinatorial search. Barcelona, Catalonia, Spain. [Google Scholar]

[R19] Li J, Fong S, Zhuang Y, & Khoury R (2016). Hierarchical classification in text mining for sentiment analysis of online news. Soft Computing, 20 (9), 3411–3420. [Google Scholar]

[R20] Li Y, Pan Q, Wang S, Yang T, & Cambria E (2018). A generative model for category text generation. Information Sciences, 450, 301–315. [Google Scholar]

[R21] Ma Y, Peng H, Khan T, Cambria E, & Hussain A (2018). Sentic lstm: a hybrid network for targeted aspect-based sentiment analysis. Cognitive Computation, 10 (4), 639–650. [Google Scholar]

[R22] Majumder N, Poria S, Gelbukh A, & Cambria E (2017). Deep learning-based document modeling for personality detection from text. IEEE Intelligent Systems, 32 (2), 74–79. [Google Scholar]

[R23] Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, & Gao J (2021). Deep learning–based text classification: A comprehensive review. ACM Computing Surveys (CSUR), 54 (3), 1–40. [Google Scholar]

[R24] Nguyen HT, Duong PH, & Cambria E (2019). Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowledge-Based Systems, 182, 104842. [Google Scholar]

[R25] Nomoto T, & Matsumoto Y (2001). A new approach to unsupervised text summarization. In 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 26–34). [Google Scholar]

[R26] Rennie JD, & Rifkin R (2001). Improving multiclass text classification with the support vector machine (Tech. Rep. No. 210). Cambridge, MA, USA: MIT Artificial Intelligence laboratory. [Google Scholar]

[R27] Russell D, Li L, & Tian F (2019). Generating text using generative adversarial networks and quick-thought vectors. In IEEE international conference on computer and communication engineering technology (pp. 129–133). [Google Scholar]

[R28] Satapathy R, Li Y, Cavallari S, & Cambria E (2019). Seq2seq deep learning models for microtext normalization. In 2019 international joint conference on neural networks (ijcnn) (pp. 1–8). [Google Scholar]

[R29] Sathyendra KM, Wilson S, Schaub F, Zimmeck S, & Sadeh N (2017). Identifying the provision of choices in privacy policy text. In 2017 conference on empirical methods in natural language processing (pp. 2774–2779). [Google Scholar]

[R30] Silva C, & Ribeiro B (2007). On text-based mining with active learning and background knowledge using svm. Soft Computing, 11 (6), 519–530. [Google Scholar]

[R31] Suykens JA, & Vandewalle J (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9 (3), 293–300. [Google Scholar]

[R32] Tur G, Deng L, Hakkani-Tür D, & He X (2012). Towards deeper understanding: Deep convex networks for semantic utterance classification. In IEEE international conference on acoustics, speech and signal processing (pp. 5045–5048). [Google Scholar]

[R33] Uysal AK, & Gunal S (2014). The impact of preprocessing on text classification. Information Processing & Management, 50 (1), 104–112. [Google Scholar]

[R34] Valdivia A, Martinez-Camara E, Chaturvedi I, Luzón MV, Cambria E, Ong Y-S, & Herrera F (2020). What do people think about this monument? understanding negative reviews via deep learning, clustering and descriptive rules. Journal of Ambient Intelligence and Humanized Computing, 11 (1), 39–52. [Google Scholar]

[R35] Vijayarajan V, Dinakaran M, Tejaswin P, & Lohani M (2016). A generic framework for ontology-based information retrieval and image retrieval in web data. Human-Centric Computing and Information Sciences, 6 (1), 18. [Google Scholar]

[R36] Yang Y, & Liu X (1999). A re-examination of text categorization methods. In 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 42–49). [Google Scholar]

[R37] Young T, Hazarika D, Poria S, & Cambria E (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13 (3), 55–75. [Google Scholar]

[R38] Yousefi-Azar M, & Hamey L (2017). Text summarization using unsupervised deep learning. Expert Systems with Applications, 68, 93–105. [Google Scholar]

[R39] Zhao W, Peng H, Eger S, Cambria E, & Yang M (2019). Towards scalable and reliable capsule networks for challenging NLP applications. In 57th annual meeting of the association for computational linguistics (pp. 1549–1559). [Google Scholar]

[R40] Zimmeck S, Story P, Smullen D, Ravichander A, Wang Z, Reidenberg J, … Sadeh N (2019). MAPS: Scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019 (3), 66–86. [Google Scholar]

PERMALINK

POCASUM : Policy Categorizer and Summarizer Based on Text Mining and Machine Learning

Rushikesh Deotale

Shreyash Rawat

V Vijayarajan

V B Surya Prasath

Abstract

1. Introduction