Skip to main content
PLOS One logoLink to PLOS One
. 2024 Apr 26;19(4):e0302070. doi: 10.1371/journal.pone.0302070

Using full-text content to characterize and identify best seller books: A study of early 20th-century literature

Giovana D da Silva 1, Filipi N Silva 2,*, Henrique F de Arruda 3, Bárbara C e Souza 1, Luciano da F Costa 4, Diego R Amancio 1
Editor: Heba El-Fiqi5
PMCID: PMC11051604  PMID: 38669247

Abstract

Artistic pieces can be studied from several perspectives, one example being their reception among readers over time. In the present work, we approach this interesting topic from the standpoint of literary works, particularly assessing the task of predicting whether a book will become a best seller. Unlike previous approaches, we focused on the full content of books and considered visualization and classification tasks. We employed visualization for the preliminary exploration of the data structure and properties, involving SemAxis and linear discriminant analyses. To obtain quantitative and more objective results, we employed various classifiers. Such approaches were used along with a dataset containing (i) books published from 1895 to 1923 and consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the same period but not being mentioned in that list. Our comparison of methods revealed that the best-achieved result—combining a bag-of-words representation with a logistic regression classifier—led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such an outcome enhances the difficulty in predicting the success of books with high accuracy, even using the full content of the texts. Nevertheless, our findings provide insights into the factors leading to the relative success of a literary work.

1 Introduction

Understanding the factors and reasons determining the effectiveness and acceptance of given pieces of artistic or scientific work represents a continuing challenge in artificial intelligence (e.g., [15]). As it is often the case with complex systems, not only a large number of possible factors is potentially involved, but their individual and combined effects also tend to be highly non-linear. In this manner, small effects can lead to considerable impacts, being also likely to vary along time and space in modes that are hard to predict.

Among the several aspects that are more likely to influence the visibility and accomplishment of an artistic piece, we have its intrinsic quality, innovation, and affinity with the main trends, interests, and expectations predominating in a given period and place. All these three main aspects are not only challenging to define but even more so to predict, which has motivated growing interest from the scientific community (e.g., [611]).

A better understanding of the motivations why an artistic piece becomes successful constitutes a particularly interesting objective for a handful of reasons: (i) this type of study can motivate the development of new concepts and methods capable of quantifying the three main aspects identified above, namely quality, innovation, and affinity of an artistic piece; (ii) that kind of research has great potential for revealing important aspects of the mechanisms underlying human preferences for specific subjects and styles along time and space; (iii) such developments can lead to strategies for predicting the acceptance of certain types of works, which may provide subsidies and motivation for developing new and more effective artistic pieces.

The present work aims at studying whether it is feasible to characterize and identify stories and narratives listed as best sellers by combining full-text content information and machine learning models. In this regard, the textual content of a set of books was modeled, and a series of experiments assessed the possibility of automatically differentiating a best seller from an ordinary book. In particular, we employed a dataset encompassing the full-text content of literary works collected from the Project Gutenberg platform. The dataset was split into two categories: success (books that appear at least once in the Publishers Weekly Bestseller Lists) and others. After applying a preprocessing step (removal of stopwords, lemmatization, and tokenization), the content of each book was embodied in terms of a word embedding representation by using the bag-of-words [12] and doc2vec [13] approaches. Finally, we employed different strategies to assess the prediction of the success of books in terms of their embedding representations, including: (i) visualization approaches, namely the linear discriminant analysis (LDA) [14] and SemAxis [15] techniques; and (ii) classification approaches, encompassing different models and cross-validation strategies.

In contrast to previous studies, here we rely on one of the prime published sources of best sellers book lists, namely the Publishers Weekly Bestsellers Lists, which comprises the best selling books every year since 1895. Although its criteria to define a book as an absolute success is not entirely specified, it is established that every considered paperbound book sold at least 2,000,000 copies, and every selected hardbound book sold 750,000 copies or more. It is also settled that Publishers Weekly only regards books distributed through the trade—that is, bookstores and libraries –, not including those sold by mail or book clubs [16]. Besides that, our work compounds the list of few studies which analyzed the success factor by analyzing the full-text content of the texts, posthumously modeling it through embeddings, and analyzing it both qualitatively (applying visualization and seeking for words that lead to discrimination) and quantitatively (involving supervised classifiers).

Ultimately, the results obtained from the considered approaches using only a book’s full-text content were insufficient to predict the success of a literary work with high accuracy. The best classification accuracy achieved the value of 0.75 by combining a bag-of-words representation with a logistic regression model, which is a fair-to-middling outcome. Nonetheless, our experiments evince that the subject (literary genre provided by Gutenberg) of a book, alone, does not seem to be enough to determine if a title will become a best seller, but rather point to the importance of content, since there are words there are more typically found in this category of books.

This work is organized as follows. Section 2 presents and discusses the related works. In Section 3, we present the research questions. Section 4 describes the used datasets. Section 5 describes the methodology adopted to analyze the books, including text preprocessing, representation, visualization, and classification. The results and discussions are reported in Section 6. Finally, in Section 7, we present the conclusions and future works.

2 Related works

The study conducted in [17] analyzed the success of books using as reference the The New York Times Best Sellers, which includes a list of best selling books in the United States. The authors considered the books appearing on the list between August 2008 and March 2016. As additional information, the sales patterns of books were also considered by using data from NPD BookScan [17]. Several interesting results were reported. Fiction books were found to be more likely to become best sellers, while nonfiction books tended to be sold with lower intensity. The authors also proposed a model that can accurately measure long-term impact since it can predict the number of copies sold by best sellers short after their release. The proposed description was found to be consistent with a previous model devised to describe the attention received by scientific papers [1]. The authors argue, therefore, that the underlying processes of attention are similar—despite the differences in time scale.

A model to predict book sales was proposed in [6]. The authors used as a dataset the NPD Bookscan, focusing on a list of the 10 thousand top-selling books in a given period. A machine learning approach was proposed using different book features. Authors’ visibility was taken into account by measuring the public interest in authors via Wikipedia page views. Previous sales were also considered as a feature to measure the previous success of authors. Book features included genre (e.g., horror and science fiction) and topic information (as provided by readers). In addition, publishers’ information was used. All features were combined in the so-called Learning to Place (L2P) machine learning algorithm [18], which aims at classifying a new instance (i.e., predicting book sales) within a sequence of previously published books. This study found that in fiction and nonfiction books, the publisher quality tends to play an important role in the prediction. The visibility of authors was also found to be an important feature, as more visible authors potentially are more likely to sell more copies. Finally, the other factors related to the text content itself (e.g., genre and topic information) were found to play relatively a minor role in the prediction model.

Differently from previous works that did not take into account the textual content [6, 17], the relevance of writing style was analyzed in [19]. The authors analyzed full books from different genres (e.g. adventure, mystery, fiction). The dataset was collected from the Project Gutenberg repository. Several linguist marks of writing style were used to characterize the texts. Examples include lexical features, distribution of grammar rules, and sentiment analysis. The authors used SVM as classifier [20], and download counts were used as a surrogate for the visibility of books. Additional information such as award recipients and the number of copies sold was also used to quantify success. The authors concluded that the used stylistic metrics are effective to quantify the success of novels.

Because only a few works have analyzed the content of books to predict if they will become best sellers, in the current study we focus our analysis on full-textual features to discriminate between best sellers and ordinary literary works.

3 Research questions

This study aims to test whether the full-text content of the book alone can indicate if it will become a best seller. While there are several ways to represent a text, we focused on the most common approaches devoted to representing long texts. For this reason, here we also investigate which text representation better grasps the information about a book becoming a success. Finally, to recognize patterns in the textual data, we also examined which classifier is the most appropriate for discriminating between successes and ordinary books.

Briefly, the main research questions here are:

  1. Is it possible to predict the inclusion of books into best sellers lists by analyzing only their full-text content?

  2. Can one use bag-of-words and neural network embeddings to detect informative attributes for identifying best sellers?

  3. Can the abovementioned embeddings be influenced by the subject headings available in the dataset (such as genre or literary class)?

  4. How different is the performance of supervised classifiers in discriminating between the two categories of books analyzed?

4 Dataset

As the main objective of this work is to understand whether it is possible to identify and characterize styles and stories classified as best sellers, our dataset was composed of two categories: success and others. In the first, we included books considered best sellers; in the latter, literary works not listed as such (at least not in the analyzed period and the consulted list). All considered instances were written in English.

To define the candidate books for the success category, we resorted to well-known annual lists: The New York Times Best Sellers, first published in 1931, and Publishers Weekly Bestseller Lists, first published in 1895. Concerning the first one, from 1931 to the present day, only 18 titles were available on the Project Gutenberg platform (a digital library whose collection is composed of full texts of books in the public domain). For the second, we mapped 110 available titles—published from 1895 to 1923—which became part of our dataset.

To select the titles of the other category, we considered the collection of books (a) published in the same period as the selected successful ones and (b) not included in the best sellers lists of Publishers Weekly. In this sense, if the success class had ten titles published in 1923, the other would have the same number of titles published in the same year—the titles randomly selected from the Gutenberg repository. At the end of the process, this category contained 109 titles (one less than the other category, as it was infeasible to collect the same amount of titles for all the years considered).

In [16] can be found the best selling lists used in this study. It is important to emphasize that the criteria for composition on the list are not entirely clear. Every hardbound book in it has sold at least 750,000 copies, and every paperbound book has sold at least 2 million copies. There is no clarification as to why such numbers were chosen as the minimum quantity to define a best selling title. Besides that, only sales of books distributed in trade (bookstores and libraries) are accounted for, a criterion that excludes those sold by mail order or reading clubs. It is also not specified why only these specific sales were considered—a reasonable explanation being that since these are somewhat old books, it was not so easy to keep track of all kinds of markets.

Additionally, it is worth mentioning that some factors were imperative in the limited number of books of the dataset (namely, 219 instances). First, we adhere to titles in the public domain only. Although there are discussions about the fair use of such content in scientific works, there is no consensus on the validity of using copyrighted pieces. Second, we considered only one book from each author to avoid identification of authorship by machine learning algorithms to be applied later. Third, because one of the design decisions was to work with a balanced database, the number of bestsellers becomes a limiting factor for the number of non-bestselling books. Lastly, we collected the same number of successes and non-successes per year of publication (which even led to one less non-successful book due to the unavailability of another title in one of the years considered). We emphasize, nonetheless, that such a temporal factor is essential because there will always be a possibility that titles from different periods may be very distinct in terms of content and writing style. An additional discussion about the temporal aspect of books and the success and non-success instances of the dataset can be visited in Section II of the S1 File.

Once the dataset was ready, we cleaned up the textual content of the 219 texts to maintain only the relevant contents of the books. In this process, the header and footer included by Project Gutenberg were removed, as well as editor/translator/author notes, captions and illustration indications, glossaries, footnotes, side-notes, annexes, and appendices. The dataset, in its final format, was made available at GitHub (https://github.com/giovanadanieles/bestSellersDataset).

5 Methodology

5.1 Text preprocessing

Using the dataset as explained in the previous section, the preprocessing of our analysis started. First, we replaced all capital letters with their corresponding lowercase counterparts. Then, the stopwords (i.e., words that provide low or no additional meaning to the context, such as articles and connectives) were removed. Next, we performed the tokenization of the books, in which elements, like punctuations and numbers, were disregarded. Finally, the obtained words were lemmatized—being lemmatization a technique whose objective is to reduce a vocable to its canonical form and to group different forms of the same word (e.g., the term “boys” is reduced to “boy” and “took” becomes “take”). Table 1 shows an example of this preprocessing.

Table 1. Preprocessing example.

Preprocessing of the excerpt “It is difficult to live up to this kind of thing, and my thoughts drift to the auld schule-house and Domsie.”, obtained from the book Beside the Bonnie Brier Bush, by Ian Maclaren. In the column titled Initial is the original excerpt; in the next, the phrase without capital letters and stopwords; finally, in the last, the extract after tokenization and lemmatization processes.

Initial Removed capital letters and stopwords Tokenized and lemmatized
It is difficult
to live up to
this kind of
thing, and my
thoughts drift
to the auld
schule-house
and Domsie.
difficult
live
kind
thing,
thoughts drift
auld
schule-house
domsie.
[‘difficult’]
[‘live’]
[‘kind’]
[‘thing’]
[‘thought’, ‘drift’]
[‘auld’]
[‘schule’, house’]
[‘domsie’]

5.2 Text embeddings

Techniques to embed textual content have been extensively used for a variety of tasks, including grasping text similarity, sentiment analysis, and classification. Among the most widely used techniques is the bag-of-words [12] approach, in which the relative frequencies of words appearing in a document are organized as a vector.

Recently, other approaches, now based on neural networks, have been developed to obtain dense embedding representations of words, sentences, or entire documents, being those approaches trained to predict masked parts in texts. In this sense, among the most used techniques is word2vec, which is based on a network comprising one hidden layer and a softmax output layer. The output layer is trained for predicting the context (words appearing together) given a focus word in a sentence [21]. For a given set of sentences, such a process provides an embedding for each word.

More sophisticated techniques such as BERT [22] and sentence BERT [23] generate embeddings that capture richer context and semantic information of words or sentences. However, these techniques, similar to W2V and GloVe [24], are limited to a small number of tokens and can not be applied to large portions of texts, such as entire books. For this reason, we opted to use the doc2vec (D2V) method to extract a vector representation of each book [13] since it has been successfully used in classification texts using large external corpora [25].

The doc2vec approach is based on the traditional word2vec [21] pipeline with the addition of the document tags as input. More specifically, it constitutes a neural network of three layers (input, hidden, and softmax), as illustrated in Fig 1a. Just like in word2vec with a continuous bag-of-words (CBOW) architecture, the inputs are one-hot vectors representing a sequence of words from a sentence in a book. A target word is omitted from the input and used to train the neural network. In addition, the input includes an extra one-hot vector identifying the book. The model is trained to predict the target word from the context (words adjacent to the target) using a negative sampling strategy. The vectors in the hidden layer connected directly to the books encoded as one-hot are used as the book embedding. Here, we opted to use the Gensim [26] software to obtain the doc2vec representations of books.

Fig 1. Representation of doc2vec and SemAxis approaches.

Fig 1

In (a) we illustrate the neural network employed to obtain the embedding representation of books based on sequences of words (encoded as one-hot vectors) extracted from a book. The network is trained to predict a target word in the sentence based on the adjacent terms. Additionally, the original book ID is also encoded as input to the neural network, and their respective trained vectors correspond to the embedding space of books. In (b), we illustrate the SemAxis approach in which the line connecting the two categories’ (success vs. others) centroids defines an axis to project all the books. This process results in a continuous one-dimensional (scalar) representation of books, which is employed for visualization purposes.

5.3 Visualization

Neural network embeddings usually result in high-dimensional dense vectors that are not correlated among themselves, which limits the use of linear techniques to reduce the dimensionality of these spaces (such as PCA [27]). Thus, the process of visualizing such structures is usually undertaken using non-linear projections, such as t-SNE [28] and UMAP [29].

However, embeddings can encode many different aspects of the data, for instance, a certain axis in a book embedding may be related to its number of pages or its adherence to the non-fiction or fantasy genres. The SemAxis approach [15] is a way to find an axis in a high-dimensional embedding that describes a certain aspect of the data. This is accomplished by first obtaining the centroids of two classes, e.g., small vs. larger books or non-fiction vs. fantasy books. The line connecting the two centers define an axis in which all the remaining books are projected. This process is illustrated in Fig 1. Since in the current work we are interested in encoding the success of books, we employed the SemAxis approach to finding an axis for samples of the success and other classes. Similarly, in addition to SemAxis, we also employed linear discriminant analysis (LDA) [14], which also results in an axis encoding a continuous representation of the two classes.

In contrast to neural network-based approaches, the bag-of-words embedding can result in highly correlated and sparse vectors. For instance, the frequency patterns of two close-related words can correlate strongly and rare words may only be present in a small set of documents. Nonetheless, both LDA and SemAxis are still applicable in these conditions.

5.4 Classification: Distinguishing successes from others

The identification and classification of textual patterns were performed using traditional well-known machine learning classifiers [30]. We considered different classifier strategies, including k-nearest neighbors (KNN) [31] (based on the probable similarity of nearest neighbors), naive Bayes (NB) [32] (that estimates the class-conditional probability based on the Bayes theorem and assuming conditional independence between attributes), decision tree (DT) [33] (which classifies an example of the test record based on a series of discriminating questions about its attributes), support-vector machine (SVM) [30, 34] (based on finding hyper-planes that can linearly separate data—called support vectors), and, finally, the two that yielded the best results: random forest (RF) [35] and logistic regression (LR) [36].

In just a few words, Random Forest is a class of ensemble methods designed over DT classifiers. It uses multiple decision trees, built using a set of random vectors, combining each of their predictions to yield a final classification. On the other hand, Logistic Regression is based on determining the conditional probability of an event happening. It models this probability by minimizing a negative likelihood function for the labeled classes.

All these tests were implemented in Python language [37] using the classifiers of Scikit-Learn [38] library. Following the guidelines described in related works [39, 40], we used the default parameters of the methods to classify texts. As an exception, in the case of the SVM, we changed the parameter “max_iter”, the maximum number of iterations, to 10,000.

6 Results and discussions

This section describes the experiments performed to study the task of automatically characterizing and identifying best seller books. The proposed data analysis pipeline is illustrated in Fig 2. First, we obtain word embedding representations of each book by employing two distinct techniques (Fig 2a): bag-of-words and doc2vec (the latter with different dimensions, namely 32, 64, 128, and 256). Next, we investigate the proposed classification problem through two main approaches: visualization and classification. In the first, we employed a simple visualization pipeline to verify and illustrate the potential of using embeddings to identify best seller books (Fig 2b–2d). The objective of this approach is to provide a preliminary and simple way to visually inspect the considered high-dimensional embeddings by summarizing them into a single continuous axis.

Fig 2. Overall diagram of the main approaches.

Fig 2

All the methods within the blue and orange boxes are applied to the two considered embeddings in a combined fashion. For example, a valid path would be: (i) embedding: doc2vec with dimension equals to 32; (ii) preprocessing: standardized; (iii) learning method: logistic regression; (iv) validation: leave-one-out.

The visualization pipeline starts with the standardization of the obtained embeddings (Fig 2b). To reduce the dimensionality of the embeddings, we employed SemAxis [15] and LDA [14]. Since these methods are supervised, the final visualizations are performed by employing the leave-one-out technique to avoid overfitting.

The second approach considered in this work is the direct application of classification methods, allowing quantitative comparison of the respective performance. For that, we employed a pipeline comprising the same embedding configurations as before but followed by three successive stages: preprocessing, learning method, and validation, each presented as a box in Fig 2. All combinations between the components of each of these boxes are considered in our evaluation.

In this sense, the following two first subsections are intended to detail the task of visualization, followed by the classification, both using the bag-of-words and then the doc2vec representation. Then, in the last subsection, we repeat these experiments to evaluate a specific variation of the constructed dataset. Additionally, for those interested in results using non-full-text content, there is an additional discussion in Section I of the S1 File. In it, we explored only the beginning of each book. Moreover, we also discuss using readability measures and textual features to discriminate between bestsellers and non-bestsellers in Section III of the S1 File.

6.1 Bag-of-words analysis

The first performed experiment intends to evaluate whether the frequency of words composing the books can discriminate between best sellers and ordinary literary works. For this purpose, we considered the set S, built based on the 3, 585 different words that appeared at least in N2 texts of the dataset. The proportion N2 was elected once smaller ones (such as N3 or N4, being N the total number of books in the dataset) evoked archaic words and words not belonging to the vernacular of the English language, and higher proportions, on the contrary, led to poorer results on the experiments.

Considering each entry in S, we computed its frequency for all books in the dataset, resulting, in the end, in a 219 × 3585 matrix of frequencies, henceforward called M. Next, the rows of M—each representing a book—were standardized and transformed according to two approaches: LDA and SemAxis, the results being cross-validated through leave-one-out. As shown in Fig 3, such processing led to a visual separation both in a and b, giving evidence that the bag-of-words model can provide a good—although not exact—split between the two studied categories.

Fig 3. Kernel density estimation of the 219 investigated literary works.

Fig 3

(a) LDA projection and (b) SemAxis projection of M.

Moreover, to quantitatively assess the obtained separation, M was used as input to supervised classification methods (videlicet: KNN, logistic regression, naive Bayes, decision tree, random forest, and SVM). We also applied leave-one-out and k-fold (taking k = 10) cross-validation methods and considered both the standardized and the non-standardized versions of M (the standardized version denoted by M^). Here, we adopted the standard hyperparameters of each model—in other words, not involving tuning operations.

As shown in Table 2, the linear regression model resulted as the best choice for grasping discrepancies between classes, leading to an average classification accuracy of 0.75, either for Leave One Out (LOO) or k-fold cross-validation. This result shows that the approach is apt—to a reasonable extent—to identify successful literary works. Furthermore, it is worth observing that the standardization positively impacts the outcomes, leading to performances as good as or better than the non-standardized case in ten out of twelve scenarios—languishing only the accuracy of the KNN model. Please consult Section IV of the S1 File for complementary information concerning precision, recall, and f1-score metrics results.

Table 2. Classification accuracy for different models and arrangements.

Results for configurations M or M^ and leave-one-out or k-fold cross-validation. Highlighted in bold is the best result for each configuration.

M M^
LOO 10-fold LOO 10-fold
foKNN 0.64 0.64 ± 0.11 0.58 0.56 ± 0.10
LR 0.65 0.64 ± 0.13 0.75 0.75 ± 0.09
NB 0.63 0.62 ± 0.11 0.63 0.62 ± 0.11
DT 0.65 0.58 ± 0.14 0.65 0.58 ± 0.14
RF 0.68 0.68 ± 0.11 0.68 0.68 ± 0.11
SVM 0.66 0.63 ± 0.11 0.72 0.74 ± 0.09

In addition, we retrieved the 40 words of S with preponderant impact onto the SemAxis projection, aiming at analyzing what sort of vocable seems to be characteristic in best sellers and in the other books. As presented in Table 3, the most meaningful words for successful books encompass six adjectives, nine nouns, one adverb, and four verbs; for the non-best seller books, we have three adjectives, seven nouns, one adverb, and nine verbs. Similarly to a result formerly reported in [19], words referring to body parts (such as eye, face, and hand) play a central role in less successful titles. Furthermore, none of the 20 most relevant terms for successes ranks among the 40 most frequent words of S—however, when analyzing the non-best seller books, the principal words eye, face, hand, and back represent, respectively, the 5th, 6th, 10th, and 12th most common words of the dataset.

Table 3. Forty most significant words to the SemAxis projection discrimination between best sellers and others.

The importance of each term for the method dictates its allocation order in the table: the element on the first row and the first column (of success/other) is the most important for the class; the one on the second row and the second column is the second most important; and so forth.

Vocables
Success ordinary evidence motive exhibit
grey substance improve copy
instruction contain examination practice
accordingly teacher numerous interesting
school large attach average
Other breath drop draw sharply
face eye turn stun
break hand push back
reckless caught arm shake
tone instant quick glance

6.2 Doc2vec analysis

The second experiment evaluates whether doc2vec’s representation of literary works can grasp the dissimilarity between the two analyzed classes. With this aim, we instantiated D2V models with 32, 64, 128, and 256 dimensions (a feature commonly called vector size, hereafter referred to as #2D). We also set the minimum word count to 1 (to ignore all words with a total frequency lower than one), the window (maximum distance between the current and predicted term within a sentence) to 5, and the epochs (number of iterations over the corpus) to 40. Lastly, the model training occurred using all 219 instances of the dataset.

Next, each model vector (henceforth called D)—a piece representing a different book—was transformed employing LDA and SemAxis techniques along with leave-one-out cross-validation, yielding the results shown in Fig 4. As can be observed, the method was able to characterize best seller and non-best seller works in a contrasting fashion, both in a and b. This result shows that it is possible to emphasize the differences between the two classes in two noticeably distinct approaches (either BoW or D2V).

Fig 4. Kernel density estimation of the 219 investigated literary works.

Fig 4

(a) LDA projection and (b) SemAxis projection of D2V representation (adopting #2D = 64).

Furthermore, we used D as the supervised classification methods’ input to quantitatively assess the obtained separation. The models used here were the same as those applied in the BoW experiment, and we also considered LOO and 10-fold cross-validations and both the standardized and the non-standardized versions of D (the standardized version denoted by D^). The chosen models’ hyperparameters were the standard ones.

As shown in Tables 4 and 5, naive Bayes was the model that best performed the task of distinguishing classes considering the D2V representation, leading to a maximum classification accuracy of 0.71 for LOO and 0.72±0.12 for 10-fold. In the LOO version, #2D = 32 raises the best results, while #2D = 256 performs better for 10-fold. Although the standardization did not affect the naive Bayes classifier, it led to the same or slightly better outcomes for the others—the exceptions being some arrangements, namely KNN (for #2D = 128) and LR (for #2D = 64 and 128) for the LOO version and KNN (#2D = 256), LR (#2D = 256), and SVM (#2D = 64) for 10-fold.

Table 4. Classification accuracy for LOO cross-validation combined with different models and arrangements.

Results for configurations D or D^ and for D2V vector size: 32, 64, 128, or 256. Highlighted in bold is the best result for each configuration.

LOO
D D^
32 64 128 256 32 64 128 256
KNN 0.65 0.65 0.65 0.67 0.65 0.65 0.64 0.67
LR 0.68 0.65 0.63 0.66 0.70 0.64 0.61 0.67
NB 0.71 0.68 0.69 0.68 0.71 0.68 0.69 0.68
DT 0.48 0.58 0.53 0.40 0.48 0.58 0.53 0.40
RF 0.66 0.63 0.67 0.64 0.66 0.63 0.67 0.64
SVM 0.68 0.63 0.62 0.63 0.69 0.66 0.65 0.64

Table 5. Classification accuracy for 10-fold cross-validation combined with different models and arrangements.

Results for configurations D or D^ and for D2V vector size: 32, 64, 128, or 256). Highlighted in bold is the best result for each configuration.

10-fold
D
32 64 128 256
KNN 0.67 ± 0.08 0.63 ± 0.06 0.65 ± 0.13 0.69 ± 0.11
LR 0.66 ± 0.13 0.67 ± 0.10 0.59 ± 0.08 0.71 ± 0.09
NB 0.68 ± 0.10 0.70 ± 0.13 0.70 ± 0.08 0.72 ± 0.12
DT 0.58 ± 0.12 0.52 ± 0.08 0.56 ± 0.10 0.50 ± 0.06
RF 0.68 ± 0.10 0.65 ± 0.10 0.66 ± 0.09 0.70 ± 0.10
SVM 0.66 ± 0.13 0.66 ± 0.09 0.57 ± 0.07 0.69 ± 0.10
D^
32 64 128 256
KNN 0.68 ± 0.11 0.64 ± 0.05 0.66 ± 0.10 0.68 ± 0.08
LR 0.68 ± 0.11 0.69 ± 0.12 0.60 ± 0.08 0.70 ± 0.10
NB 0.68 ± 0.10 0.70 ± 0.13 0.70 ± 0.08 0.72 ± 0.12
DT 0.58 ± 0.12 0.52 ± 0.08 0.56 ± 0.10 0.50 ± 0.06
RF 0.68 ± 0.10 0.65 ± 0.10 0.66 ± 0.09 0.70 ± 0.10
SVM 0.67 ± 0.14 0.64 ± 0.11 0.57 ± 0.08 0.70 ± 0.10

6.3 Are the subjects being grasped by the approaches?

What if the above-mentioned approaches are only grasping (or being biased by) the subjects of the books? That would be a valid inquiry once we did not regard this type of information during the construction of the database. To assess this possibility, we retrieved the list of subjects of each book provided by the Gutenberg platform and then analyzed the ten most common ones in the dataset. In Fig 5, we plot those subjects against the SemAxis projection of the books’ D2V vector representation (using #2D = 64), stratifying the results by category. As one can see, the only subjects with a representative number of instances are PS and PR, which also seem to explain the separation obtained through the D2V method to some degree.

Fig 5. On the y-axis, the ten most common subjects in the dataset.

Fig 5

On the x-axis, the SemAxis projection of the books’ D2V representation (adopting #2D = 64).

PR and PS are classifications used by the Library of Congress [41] to catalog English and British literature, respectively. In our case, the PR subject represents 102 instances of the dataset, 34 best sellers, and 68 non-best seller works. The PS one, by contrast, encompasses 98 books—72 best sellers and 26 other types of works. In this manner, as, in principle, the success category is the only one with a limited number of instances (given its criteria), we created a new dataset (with 72 successes and 72 others, embracing the same standards stated in the creation of the former dataset) with only literary works belonging to subject PS. Then, 46 new non-best seller titles were selected from the Gutenberg platform. Using this current dataset, we repeated the previous experiments, aiming at understanding whether the fact that a book belongs to English or British literature was enough to explain the separation provided by the BoW and D2V methods. The results are presented and discussed below.

6.3.1 Bag-of-words analysis

Similarly to the previous experiment, we considered the set S, built based on the 3,257 different words that appeared in the text of at least N2 books of the PS dataset. Then, we calculated their frequencies, resulting in a 144 × 3257 dimension matrix (henceforth called MPS). Next, MPS’s rows were standardized, transformed using LDA and SemAxis, and verified via LOO cross-validation, as shown in Fig 6. It is possible to observe that the separation between the classes is still perceptible—both in a and b—although now solely English literature is being considered.

Fig 6. Kernel density estimation of the 144 investigated literary works belonging to the PS subject.

Fig 6

(a) LDA projection and (b) SemAxis projection of MPS.

To quantitatively analyze the separation of the categories, we performed supervised classification methods using the standardized and non-standardized versions of MPS as input. The applied models, hyperparameters, and cross-validation methods were the same as the former experimentation. As shown in Table 6, the random forest model was the best option to distinguish the best seller and other instances, leading to an average accuracy of 0.71—both in LOO and 10-fold cross-validations. The standardization did not affect the results. Even though this is a lower accuracy than that obtained with the previous dataset (which led to the highest accuracy of 0.75), it is worth mentioning that the PS dataset has 35% fewer instances than the other, which makes us expect lower accuracies and higher standard deviations. Thus, it is possible to state that the BoW method is not classifying the corpus instances based predominantly or solely on their literary class.

Table 6. Classification accuracy for different models and arrangements considering the PS dataset.

Results for configurations MPS or M^PS and leave-one-out or k-fold cross-validation.

M PS M^PS
LOO 10-fold LOO 10-fold
KNN 0.62 0.60±0.09 0.63 0.62±0.13
LR 0.64 0.63±0.15 0.69 0.67±0.13
NB 0.65 0.65±0.15 0.65 0.65±0.15
DT 0.62 0.57±0.14 0.62 0.57±0.14
RF 0.71 0.71±0.14 0.71 0.70±0.12
SVM 0.61 0.61±0.16 0.66 0.61±0.16

6.3.2 Doc2vec analysis

The D2V models were instantiated in the context of the new dataset with the same hyperparameters as the previous tests—the only exception being that we trained the model using 144 books instead of 219. We repeated the former configurations (vector sizes 32, 64, 128, and 256 and LOO and 10-fold cross-validations) and adopted the standardized and the non-standardized versions of the model vectors—called D^PS and DPS, respectively. Fig 7 shows the results of the transformations of the model vectors via LDA and SemAxis. The split between best seller and non-best seller works was again observed, suggesting that the method is insensitive to the literary class.

Fig 7. Kernel density estimation of the 144 investigated literary works belonging to the PS subject for the D2V representation.

Fig 7

(a) LDA projection and (b) SemAxis. In both cases, #2DPS = 64 was adopted.

The quantitative assessment using supervised classification led to the results shown in Tables 7 and 8. From Table 8, it is possible to conclude that the models that best performed for LOO cross-validation were logistic regression and naive Bayes—the highest accuracy (0.67) given by the latter, with #2DPS = 256. In this case, the standardization did not contribute to performance improvement only in the case of the KNN model. From Table 8, we conclude that no model stood out for 10-fold, with the highest accuracy of 0.67 given by the SVM model, with #2DPS = 32 and non-standardized input. The standardization process induced better results on six distinct occasions, although the best-obtained accuracy counts on a non-standardized vector.

Table 7. Classification accuracy for LOO cross-validation combined with different models and arrangements (i.e., whether DPS or D^PS were employed and with which D2V vector size: 32, 64, 128, or 256), considering the PS dataset.

Highlighted in bold is the best result for each configuration.

LOO
D PS D^PS
32 64 128 256 32 64 128 256
KNN 0.56 0.59 0.60 0.56 0.51 0.57 0.56 0.55
LR 0.62 0.62 0.58 0.60 0.62 0.65 0.60 0.60
NB 0.60 0.62 0.62 0.67 0.60 0.62 0.62 0.67
DT 0.51 0.51 0.44 0.53 0.51 0.51 0.44 0.53
RF 0.53 0.60 0.58 0.56 0.53 0.60 0.58 0.56
SVM 0.62 0.61 0.60 0.58 0.62 0.61 0.62 0.58
Table 8. Classification accuracy for 10-fold cross-validation combined with different models and arrangements (i.e., whether DPS or D^PS were employed and with which D2V vector size: 32, 64, 128, or 256), considering the PS dataset.

Highlighted in bold is the best result for each configuration.

10-fold
D PS
32 64 128 256
KNN 0.59 ± 0.13 0.58 ± 0.14 0.49 ± 0.12 0.60 ± 0.09
LR 0.66 ± 0.13 0.54 ± 0.06 0.59 ± 0.12 0.66 ± 0.08
NB 0.61 ± 0.08 0.59 ± 0.12 0.58 ± 0.14 0.61 ± 0.14
DT 0.45 ± 0.13 0.49 ± 0.10 0.66 ± 0.13 0.48 ± 0.09
RF 0.59 ± 0.10 0.60 ± 0.10 0.57 ± 0.17 0.59 ± 0.13
SVM 0.65 ± 0.12 0.49 ± 0.10 0.58 ± 0.12 0.63 ± 0.10
D^PS
32 64 128 256
KNN 0.60 ± 0.13 0.61 ± 0.16 0.53 ± 0.07 0.59 ± 0.08
LR 0.66 ± 0.12 0.58 ± 0.09 0.59 ± 0.12 0.65 ± 0.09
NB 0.61 ± 0.08 0.59 ± 0.12 0.58 ± 0.14 0.61 ± 0.14
DT 0.45 ± 0.13 0.49 ± 0.10 0.66 ± 0.13 0.48 ± 0.09
RF 0.59 ± 0.10 0.60 ± 0.10 0.57 ± 0.17 0.59 ± 0.13
SVM 0.67 ± 0.11 0.53 ± 0.07 0.55 ± 0.14 0.63 ± 0.10

For the 219-instances dataset, the best-achieved accuracy was 0.72. Again, we expected a drop in the accuracy, as the new dataset has 35% lesser instances than the other. Thus, just as in the BoW method, it is possible to infer that the separation between classes obtained in the D2V approach does not rely solely on whether a book belongs to English or British literature.

7 Conclusions

The study of characteristics leading to literary pieces becoming best sellers constitutes an intriguing and challenging research subject. The present work addressed this issue while considering aspects derived from the full content of a list of more and less successful books retrieved from the Gutenberg Project, based on the best seller lists of Publishers Weekly. Several alternative content representation, standardization, visualization, and classification approaches were considered, as summarized in the diagram shown in Fig 2.

We started our analysis by examining the data using visualization techniques. The visualization enabled a preliminary direct inspection of the embedding by looking at a single axis that maximizes the separation between best sellers and ordinary books. Specifically, we employed SemAxis and LDA techniques—the first providing better discrimination between classes than the latter, both for bag-of-words and doc2vec representations. Furthermore, SemAxis provided means that helped to: (i) understand the most characteristic words in best sellers and non-best sellers; and (ii) check if the respective success was related to the subjects of the books (e.g., love stories, adventure stories, fiction, among others). In line with earlier work [19], words related to body parts (like face, eye, and hand) played a central role in non-best seller books, while more varied and less common vocables (such as ordinary, accordingly, and examination) were characteristic of more successful books. Moreover, we found no evidence that the subject of the books impacted the class discrimination obtained.

For the classification tasks, we tested two strategies for preprocessing the two distinct representations: (i) standardizing and (ii) non-standardizing the embeddings. Then, we evaluated the proposed representations via different classifiers (namely: KNN, LR, NB, DT, RF, and SVM). The best-obtained result was acquired with the complete dataset (219 books) using the LR classifier with the standardized bag-of-words representation. In this case, the final classification accuracy was 0.75. Still dealing with the complete set, the best accuracy obtained for D2V embedding was 0.72, combining the standardized representation with the NB model. For the dataset considering only the PS subject (144 books), the bag-of-words approach throughput the most promising results for the standardized data inputted in the RF classifier. The D2V representation, in contrast, returned better outcomes for the standardized data combined with the NB classifier. These results agree with the tendency of the two classes’ separation found in the visualization analysis. Interestingly, the standardization did not affect the results significantly in the doc2vec approach for both datasets.

The reported methodology and results pave the way for several related studies, some of which are described as follows. Firstly, it would be interesting to adapt the reported method to other types of embeddings, for example, the BERT transformer modified to work with long texts. Secondly, it would be interesting to consider the described approach for better understanding: (i) other types of documents, such as scientific books and articles, and (ii) additional types of artistic production, including music, poetry, and theater. Lastly, another point that could be explored concerns the explanation of additional reasons why some literary works become best sellers and others do not.

Concerning the limitations of the work, the three main points we stress are (i) the absence of modern books in the database, (ii) the absence of more modern modeling techniques, and (iii) the limitation in dataset size imposed by the number of available best-selling books. As previously discussed, the scarcity of books is due to copyright laws that protect the complete contents of modern books. Even though such content is found free of charge on the internet, we do not have the right to use it. Regarding modeling, more modern techniques, such as BERT, do not deal well with long texts [42]. As the median size of our dataset is approximately 90,000 characters, it would not be appropriate to apply such a technique. Even if the modeling yielded highly accurate results, they would not be reliable. Finally, concerning the limited size of the dataset, we are restricted by the number of books listed as best sellers and also available in the public domain. As previously stated, we can not leverage books that don’t have free content. Also, best-selling books are scarce per nature: if all books were best-selling pieces, this study would not even exist. In addition to these points, it is also worth mentioning that although an accuracy greater than 80 or 90% would be desirable, it would be unrealistic to predict the success of works with such a high outcome, given the intricate and multifaceted nature of such a task. Factors like marketing and trends can influence the popularity of books in ways difficult to predict or measure. Therefore, the 75% accuracy result becomes reasonable, although somewhat limited, if we think we are exploring solely the textual content of each book. Future models, incorporating additional factors such as marketing, author popularity and contextual elements, could offer a more comprehensive understanding of what drives a book’s success.

Regardless of the dataset limitations we recognized in this research, we have employed careful effort to minimize any undesirable impact of external factors, such as authorship, publication period, and literary genre, being the book’s full text the main feature considered in the hypothesis tested in this work. Ultimately, we were able to provide valuable insights into the factors that seem to lead to the relative success of a book in becoming a best seller, namely, the content of the text.

Supporting information

S1 File

(PDF)

pone.0302070.s001.pdf (1.5MB, pdf)

Data Availability

The dataset of books were obtained from the Gutenberg dataset and are distributed in public domain. The specific subset employed in this study can be found in http://dx.doi.org/10.5281/zenodo.7622473.

Funding Statement

G. D. da S. acknowledges São Paulo Research Foundation (FAPESP) from Brazil for sponsorship (grant no. 2021/01744-0). B. C. e S. acknowledges Coordination for the Improvement of Higher Education (CAPES) from Brazil for sponsorship Finance Code 001. L. da F. C. thanks Brazilian National Council for Scientific and Technological Development (CNPq) (grant no. 307085/2018-0) and São Paulo Research Foundation (FAPESP) (grant grant 15/22308-2). D. R. A. acknowledges financial support from Brazilian National Council for Scientific and Technological Development (CNPq) (grant no. 311074/2021-9) and São Paulo Research Foundation (FAPESP) (grant no. 2020/06271-0). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Wang D, Song C, Barabási A. Quantifying long-term scientific impact. Science. 2013;342(6154):127–132. doi: 10.1126/science.1237825 [DOI] [PubMed] [Google Scholar]
  • 2.Barabási A. The formula: the universal laws of success. Hachette UK; 2018.
  • 3. Lee K, Park J, Kim I, Choi Y. Predicting movie success with machine learning techniques: ways to improve accuracy. Information Systems Frontiers. 2018;20(3):577–588. doi: 10.1007/s10796-016-9689-z [DOI] [Google Scholar]
  • 4. Tohalino JAV, Amancio DR. On predicting research grants productivity via machine learning. Journal of Informetrics. 2022;16(2):101260. doi: 10.1016/j.joi.2022.101260 [DOI] [Google Scholar]
  • 5. Cetinic E, Lipic T, Grgic S. A deep learning perspective on beauty, sentiment, and remembrance of art. IEEE Access. 2019;7:73694–73710. doi: 10.1109/ACCESS.2019.2921101 [DOI] [Google Scholar]
  • 6. Wang X, Yucesoy B, Varol O, Eliassi-Rad T, Barabási A. Success in books: predicting book sales before publication. EPJ Data Science. 2019;8(1):1–20. doi: 10.1140/epjds/s13688-019-0208-6 [DOI] [Google Scholar]
  • 7. Harvey J. The content characteristics of best-selling novels. Public Opinion Quarterly. 1953;17(1):91–114. doi: 10.1086/266441 [DOI] [Google Scholar]
  • 8. Lee S, Ji H, Kim J, Park E. What books will be your bestseller? A machine learning approach with Amazon Kindle. The Electronic Library. 2021;39(1):137–151. doi: 10.1108/EL-08-2020-0234 [DOI] [Google Scholar]
  • 9. Lee S, Kim J, Park E. Can book covers help predict bestsellers using machine learning approaches? Telematics and Informatics. 2023; p. 101948. doi: 10.1016/j.tele.2023.101948 [DOI] [Google Scholar]
  • 10.Maity SK, Panigrahi A, Mukherjee A. Book reading behavior on goodreads can predict the amazon best sellers. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017; 2017. p. 451–454.
  • 11. Zhang C, Zhou Q. Assessing books’ depth and breadth via multi-level mining on tables of contents. Journal of Informetrics. 2020;14(2):101032. doi: 10.1016/j.joi.2020.101032 [DOI] [Google Scholar]
  • 12.Manning C, Schutze H. Foundations of statistical natural language processing. MIT press; 1999.
  • 13.Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR; 2014. p. 1188–1196.
  • 14.Ghojogh B, Crowley M. Linear and quadratic discriminant analysis: Tutorial. arXiv preprint arXiv:190602590. 2019;.
  • 15.An J, Kwak H, Ahn YY. SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018. p. 2450–2461. Available from: https://aclanthology.org/P18-1228.
  • 16.Hackett AP, Burke JH. 80 years of best sellers. R. R. Bowker Company; 1977.
  • 17. Yucesoy B, Wang X, Huang J, Barabási A. Success in books: a big data approach to bestsellers. EPJ Data Science. 2018;7:1–25. doi: 10.1140/epjds/s13688-018-0135-y [DOI] [Google Scholar]
  • 18.Wang X, Varol O, Eliassi-Rad T. L2P: an algorithm for estimating heavy-tailed outcomes. arXiv preprint:190804628. 2019;.
  • 19.Ashok VG, Feng S, Choi Y. Success with style: using writing style to predict the success of novels. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing; 2013. p. 1753–1764.
  • 20. Fan R, Chang K, Hsieh C, Wang X, Lin C. LIBLINEAR: a library for large linear classification. The Journal of Machine Learning Research. 2008;9:1871–1874. [Google Scholar]
  • 21.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.
  • 22.Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
  • 23.Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bertt-networks. arXiv preprint arXiv:190810084. 2019;.
  • 24.Pennington J, Socher R, Manning CD. GloVe: Global Vectors for Word Representation. In: Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532–1543. Available from: http://www.aclweb.org/anthology/D14-1162.
  • 25.Lau JH, Baldwin T. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In: Proceedings of the 1st Workshop on Representation Learning for NLP. Berlin, Germany: Association for Computational Linguistics; 2016. p. 78–86. Available from: https://aclanthology.org/W16-1609.
  • 26.Řehůřek R, Sojka P. Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA; 2010. p. 45–50.
  • 27. Gewers FL, Ferreira GR, Arruda HFD, Silva FN, Comin CH, Amancio DR, et al. Principal component analysis: A natural approach to data exploration. ACM Computing Surveys (CSUR). 2021;54(4):1–34. doi: 10.1145/3447755 [DOI] [Google Scholar]
  • 28. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(11). [Google Scholar]
  • 29.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018;.
  • 30.Mitchell TM. Machine learning. vol. 1. McGraw-hill New York; 1997.
  • 31.Fix E, Hodges JL. Discriminatory analysis, nonparametric discrimination: consistency properties. Randolph Field, Texas: USAF School of Aviation Medicine; 1951. 4.
  • 32.Zhang H. The Optimality of Naive Bayes. In: Barr V, Markov Z, editors. Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004). AAAI Press; 2004.
  • 33. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. New York: Routledge; 2017. [Google Scholar]
  • 34. Murthy SK. Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining and Knowledge Discovery. 1998;2(4):345–389. doi: 10.1023/A:1009744630224 [DOI] [Google Scholar]
  • 35. Breiman L. Random forests. Machine learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324 [DOI] [Google Scholar]
  • 36. Yu H, Huang F, Lin C. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning. 2011;85(1):41–75. doi: 10.1007/s10994-010-5221-8 [DOI] [Google Scholar]
  • 37. Rossum GV, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009. [Google Scholar]
  • 38. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
  • 39. Amancio DR, Comin CH, Casanova D, Travieso G, Bruno OM, Rodrigues FA, et al. A systematic comparison of supervised classifiers. PLoS ONE. 2014;9(4):e94137. doi: 10.1371/journal.pone.0094137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, da F Costa L, et al. Clustering algorithms: a comparative approach. PLoS ONE. 2019;14(1):e0210236. doi: 10.1371/journal.pone.0210236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Government US. Library of congress classification outline; 2013. https://www.loc.gov/catdir/cpso/lcco/?.
  • 42. Gao S, Alawad M, Young MT, Gounley J, Schaefferkoetter N, Yoon HJ, et al. Limitations of transformers on clinical text classification. IEEE journal of biomedical and health informatics. 2021;25(9):3596–3607. doi: 10.1109/JBHI.2021.3062322 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Heba El-Fiqi

10 May 2023

PONE-D-23-04137Using Full-Text Content to Characterize and Identify Best Seller BooksPLOS ONE

Dear Dr. Silva,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jun 24 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Heba El-Fiqi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Review

I think this article is basically an interesting study. The “mathematical side” of the argument is sound, but crucial aspects of the study are flawed. The dataset is not adequate for the conclusions that are drawn. The main approach should be better motivated. Its limitations should be discussed in deeper theoretical detail, and the conclusions should be stated with more caution.

The authors state: “This study aims to probe whether the full-text content of the book alone can indicate if it will become a best seller.” That would require surveying all possible methods, which isn’t feasible.

Same point again: “The obtained results suggest that it is infeasible to predict the success of a literary work with high accuracy by using only its full-text content.” But that conclusion cannot be drawn. A study of this kind can only show that the approaches that have been tested fail.

The authors rightly say that “the main trends, interests, and expectations predominating in a given period” are crucial for bestsellerhood. This theoretical argument is a stronger one than the empirical one proposed in the paper. The empirical results are however compatible with that theoretical statement. Important factors are text-external.

The conclusion “our experiments evince that the subject of the books does not seem to be a core factor for a title becoming a best seller” seems to be carelessly stated. It is hard to deny “that the subject of the books IS A CRUCIAL FACTOR for a title becoming a best seller”. The experiments cannot disprove that. I actually find the accuracy of 0.75 surprisingly high. It would rather support the conclusion that content is very important, had the data behind it been more convincingly curated.

The data selection is of the convenience kind. It is unclear how it would be representative of some relevant populations. The Gutenberg repository is hardly a representative selection of books. It is likely to have books interesting by their quality and/or popularity. Books belonging to typical bestseller genres which have failed to become bestsellers, say an unsuccessful crime novel, tend to be forgotten, and they are less likely to find their way to Gutenberg. If, as the authors correctly state, “the main trends, interests, and expectations” are important, bestsellers should be compared to “fail-sellers” in the same time and place aligning with the same trends, interests, and expectations. The size of the dataset should also be motivated.

So, “the other [class set] would have the same number of titles published in the same year — the titles randomly selected from the Gutenberg repository”. There are other relevant parameters to consider, mainly genre. But the “other” books include both literary works and non-fiction, whereas, I guess, the bestsellers to a high extent comprise genre fiction. (“the PS dataset” is more adequate, but smaller). In short, the “other” class is likely to be biased in a way that makes it more or less useless for the argument that is advanced. For instance, it includes Joyce’s Ulysses, which is of course completely unrepresentative of failing books. It has also probably been a major bestseller in the longer run.

The methods applied by the authors to compute vector representations have not been developed for the representation of the content of novels, but rather for classification of shorter documents. Again, it seems to me that the authors have made convenience decisions, which come without deeper theoretical motivation. In particular, the representations are static, whereas it seems evident that there is a time-related dynamics to narratives, which is important for the enjoyment of fiction and for bestsellerhood.

Reviewer #2: This paper aimed to study whether it is feasible to characterize and identify stories and narratives listed as best sellers by combining full-text content information and machine learning models. In this regard, the textual content of a set of books was modeled, and a series of experiments assessed the possibility of automatically differentiating a best seller from an ordinary book. In particular, the authors employed a dataset encompassing the full-text content of literary works collected from the Project Gutenberg platform.

Overall, this paper is interesting for the community of text mining and machine learning applications.

The weaknesses of this paper are as follows:

1. This paper utilized full-text content to identify best-selling books. However, there are two issues that the authors should address:

(1) The paper only employed shallow features of the full-text content, neglecting deeper features such as discourse or writing styles of the books.

(2) The authors only provided the identification results of best-selling books based on full-text content. They should also present results based on non-full-text content for comparison.

2. The dataset used in this paper consists of 219 books published between 1895 and 1924. There are two issues that the authors should address:

(1) The size of the dataset is relatively small.

(2) Many contemporary books could be utilized in this study. It is worth noting that the full-text content of current books can be accessed online. The authors should discuss this context, which differs from the past.

3. Concerning the related works in this paper, there are the following two issues:

(1) Several relevant studies have been overlooked, such as Harvey (1953), Lee et al. (2021), Lee et al. (2023), and Maity et al. (2017), among others.

Harvey, J. (1953). The content characteristics of best-selling novels. Public Opinion Quarterly, 17(1), 91-114.

Lee, S., Ji, H., Kim, J., & Park, E. (2021). What books will be your bestseller? A machine learning approach with Amazon Kindle. The Electronic Library, 39(1), 137-151.

Lee, S., Kim, J., & Park, E. (2023). Can book covers help predict bestsellers using machine learning approaches?. Telematics and Informatics, 101948.

Maity, S. K., Panigrahi, A., & Mukherjee, A. (2017, July). Book reading behavior on goodreads can predict the amazon best sellers. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017 (pp. 451-454).

(2)Additionally, some related studies have employed full-text content from book tables of contents to evaluate book quality, such as Zhang & Zhou (2020).

Zhang C., Zhou Q. Assessing Books’ Depth and Breadth via Multi-level Mining on Tables of Contents. Journal of Informetrics, 2020, 14(2): 101032.

4. The methods used in this paper are relatively simple. I recommend that the authors briefly describe less important methods, while conversely, some methods should be explained in more detail, such as the classification method described in Section 5.4. Additionally, the title in Section 5.4 should be made more specific.

5. Some expressions need to be more rigorous, such as providing the full names for abbreviations that appear for the first time, like LOO.

6. The paper's structure could benefit from further refinement. It is recommended to create a separate section for related discussions, encompassing the theoretical and practical implications of this study, as well as the paper's limitations.

In summary, this study has some significance, but there are issues with the research methods and the analysis of experimental results.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 26;19(4):e0302070. doi: 10.1371/journal.pone.0302070.r002

Author response to Decision Letter 0


7 Aug 2023

Response to reviewers is also attached as a formatted PDF file.

---------

Dear Editor,

Please find attached a revised version of the manuscript ``Using Full-Text Content to Characterize and Identify Best Seller Books'', in which we have considered all the reviewers' comments. The most substantial changes made are marked in magenta in the revised version. Also attached is a list of responses to the reviewers' comments.

Yours sincerely,

The authors.

Reviewers' comments:

Reviewer #1:

I think this article is basically an interesting study. The “mathematical side” of the argument is sound, but crucial aspects of the study are flawed. The dataset is not adequate for the conclusions that are drawn. The main approach should be better motivated. Its limitations should be discussed in deeper theoretical detail, and the conclusions should be stated with more caution.

>>> Answer: Thank you for your comment. We agree that some of the conclusions were carelessly stated, and we have rephrased them. All the major changes made are marked in magenta in the revised version and some of these are also discussed hereon this letter.

The authors state: ``This study aims to probe whether the full-text content of the book alone can indicate if it will become a best seller.'' That would require surveying all possible methods, which isn’t feasible.

>>> Answer: Thank you for appropriately pointing this out, we have rewritten this statement as follows:

>>> ``This study aims to test whether the full-text content of the book alone can indicate if it will become a best seller.''

Same point again: ``The obtained results suggest that it is infeasible to predict the success of a literary work with high accuracy by using only its full-text content.'' But that conclusion cannot be drawn. A study of this kind can only show that the approaches that have been tested fail.

>>> Answer: Thank you for your comment, we have also changed the writing to emphasize this as a limitation in our conclusions.

>>> ``Ultimately, the results obtained from the considered approaches using only a book's full-text content were insufficient to predict the success of a literary work with high accuracy.''

The authors rightly say that ``the main trends, interests, and expectations predominating in a given period'' are crucial for bestsellerhood. This theoretical argument is a stronger one than the empirical one proposed in the paper. The empirical results are however compatible with that theoretical statement. Important factors are text-external.

>>> Answer: Thank you for your comment. We were glad to perceive the compatibility between theory and the results yielded in our research.

The conclusion “our experiments evince that the subject of the books does not seem to be a core factor for a title becoming a best seller” seems to be carelessly stated. It is hard to deny ``that the subject of the books IS A CRUCIAL FACTOR for a title becoming a best seller''. The experiments cannot disprove that. I actually find the accuracy of 0.75 surprisingly high. It would rather support the conclusion that content is very important, had the data behind it been more convincingly curated.

>>> Answer: Thank you for pointing this out. We believe that our writing and the usage of the term ``subject'' might have caused confusion between a book's genre and its content. We changed the text for more clarity, as follows:

>>> ``Nonetheless, our experiments evince that the subject (literary genre provided by Gutenberg) of a book, alone, does not seem to be enough to determine if a title will become a best seller, but rather point to the importance of content, since there are words there are more typically found in this category of books.''

The data selection is of the convenience kind. It is unclear how it would be representative of some relevant populations. The Gutenberg repository is hardly a representative selection of books. It is likely to have books interesting by their quality and/or popularity. Books belonging to typical bestseller genres which have failed to become bestsellers, say an unsuccessful crime novel, tend to be forgotten, and they are less likely to find their way to Gutenberg. If, as the authors correctly state, “the main trends, interests, and expectations” are important, bestsellers should be compared to ``fail-sellers'' in the same time and place aligning with the same trends, interests, and expectations. The size of the dataset should also be motivated.

So, “the other [class set] would have the same number of titles published in the same year — the titles randomly selected from the Gutenberg repository”. There are other relevant parameters to consider, mainly genre. But the “other” books include both literary works and non-fiction, whereas, I guess, the bestsellers to a high extent comprise genre fiction. (“the PS dataset” is more adequate, but smaller). In short, the “other” class is likely to be biased in a way that makes it more or less useless for the argument that is advanced. For instance, it includes Joyce’s Ulysses, which is of course completely unrepresentative of failing books. It has also probably been a major bestseller in the longer run.

>>> We chose to work with the Project Gutenberg repository not simply for its accessibility but due to its extensive collection of public domain books. This repository offers a diverse range of both best sellers and other preserved titles, spanning various genres and time periods. The nature of this selection underpins our analysis and allows us to conduct a broad yet meaningful comparison between best sellers and other titles that managed to be preserved over time.

>>> Concerning the size of the dataset, it is important to clarify that our constraints are dictated by the number of best sellers, which inherently are not numerous. While a larger dataset might have provided additional insights, we had to account for author-specific effects to avoid conflating book success with author popularity. This necessary control reduced the size of our dataset to its current amount.

>>> Regarding James Joyce's `Ulysses', we concur with your comment. It indeed faced numerous issues at the time of publication, which hindered its initial success. Nonetheless, our methodology has identified it as a success, echoing its subsequent recognition and popularity. While `Ulysses' may not fit the traditional best seller mold at its time of publication, its later success underscores the potential validity and foresight of our method. We added an additional discussion about this specific book in Section II of the Supplementary Material.

>>> About ``fail sellers'', if a book was completely forgotten (or never read) after its publication, it would be very difficult to find its complete content available on the internet. In fact, not even some of the best sellers listed in the consulted list were found for download. It was with this in mind that we never used the nomenclature ``fail seller'', ``failure'', or ``unsuccessful'': what we are dealing with here are the less successful (or non-successful) books in the period studied (from 1895 to 1923).

The methods applied by the authors to compute vector representations have not been developed for the representation of the content of novels, but rather for classification of shorter documents. Again, it seems to me that the authors have made convenience decisions, which come without deeper theoretical motivation. In particular, the representations are static, whereas it seems evident that there is a time-related dynamics to narratives, which is important for the enjoyment of fiction and for bestsellerhood.

>>> Answer: Thank you for your comment. We have included further explanations of our method choices for handling full-text content. As follows:

>>> ``More sophisticated techniques such as BERT and sentence BERT generate embeddings that capture richer context and semantic information of words or sentences. However, these techniques, similar to W2V and GloVe, are limited to a small number of tokens and can not be applied to large portions of texts, such as entire books. For this reason, we opted to use the doc2vec (D2V) method to extract a vector representation of each book since it has been successfully used in classification texts using large external corpora.''

Reviewer #2:

This paper aimed to study whether it is feasible to characterize and identify stories and narratives listed as best sellers by combining full-text content information and machine learning models. In this regard, the textual content of a set of books was modeled, and a series of experiments assessed the possibility of automatically differentiating a best seller from an ordinary book. In particular, the authors employed a dataset encompassing the full-text content of literary works collected from the Project Gutenberg platform.

Overall, this paper is interesting for the community of text mining and machine learning applications.

The weaknesses of this paper are as follows:

1. This paper utilized full-text content to identify best-selling books. However, there are two issues that the authors should address:

(1) The paper only employed shallow features of the full-text content, neglecting deeper features such as discourse or writing styles of the books.

(2) The authors only provided the identification results of best-selling books based on full-text content. They should also present results based on non-full-text content for comparison.

>>> Answer: Thank you for your comment. We have performed a new experiment where we address non-full-text content. Its discussion is in Section I of the Supplementary Material provided. Additionally, we have also included in Section III of the Supplementary Material a new discussion concerning readability scores and other textual features.

2. The dataset used in this paper consists of 219 books published between 1895 and 1924. There are two issues that the authors should address:

(1) The size of the dataset is relatively small.

(2) Many contemporary books could be utilized in this study. It is worth noting that the full-text content of current books can be accessed online. The authors should discuss this context, which differs from the past.

>>> Answer: Thank you for your comment. Unfortunately, as we discussed in the text (Section 4, copied below), the size and the publication time of the books in the dataset is, indeed, one of the limitations of this work. We recognize how this can restrict our research, but we took careful measures to work with the data that was publicly available and we believe that our work yields valuable and valid results.

>>> ``Additionally, it is worth mentioning that some factors were imperative in the limited number of books of the dataset (namely, 219 instances). First, we adhere to titles in the public domain only. Although there are discussions about the fair use of such content in scientific works, there is no consensus on the validity of using copyrighted pieces. Second, we considered only one book from each author to avoid identification of authorship by machine learning algorithms to be applied later. Third, because one of the design decisions was to work with a balanced database, the number of bestsellers becomes a limiting factor for the number of non-bestselling books. Lastly, we collected the same number of successes and non-successes per year of publication (which even led to one less non-successful book due to the unavailability of another title in one of the years considered). We emphasize, nonetheless, that such a temporal factor is essential because there will always be a possibility that titles from different periods may be very distinct in terms of content and writing style.''

>>> We highlight the excerpt ``Although there are discussions about the fair use of such content in scientific works, there is no consensus on the validity of using copyrighted pieces''. While it would be possible, yes, to obtain the complete contents of more contemporary books, this would be ethically (and legally) debatable.

3. Concerning the related works in this paper, there are the following two issues:

(1) Several relevant studies have been overlooked, such as Harvey (1953), Lee et al. (2021), Lee et al. (2023), and Maity et al. (2017), among others.

Harvey, J. (1953). The content characteristics of best-selling novels. Public Opinion Quarterly, 17(1), 91-114.

Lee, S., Ji, H., Kim, J., & Park, E. (2021). What books will be your bestseller? A machine learning approach with Amazon Kindle. The Electronic Library, 39(1), 137-151.

Lee, S., Kim, J., & Park, E. (2023). Can book covers help predict bestsellers using machine learning approaches?. Telematics and Informatics, 101948.

Maity, S. K., Panigrahi, A., & Mukherjee, A. (2017, July). Book reading behavior on goodreads can predict the amazon best sellers. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017 (pp. 451-454).

(2)Additionally, some related studies have employed full-text content from book tables of contents to evaluate book quality, such as Zhang & Zhou (2020).

Zhang C., Zhou Q. Assessing Books’ Depth and Breadth via Multi-level Mining on Tables of Contents. Journal of Informetrics, 2020, 14(2): 101032.

>>> Answer: Thank you for your comment, we have included the references in our discussion.

4. The methods used in this paper are relatively simple. I recommend that the authors briefly describe less important methods, while conversely, some methods should be explained in more detail, such as the classification method described in Section 5.4. Additionally, the title in Section 5.4 should be made more specific.

>>> Answer: Thank you for pointing this out. We have changed this section's title and added more discussion on both the classification methods used and the classification problem we are addressing, as follows:

>>> ``The identification and classification of textual patterns were performed using traditional well-known machine learning classifiers. We considered different classifier strategies, including $k$-nearest neighbors (KNN) (based on the probable similarity of nearest neighbors), naive Bayes (NB) (that estimates the class-conditional probability based on the Bayes theorem and assuming conditional independence between attributes), decision tree (DT) (which classifies an example of the test record based on a series of discriminating questions about its attributes), support-vector machine (SVM) (based on finding hyper-planes that can linearly separate data - called support vectors), and, finally, the two that yielded the best results: random forest (RF) and logistic regression (LR).

>>> In just a few words, Random Forest is a class of ensemble methods designed over DT classifiers. It uses multiple decision trees, built using a set of random vectors, combining each of their predictions to yield a final classification.

On the other hand, Logistic Regression is based on determining the conditional probability of an event happening. It models this probability by minimizing a negative likelihood function for the labeled classes.''

5. Some expressions need to be more rigorous, such as providing the full names for abbreviations that appear for the first time, like LOO.

>>> Answer: Thank you for your comment. We have carefully corrected these mistakes in the text.

6. The paper's structure could benefit from further refinement. It is recommended to create a separate section for related discussions, encompassing the theoretical and practical implications of this study, as well as the paper's limitations.

>>> Answer: Thank you for your comment. We have included a separate supplementary material to complement some valuable discussions about our research. We also included a final paragraph in the Conclusion section of the paper to address the implications of our study, as follows:

>>> ``Regardless of the dataset limitations we recognized in this research, we have employed careful effort to minimize any undesirable impact of external factors, such as authorship, publication pe

Attachment

Submitted filename: ResponseLetter.pdf

pone.0302070.s002.pdf (38.9KB, pdf)

Decision Letter 1

Heba El-Fiqi

10 Oct 2023

PONE-D-23-04137R1Using Full-Text Content to Characterize and Identify Best Seller BooksPLOS ONE

Dear Dr. Silva,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 24 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Heba El-Fiqi

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: The dataset used in this paper consists of 219 books published between 1895 and 1924. The title of this paper is "Using Full-Text Content to Characterize and Identify Best Seller Books". Therefore, one issue that needs to be addressed is whether the method in this paper is applicable to identification of current Best Seller Books. This is because the full-text content of current books can be accessed online. Therefore, the author needs to limit the title of the paper and conduct necessary discussions.

Reviewer #3: Pros:

1- The paper is organized and well written.

2- This paper shows a comparison analysis between the proposed model with numerous traditional classifiers.

3- The proposed method and results help publishing companies and writers.

Cons:

1- The results were not good (the average accuracy of 75%). These results may make the model’s decisions untrusted.

2- It may be better to check deep learning like BERT or a large language model (LLM).

3- The pictures are blurring.

Note:

It may be better to remove “Then” in the text “Then, to obtain quantitative and more objective results, we employed various classifiers”

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: Yes: Hashim Abu-Gellban

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 26;19(4):e0302070. doi: 10.1371/journal.pone.0302070.r004

Author response to Decision Letter 1


21 Nov 2023

Dear Editor,

Please find attached a revised version of the manuscript “Using Full-Text Content to Characterize and Identify Best Seller Books”, in which we have considered all the reviewers’ comments. The most substantial changes made are marked in magenta in the revised version. Also attached is a list of re- sponses to the reviewers’ comments.

Yours sincerely, The authors.

Reviewers’ comments:

Reviewer #2:

The dataset used in this paper consists of 219 books published between 1895 and 1924. The title of this paper is ”Using Full-Text Content to Charac- terize and Identify Best Seller Books”. Therefore, one issue that needs to be addressed is whether the method in this paper is applicable to identification of current Best Seller Books. This is because the full-text content of current books can be accessed online. Therefore, the author needs to limit the title of the paper and conduct necessary discussions.

Answer: Thanks for pointing that out. We updated the paper title to “Using full-text content to identify best sellers: a study of early 20th-century literature”. In this way, we hope to reflect the period of the books analyzed, ensuring that the reader is aware of this information before starting to read. Furthermore, we added a paragraph in the Conclusions section that reaffirms that one of the limitations of this work is that we cannot analyze books pub- lished after the beginning of the 20th century due to copyright reasons. The paragraph is as follows:

(...) Concerning the limitations of the work, the two main points we stress are (i) the absence of modern books in the database and (ii) the absence of more modern modeling techniques. As previously discussed, the scarcity of books is due to copyright laws that protect the complete contents of modern books. Even though such content is found free of charge on the internet, we do not have the right to use it.

Reviewer #3:

Pros:

1. The paper is organized and well written.

2. This paper shows a comparison analysis between the proposed model with numerous traditional classifiers.

3. The proposed method and results help publishing companies and writ- ers.

Cons:

1. The results were not good (the average accuracy of 75%). These results may make the model’s decisions untrusted.

Answer: Thanks for pointing that out. We recognize that 75% accu- racy is not a lot, but we highlight the intricate nature of the problem, which makes it difficult to achieve high accuracies (such as 80 or 90%). To make this clear, we added an excerpt in the Conclusions section. The excerpt is as follows:

(...) In addition to these points, it is also worth mentioning that although an accuracy greater than 80 or 90% would be desirable, it would be un- realistic to predict the success of works with such a high outcome, given the intricate and multifaceted nature of such a task. Factors like mar- keting and trends can influence the popularity of books in ways difficult to predict or measure. Therefore, the 75% accuracy result becomes rea- sonable, although somewhat limited, if we think we are exploring solely the textual content of each book.

2. It may be better to check deep learning like BERT or a large language model (LLM).

Answer: Thanks for pointing that out. In Text embeddings subsection we mention the following:

(...) More sophisticated techniques such as BERT and sentence BERT generate embeddings that capture richer context and semantic informa- tion of words or sentences. However, these techniques, similar to W2V and GloVe, are limited to a small number of tokens and can not be applied to large portions of texts, such as entire books.

However, to highlight this limitation (which may not have become so clear during reading), we added a new excerpt in a Conclusions section paragraph that mentions the limitations of the work. The excerpt is as follows:

(...) Regarding modeling, more modern techniques, such as BERT, do not deal well with long texts. As the median size of our dataset is approximately 90,000 characters, it would not be appropriate to apply such a technique. Even if the modeling yielded highly accurate results, they would not be reliable.

3. The pictures are blurring.

Answer: Thanks for your comment. We double checked, and the im- ages are in good quality. The PLOS ONE submission system automatically converts the images from the PDF to lower quality versions. To access the figures in higher resolution, click on “download figure” in the PDF.

Note: It may be better to remove “Then” in the text “Then, to obtain quantitative and more objective results, we employed various classifiers”

Answer: Indeed, thanks for spotting it. We updated it accordingly.

Attachment

Submitted filename: response.pdf

pone.0302070.s003.pdf (26.9KB, pdf)

Decision Letter 2

Heba El-Fiqi

12 Dec 2023

PONE-D-23-04137R2Using full-text content to characterize and identify best seller books: a study of early 20th-century literaturePLOS ONE

Dear Dr. Silva,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 26 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Heba El-Fiqi

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Thank you for your revision. After the revision, the motivation for the research is clearly stated, the research methods are appropriate, and the research results are credible. I believe this paper has satisfactorily answered my previous doubts. The paper can be accepted for publication.

Reviewer #3: Pros:

1- The paper is organized and well written.

2- This paper shows a comparison analysis between the proposed model with numerous traditional classifiers.

3- The proposed method and results help publishing companies and writers.

Cons:

1- The obtained results, with an average accuracy of 75%, fall short of expectations, potentially undermining the reliability of the model's decisions. Despite the author's assertion that the results were reasonable, they lie in the middle ground between randomness (50%) and high accuracy (<99%). Additionally, depending solely on the accuracy metric is insufficient. The paper should include other evaluation metrics, such as precision, recall, and F1-score, to provide a more comprehensive assessment of the classification models.

2- Exploring deep learning models like BERT or a large language model (LLM) could be more beneficial. The revised paper (R2) notes that "BERT does not deal well with long texts," referencing a study that utilized only the first 510 tokens of extensive text. However, this truncation limits classifiers' awareness of most the input text. There are alternative techniques to handle lengthy text when applying BERT, such as segmenting the text into chunks to fit the model's input size. Moreover, LLMs exhibit a capability to manage larger input sizes.

3- The dataset limitation is apparent, consisting of only 219 examples.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 26;19(4):e0302070. doi: 10.1371/journal.pone.0302070.r006

Author response to Decision Letter 2


9 Mar 2024

Reviewers’ comments:

Reviewer #3:

Pros:

1. The paper is organized and well written.

2. This paper shows a comparison analysis between the proposed model with numerous traditional classifiers.

3. The proposed method and results help publishing companies and writers.

Cons:

1. The obtained results, with an average accuracy of 75%, fall short of expectations, potentially undermining the reliability of the model’s decisions. Despite the author’s assertion that the results were reasonable, they lie in the middle ground between randomness (50%) and high accuracy (<99%). Additionally, depending solely on the accuracy metric is insufficient. The paper should include other evaluation metrics, such as precision, recall, and F1-score, to provide a more comprehensive assessment of the classification models.

Answer: Thank you for your comment. The proposed model aims to tackle the very complex task of predicting the success of a book solely based on its content. While a 75% accuracy rate may seem lower in terms of reliability, for instance, for deployment in the industry, we never anticipated the model would achieve high values anyway. We acknowledge that several external factors, beyond just the content, significantly influence whether a book becomes a bestseller. This includes the author popularity, adaptations into other media forms (e.g., movies inspired by the book), and the political and social context at the time of publication. Such an accuracy value showcases that the content seems to be an important factor in determining the success of books. Future work may improve these numbers by employing more sophisticated methods. For delivering the prediction task, future models could also incorporate content, author information, and more context in order to achieve better performance. We included such a suggestion in the conclusions.

Concerning the other evaluation metrics, we added a new section (Section IV - Assessing precision, recall, and f1-score metrics) in the supplementary material, where we expose the precision, recall, and f1-score for the modeling/configuration that achieved the best-obtained results. As stated there, these metrics are not considerably gainful, as they yield results too similar to the accuracy. However, we recognize the importance of exposing them so that the reader can understand that (i) accuracy is an appropriate metric for this case and (ii) all the other metrics sustain the value of the accuracy.

2. Exploring deep learning models like BERT or a large language model (LLM) could be more beneficial. The revised paper (R2) notes that ”BERT does not deal well with long texts,” referencing a study that utilized only the first 510 tokens of extensive text. However, this truncation limits classifiers’ awareness of most the input text. There are alternative techniques to handle lengthy text when applying BERT, such as segmenting the text into chunks to fit the model’s input size. Moreover, LLMs exhibit a capability to manage larger input sizes.

Answer: Indeed, there are alternative methodologies in the literature, such as aggregating vector representations or using larger token sizes. However, even these approaches are constrained by a relatively small token limit, extending up to 32,000 tokens (e.g., the Longformer [1] and TransformerXL [2]). Considering that a typical book may contain between 70,000 and 120,000 tokens, these models still fall short of covering entire texts. Moreover, the inherent computational memory requirements for processing with large models add another layer of complexity. Additionally, pre-trained large language models may not be ideally suited for our analysis due to potential biases in their training datasets. For instance, a preliminary examination of ChatGPT revealed its ability to detect whether a book was a bestseller based solely on its title, indicating possible data contamination. Given our lack of control over the training datasets of these large language models, their applicability in our study is not clear.

[1] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. ”Longformer:

The long-document transformer.” arXiv preprint arXiv:2004.05150 (2020). [2] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

Although fine-tuning and exploring the performance of large language models for predicting the success of books is beyond the current scope of our research, we propose in the conclusions that future studies should assess the efficacy of LLMs for this specific task. Two significant challenges must be addressed: the limited token size, which might be mitigated through a strategy of piecewise summarization, and ensuring the training dataset used in the model does not influence its analysis. One potential solution could involve using only inputs from books published after the model training, which poses challenges due to restricted access to contemporary works. Alternatively, models could be carefully trained with datasets excluding references to the analyzed books.

3. The dataset limitation is apparent, consisting of only 219 examples.

Answer: Thanks for your comment, we really appreciate that. We acknowledge the limitation of our dataset and added a new excerpt on the Conclusions to reinforce this restriction. The excerpt is as follows:

(...) Concerning the limitations of the work, the three main points we stress are (i) the absence of modern books in the database, (ii) the absence of more modern modeling techniques, and (iii) the limitation in dataset size imposed by the number of available best-selling books. (...) We are restricted by the number of books listed as best sellers and also available in the public domain. As previously stated, we can not leverage books that don’t have free content. Also, best-selling books are scarce per nature: if all books were best-selling pieces, this study would not even exist.

Attachment

Submitted filename: best_sellers_response_to_reviewers.pdf

pone.0302070.s004.pdf (31.1KB, pdf)

Decision Letter 3

Heba El-Fiqi

27 Mar 2024

Using full-text content to characterize and identify best seller books: a study of early 20th-century literature

PONE-D-23-04137R3

Dear Dr. Silva,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Heba El-Fiqi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (PDF)

    pone.0302070.s001.pdf (1.5MB, pdf)
    Attachment

    Submitted filename: ResponseLetter.pdf

    pone.0302070.s002.pdf (38.9KB, pdf)
    Attachment

    Submitted filename: response.pdf

    pone.0302070.s003.pdf (26.9KB, pdf)
    Attachment

    Submitted filename: best_sellers_response_to_reviewers.pdf

    pone.0302070.s004.pdf (31.1KB, pdf)

    Data Availability Statement

    The dataset of books were obtained from the Gutenberg dataset and are distributed in public domain. The specific subset employed in this study can be found in http://dx.doi.org/10.5281/zenodo.7622473.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES