Skip to main content
PLOS One logoLink to PLOS One
. 2021 Apr 8;16(4):e0249957. doi: 10.1371/journal.pone.0249957

A computational lens into how music characterizes genre in film

Benjamin Ma 1,*,#, Timothy Greer 1,*,#, Dillon Knox 1,*,#, Shrikanth Narayanan 1,2
Editor: Stavros Ntalampiras3
PMCID: PMC8031455  PMID: 33831109

Abstract

Film music varies tremendously across genre in order to bring about different responses in an audience. For instance, composers may evoke passion in a romantic scene with lush string passages or inspire fear throughout horror films with inharmonious drones. This study investigates such phenomena through a quantitative evaluation of music that is associated with different film genres. We construct supervised neural network models with various pooling mechanisms to predict a film’s genre from its soundtrack. We use these models to compare handcrafted music information retrieval (MIR) features against VGGish audio embedding features, finding similar performance with the top-performing architectures. We examine the best-performing MIR feature model through permutation feature importance (PFI), determining that mel-frequency cepstral coefficient (MFCC) and tonal features are most indicative of musical differences between genres. We investigate the interaction between musical and visual features with a cross-modal analysis, and do not find compelling evidence that music characteristic of a certain genre implies low-level visual features associated with that genre. Furthermore, we provide software code to replicate this study at https://github.com/usc-sail/mica-music-in-media. This work adds to our understanding of music’s use in multi-modal contexts and offers the potential for future inquiry into human affective experiences.

Introduction

Music plays a crucial role in the experience and enjoyment of film. While the narrative of movie scenes may be driven by non-musical audio and visual information, a film’s music carries a significant impact on audience interpretation of the director’s intent and style [1]. Musical moments may complement the visual information in a film; other times, they flout the affect conveyed in film’s other modalities (e.g.—visual, linguistic). In every case, however, music influences a viewer’s experience in consuming cinema’s complex, multi-modal stimuli. Analyzing how these media interact can provide filmmakers and composers insight into how to create particular holistic cinema-watching experiences.

We hypothesize that musical properties, such as timbre, pitch, and rhythm, achieve particular stylistic effects in film, and are reflected in the display and experience of a film’s accompanying visual cues, as well as its overall genre classification. In this study, we characterize differences among movies of different genres based on their film music scores. While this paper focuses on how music is used to support specific cinematic genres, created to engender particular film-watching experiences, this work can be extended to study other multi-modal content experiences, such as viewing television, advertisements, trailers, documentaries, music videos and musical theatre.

Related work

Music use across film genre

Several studies have explored music use in cinema. Music has been such an integral part of the film-watching experience that guides for creating music for movies have existed since the Silent Film era of the early 20th century [2]. Gorbman [3] noted that music in film acts as a signifier of emotion while providing referential and narrative cues, while Rodman [4] points out that these cues can be discreetly “felt” or overtly “heard.” That stylistic musical effects and their purpose in film is well-attested provides an opportunity to study how these musical structures are used.

Previous work has made preliminary progress in this direction. Brownrigg presented a qualitative study on how music is used in different film genres [5]. He hypothesized that film genres have distinctive musical paradigms existing in tension with one another. By this token, the conventional score associated with one genre can appear in a “transplanted” scene in another genre. As an example, a Science Fiction movie may use musical conventions associated with Romance to help drive the narrative of a subplot that relates to love. In this paper, we use a multiple instance machine learning approach to study how film music may provide narrative support to scenes steeped in other film genres.

Other studies have taken a more quantitative approach, extracting audio from movies to identify affective content [6, 7]. Gillick analyzed soundtracks from over 40,000 movies and television shows, extracting song information and audio features such as tempo, danceability, instrumentalness, and acousticness, and found that a majority of these audio features were statistically significant predictors of genre, suggesting that studying music in film can offer insights into how a movie will be perceived by its audience [8]. In this work, we use musical features and state-of-the-art neural embeddings to study film genre.

Another study that used machine learning techniques, by Austin et al., found timbral features most discriminatory in separating movie genres [1]. In prior work, soundtracks were analyzed without accounting for if or for how long the songs were used in a film. We extend these studies by investigating how timestamped musical clips that are explicitly used in a film relate to that film’s genre.

Musical-visual cross-modal analysis

Previous research has established a strong connection between the visual and musical modes as partners in delivering a comprehensive narrative experience to the viewer [912]. Cohen [10] argued that music “is one of the strongest sources of emotion in film” because it allows the viewer to subconsciously attach emotional associations to the visuals presented onscreen. Wingstedt [13] advanced this theory by proposing that music serves not only an “emotive” function, but also a “descriptive” function, which allows the soundtrack to describe the setting of the story-world (e.g., by using folk instruments for a Western setting). In combination with its emotive function, music’s descriptive function is critical in supporting (or undermining) the film genre characterized by the visuals of the film.

In this study, we use screen brightness and contrast as two low-level visual features to describe the visual mode of the film. Chen [14] found that different film genres have characteristically different average brightness and contrast values: Comedy and Romance films have higher contrast and brightness, while Horror, Sci-Fi, and Action films were visually darker with less contrast. Tarvainen [15] established statistically significant correlations between brightness and color saturation with feelings of “beauty” and “pleasantness” in film viewers, while darkness and lack of color were associated with “ugliness” and “unpleasantness.” This result is complementary to Chen’s finding: Comedy and Romance films tend to evoke “beauty” and “pleasantness,” while Action, Horror, and Sci-Fi tend to emphasize gritty, muddled, or even “unpleasant” and “ugly” emotions.

Multiple instance learning

Multiple instance learning (MIL) is a supervised machine learning method where ground truth labels are not available for every instance; instead, labels are provided for sets of instances, called “bags.” The goal of classification in this paradigm is to predict bag-level labels from information spread over instances. In our study, we treat each of the 110 films in the dataset as a bag, and each musical cue within the film as an instance. A musical cue is a single timestamped instance of a track from the soundtrack that plays in the film.

Strong assumptions about the relationship between bags and instances are common, including the standard multiple instance (MI) assumption where a bag (movie) contains a label if and only if there exists at least one instance (a cue within that movie) that is tagged with that label. In this work, we make the soft bag assumption, which allows for a negative-labeled bag to contain positive instances [16]. In other words, a film can contain musical moments characteristic of genres that are outside its own.

Simple MI

Simple MI is a MI method in which a summarization function is applied to all instances within a bag, resulting in a single feature vector for the entire bag. Then, any number of classification algorithms can be applied to the resulting single instance classification problem. Here, the arithmetic mean is used as a straightforward summarization function, as applied in [17].

Instance majority voting

In instance majority voting, each instance within a given bag is naïvely assigned the labels of that bag, and a classifier is trained on all instances. Bag-level labels are then assigned during inference using an aggregation scheme, such as majority voting [18]. As an example, a movie that is labeled as a “comedy” would propagate that label to the cues it contains during model training, and then majority voting across cues would be used to predict the final label for the movie during inference.

Neural network approaches

Neural network approaches within an MIL framework have been used extensively for sound event detection (SED) tasks with weak labeling. Ilse et al. [19] proposed an attention mechanism over instances and demonstrated competitive performance on several benchmark MIL datasets. Wang et al. [20] compared the performance of five MIL pooling functions, including attention, and found that linear softmax pooling produced the best results. Kong et al. [18] proposed a new feature-level attention mechanism, where attention is applied to the hidden layers of a neural network. Gururani et al. [21] used an attention pooling model for a musical instrument recognition task, and found improved performance over other architectures, including recurrent neural networks. In this work, we compare each of these approaches for the task of predicting a film’s genre from its music.

Contribution of this work

In this work, we objectively examine the effect of musical features on perception of film. We curate and release a dataset of processed features from 110 popular films and soundtracks, and share the code we use for our experiments (https://github.com/usc-sail/mica-music-in-media). To our knowledge, this is the first study that applies deep learning models on musical features to predict a film’s genre. Additionally, we interpret these models via a permutation feature importance analysis on MIR features. This analysis suggests which interpretable musical features are most predictive of each film genre studied. Lastly, we conduct a novel investigation on the interaction between the musical and low-level visual features of film, finding that musical and visual modes may exhibit characteristics of different genres in the same film clips. We believe that this work also sets the foundation that can be extended to help us better understand music’s role as a significant and interactive cinematic device, and how viewers respond to the cinematic experience.

Research data collection and curation

Film and soundtrack collection

Soundtracks

We collected the highest-grossing movies from 2014-2019 in-house (boxofficemojo.com). We identified 110 films from this database with commercially available soundtracks that include the original motion picture score and purchased these soundtracks as MP3 digital downloads (see S1 Appendix for details).

Film genre

We labeled the genres of every film in our 110-film dataset by extracting genre tags from IMDb (imdb.com). Although IMDb lists 24 film genres, we only collect the tags of six genres for this study: Action, Comedy, Drama, Horror, Romance, and Science Fiction (Sci-Fi). This reduced taxonomy is well-attested in previous literature [1, 2224], and every film in our dataset represents at least one of these genres.

We use multi-label genre tags because many movies span more than one of the genres of interest. Further, we conjecture that these movie soundtracks would combine music that has characteristics from each genre in a label set. Statistics of the data set that we use is given in Table 1.

Table 1. A breakdown of the 110 films in our dataset.

Only 33 of the films have only one genre tag; the other 77 films are multi-genre. A list of tags for every movie is given in S1 Appendix.

Genre Tag Number of Films
Action 55
Comedy 37
Drama 44
Horror 11
Romance 13
Science Fiction 36

Automatically extracting musical cues in film

We developed a methodology we call Score Stamper that automatically identifies and timestamps musical cues from a soundtrack that are used in its corresponding film. A given track from the soundtrack may be part of multiple cues if clips from that track appear in the film on multiple occasions.

The Score Stamper methodology uses Dejavu’s audio fingerprinting tool [25], which is robust to dialogue and sound effects. Default settings were used for all Dejavu parameters. The Score Stamper pipeline is explained in Fig 1. At the end of the Score Stamper pipeline, each film has several “cue predictions.”

Fig 1. The Score Stamper pipeline.

Fig 1

A film is partitioned into non-overlapping five-second segments. For every segment, Dejavu will predict if a track in the film’s soundtrack is playing. Cues, or instances of a song’s use in a film, are built by combining window predictions. In this example, the “Cantina Band” cue lasts for 15 seconds because it was predicted by Dejavu in two nearby windows.

We evaluated Score Stamper’s prediction performance on a test set of three films: “The Lion King” (1994), “Love, Actually,” and “Star Wars: Episode IV—A New Hope.” These films were selected to determine if Dejavu would be robust to markedly different film genres, with different composers, directors, and instrumentation for each set of cues. Additionally, “Love Actually” uses a compilation soundtrack, while the other two films feature composed soundtracks. Musical cues were annotated manually in-house. A total of 84 cues, spanning 162 minutes, were identified across the three films.

Score Stamper’s predictions reached an average precision of 0.94 (SD = .012) and an average recall of 0.47 (SD = .086). We deemed these metrics acceptable for the purposes of this study, as a high precision score indicates that almost every cue prediction Dejavu provides will be correct, given that these test results generalize to the other films in our dataset. The recall is sufficient because the cues recognized are likely the most influential on audience response, as they are included on the commercial soundtrack and mixed clearly over sound effects and dialogue in the film. High recall is made difficult or impossible by several confounding factors: the omission of some songs in a film from its purchasable soundtrack, variations on the soundtrack rendition of the song, and muted placement of songs in the mix of the film’s audio.

This result also suggests that Score Stamper overcomes a limitation encountered in previous studies: in prior work, the whole soundtrack was used for analysis (which could be spurious given that soundtrack songs are sometimes not entirely used, or even used at all, in a film) [1, 8, 26]. By contrast, only the music found in a film is used in this analysis. Another benefit of this method is a timestamped ordering of every cue, opening up opportunity for more detailed temporal analysis of music in film.

Musical feature extraction

MIR features

Past research in movie genre classification suggests that auditory features related to energy, pitch, and timbre are predictive of film genre [27]. We apply a similar process to [1, 28, 29] in this study: we extract features that relate to dynamics, pitch, rhythm, timbre, and tone using the eponymous functions in MATLAB’s MIRtoolbox [30] with default parameters. Spectral flatness and spectral crest are not available in MIRtoolbox, so we compute them using the eponymous functions in Audio Toolbox [31] with default parameters (see Table 2). To capture high-level information and align features extracted at different frequencies, all features are then “texture-windowed” by calculating mean and standard deviation of five-second windows with 33% overlap, as in [32].

Table 2. Auditory features used and feature type.
Feature Type Feature
Dynamics RMS Energy
Pitch Chroma
Rhythm Pulse Clarity [33], Tempo
Timbre MFCCs, ΔMFCCs, ΔΔMFCCs, Roughness, Spectral Centroid, Spectral Crest, Spectral Flatness, Spectral Kurtosis, Spectral Skewness, Spectral Spread, Zero-crossing Rate
Tone Key Mode, Key Strength, Spectral Brightness, Spectral Entropy, Inharmonicity

VGGish features

In addition to the aforementioned features, we also extract embeddings from every cue using VGGish’s pretrained model [34]. In this framework, 128 features are extracted from the audio every.96 seconds, which we resample to 1 Hz to align with the MIR features. These embeddings have shown promise in tasks like audio classification [35], music recommendation [36], and movie event detection [37]. We compare the utility of these features with that of the MIR features.

Visual features

Following previous works in low-level visual analysis of films [14, 15], we extract two features from each film in our dataset: brightness and contrast. These features were sampled at 1 Hz to align with musical features. Brightness and contrast were calculated as in [14], given by:

Bt=1|Pt|PtpBt(p) (1)

and

C=(BMax-BMin)(1-|AreaBmax-AreaBmin|) (2)

where C is the contrast, Pt is the set of all pixels onscreen at timestep t, Bt(p) is the brightness at pixel p at timestep t, and BMax and BMin refer to the maximum and minimum average brightness across pixels, evaluated per timestep.

Methods

Genre prediction model training procedure

In order to select the model architecture which could best predict film genre from musical features, we use leave-one-out cross-validation, meaning that a model is trained for each of the 110 films in the corpus using the other 109 films. As the ground truth label for each movie can contain multiple genres, the problem of predicting associated genres was posed as multi-label classification. For Simple MI and Instance Majority Voting approaches, the multi-label problem is decomposed into training independent models for each genre, in a method called binary relevance. The distribution of genre labels is unbalanced, with 55 films receiving the most common label (Action), and only 11 films receiving the least common label (Horror). In order to properly evaluate model performance across all genres, we calculate precision, recall, and F1-score separately for each genre, and then report the macro-average of each metric taken over all genres.

Model architectures

For the genre prediction task, we compare the performance of several MIL model architectures. First, we explore a Simple MI approach where instances are averaged with one of the following base classifiers: random forest (RF), support vector machine (SVM), or k-nearest neighbors (kNN). Using the same base classifiers, we also report the performance of an instance majority voting approach.

For neural network-based models, the six different pooling functions shown in Table 3 are explored. We adopt the architecture given in Fig 2, which has achieved state-of-the-art performance on sound event detection (SED) tasks [18]. Here, the input feature representation is first passed through three dense embedding layers before going into the pooling mechanism. At the output layer, we convert the soft output to a binary prediction using a fixed threshold of 0.5. A form of weighted binary cross-entropy was used as the loss function, where weights for the binary positive and negative class for each genre are found by using the label distribution for the input training set. An Adam optimizer [38] with a learning rate of 5e-5 was used in training, and the batch size was set to 16. Each model was trained for 200 epochs.

Table 3. The six pooling functions, where xi refers to the embedding vector of instance i in a bag set B and k is a particular element of the output vector h.

In the multi-attention equation, L refers to the attended layer and w is a learned weight. The attention module outputs are concatenated before being passed to the output layer. In the feature-level attention equation, q(⋅) is an attention function on a representation of the input features, u(⋅).

Function Name Pooling Function
Max pooling hk=maxixik
Average pooling hk=1|B|ixik
Linear softmax hk=ixik2ixik
Single attention hk=iwixikiwik
Multi-attention hk(L)=iwik(L)xik(L)iwik(L)
Feature-level attention hk=xBq(x)ku(x)k

Fig 2. Neural network model architecture.

Fig 2

Frame-level and cue-level features

For each cue predicted by Score Stamper, a sequence of feature vectors grouped into frames is produced (either VGGish feature embeddings or hand-crafted MIR features). For instance, a 10-second cue represented using VGGish features will have a sequence length of 10 and a feature dimension of 128. One way to transform the problem to an MIL-compatible representation is to simply treat all frames for every cue as instances belonging to a movie-level bag, ignoring any ordering of the cues. This approach is called frame-level representation.

A simplifying approach is to construct cue-level features by averaging frame-level features per cue, resulting in a single feature vector for each cue. Using MIL terminology, these cue-level feature vectors then become the instances belonging to the film, which is a “bag.” We evaluate the performance of each model type when frame-level features are used and when cue-level features are used.

Results

Genre prediction

Table 4 shows the performance of several model architectures on the 110-film dataset, using either VGGish features or MIR features as input. All of our models outperform both a random guess baseline, using class frequencies, and a zero rule baseline, where the most common (plurality) label set is predicted for all instances. We observe that a previous study, which predicted uni-labeled film genres from music tracks, reported a macro F1-score of 0.54 [1]. While important aspects of the two studies differ (track-level vs. film-level prediction, uni-label vs. multi-label genre tags), macro-F1 scores of 0.62 from our best-performing models demonstrate improved performance on an arguably more difficult task.

Table 4. Classification results on the 110-film dataset.

Performance metrics using leave-one-out cross-validation for each cue-level feature model are reported. IMV stands for Instance Majority Voting; FL Attn for Feature-Level Attention. Simple MI and IMV results represent performance with the best base classifier (kNN, SVM, and random forest were tried). All models reported mean-averaged precision significantly better than the random guess baseline (p <.01), as given by a paired t-test.

Features Model Precision Recall F1-Score
None Random Guess .32 .32 .32
Plurality Label .14 .34 .19
VGGish kNN—Simple MI .64 .59 .61
SVM—IMV .64 .42 .44
Max Pooling .56 .48 .49
Avg. Pooling .55 .78 .62
Linear Softmax .62 .59 .59
Single Attn .60 .73 .65
Multi-Attn .45 .72 .52
FL Attn .53 .74 .57
MIR SVM—Simple MI .60 .52 .56
SVM—IMV .55 .40 .42
Max Pooling .40 .10 .15
Avg. Pooling .55 .78 .61
Linear Softmax .55 .61 .57
Single Attn .49 .76 .55
Multi-Attn .44 .70 .51
FL Attn .53 .67 .56

We note that cue-level feature representations outperform instance-level feature representations across all models, so only values from cue-level feature models are reported. We further observe that Simple MI and IMV approaches perform better in terms of precision, recall, and F1-score when using VGGish features than when using MIR features. This result makes sense, as VGGish embeddings are already both semantically meaningful and compact, allowing for these relatively simple models to produce competitive results. Indeed, we find that Simple MI with an SVM as a base classifier on VGGish features produces the highest precision of all the models we tested. We report precision-recall curves for the top-performing MIR and VGGish models in S2 Appendix. In S3 Appendix, we present a scatter plot with precision and recall for each film (micro-averaged across all cues), for both VGGish and MIR average pooling models.

Finally, we observe that models trained using VGGish features generally outperform their counterparts trained using MIR features. Here, we note that the overall best-performing model in terms of macro-averaged F1-score is a single-attention model with 128 nodes per hidden layer, and trained using VGGish features. Interestingly, pooling mechanisms that are most consistent with the standard MI assumption—Max Pooling and Linear Softmax Pooling [20]—perform worse than other approaches. This result is consistent with the idea that a film’s genre is characterized by all the musical cues in totality, and not by a single musical moment.

Musical feature relevance scoring

To determine the the importance of different musical features toward predicting each film genre, we used the method of Permutation Feature Importance (PFI), as described in [39]. PFI scores the importance of each feature by evaluating how prediction performance degrades after randomly permuting the values of that feature across all validation set examples. The feature importance score sk for feature k is calculated as:

sk=1-F1permkF1orig (3)

where F1permk is the F1-score of the model across all leave-one-out cross-validation instances with feature k permuted, and F1orig is the F1-score of the model without any permutations. A high score sk means that the model’s performance degraded heavily when feature k was permuted, indicating that the model relies on that feature to make predictions. This analysis was used to provide an understanding for which features contributed the most to genre predictions, not to provide the best-performing model.

To generate the F1-scores, we used our best-performing model trained on MIR features: an average-pooling model with 64 nodes per hidden layer (F1-score = 0.61). We did not analyze a model trained on VGGish features, because VGGish features are not interpretable: a PFI analysis using these features would not illuminate which human-understandable musical qualities contribute most to genre predictions. Since we had a large feature set of 140 features, and many of our features were closely related, we performed PFI on feature groups rather than individual features, as in [40]. We evaluated eight feature groups: MFCCs, ΔMFCCs, ΔΔMFCCs, Dynamics, Pitch, Rhythm, Timbre, and Tone. One feature group was created for each feature type in Table 2 (see section “Research data collection and curation”). MFCCs, ΔMFCCs, ΔΔMFCCs were separated from the “Timbre” feature type into their own feature groups, in order to prevent one group from containing a majority of the total features (and thus having an overwhelmingly high feature importance score). For each feature group, we randomly permuted all features individually from the others to remove any information encoded in the interactions between those features. We report results averaged over 100 runs in order to account for the effects of randomness. The results of our PFI analysis are shown in Fig 3.

Fig 3. Feature importance by genre and feature group, reported with 95% CI error bars.

Fig 3

Fig 3 shows that MFCCs were the highest scoring feature group in every genre. Across all genres (i.e., the “Overall” column in Fig 3), the next-highest scoring feature groups were Tone, Pitch, and ΔMFCCs. This corroborates past research finding MFCCs to be the best-performing feature group for various music classification tasks [41, 42]. MFCC and ΔMFCC features were the only ones to significantly degrade performance when permuted for the Comedy and Drama genres, suggesting that those genres may be most characterized by timbral information encoded in MFCCs.

The model relied heavily on the Tone feature group in distinguishing Horror films. Brownrigg [5] qualitatively posits that atonal music or “between-pitch” sounds are characteristic of Horror film music, and the model’s reliance on Tone features–including key mode, key strength, spectral brightness, spectral entropy, and inharmonicity–supports this notion. The Tone feature group also scored highly for the Romance and Sci-Fi genres, whose scores often include modulations in texture and key strength or mode to evoke feelings of passion and wonder, respectively.

Finally, we note that the model’s predictions for Horror and Romance exhibited greater score variance during feature permutation than for the other genres, likely because Horror and Romance were under-represented in the 110-film corpus.

Musical-visual cross-modal analysis

To investigate whether visual features associated with a genre correlate with music that the model has learned to be characteristic of that genre, we compare median screen brightness and contrast from film clips with labeled musical cues. For instance: if the model finds that music from a given film is highly characteristic of Comedy (regardless of the actual genre labels of the film), do we observe visual features in that film that are characteristic of Comedy?

We consider three different sources of genre labels: the true labels, the predicted labels from the best-performing model, and the predicted genre labels where only false positives are counted (that is, true positive genre predictions are removed from the set of all genre predictions.) By comparing the brightness and contrast averages using the actual and predicted labels, we can analyze whether musical patterns that the model finds characteristic of each genre correspond to visual patterns typical of the genre.

We use a single-attention pooling model trained on VGGish features (F1-score = 0.65). For each genre, we report the difference between the median brightness or contrast value in film clips labeled with that genre against the median value of all other clips. Table 5 shows the results.

Table 5. Difference in median brightness and contrast (×101) across all films labeled with a given genre against median brightness and contrast of the set of films excluding the given genre.

Bold values show a statistically significant difference, as given by a Mann-Whitney U test with Bonferroni correction (α = 0.01, m = 6) between the median of films including a given genre against those excluding it, within a given prediction source (Actual, Predicted, or False Positive).

Brightness
Actual Predicted False Positive
Action 0.08 0.06 -0.18
Comedy 0.23 0.19 0.16
Drama -0.07 0.01 0.31
Horror -0.70 -0.44 -0.15
Romance 0.11 0.17 0.16
Sci-Fi 0.01 -0.04 -0.07
Contrast
Actual Predicted False Positive
Action -0.08 0.00 0.13
Comedy 0.53 0.45 0.38
Drama -0.25 -0.16 0.31
Horror -0.28 -0.25 0.27
Romance 0.35 0.16 -0.08
Sci-Fi 0.30 0.05 -0.20

From the “Actual” metrics, we observe that for both brightness and contrast, our dataset follows the trends illustrated in [14]: Comedy and Romance films have high average brightness and contrast, while Horror films have the lowest values for both features. However, we also note that clips from Sci-Fi films in our dataset also have high contrast, which differs from the findings of [14].

When comparing the brightness and contrast of clips by their “Predicted,” rather than “Actual,” genre, we note that the same general trends are present, but tend more toward the global median for both metrics. This movement toward the median suggests that the musical styles the model has learned to associate with each film genre do not necessarily correspond to their visual styles; e.g., a clip with music befitting Comedy may not keep the Comedy-style visual attributes of high brightness and contrast. This gap is partially explainable by the fact that the model has imperfectly learned the musical differences between genres. However, insofar as the model has learned an approximation of musical characteristics distinguishing film genres, we contend that the difference between the “Actual” and “Predicted” visual averages is an approximation of the difference between visual styles in a film’s labeled genre(s) against those genre(s) that its music alone would imply.

To further support this notion, we present the “False Positive” measure, which isolates the correlation between musical genre characteristics and visual features in movies outside that genre. For instance, in an Action movie with significant Romance musical characteristics (causing the model to assign a high Romance confidence score), do we observe visual features associated with Romance? For half of the genres’ brightness values, and a majority of the genres’ contrast values, we actually found the opposite: “False Positive” metrics tended in the opposite direction to the “Actual” metrics. This unexpected result warrants further study, but we suspect that even when musical style subverts genre expectations in a film, the visual style may stay consistent with the genre, causing the observed discrepancies between the two modes.

Conclusion

In this study, we quantitatively support the notion that characteristic music helps distinguish major film genres. We find that a supervised neural network model with attention pooling produces competitive results for multi-label genre classification. We use the best-performing MIR feature model to show that MFCC and tonal features are most suggestive of differences between genres. Finally, we investigate the interaction between musical and low-level visual features across film genres, but do not find evidence that music characteristic of a genre implies low-level visual features common in that genre. This work has applications in film, music, and multimedia studies.

Supporting information

S1 Appendix. Complete list of films used in this study.

(PDF)

S2 Appendix. Precision-recall curves for top-performing MIR and VGGish models.

(PDF)

S3 Appendix. Scatter plot displaying precision and recall for each film (micro-averaged across all cues), for both VGGish and MIR average pooling models.

(PDF)

S1 Data

(ZIP)

Data Availability

All relevant data are within the paper and its Supporting information files.

Funding Statement

The study was done at the Center for Computational Media Intelligence at USC, which is supported by a research award from Google. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Austin A, Moore E, Gupta U, Chordia P. Characterization of movie genre based on music score. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2010. p. 421–424.
  • 2. Lang E, West G. Musical Accompaniment of Moving Pictures: A Practical Manual for Pianists and Organists and an Exposition of the Principles Underlying the Musical Interpretation of Moving Pictures. Boston Music Company; 1920. [Google Scholar]
  • 3. Gorbman C. Unheard melodies: Narrative film music. Indiana University Press; 1987. [Google Scholar]
  • 4. Rodman R. The popular song as leitmotif in 1990s film. In: Changing tunes: The use of pre-existing music in film. Routledge; 2017. p. 119–136. [Google Scholar]
  • 5. Brownrigg M. Film music and film genre. University of Stirling; 2003. [Google Scholar]
  • 6.Xu M, Chia LT, Jin J. Affective content analysis in comedy and horror videos by audio emotional event detection. In: 2005 IEEE International Conference on Multimedia and Expo. IEEE; 2005. p. 4–pp.
  • 7. Hanjalic A. Extracting moods from pictures and sounds: Towards truly personalized TV. IEEE Signal Processing Magazine. 2006;23(2):90–100. 10.1109/MSP.2006.1621452 [DOI] [Google Scholar]
  • 8.Gillick J, Bamman D. Telling stories with soundtracks: an empirical analysis of music in film. In: Proc. of the 1st Workshop on Storytelling; 2018. p. 33–42.
  • 9. Chion M, Gobman C, Murch W. Audio-vision: sound on screen. Columbia University Press; 1994. [Google Scholar]
  • 10. Cohen AJ. Music as a source of emotion in film. Music and emotion: Theory and research. 2001; p. 249–272. [Google Scholar]
  • 11. Wingstedt J, Brändström S, Berg J. Narrative Music, Visuals and Meaning in Film. Visual Communication. 2010;9(2):193–210. 10.1177/1470357210369886 [DOI] [Google Scholar]
  • 12.Simonetta F, Ntalampiras S, Avanzini F. Multimodal music information processing and retrieval: Survey and future challenges. In: 2019 International Workshop on Multilayer Music Representation and Processing (MMRP). IEEE; 2019. p. 10–18.
  • 13. Wingstedt J. Narrative music: towards an understanding of musical narrative functions in multimedia. Luleå tekniska universitet; 2005. [Google Scholar]
  • 14. Chen I, Wu F, Lin C. Characteristic color use in different film genres. Empirical Studies of the Arts. 2012;30(1):39–57. 10.2190/EM.30.1.e [DOI] [Google Scholar]
  • 15. Tarvainen J, Westman S, Oittinen P. The way films feel: Aesthetic features and mood in film. Psychology of Aesthetics, Creativity, and the Arts. 2015;9(3):254. 10.1037/a0039432 [DOI] [Google Scholar]
  • 16. Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, et al. Multiple Instance Learning: Foundations and Algorithms. Springer; 2016. [Google Scholar]
  • 17. Dong L. A Comparison of Multi-instance Learning Algorithms. The University of Waikato; 2006. [Google Scholar]
  • 18. Kong Q, Yu C, Xu Y, Iqbal T, Wang W, Plumbley M. Weakly Labelled AudioSet Tagging With Attention Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019;27(11):1791–1802. 10.1109/TASLP.2019.2930913 [DOI] [Google Scholar]
  • 19. Ilse M, Tomczak J, Welling M. Attention-based deep multiple instance learning. CoRR. 2018;abs/1802.04712. [Google Scholar]
  • 20. Wang Y, Li J, Metze F. A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling; 2018. [Google Scholar]
  • 21.Gururani S, Sharma M, Lerch A. An attention mechanism for musical instrument recognition. Proc of the International Society for Music Information Retrieval Conference. 2019;.
  • 22.Zhou H, Hermans T, Karandikar AV, Rehg JM. Movie genre classification via scene categorization. In: Proc. of the 18th ACM international conference on Multimedia; 2010. p. 747–750.
  • 23.Rasheed Z, Shah M. Movie genre classification by exploiting audio-visual features of previews. In: Object recognition supported by user interaction for service robots. vol. 2. IEEE; 2002. p. 1086–1089.
  • 24.Simões GS, Wehrmann J, Barros RC, Ruiz DD. Movie genre classification with convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN). IEEE; 2016. p. 259–266.
  • 25. Drevo W. Audio Fingerprinting with Python and Numpy; 2013. [Google Scholar]
  • 26. Shan MK, Kuo FF, Chiang MF, Lee SY. Emotion-based music recommendation by affinity discovery from film music. Expert systems with applications. 2009;36(4):7666–7674. 10.1016/j.eswa.2008.09.042 [DOI] [Google Scholar]
  • 27.Jain S, Jadon R. Movies genres classifier using neural network. In: 2009 24th International Symp. on Computer and Information Sciences. IEEE; 2009. p. 575–580.
  • 28. Eerola T. Are the emotions expressed in music genre-specific? An audio-based evaluation of datasets spanning classical, film, pop and mixed genres. Journal of New Music Research. 2011;40(4):349–366. 10.1080/09298215.2011.602195 [DOI] [Google Scholar]
  • 29.Greer T, Ma B, Sachs M, Habibi A, Narayanan S. A Multimodal View into Music’s Effect on Human Neural, Physiological, and Emotional Experience. In: Proc. of the 27th ACM International Conference on Multimedia; 2019. p. 167–175.
  • 30. Lartillot O, Toiviainen P, Eerola T. A matlab toolbox for music information retrieval. In: Data analysis, machine learning and applications. Springer; 2008. p. 261–268. [Google Scholar]
  • 31.MathWorks. MATLAB Audio Toolbox; 2019.
  • 32. Tzanetakis G, Cook P. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing. 2002;10(5):293–302. 10.1109/TSA.2002.800560 [DOI] [Google Scholar]
  • 33.Lartillot O, Eerola T, Toiviainen P, Fornari J. Multi-Feature Modeling of Pulse Clarity: Design, Validation and Optimization. In: ISMIR. Citeseer; 2008. p. 521–526.
  • 34.Hershey S, Chaudhuri S, Ellis D, Gemmeke J, Jansen A, Moore R, et al. CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2017. p. 131–135.
  • 35.El Hajji M, Daniel M, Gelin L. Transfer Learning based Audio Classification for a noisy and speechless recordings detection task, in a classroom context. In: Proc. SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education; 2019. p. 109–113.
  • 36. Lee S, Lee J, Lee K. Content-based feature exploration for transparent music recommendation using self-attentive genre classification; 2018. [Google Scholar]
  • 37. Ziai A. Detecting Kissing Scenes in a Database of Hollywood Films. CoRR. 2019;abs/1906.01843. [Google Scholar]
  • 38. Kingma D, Ba J. Adam: A Method for Stochastic Optimization; 2014. [Google Scholar]
  • 39. Molnar C. Interpretable Machine Learning; 2019. [Google Scholar]
  • 40.Ma B, Greer T, Sachs M, Habibi A, Kaplan J, Narayanan S. Predicting Human-Reported Enjoyment Responses in Happy and Sad Music. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII); 2019. p. 607–613.
  • 41.Kim YE, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, et al. Music emotion recognition: A state of the art review. In: Proc. ISMIR. Citeseer; 2010. p. 255–266.
  • 42.Eronen A. Comparison of features for musical instrument recognition. In: Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575); 2001. p. 19–22.

Decision Letter 0

Stavros Ntalampiras

4 Dec 2020

PONE-D-20-31042

A computational lens into how music characterizes genre in film

PLOS ONE

Dear Dr. Ma,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 18 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Stavros Ntalampiras

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Financial Disclosure section:

"The study was done at the Center for Computational Media Intelligence at USC, which is supported by a research award from Google. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

We note that you received funding from a commercial source: Google.

Please provide an amended Competing Interests Statement that explicitly states this commercial funder, along with any other relevant declarations relating to employment, consultancy, patents, products in development, marketed products, etc.

Within this Competing Interests Statement, please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests).  If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include your amended Competing Interests Statement within your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This work presents interesting and original findings about the use of music genre in films. Authors provide a proper literature review and the language is clear.

=======

Methods

=======

You could provide more details about p-values and statistically significance, but, honestly, I do not mind about this too much. The main issue, however, is that the difference among the models used should be inspected further. One method that I suggest is to show the results (i.e. tables 4 and 5) using violin plots, which allow a general qualitative overview of the distribution without falling in type I and II errors. Moreover, violin plots are easily to build. Another option is to just use a scatter plot in a Precision-Recall space and different colors for different models.

You say that you have used p-value for checking results in table 5, but what test have you used? Why have you chosen that test? Have you corrected it with some method (e.g. Bonferroni/Holm methods)? Have you used a multi-distribution test such as Kruskal-Wallis or ANOVA?

About the features, why have you chosen those features? Which previous studies have you followed? Have you simply used the Matlab standard features? If yes, why?

In figure 3, you show the feature importance plot for MIR features; I did not get why you have not computed the feature importance for the VGG features: reducing the number of features used for a classification model may lead to and increase of the overall performance. Again, why have you not used confidence intervals/violin plots/box plots or similar in this figure? It is hard to understand what is the importance of each feature otherwise.

============

Ease of reading

============

You should also declare more precisely the contribute of their work in the abstract and possibly in the introduction: the sentence that is at now is almost unuseful.

To my understanding, the paragraph "Visual-musical cross-modal analysis" has almost no reason to be. You repeat everything later, while previous works should be put in "Related works". Note that an extnsive survey about multi-modal and cross-modal music studies extists [1].

Paragraph "Multiple Instance Learning" is very unclear. You should say as soon as possible what is a bag and what is an instance in your study. Everything should then be referenced to your case (e.g. hypothesis etc.). This allows the reader to understand MPI with a concrete example.

Results reported about ScoreStamper in paragraph "Automatically extracting musical cues in film" are scientifically unuseful. You have tested it in only 3 bags. How many instances, stamps, multiple occasions there were? How much were differente the music pieces in the song track? Reasons about the low recall are unclear.

You should describe with more details the structure of the models used, even if these models were already used in previous papers.

=========

References

=========

[1] F. Simonetta, S. Ntalampiras, and F. Avanzini, “Multimodal Music Information Processing and Retrieval: Survey and Future Challenges,” in Proceedings of 2019 International Workshop on Multilayer Music Representation and Processing, Milan, Italy, 2019, pp. 10--18. https://arxiv.org/abs/1902.05347

Reviewer #2: This paper presents a study on the use of films' soundtracks to help automatic classification of their broad genre (e.g. Action, Comedy). To do so, the authors 1) curated a new dataset of films and their corresponding sound tracks with automatic fingerprinting annotations; 2) explored different features for describing the music content of soundtracks, in particular they explored common MIR features related to dynamics, rhythm, pitch, timbre and tone (such as MFCCs, tempo and chroma) and deep learning derived features (VGGish); 3) investigated different variations classifiers and problem definitions (SVMs, kNNs, DNNs, Multiple Instance Learning, different pooling strategies); 4) performed an ablation study on the importance of the MIR features for the classification of the different genres and its relation with simple image clues such as brightness and contrast.

The paper is well written, has good references to previous work and a thoughtful discussion of the results, so I'd recommend it to be accepted. Even though the paper is in good shape already, there is room for improvement in the structure and more details about implementation and experiments should be added. See my comments below.

Improvement in structure ===

In my opinion the contributions should be better highlighted. It is a bit difficult to understand what exactly the contributions are with respect to previous work since the discussion in related work doesn't clearly highlight the limitations in the literature besides some isolated comments spread out in the text (i.e. the first clear statement I saw was in line 46). The innovations in methodology, and the analysis should be also listed as contributions if they're more extensive than previous works.

More detail on the experiments ===

Authors should explain how they assessed statistical significance. It is mentioned in the text but not clarified. Also it would be good to have more details on the network architecture besides a reference (capacity or number of parameters, layer's size, input size a bit more clear). Note that the dataset is not very big and this raises some questions on suitability of the architecture that could be partially answer with more information on the implementation.

I think some reference to the performance of previous works is needed to understand if the models presented here are performing reasonably (which they seem when I went compared to [1]) but an explicit comment would help understand the work better.

I'm not sure if I followed the conclusions in L298-L302 that the music style of a clip. The conclusion is that because the brightness and contrast in the clips using predicted labels are not correlating with the "expected behaviour" of each genre then the music in the clip doesn't necessarily correspond to the visual style? Then why do you see that effect in the clips when you use "actual labels"? Could it be an artifact of the model's performance?

Nit comments ===

- The figures are not in the main text, not sure if this is an artifact of the reviewing template or something to correct.

- L154: Briefly explain "texture windowed"

- L166: Would prefer a short explanation on how brightness and contrast were obtained and refer to paper for further details.

- L221: I don't understand this phrase, couldn't parse it.

- L283 - L286: Are you trying to say that investigating brightness and contrast mean scores on the model's predictions help understand associations between those visual features and what the model learned? And that could potentially be applied to unknown genres? Maybe rephrase to make it clearer.

Reviewer #3: The authors build on research in the field, utilizing well-constructed computational models to retrieve information and data to help us understand how film music operates across several genres and interacts with other film modalities. They have applied this to over 100 films, and I find that their approach in identifying the music from the soundtracks that are actually timestamped in the film itself is a sound and even essential one. Though this study does involve necessary technical information appropriate for a study like this, they are careful to take the reader step-by-step through their process, carefully explaining terms, and leading us to their conclusions in a logical, cogent manner. They have also provided strong data in support of the study. I believe that this study can open the way to further research, as they even suggest in the paper. As a practicing musician and musical scholar myself, I will be interested in seeing this lead to further published work that will help us better understand music’s role as significant and interactive cinematic device, and how viewers respond to the cinematic experience, emotively, perceptively, and cognitively.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Joseph L. Rivers Jr.

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Apr 8;16(4):e0249957. doi: 10.1371/journal.pone.0249957.r002

Author response to Decision Letter 0


25 Jan 2021

(This response is also available in the Response to Reviewers file upload.)

To the academic editor and reviewers:

Thank you for your insightful comments on our manuscript submission. We have reviewed your comments one-by-one and prepared a revised document with changes based on your feedback. In this letter, we describe each reviewer comment and the changes we have made to address it.

We received editorial feedback to amend our Competing Interests statement to acknowledge this study’s partial funding from a corporation, Google. We have updated our Cover Letter to include an amended Competing Interests statement. Please let us know if other clarifications must be made to the manuscript in this regard.

We received feedback from reviewers advising us to provide more details about p-values and statistical significance of our results. To this end, we conducted Mann-Whitney U tests on our results from Table 5 and corrected them with Bonferroni correction.

To show that our results were replicable, we re-ran all of our experiments with leave-one-out cross-validation. We believe that this further strengthens the conclusions drawn from the experimental modeling results. At the behest of one of the reviewers, we also include scatter plots in the Precision-Recall space to better illustrate the flexibility and performance of our models.

We hope that the statistical tests that we have implemented in the revised submission will help further substantiate our experimental findings, and the conclusions drawn from our work.

One reviewer asked us to justify our choice of music information retrieval (MIR) features, as well as clarify how we calculated them. The MIR features we chose have shown utility in music emotion recognition and music genre classification from prior published work. Default parameters were used for computation and then texture-windowed to provide our models feature sets with equal window lengths. In the manuscript, we have clarified how we computed our MIR features, and cited prior works which inspired our choice for the features used in our study. Our revised submission now includes specific details of which previous studies we were following and how we computed the features we used for our study.

We thank the reviewers for inquiring why the feature importance analysis was not conducted for VGGish features. We have added a sentence in our manuscript that mentions that VGGish features are not interpretable like MIR features are (for the reason that VGGish features do not correlate with human-understandable musical descriptors, such as loudness), so a feature importance plot for VGGish features would similarly not be interpretable. Our motivation for performing PFI was not to improve model performance, but rather to get a sense for which features contributed most to genre predictions. We have amended our manuscript to address this comment. Finally, we have added confidence interval indicators to Figure 3, which displays the results of the PFI analysis.

The reviewers advised us to more explicitly state the contributions and potential applications of our work. To address this feedback, we have added in a new section to our manuscript called “Contribution of this Work,” which we hope will allow the reader to more precisely understand what we add to this area of study.

We appreciate the reviewers for pointing out a useful reference for cross-modal studies involving music. We have included the citation in our updated manuscript. Additionally, we have trimmed the “Visual-musical cross-modal analysis” section to avoid redundancy in the results section.

We received feedback that the “Multiple instance learning” sub-section in Related Works is unclear. In our most recent revision, we feel that the section is better motivated and clearer with regard to what a bag and an instance are in the context of our study. We believe the manuscript now better reflects why our problem could be framed as an MIL task, and we thank the reviewer for pointing out that this will allow the readers of our manuscript to better understand multiple instance learning and its relationship to this study.

A reviewer asked for more details on Score Stamper’s performance on three films that were manually annotated in-house. We have added to our manuscript the number of instances per movie, as well as our justification for choosing the three movies (they are of different genres, by different directors, and contain different style soundtracks).

At the request of the reviewers, we have added more information about the models and architectures that we used in our study. Additionally, we have expanded our study to include leave-one-out error analysis, which we hope provides more rigor to our study, especially given the relatively small size of our dataset. In our new manuscript, we more clearly show our method’s efficacy in predicting film genres and give more details about our models so that our study may be more easily reproduced. Finally, we have added a comparative analysis of our models’ performance against the baseline classifiers and prior studies.

Thank you for the comments about the need for additional clarity in the conclusions on “Musical-visual cross-modal analysis”. We have now added remarks on the various implications of our results. When we compare visual and musical styles by genre, the musical styles in question are the characteristics the model has learned to associate with every genre, which are not necessarily the actual musical characteristics that distinguish different film genres. Just as the model has learned an approximation of musical characteristics distinguishing film genres, our analysis approximates the relationship between visual and musical characteristics in different film genres.

Our goal in our musical-visual cross-modal analysis is to determine if visual features associated with a genre correlate with music that the model has learned to be characteristic of that same genre. We have updated the manuscript to clarify this motivation for readers.

We thank our reviewers for their interest in our manuscript and improving it. We hope that our work will find some use to our readers and others who are interested in understanding music’s role as a significant and interactive cinematic device.

Attachment

Submitted filename: PLOS Rebuttal Letter FINAL.pdf

Decision Letter 1

Stavros Ntalampiras

17 Feb 2021

PONE-D-20-31042R1

A computational lens into how music characterizes genre in film

PLOS ONE

Dear Dr. Ma,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Apr 03 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Stavros Ntalampiras

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1) Statistical significance tests were only performed for the visual feature experiments and not for the VGG-ish and MIR features.

2) the caption of table 5 is unclear about what the bold style means: authors say to have computed p-values using bonferroni correction; this means that they have compared multiple tests, but I cannot understand what are these tests: did they compare all the tests for which they show the average at once? or did they compared "actual", "predicted" and "false positives" in each row? or all rows of "predicted" and then all rows of "actual" and then all rows of "false positives"?

3) Moreover, statistical tests need the same cardinality between the set tested. This would mean that the number of "films labeled with a given genre" is the same as the number of "films excluding the given genre", which seems unlikely. How they managed this problem for the statistical test?

4) I appreciate the addition of supplementary figure S2, but it shows recision-recall curves, not scatter plots. PR-curves comes out when the classification is made based on some threshold, and they are a method for evaluating the model, not the distributions of the predictions. AUC (Area under curve) is similar to F1-score and in this case PR-curves don't add any knowledge. Even if other reviewers find this plot useful, it's not clear what "no skill" line is. It would be more clear if MIR and VGG-ish points were on the same plot.

When I wrote "scatter plot in a precision-recall space", I was meaning a scatter plot, not a curve. Scatter plots are plots which shows points in their coordinates; in my example, coordinates were precision and recall. Points could be, for instance, each film. Multiple distributions can be plotted using different colors, eg. multiple models: see for instance the following image https://www.researchgate.net/profile/Mohit_Bansal7/publication/267783907/figure/fig3/AS:669377024258060@1536603327607/BCubed-Precision-Recall-scatter-plot-for-the-Japanese-English-dataset-Each-point.png A plot such that could be useful to qualitatively evaluate the difference models without falling in type I and type II errors.

Reviewer #2: In my previous review I mentioned that the paper was in good shape already but I recommended the authors to 1) clarify their contributions and structure; 2) provide more details in experiments and discussion, i.e. explain details in architecture, statistical significance, how the visual features were calculated, among others.

The authors addressed all my comments and also improved the results and discussion section which is much more clear now, so I recommend the paper to be accepted in this new version.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Apr 8;16(4):e0249957. doi: 10.1371/journal.pone.0249957.r004

Author response to Decision Letter 1


5 Mar 2021

(This response is also available in the Response to Reviewers file upload.)

We thank the reviewers for their feedback, which have helped us further strengthen our manuscript. Below we provide details of how we addressed their comments.

1) Statistical significance tests were only performed for the visual feature experiments and not for the VGG-ish and MIR features.

Response: We overlooked showing statistical significance tests for the experiments involving VGGish and MIR features in the previous manuscript. We have included a paired t-test on the mean average precision (mAP) of our models, where the null hypothesis is that the mAP of our proposed model is not shown to be significantly higher than the random guess baseline. Perhaps not surprisingly, our best-performing models show significantly better performance than the baseline at the 0.01 level. The table has been updated with more details, and we hope the readers will see that these models are significantly better-performing than our baselines.

2) The caption of table 5 is unclear about what the bold style means: authors say to have computed p-values using bonferroni correction; this means that they have compared multiple tests, but I cannot understand what are these tests: did they compare all the tests for which they show the average at once? or did they compared "actual", "predicted" and "false positives" in each row? or all rows of "predicted" and then all rows of "actual" and then all rows of "false positives"?

Response: We appreciate the feedback about our caption in table 5: we have clarified this caption to indicate that we were using 6 hypotheses in our Bonferroni Correction, which correspond to comparing the median of a particular genre in a particular column (“Actual,” “Predicted,” and “False Positives”) with the median of all of the other genres in that same column. Our caption has been reworded and proper parameters are included to elucidate the tests that we ran.

3) Moreover, statistical tests need the same cardinality between the set tested. This would mean that the number of "films labeled with a given genre" is the same as the number of "films excluding the given genre", which seems unlikely. How they managed this problem for the statistical test?

Response: While there are some statistical tests that need the same cardinality between the two samples tested, we used a Mann Whitney U test in our results reported in Table 5, which does not mandate that the number of films labeled with a given genre is the same as the number of films excluding the given genre.

4) I appreciate the addition of supplementary figure S2, but it shows precision-recall curves, not scatter plots. PR-curves comes out when the classification is made based on some threshold, and they are a method for evaluating the model, not the distributions of the predictions. AUC (Area under curve) is similar to F1-score and in this case PR-curves don't add any knowledge. Even if other reviewers find this plot useful, it's not clear what "no skill" line is. It would be more clear if MIR and VGG-ish points were on the same plot. When I wrote "scatter plot in a precision-recall space", I was meaning a scatter plot, not a curve. Scatter plots are plots which shows points in their coordinates; in my example, coordinates were precision and recall. Points could be, for instance, each film. Multiple distributions can be plotted using different colors, eg. multiple models: see for instance the following image https://www.researchgate.net/profile/Mohit_Bansal7/publication/267783907/figure/fig3/AS:669377024258060@1536603327607/BCubed-Precision-Recall-scatter-plot-for-the-Japanese-English-dataset-Each-point.png A plot such that could be useful to qualitatively evaluate the difference models without falling in type I and type II errors.

Response: We appreciate our reviewer’s clarification about the PR plot. We have amended our manuscript to include models based on VGGish features and MIR features on the same PR plot to facilitate easy comparison of models, as was shown in the reviewer’s example. Each film’s cues were compiled and the predictions of these cues were used to calculate micro-averaged precision and recall scores for a film. We label the highest- and the lowest-precision and recall films, which we believe indicates that the performance between models using MIR features and VGGish features is similar for each film. We hope this new figure complements the PR curves shown in supplementary figure S2 by allowing readers to easily compare between MIR and VGGish-based models.

Attachment

Submitted filename: Rebuttal submission3 PLOSONE.pdf

Decision Letter 2

Stavros Ntalampiras

29 Mar 2021

A computational lens into how music characterizes genre in film

PONE-D-20-31042R2

Dear Dr. Ma,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Stavros Ntalampiras

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: All the comments have finally been addressed. Statistical significance tests have been carried out over all the evaluations shown in the text and an additional plot has been added, giving the rough idea about how different the two models are. Since PLOS One guidelines instruct the reviewers to stress the rigorousity of the scientific procedure in respect to the results, even if the plot added shows how VGG-ish features only marginally outcomes classic MIR features, I find that the paper still adds knowledge worth of publication.

Reviewer #2: The authors addressed my comments previously, and they have carefully answered and made the changes requested for the other reviewer in this iteration. The paper is in good shape to me.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Stavros Ntalampiras

31 Mar 2021

PONE-D-20-31042R2

A computational lens into how music characterizes genre in film

Dear Dr. Ma:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Stavros Ntalampiras

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Complete list of films used in this study.

    (PDF)

    S2 Appendix. Precision-recall curves for top-performing MIR and VGGish models.

    (PDF)

    S3 Appendix. Scatter plot displaying precision and recall for each film (micro-averaged across all cues), for both VGGish and MIR average pooling models.

    (PDF)

    S1 Data

    (ZIP)

    Attachment

    Submitted filename: PLOS Rebuttal Letter FINAL.pdf

    Attachment

    Submitted filename: Rebuttal submission3 PLOSONE.pdf

    Data Availability Statement

    All relevant data are within the paper and its Supporting information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES