Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2024 Nov 4;121(46):e2409770121. doi: 10.1073/pnas.2409770121

Measuring diversity in Hollywood through the large-scale computational analysis of film

David Bamman a,1, Rachael Samberg b, Richard Jean So c, Naitian Zhou a
PMCID: PMC11573682  PMID: 39495931

Significance

Computational research on film promises to teach us about the broad social effects of culture but large-scale analysis has been chilled in the United States by the Digital Millennium Copyright Act. A regulatory change in 2021 by the U.S. Copyright Office, however, has opened up a new frontier of research investigating the science of film. We use this exemption to create one of the largest known collections of digitized movies for research. Applying computational methods to this collection, we confirm several existing qualitative studies of representation and diversity in Hollywood film, while also making several important findings. This research opens up a range of analytical questions on film that can now begin to be answered with empirical methods.

Keywords: film, computer vision, culture

Abstract

Movies are a massively popular and influential form of media, but their computational study at scale has largely been off-limits to researchers in the United States due to the Digital Millennium Copyright Act. In this work, we illustrate use of a new regulatory framework to enable computational research on film that permits circumvention of technological protection measures on digital video discs (DVDs). We use this exemption to legally digitize a collection of 2,307 films representing the top 50 movies by U.S. box office over the period 1980 to 2022, along with award nominees. We design a computational pipeline for measuring the representation of gender and race/ethnicity in film, drawing on computer vision models for recognizing actors and human perceptions of gender and race/ethnicity. Doing so allows us to learn substantive facts about representation and diversity in Hollywood over this period, confirming earlier studies that see an increase in diversity over the past decade, while allowing us to use computational methods to uncover a range of ad hoc analytical findings. Our work illustrates the affordances of the data-driven analysis of film at a large scale.


Film is a massively popular form of culture and media, with sales of movie tickets and home entertainment routinely reaching over $30 billion in the United States and Canada alone (1, 2). For decades, researchers across a number of disciplines, such as communications and cultural studies, have sought to understand the effects of watching movies on human social behavior and belief (3, 4). Historically, cultural historians have long argued that individual works of film, such as Birth of a Nation, have played an important role in propagating racial stereotypes in U.S. society (57), while media effects researchers have empirically demonstrated how films can normalize ideas about sex and violence for large audiences (8, 9). Most recently in 2015 and 2016, the “#OscarsSoWhite” social media campaign was launched to critique the lack of racial minority directors and actors in both award nominations and within Hollywood in general. In all of these cases, there exists the strong belief that negative or false cultural representations in cinema can produce harmful social effects, particularly toward women and racial minorities.

Methodologically, past research has examined these representations in film using traditional close reading techniques, examining a small number of texts in order to reveal and critique the biases that inhere in them (1014). Over the past twenty years, efforts have also adopted an empirical approach: Researchers watch hundreds of hours of movies and manually code character gender, ethnicity, and how often they appear in leading roles or have dialog (1519). In many ways, this data-driven human work represents a gold standard for analysis, in carrying out careful observational research tying determinations of gender and race to constructs like “leading role.” At the same time, the past 10 y have also seen the rise of computational methods from computer vision reaching a level of maturity that permits their use on measurements of culture (20). This includes James Cutting’s early work on motion and luminescence in film (21), work by Google, the Geena Davis Institute and University of Southern California developing models of gender (22) and Arnold and Tilton’s work opening up the field of “distant viewing” for questions of humanistic inquiry (2325). This foundational work has applied models in computer vision for recognizing the people present on screen—along with their poses and blocking within shots—to make claims about visual representation, supporting these claims with a rich set of theoretical concepts that also interrogate bias and other potential sources of error. Work from computer vision has created accurate measuring instruments, and distant viewing has provided a cogent theoretical framework for applying those instruments to culture. Our paper extends this work by tackling one critical component that has been missing but is required to drive forward the large-scale analysis of film and its impact on society: data.

In the United States, film data for copyrighted materials have generally been protected by the Digital Millennium Copyright Act (which prohibits breaking digital locks on DVDs) or by licensing terms from streaming services such as Netflix and Amazon. In this work, we draw on a recent exemption to the Digital Millennium Copyright Act (DMCA) issued by the U.S. Copyright Office to legally build a collection of 2,307 movies for analysis. This dataset, coupled with validated techniques in computer vision, allows us to interrogate, as one case study, the representation of race/ethnicity and gender in Hollywood movies over the period 1980 to 2022. Our findings both confirm existing research on race and diversity in Hollywood while also uncovering several important facts: In terms of confirmation, we find that Hollywood movies are getting more diverse, with increasing representation for actors who are women, Black, Hispanic/Latino, East Asian, and South Asian over the past decade. We find that Black actors in particular have been historically underrepresented in award-nominated films relative to their occurrence in popular ones, complementing existing knowledge about biases in individual nominations. We additionally find that screen time allotted to non-White actors within films is also getting more diverse so that, shot by shot, viewers are more likely to see a racial mix of characters over time; and all groups except White men tend to be underrepresented in leading roles relative to their representation in nonleading performances, stressing the importance of examining disparities in the long tail of casting.

We argue that the new ability to build datasets for the analysis of film, paired with computer vision methods adapted for cinematic analysis, allows others to measure an open-ended set of concepts and provide new forms of evidence for cultural inquiry.

Data

Previously, building a large-scale collection of film to carry out empirical analysis had largely not been possible in the United States due to copyright restrictions, including restrictions on circumventing digital locks (sometimes called “technological protection measures” or “TPMs”) that the film industry applies to motion picture DVDs. Ordinarily, copyright grants exclusive rights to the creators of original expression as an incentive to advance societal knowledge and progress. The most relevant such exception to the exclusive rights is the fair use exception, which flexibly supports many educational and research activities-including text and data mining research (26). However, fair use does not support breaking TPMs to undertake that text and data mining research. Instead, a separate exemption would need to be created under the DMCA to authorize “breaking” TPMs for this research with films.

In 2021, the US Copyright Office granted such an exemption to the DMCA, codified at 37 CFR 201.40(b)(4). We have used that exemption to build a collection of 2,307 movies for analysis. We select films for this analysis following two selection criteria:

Popular Films.

To capture popular films, we gather the top 50 nonanimated, narrative movies by U.S. box office over the period 1980 to 2022, using calendar year grosses recorded by www.boxofficemojo.com (SI Appendix, Fig. S1). This yields a total of 2,023 distinct movies.

Prestige Films.

To capture prestige films, we select films nominated for “Best Picture” equivalent awards by six different organizations: Academy Awards, Golden Globes, British Academy of Film and Television Arts, Los Angeles Film Critics Association, National Board of Review, and National Society of Film Critics. This yields a total of 545 distinct movies.

Given overlap between popular and award-nominated films, the union of the two sets is a total of 2,307 movies. We build a collection by purchasing DVDs, breaking TPM on them, and digitizing them. In accordance with 37 CFR 201.40(b)(4), all DVDs are owned by UC Berkeley, and all computation performed on this collection is carried out in the UC Berkeley Secure Research Data and Compute (SRDC) environment, which protects data to the level of requirements by the Health Insurance Portability and Accountability Act (HIPAA) and the Family Educational Rights and Privacy Act (FERPA) (27).

Measuring Identity

Our core research question examines the representation of race/ethnicity and gender over time in Hollywood, where much prior work has revealed a strong imbalance over the past 20 y. For gender, this work has shown that men appear roughly twice as often as women across a variety of measures, including rate of protagonists, major characters, and speaking roles (17, 22, 28). For race and ethnicity, this work has documented underrepresentation and erasure for many non-White racial/ethnic groups relative to U.S census estimates (17, 19). Much of the important work on gender has relied on viewing the film to make judgments about the rate with which characters occupy roles, with several important exceptions: Prior work has used face detection paired with gender recognition models (22, 29, 30) to examine gender on screen; and Arnold and Tilton (2325) use face detection and recognition models from OpenFace and VGGFace2 to study gender performance of main characters in the television sitcoms Bewitched and I Dream of Jeannie in their work on distant viewing. We build directly on this work by bringing together two sources of information to measure representation on screen: computational methods to recognize the actors who appear in frames and human judgments about their race/ethnicity and gender.

Recognizing Actors.

To identify the actors who are present on screen, we draw on previous work leveraging cast lists and actor photos from the Internet Movie Database (IMDB) (3134). At a high level, our method generates a vector representation for each face track in a movie (a sequence of overlapping faces) and a vector representation for each actor in the cast list, and finds the closest actor for each face track in that representation space; see Materials and Methods below for details and validation. Fig. 1 illustrates the output of this process.

Fig. 1.

Fig. 1.

Example output of our pipeline on the movie La La Land, courtesy of Lions Gate Films Inc. (This image is excluded from Creative Commons license).

This method allows us to measure the concept of facetime: The total amount of time that an actor’s face is recognizable on screen. As noted in SI Appendix, facetime is correlated with other measures of screentime (such as those that measure whether any part of an actor can be seen, or if they can be heard). Table 1 lists the actors with the most facetime in our collection of popular movies released between 1980 and 2022.

Table 1.

Actors with most facetime in top Hollywood movies, 1980 to 2022

Actor Hours:Minutes
Tom Hanks 17:36
Tom Cruise 16:55
Denzel Washington 14:41
Meryl Streep 12:12
Robert De Niro 12:08
Julia Roberts 11:31
Robin Williams 11:18
Eddie Murphy 10:57
Steve Martin 10:09
Brad Pitt 10:05

Actor Gender and Race/Ethnicity.

Computational methods of gender and race/ethnicity are widely explored in the computer vision community for both gender and race classification tasks (35) and to benchmark the diversity and fairness of datasets (36). Recent work by science and technology scholars (3739), however, has exposed issues with this automatic analysis, including the potential for both bias (4042) and misrepresentation (43) in gender and racial categories. Yet these works also agree that race and gender must not be ignored; the operationalization of these social constructs should rather be tailored to the research question (44), where potential social harms tied to measurement can be offset by what they reveal and critique about race and gender as unstable and political social constructions (45, 46).

Our research question in this study is centered around representation from the perspective of an average viewer: When seeing an actor on screen, what race/ethnicity and gender do viewers see represented? Rather than rely on computational methods for these important objects of study, we draw on human perceptions for the set of actors we identify. For gender, we draw on Wikidata, which contains information about actors that have been sourced by the Wikipedia community, and capture a variety of gender expressions beyond a simple binary. For race/ethnicity, we carry out a user study soliciting perceptions of race/ethnicity for a set of 6,740 actors who together comprise 90% of all faces seen on-screen (SI Appendix, Figs. S2–S4). For each actor, we solicit 10 perceptions from survey participants, one each from respondents who self-identify as {Black, White, Hispanic/Latino, East Asian, South Asian} × {man (including trans man), woman (including trans woman)}. Importantly, our object of study here is not the racial/ethnic self-identity of the actor (which is unknowable outside of assertion by the actor themselves), but rather the perception of race/ethnicity as seen by an average viewer.

We assess the performance of actor recognition with respect to gender and race/ethnicity, measuring the degree to which computational methods undercount or overcount actors for each identity category we study. Drawing on methods for statistical bias correction for prevalence estimation (e.g., how often we see a Black actor among all faces recognized), we find relative parity for Black, White, and East Asian actors; we note overcounting for men and Hispanic/Latino and South Asian actors, and undercounting for women; those biases, however, are comparatively small, such that applying bias correction measures would change prevalence rates by a maximum of 1.6 absolute percentage points (SI Appendix, section 2.D).

Findings

Our first set of findings confirms existing qualitative research and provides a measure of concurrent validity for our methods.

Hollywood Movies Collectively Are Getting More Diverse.

We carry out the process described above for all movies in our popular dataset and measure the amount of screentime for men and women over the period 1980 to 2022. Fig. 2 presents these results for women, displaying the gender rate for each movie as a point and a locally estimated scatterplot smoothing (LOESS) best-fit curve for the trend (along with 95% bootstrap CIs on the prediction). We bootstrap treating the film as the resampling unit, since films have strong correlations among the gender and race/ethnicity of their cast; see SI Appendix, section 2.F.5 for yearly averages with independent CIs (SI Appendix, Figs. S12 and S13) and alternative bootstrap specifications resampling by actor (SI Appendix, Figs. S14 and S15). This work further confirms the repeated finding that men have far greater screentime than women but that the rate by which viewers see women appearing on screen has been increasing over time (SI Appendix, Fig. S5), moving from a relative flat rate of occurrence around 25% from 1980 to 2010 toward 40% by 2022. As Fig. 3 illustrates, we see similar patterns of increasing representation for Black, East Asian, Hispanic/Latino, and South Asian actors over roughly the same time period.

Fig. 2.

Fig. 2.

Representation of actors who are women has increased over time; each point represents one movie, illustrated by a sample of movies from 2022.

Fig. 3.

Fig. 3.

Representation of Black, East Asian, Hispanic/Latino, and South Asian actors has increased over time (Left, detail), while representation of White actors has decreased (Right).

Previous work has found a similar gender disparity for screentime in English-language fiction (47), significantly explained by the gender of the author (men as authors give three times more screentime to men than women, while women provide equal screentime). We investigate a similar hypothesis by focusing on the gender of the director. We identify the directors of movies using IMDB and use Wikidata to provide information about their gender, supplementing gaps using the referential gender of the director’s Wikipedia biography. We see again a strong disparity: Women direct movies with relative gender parity (50.1% women [45.6 to 54.6], n=81), while men direct films in which men have three times more screentime than women (29.0% [28.2 to 29.9], n=1,941). Among the 2,023 distinct movies in this dataset, women have directed only 4.0% of them.

Black Actors Are Historically Underrepresented in Award-Nominated Films.

Much prior work has examined the lack of non-White nominees and winners for individual roles such as best actor and director (48) and demonstrated the transformative impact of the #OscarsSoWhite campaign on increasing representation among those nominations (15). We test the degree to which there is a difference in representation for popular films and award-nominated films, examining here the representation within each movie as a whole. We compare the average facetime rate for each identity category among popular films with that same measure for award-nominated films and carry out a permutation test to assess the significance of the difference in means between them.

As Table 2 illustrates, we see strong differences in the representation of Black actors across these two sets of films, with Black actors underrepresented in award-nominated films relative to popular ones. As illustrated in (SI Appendix, Figs. S7 and S8), we see that the difference is largely due to underrepresentation of Black actors in award-nominated films during the period 1990 to 2010.

Table 2.

Average representation by group in award-nominated films vs. popular films, along with P value for permutation test assessing different in means

Prestige Popular P
% Black 0.068 0.112 **
% East Asian 0.027 0.023
% Hispanic/Latino 0.048 0.052
% South Asian 0.015 0.013
% White 0.866 0.831 *
% Men 0.670 0.703 *
% Women 0.330 0.297 *

* denotes P<0.01, **P<0.001 after Bonferroni correction.

Our remaining analyses offer findings about race and diversity in Hollywood film by investigating this phenomenon at a large scale in ad hoc ways. We focus in particular on within-film diversity (as opposed to industry-wide diversity), and diversity in lead vs. nonlead roles.

Hollywood Movies Individually Are Getting More Diverse.

While the findings above generally comport with manual-intensive previous studies over shorter time frames (17, 22, 28), our methodology allows us to ask more granular, ad hoc questions without re-viewing the entire collection. The analysis above demonstrates that representation in Hollywood as a whole has become more diverse from the perspective of viewers. But that could be explained by movies that feature majority non-White casts (e.g. Crouching Tiger, Hidden Dragon; The Woman King); a viewer not watching those movies would not be exposed to that increasing representation in the industry.

We test whether movies are becoming more diverse internally by measuring the average race/ethnicity entropy within an individual movie over all frames in which at least two actors are present (SI Appendix, section 2.F.2). An entropy of 0 denotes that all actors on screen have the same race/ethnicity; as entropy goes up, diversity increases as well. As Fig. 4 illustrates, we see increasing within-movie diversity over this same time period as well, especially over the past decade. This finding is robust to the definition of “mutual presence” on screen, showing similar increasing trends over time for actors who appear together within a fixed window of time (SI Appendix, Fig. S6).

Fig. 4.

Fig. 4.

Diversity within individual movies is increasing over time, as measured by entropy.

Representation Among Lead vs. Nonlead Roles.

Finally, we consider representation among lead vs. nonlead roles, establishing an actor’s performance as a lead as function of their overall facetime relative to the actor with the greatest facetime in the film (SI Appendix, section 2.F.4 and Fig. S9). When measuring representation within these roles, we see a stark difference between representation for White men and all other gender and race/ethnicity categories. While lead and nonlead roles have both seen increasing representation over time (SI Appendix, Figs. S10 and S11), leading roles have far less representation for non-White actors and women than nonleading ones, as Table 3 illustrates. While much prior work necessarily focuses on leading roles alone, this points to the importance of measuring representation within the long tail of roles within a film.

Table 3.

Average representation in lead vs. nonlead roles in popular movies, along with P value for two-tailed paired t test assessing difference in means

Lead Nonlead P
% Black 0.103 0.119
% East Asian 0.016 0.030 ***
% Hispanic/Latino 0.036 0.066 ***
% South Asian 0.009 0.016 ***
% White 0.860 0.809 ***
% Men 0.759 0.667 ***
% Women 0.240 0.332 ***

*** denotes P<0.0001 after Bonferroni correction.

Discussion

The large-scale computational analysis of film allows us to leverage the maturity of methods in computer vision with new access in the United States to film data for research to open up a range of new analytical questions. We focus here on illustrating this power to measure how the film industry has historically engaged with diversity in its representation of actors on screen and observe real changes over the past decade. The availability of past manual research examining representation in film offers a test of concurrent validity for this method, where examining “facetime” of all actors has strong connections with other aspects of representation (leads, presence of dialog) that have been measured in past (17, 22, 28). Our work confirms these findings and offers insight into ad hoc questions on the diversity within film (as opposed to the industry as a whole) and in disparities that exist within film awards.

Film analytics of this kind presents new challenges—and affordances—for open science. Under 37 CFR 201.40(b)(4), researchers are permitted to break technological protection measures on DVDs for their own research (subject to restrictions outlined above), but not to distribute any TPM-stripped films to others. We can, however, encourage openness in two ways. First, we release our computational pipeline so that others who purchase DVDs at their own institutions are able to reproduce our results; to this end, we also release all of the universal product codes (UPC) for the DVDs in our dataset so that others are able to purchase the identical version. Second, we release all derived measures that we have calculated in our collection; this includes the locations of faces within each frame of a movie, the face track they belong to, and any actor that we have recognized for that track (along with its confidence); shot boundary locations; and structured metadata about those films (aspect ratio, frame rate, etc.). This represents the largest (to our knowledge) open collection of granular film data, so that the analysis we make here can be interrogated by further examination and critique. In much the same way that computational text analysis has generated new knowledge of culture expressed in written form (49), we hope that this new availability of film data and computational mode of inquiry can drive a range of similar pursuits—not only for these questions of representation, but for broader aspects of film style (50) as well.

Materials and Methods

Actor Recognition.

We recognize faces in each movie frame using YuNet (51) and assemble sequences of overlapping faces into face tracks (52). For each face track, we select the single face with highest detection confidence and generate a representation for it using the buffalo_l recognition model of InsightFace (53), a ResNet-50 architecture trained on the WebFace-12M dataset (54). For each movie in our dataset, we gather its cast list from IMDB and sample up to 10 images from IMDB per actor on that list. We run the same face detection and recognition models on each actor image and represent an actor as the average representation over all such faces. For each track representation, we find the closest match (by cosine similarity) between its vector representation and the vectors for all actors from the cast, enforcing a minimum similarity of 0.18 (tuned on the development data, as described below); faces below this threshold remain Unknown.

We evaluate the in-domain performance of this pipeline by creating a benchmark dataset consisting of 129 movies, stratified by year over our period of study (3 movies for each year between 1980 to 2022). We digitize each movie a second time (relying on 37 CFR 201.40(b)(1) for the digitization of short clips for research purposes) and sample 25 frames from each movie for the manual annotation of actor identity. We manually label the bounding boxes for all faces present in a frame, and label the actor identity for all faces that are recognizable from the IMDB cast list; all faces that are unidentifiable (e.g., background characters, uncredited roles) are labeled Unknown. We partition this dataset into separate training, development, and test splits (43 movies per split, with no overlap between splits); we optimize all hyperparameters for the methods above on the training and development sets and report accuracy on the test set. On this test data, we report standard metrics of average precision (AP) for face detection and F1 for person identification. In this latter measure, all non-Unknown gold labels count toward recall trials, and all non-Unknown predictions count toward precision trials. A prediction for an actor identity marked as Unknown by a human would penalize precision; an Unknown prediction for an actor whose identity was labeled by a human would penalize recall. We see an AP@50 of 0.869 for face detection, and an F1 score of 0.852 for person identification.

Measuring Gender.

In order to gather information about a community’s perception of the gender of actors, we draw on Wikidata, which contains information about actor gender that have been sourced by the Wikipedia community. This information captures a variety of gender expressions beyond a simple binary, including categories for male, female, nonbinary, trans man, trans woman, genderqueer, etc. Attestations of gender on Wikidata are often linked to primary and secondary sources as evidence. Actor gender perceptions change over time within this community concurrent with public information about the actor [such as Elliot Page’s public identification as a trans man in December 2021 (55)]. To capture historical gender perceptions, we use yearly snapshots of Wikidata data beginning in 2014 and treat an actor’s gender perception for a film as the Wikidata gender label in the year of the film’s release. Actor gender for any year prior to their first appearance on Wikidata is considered identical to the gender perception in the first attested year.

Measuring Race/Ethnicity.

As recent research point out, the social construct of “race” encompasses many things (42, 44): racial self-identity (an individual’s self-perceived race), observed race (ascribed onto an individual by others), and can be judged from genealogy (“race” inherited from parents), phenotype information (appearance), behavior, and other factors. The racial self-identity of an actor is fundamentally unknowable outside assertion by the actor themselves. Our work, therefore, does not attempt to infer that self-identity, but rather to understand how viewers themselves see race/ethnicity represented on screen. To do so, we carry out a user study soliciting perceptions of race/ethnicity for a set of 6,970 actors who together account for 90% of all of recognized faces in our dataset (SI Appendix, Fig S3). We solicit perceptions of actor race/ethnicity from participants on the survey platform Prolific. For each actor in our dataset, we solicit 10 judgments: one each from men (including trans men) who identify as Black, White, Hispanic/Latino, East Asian, and South Asian, and one each from women (including trans women) who identify as the same groups. Participants are paid on average $15 per hour. This research has been approved by the University of California, Berkeley Institutional Review Board under protocol ID 2024-01-17041. More details can be found in (SI Appendix, section 2.C).

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

We thank the reviewers and managing editor for their insightful comments. Funding: The research reported in this article was supported by funding from the Mellon Foundation. This work was made possible by the use of the Secure Research Data and Compute Platform at the University of California, Berkeley.

Author contributions

D.B., R.S., R.J.S., and N.Z. designed research; D.B., R.S., R.J.S., and N.Z. performed research; D.B. and N.Z. analyzed data; and D.B., R.S., R.J.S., and N.Z. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Code and data have been deposited in GitHub (https://github.com/dbamman/movie-representation) ( 56). Some study data are available: As noted in the “Discussion” section, we are not able to directly republish the original movies extracted from DVDs under the current exemption to the DMCA, but we make available several other forms of data to encourage openness and reproducibility. We release our computational pipeline so that others are able to run our methods on their own collections; we release all UPCs for the DVDs in our dataset so that others are able to purchase the same versions; and we release all derived measured that we have calculated in our collection (including the locations of faces within films, the face track and actor they belong to, shot boundaries, and film-level metadata).

Supporting Information

References

  • 1.Motion Picture Association, Theme report (2019). https://www.motionpictures.org/wp-content/uploads/2020/03/MPA-THEME-2019.pdf. Accessed 18 October 2024.
  • 2.Motion Picture Association, Theme report (2021). https://www.motionpictures.org/wp-content/uploads/2022/03/MPA-2021-THEME-Report-FINAL.pdf. Accessed 18 October 2024.
  • 3.Staiger J., Interpreting Films: Studies in the Historical Reception of American Cinema (Princeton University Press, Princeton, NJ, 1992). [Google Scholar]
  • 4.Perse E. M., Lambe J., Media Effects and Society (Routledge, New York, 2016). [Google Scholar]
  • 5.Stokes M., Griffith’s D. W., The Birth of a Nation: A History of the Most Controversial Motion Picture of All Time (Oxford University Press, Oxford, 2007). [Google Scholar]
  • 6.Urwand B., The Black image on the White screen: Representations of African Americans from the origins of cinema to the Birth of a Nation. J. Am. Stud. 52, 45–64 (2018). [Google Scholar]
  • 7.C. Sackl, “Screening Blackness: Controversial visibilities of race in Disney’s fairy tale adaptations” in On Disney: Deconstructing Images, Tropes and Narratives, U. Dettmar, I. Tomkowiak, Eds. (J. B. Metzler, 2022), pp. 81–96.
  • 8.Donnerstein E., Linz D., Mass media sexual violence and male viewers: Current theory and research. Am. Behav. Sci. 29, 601–618 (1986). [Google Scholar]
  • 9.Weisz M. G., Earls C. M., The effects of exposure to filmed sexual violence on attitudes toward rape. J. Interpers. Violence 10, 71–84 (1995). [Google Scholar]
  • 10.b hooks, Eating the Other: Desire and Resistance (Race and Representation, Black Looks, 1992).
  • 11.Smith V., Representing Blackness: Issues in Film and Video (Rutgers University Press, New Brunswick, NJ, 1997). [Google Scholar]
  • 12.Nama A., Black Space: Imagining Race in Science Fiction Film (University of Texas Press, 2008). [Google Scholar]
  • 13.Wilderson F. B. III, Red, White & Black: Cinema and the Structure of US Antagonisms (Duke University Press, 2010). [Google Scholar]
  • 14.Gabbard K., Black Magic: White Hollywood and African American Culture (Rutgers University Press, 2004). [Google Scholar]
  • 15.Annenberg Inclusion Initiative, A 96-year historical analysis of gender and race/ethnicity of all academy award nominees and winners (2024). https://www.inclusionlist.org/oscars. Accessed 18 October 2024.
  • 16.A. C. Ramón, M. Tran, D. Hunt, “Hollywood diversity report 2023, Part 1” (Tech. Rep., UCLA Entertainment & Media Research Initiative, 2023). https://socialsciences.ucla.edu/wp-content/uploads/2024/06/UCLA-Hollywood-Diversity-Report-2023-Film-3-30-2023.pdf. Accessed 18 October 2024.
  • 17.S. L. Smith, K. Pieper, S. Wheeler, Inequality in 1,600 popular films: Examining portrayals of gender, race/ethnicity. LGBTQ+& disability from 2007 to 2022 (2023). https://assets.uscannenberg.org/docs/aii-inequality-in-1600-popular-films-20230811.pdf. Accessed 18 October 2024.
  • 18.Eschholz S., Bufkin J., Long J., Symbolic reality bites: Women and racial/ethnic minorities in modern film. Sociol. Spect. 22, 299–334 (2002). [Google Scholar]
  • 19.Tukachinsky R., Mastro D., Yarchi M., Documenting portrayals of race/ethnicity on primetime television over a 20-year span and their association with national-level racial/ethnic attitudes. J. Soc. Issues 71, 17–38 (2015). [Google Scholar]
  • 20.Somandepalli K., et al. , Computational media intelligence: Human-centered machine analysis of media. Proc. IEEE 109, 891–910 (2021). [Google Scholar]
  • 21.Cutting J. E., et al. , Changes in hollywood film over 75 years. i-Perception 2, 569–576 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Google, The women missing from the silver screen and the technology used to find them (2017). https://about.google/intl/ALL_us/main/gender-equality-films/. Accessed 18 October 2024.
  • 23.Arnold T., Tilton L., Berke A., Visual style in two network era sitcoms. J. Cult. Anal. 4, 11045 (2019). [Google Scholar]
  • 24.Arnold T., Tilton L., Distant viewing: Analyzing large visual corpora. Digit. Scholarsh. Humanit. 34, i3–i16 (2019). [Google Scholar]
  • 25.Arnold T., Tilton L., Distant Viewing: Computational Exploration of Digital Images (MIT Press, 2023). [Google Scholar]
  • 26.Fiil-Flynn S. M., et al. , Legal reform to enhance global text and data mining research. Science 378, 951–953 (2022). [DOI] [PubMed] [Google Scholar]
  • 27.J. Christopher et al., “Corralling sensitive data in the Wild West: Supporting research with highly sensitive data” in Practice and Experience in Advanced Research Computing 2022: Revolutionary: Computing, Connections, You (Association for Computing Machinery, New York, NY, 2022), pp. 1–5.
  • 28.M. M. Lauzen, It’s a man’s (celluloid) world: Portrayals of female characters in the top grossing U.S. films of 2022 (2023). https://womenintvfilm.sdsu.edu/wp-content/uploads/2023/03/2022-its-a-mans-celluloid-world-report-rev.pdf. Accessed 18 October 2024.
  • 29.T. Guha, C. W. Huang, N. Kumar, Y. Zhu, S. S. Narayanan, “Gender representation in cinematic content: A multimodal approach” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (Association for Computing Machinery, New York, NY, 2015), pp. 31–34.
  • 30.Mazières A., Menezes T., Roth C., Computational appraisal of gender representativeness in popular movies. Humanit. Soc. Sci. Commun. 8, 1–9 (2021).38617731 [Google Scholar]
  • 31.R. Aljundi, P. Chakravarty, T. Tuytelaars, “Who’s that actor? Automatic labelling of actors in TV series starting from IMDB images” in Proceedings of ACCV 2016: 13th Asian Conference on Computer Vision (Springer-Verlag, Berlin, 2017), pp. 467–483.
  • 32.P. Vicol, M. Tapaswi, L. Castrejon, S. Fidler, “Moviegraphs: Towards understanding human-centric situations from videos” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE Computer Society, Los Alamitos, CA, 2018), pp. 8581–8590.
  • 33.A. Nagrani, A. Zisserman, “From Benedict Cumberbatch to Sherlock Holmes: Character identification in TV series without a script” in British Machine Vision Conference (The British Machine Vision Association and Society for Pattern Recognition, 2017).
  • 34.M. Bain, A. Nagrani, A. Brown, A. Zisserman, “Condensed movies: Story based retrieval with contextual embeddings” in Proceedings of 15th Asian Conference on Computer Vision (ACCV), H. Ishikawa, C. Liu, T. Pajdla, J. Shi, Eds. (Springer, Cham, 2020), pp. 460–479.
  • 35.Fu S., He H., Hou Z. G., Learning race from face: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2483–2509 (2014). [DOI] [PubMed] [Google Scholar]
  • 36.K. Kärkkäinen, J. Joo, “FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation” in Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2021).
  • 37.Noble S. U., Algorithms of Oppression: How Search Engines Reinforce Racism (New York University Press, 2018). [DOI] [PubMed] [Google Scholar]
  • 38.O’Neill C., Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Crown Books, 2016). [Google Scholar]
  • 39.Benjamin R., Race After Technology: Abolitionist Tools for the New Jim Code (John Wiley & Sons, 2019). [Google Scholar]
  • 40.Scheuerman M. K., Wade K., Lustig C., Brubaker J. R., How we’ve taught algorithms to see identity: Constructing race and gender in image databases for facial analysis. Proc. ACM Hum. Comput. Interact. 4, 1–35 (2020). [Google Scholar]
  • 41.Z. Khan, Y. Fu, “One label, one billion faces: Usage and consistency of racial categories in computer vision” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery, New York, NY, 2021), pp. 587–597.
  • 42.S. Benthall, B. D. Haynes, “Racial categories in machine learning” in Proceedings of the Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery, New York, NY, 2019), pp. 289–298.
  • 43.Keyes O., The misgendering machines: Trans/HCI implications of automatic gender recognition. Proc. ACM Hum. Comput. Interact. 2, 1–22 (2018). [Google Scholar]
  • 44.A. Hanna, E. Denton, A. Smart, J. Smith-Loud, “Towards a critical race methodology in algorithmic fairness” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery, New York, NY, 2020), pp. 501–512.
  • 45.Klein L. F., Dimensions of scale: Invisible labor, editorial work, and the future of quantitative literary studies. PMLA 135, 23–39 (2020). [Google Scholar]
  • 46.So R. J., Redlining Culture: A Data History of Racial Inequality and Postwar Fiction (Columbia University Press, 2021). [Google Scholar]
  • 47.Underwood T., Bamman D., Lee S., The transformation of gender in English-language fiction. Cult. Anal. 3, 1–25 (2018). [Google Scholar]
  • 48.E. Berman, See the entire history of the Oscars diversity problem in one chart (2016). https://labs.time.com/story/oscars-diversity/. Accessed 18 October 2024.
  • 49.Michel J. B., et al. , Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Cooper A., Nascimento F., Francis D., Exploring film language with a digital analysis tool: The case of Kinolab. Digit. Hum. Q. 15, 1 (2021). [Google Scholar]
  • 51.Wu W., Peng H., Yu S., Yunet: A tiny millisecond-level face detector. Mach. Intell. Res. 20, 656–665 (2003). [Google Scholar]
  • 52.E. Bochinski, V. Eiselein, T. Sikora, “High-speed tracking-by-detection without using image information” in Proceedings of the 14th IEEE International Conference on Advanced Video Signal Based Surveillance (AVSS) (IEEE, Piscataway, NJ, 2017), pp. 1–6.
  • 53.J. Guo, J. Deng, Insightface: 2D and 3D face analysis project (2019). https://github.com/deepinsight/insightface. Accessed 18 October 2024.
  • 54.Z. Zhu et al., “WebFace260M: A benchmark unveiling the power of million-scale deep face recognition” in Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE Computer Society, Los Alamitos, CA, 2021), pp. 10492–10502.
  • 55.M. Donnelly, Oscar-nominated ‘Umbrella Academy’ star Elliot Page announces he is transgender (2020). https://variety.com/2020/film/news/elliot-page-transgender-ellen-page-juno-umbrella-academy-1234843023/. Accessed 18 October 2024.
  • 56.D. Bamman, Data and code to support “Measuring diversity in Hollywood through the large-scale computational analysis of film.” Github. https://github.com/dbamman/movie-representation. Deposited 19 October 2024. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

Code and data have been deposited in GitHub (https://github.com/dbamman/movie-representation) ( 56). Some study data are available: As noted in the “Discussion” section, we are not able to directly republish the original movies extracted from DVDs under the current exemption to the DMCA, but we make available several other forms of data to encourage openness and reproducibility. We release our computational pipeline so that others are able to run our methods on their own collections; we release all UPCs for the DVDs in our dataset so that others are able to purchase the same versions; and we release all derived measured that we have calculated in our collection (including the locations of faces within films, the face track and actor they belong to, shot boundaries, and film-level metadata).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES