Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2025 Apr 30;122(18):e2508428122. doi: 10.1073/pnas.2508428122

Historians use data science to mine the past

Carolyn Beans
PMCID: PMC12067229  PMID: 40305052

In 2006, historian Jo Guldi recalled Google Books as nothing less than a revelation. The company had started digitizing millions of books from university and public libraries. With a single click, Guldi, then a PhD candidate, mined this vast collection for written records of a little-known 18th-century Scottish landowner named John Sinclair, the subject of her research. The result, she recalls: 500 documents.

graphic file with name pnas.2508428122unfig01.jpg

With data science tools in hand, historians can ask questions that would be difficult to answer with a lifetime of perusing tomes or scrolling through microfiche. Image credit: Shutterstock/BPTU.

Those millions of volumes of digitized texts offered more than the opportunity for a simple word search. It was possible to mine texts for patterns and relationships among people and places. She knew her field was forever changed.

Historians today can perform computational tasks as they dig for insights—using, for example, computational models to reveal how often two words appear together in texts, launching network analyses to link individuals who appear in the same documents, or training computer vision models to recognize key features on digitized maps.

With these tools in hand, historians ask questions that would be difficult to answer with a lifetime of scrolling through microfiche. When and where did the originators of the anti-government expansion movement in the United States first meet? Who led the spy group that sent intel to the United States from Cuba in the late 1800s?

But practitioners of this field, known as digital history, understand that they are working with incomplete data. Most records have yet to be digitized, and many that are contain biases. And there are dangers in mining words from decades past without a deep understanding of their context.

The field of history, therefore, has approached data science with a wary eye. “The bar for historical proof in history is as high as we can go,” says Guldi, now a historian and data scientist at Emory University in Atlanta, Georgia. “Getting the discipline to a place where we can believe the results of the computer has been very difficult.” Yet, the promise of unearthing historical truths hidden in mounds of records makes digital history a tantalizing approach for a small, but growing, number of scholars.

Counting Words

Much of digital history is about tallying up words in texts. “We count words over time to understand changing ideas and politics,” Guldi explains. A gradual increase in the instances of a person’s name over decades of meeting minutes might show growing influence in a company. A sudden uptick in a particular phrase in US congressional records might reveal the moment a societal issue, such as climate change or desegregation, entered the political debate. Digital historians also use computational models from natural language processing to sort words and documents into categories. Or they may use statistical algorithms to reveal changes in word use. “But all of those mechanisms begin with counting words,” Guldi says.

The process starts when researchers—often computational scientists—scan letters, newspapers, or other documents and then use a machine learning tool known as optical character recognition (OCR) to help the computer recognize words. With the power of natural language processing, they teach the computer to understand which words refer to a date, for example, versus an author or a speaker, so that the text can be organized into a database.

Guldi then can begin asking questions. “I might say to the computer, ‘I’m really interested in how one decade is different than another decade,’” she says. “‘Show me all of the words that appear in the 1980s but not the 1970s.’”

Digital historians often return to original documents to confirm that words plucked from texts without context mean what the scholar thinks they do. Computer algorithms can make this fact-checking easier by finding quotes that contain the word. Historians can ask AI to summarize what’s happening in a document, along with the text snippets supporting its conclusion. But large language models, like the sort that power ChatGPT, sometimes invent these snippets, Guldi says. “We don’t take the computer’s word for it.”

graphic file with name pnas.2508428122unfig02.jpg

By combining geospatial railway data with census data, historians explored the occupations of people living at different distances from tracks. The research, which used MapReader to train computer vision models to recognize features on digitized maps, shows that there were job opportunity upsides to living close to railroads. This map shows railspace (red squares) in east and southeast London circa 1900. Image credit: Katherine McDonough. Reprinted from ref. 7, which is licensed under CC BY.

Tracing a Trend

So, how can patterns of word choice elucidate societal trends? In a study reported in her 2023 book The Dangerous Art of Text Mining: A Methodology for Digital History, Guldi explored the beginnings of the anti-environmentalist movement in the US Congress by mining a database of congressional debates (1). Historians knew that bipartisan support for environmental causes ended with debates over building the Trans-Alaska Pipeline in the early 1970s. Guldi wanted to reveal the changes in word choice at the heart of the anti-environmentalist movement, as well as the number and identity of actors involved.

She directed a computational model to list the words that occur alongside the word “environmentalist” within five-year blocks from 1970 through 2009. She learned, for example, that “radical environmentalist” was first said four times in the window from 1970 to 1974. From 2005 through 2009, it was spoken 83 times. Over the decades, other disparaging descriptors followed, including “militant” and “rabid.”

Yet, over four decades, only six politicians were responsible for 90% of the utterances of anti-environmentalist phrases. Such results, she says, help her and others “gain insight about how cultural and political shifts happen.”

Making Connections

Another powerful tool, known as a network analysis, can help historians tease out who knew who when and perhaps offer clues as to how these relationships spurred change. “Everyone’s always trying to understand influence,” says Peter Roady, a historian at the University of Utah in Salt Lake City. But how can a historian learn more about unrecognized movers and shakers without names to go after?

Roady wanted to uncover the players behind the birth of a conservative movement aimed at limiting government expansion in the United States. He created a database of digitized records from early conservative organizations, such as letters and board meeting minutes. He then fed his database into a network analysis tool that draws direct lines between individuals and organizations that appear in the same documents, placing them within a larger web of interactions.

The analysis, reported in 2023, suggests that the main authors of this conservative movement were interacting by the late 1920s and exerting influence on American politics by the 1940s—more than 10 years before when historians had dated the beginnings of the movement’s influence (2). One previously unrecognized figure, Colby Chester, then chairman of General Foods, was well connected. “He was a hidden but very important figure behind the scenes within the American conservative movement,” Roady says.

Proceed with Caution

The data science approach, of course, does have limitations. If a historian isn’t careful, what they gain in breadth can be lost in depth—the deep understanding of a record’s position in place and time.

In 2016, Lara Putnam of the University of Pittsburgh in Pennsylvania became one of the earliest historians to sound the alarm about how mining for records through online search engines and databases was transforming the field in ways that were not all positive (3). Today, she still worries that these quick searches risk presenting historians with data they lack the context to interpret. But she endorses historians who go beyond simple record searches to adopt more complex tools from data science knowingly. “[They] are generally doing so with well-honed caution about source origins and the risks of decontextualization,” she says, noting that Guldi “really exemplifies the mindful use of computational approaches.”

“The hope is that if you get a big aggregate dataset, you’ll be able to allow folks to feel more confident in the hypotheses they put forward.”

—Peter Roady

One challenge is that databases to which scholars turn with these new methods may be incomplete in ways that aren’t obvious. For example, a newspaper might be included in a database, but if OCR didn’t capture its pages properly, the text wouldn’t be searchable.

Case in point: In 2023, research led by historians at The Alan Turing Institute in London, United Kingdom, revealed that searching the British Library’s digitized newspaper collection for information about life during the Industrial Revolution would return politically biased results. The reason: OCR was better at reading the fonts favored by more expensive and conservative papers than those used by less expensive, liberal ones (4).

Even when OCR works as planned, databases only capture a sliver of archival records worldwide, though those numbers are steadily growing, Guldi says. In the United States, as of 2023, the Library of Congress had digitized more than 9 million records, including newspapers, manuscripts, and the personal papers of every president from George Washington to Calvin Coolidge (5).

graphic file with name pnas.2508428122unfig03.jpg

A network analysis of digitized records from US conservative organizations in the 1920s through the 1950s shows that a previously unrecognized figure, Colby Chester, was at the center of early efforts to shrink government expansion. Image credit: Peter Roady.

Tailored Tools

Altering the computational tools could make it easier for historians to ask questions and ground the answers in historical context. Many network analysis tools were built to identify marketable information, such as which products an Amazon customer might like, says Kalani Craig, a digital historian at Indiana University (IU) Bloomington. Her team recently used a network tool to identify the leader of the Pinkerton National Detective Agency’s spy group that sent intelligence information to the United States from Cuba in the late 19th century. To do so, the team fed agency reports into the tool to create a network of names. Lines connecting pairs of names showed that those names appeared together in a report. The results, not yet published, revealed one unfamiliar name that only linked to a single other spy, who himself had many other connections. IU Bloomington colleague Arlene Díaz, a historian of Latin America and the Caribbean, then went back to the archives looking specifically for the unknown figure. She found a letter suggesting that he was, in fact, the head of the group. “He ended up being the spy ring master,” Craig says. “That one tiny little node.”

But this process was difficult, in part because the network tool only allowed Craig to include a single name for each individual. It recognized a “Mr.” followed by a surname as a completely different person from the complete first and last name. Nicknames, too, were problematic. At times, the spies referred to a diplomat they regularly shadowed by his last name, Congosta. At other times, they used their own nickname for him, Langosta, the Spanish word for lobster. To the off-the-shelf network tool she was using, these names were seen as two separate people. The workaround required Craig to modify her database.

To address this challenge, Craig recently designed her own network analysis software, called Net.Create, in collaboration with educational researchers, software developers, and other digital historians. The program, not yet published, accommodates multiple names for a single individual. Net.Create also includes a required citation field, so users can move from a connection between two individuals to the original document that links them, providing key context beyond a single line connecting dots.

More Than Words

Because historians work with far more than words, Katherine McDonough, a historian at Lancaster University in the United Kingdom and senior research fellow at The Alan Turing Institute, collaborates with scientists to build software that gathers data from digitized maps. Her first project enables historians to train computer vision models, which are, in this case, a form of AI that classifies data in images to identify a specific feature on a map, such as tree cover or farmland (6).

The program first divides maps into grid spaces called patches. A researcher then views individual patches randomly selected across many maps. For each patch, they mark whether the feature of interest is present or absent. The process can entail marking thousands of these patches. But they can then use these hand-annotated patches to train the computer to recognize the feature on any other map it encounters. The software, known as MapReader, was developed as part of the Living with Machines project, a five-year British initiative launched in 2018 that brought historians and scientists together from a variety of universities and institutes to envision new ways to use digitized archives to understand life during the Industrial Revolution. McDonough and colleagues have since developed software allowing MapReader to read text on maps (7).

McDonough’s team first used MapReader to recognize railroad infrastructure on maps of Great Britain from the mid-19th through 20th centuries (6). By combining these geospatial railway data with census data, they could explore the occupations of people living at different distances from tracks. One might have expected that the advent of trains would be entirely detrimental to communities along tracks but far from stations, McDonough says. Tracks, after all, brought noise and pollution. But the research shows that there were also some upsides in proximity to employment opportunities. People living near “railspace” often worked as railway inspectors, signalmen, railway coach and wagon makers, and other train-related jobs. The study’s approach offers a new look at “occupations and economic classes that have thus far been excluded from any positive narrative about the impact of rail,” McDonough says.

None of these tools need replace a historian’s traditional path to knowledge: a well-read eye on hard-won records. But Roady hopes that more historians will see the potential in viewing their research as data. He’d also like to see historians pool those data with colleagues to reach even deeper insights.

“Libraries are full of books written by historians making an argument that such and such organization, such and such person, was really important. But they’re often doing that based on an understandably limited amount of data,” Roady says. “The hope is that if you get a big aggregate dataset, you’ll be able to allow folks to feel more confident in the hypotheses they put forward.”

References

  • 1.Guldi J., The Dangerous Art of Text Mining: A Methodology for Digital History (Cambridge University Press, 2023). [Google Scholar]
  • 2.Roady P., Selling selective anti-statism: The conservative persuasion campaign and the transformation of American politics since the 1920s. Mod. Am. Hist. 6, 21–43 (2023). [Google Scholar]
  • 3.Putnam L., The transnational and the text-searchable: Digitized sources and the shadows they cast. Am. Hist. Rev. 121, 377–402 (2016). [Google Scholar]
  • 4.The Alan Turing Institute, “Living with machines: The environmental scan” (video recording, 2024). https://youtu.be/vTc4S3Zx9IA?si=Za2RZrdPIpf9UOTP. Accessed 17 April 2025.
  • 5.Owens T., “Library of Congress digitization strategy: 2023-2027” Library of Congress Blogs (2023). https://blogs.loc.gov/thesignal/2023/02/library-of-congress-digitization-strategy-2023-2027/. Accessed 12 March 2025.
  • 6.Rhodes J., et al. , “Beyond the tracks: Re-connecting people, places and stations in the history of late-Victorian railways” in Living with Machines: Computational Histories in the Age of Industry, Ahnert R., Griffin E., & Lawrence J., Eds. (University of London, 2024). [Google Scholar]
  • 7.McDonough K., Beelen K., Wilson D. C. S., Wood R., Reading maps at a distance: Texts on maps as new historical data. Imago Mundi 76, 296–307 (2025). [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES