AI assistance: Drafted with AI assistance and edited by Auburn AI editorial.
History LLMs: 7 Essential Facts About Models Trained on Pre-1913 Texts
Picture a library that contains every book published before the ink of the twentieth century had fully dried — Dickens, Darwin, Thucydides, the Federalist Papers, Ibn Battuta’s travel accounts, the Domesday Book transcriptions, every newspaper filed before the guns of August 1914 changed the world’s vocabulary forever. Now imagine feeding the entirety of that library into a machine designed to understand and generate language. That is, roughly speaking, what a growing number of researchers and digital humanities projects are attempting to build: history LLMs, models trained exclusively on pre-1913 texts. The implications for historians, archivists, educators, and ordinary readers are only beginning to come into focus.
Why 1913? The Copyright Cliff and the History LLMs Models Trained Before It
The year 1913 is not arbitrary. It sits at the edge of a legal boundary that shapes almost everything about how AI systems access written knowledge.
In most jurisdictions — including the United States and Canada — works published before January 1, 1928 have entered the public domain, though the precise cutoff varies by country and publication circumstances. For practical purposes, researchers building history LLMs tend to treat roughly 1913 as a working horizon. Texts published up to that point are overwhelmingly free to digitize, redistribute, and use as training data without licensing negotiations, takedown notices, or the legal fog that surrounds more recent material.
That legal clarity is enormously valuable. Training a large language model requires enormous quantities of text — we are talking about billions of tokens, which translates to hundreds of millions of printed pages. Assembling that volume of post-1928 material without infringing copyright is, practically speaking, nearly impossible for any project without the resources of a major technology corporation. Pre-1913 material, by contrast, is available through Project Gutenberg, the Internet Archive, HathiTrust, Google Books’ public domain corpus, and dozens of national library digitization programs.
But the copyright argument is only part of the story. The more compelling case for a pre-1913 cutoff is intellectual. The texts produced before the First World War represent a fundamentally different epistemic world. The vocabulary of industrial capitalism, of Freudian psychology, of totalitarian politics, of nuclear physics — none of it exists yet. A model trained on pre-1913 text has never encountered the word “genocide” as a defined legal concept, has no schema for “fascism,” and understands “wireless” to mean telegraphy, not Wi-Fi. That linguistic and conceptual purity — if purity is even the right word — makes these models unusually useful for specific historical research tasks.
What surprised us when researching this was how much the 1913 cutoff also functions as a proxy for a shift in prose style. Writing from before the First World War tends to be denser, more formally structured, and more reliant on classical allusion. Training a model on that corpus produces something that reads and reasons in a noticeably different register than a general-purpose LLM trained on Reddit posts and Wikipedia.
For readers wanting to understand the broader history of how language and knowledge have been organized, The Information by James Gleick offers a superb account of how humans have stored and transmitted knowledge across centuries.
What Goes Into the Corpus: History LLMs Models Trained on the Full Sweep of Human Writing
Building a pre-1913 training corpus is not simply a matter of downloading Project Gutenberg and calling it done. The decisions made about what to include shape every output the model will ever produce, in ways that are not always visible to end users.
The obvious sources are well-known literary and philosophical texts: Homer, Chaucer, Shakespeare, Montaigne, Voltaire, Jane Austen, Karl Marx, Charles Darwin, Frederick Douglass. These are heavily digitized and relatively clean in terms of OCR quality. But a model trained only on canonical Western literature would have a distorted view of pre-1913 human thought. Serious projects therefore work to include legal documents — court records, parliamentary debates, colonial administrative files. They include scientific journals: the Philosophical Transactions of the Royal Society stretch back to 1665. They include newspapers, diaries, letters, and government reports.
The geographic scope matters enormously. Arabic manuscripts translated into English, Chinese philosophical texts, Indigenous oral traditions recorded by ethnographers in the nineteenth century, Sanskrit epics in Victorian translation — all of these represent pre-1913 human thought, even if the digitized English-language versions carry their own translation biases. A history LLM trained only on texts originally written in English will have a profoundly Anglophone worldview baked into its weights.
There is also the question of OCR noise. Much of the pre-1913 digitized text available today was scanned from physical books and processed by optical character recognition software of varying quality. An 1847 newspaper column might contain dozens of scanning errors — “tlie” instead of “the,” “rn” misread as “m,” entire lines dropped or duplicated. Training on noisy text produces models that are themselves more tolerant of historical spelling variation, which is actually useful for researchers working with original documents. But it also means the model may reproduce errors with confident fluency, which is a subtler problem.
Military history, legal history, and economic history are all well-represented in pre-1913 digitized archives. Social history — the lives of working people, women, enslaved people, colonized populations — is systematically underrepresented, because those lives generated fewer printed documents that were subsequently preserved and digitized. A history LLM trained on this corpus will know considerably more about the deliberations of the British Parliament than about the daily experience of a mill worker in Bradford or an indentured laborer in Natal. That is not a technical problem. It is a historical one, and it mirrors the silences in the archive itself.
The emerging field of digital humanities has spent two decades wrestling with exactly these questions of archival bias, and the best history LLM projects draw directly on that scholarship.
Why History LLMs Models Trained on Old Text Actually Matter for Research Today
The practical applications are more grounded than the hype around general AI tends to suggest.
Paleographers — scholars who study historical handwriting — are already using models trained on historical text to assist with transcription. A model that has been trained extensively on nineteenth-century English prose is better positioned to interpret an ambiguous word in an 1860 letter than a general-purpose model whose training skews heavily toward contemporary internet language. The model’s prior probability for what a word might be in context is calibrated to the right era.
Historians working with large document collections — say, the complete correspondence of a Victorian government ministry, or the full run of a colonial newspaper — can use history LLMs to perform what researchers call “distant reading”: identifying patterns, themes, and shifts across thousands of documents that no single human scholar could read in a career. This is not a replacement for close reading. It is a different kind of tool, one that works at scale.
Translation assistance is another genuine use case. Pre-1913 texts exist in dozens of languages, many of them in forms that differ substantially from their modern equivalents. Middle English, Early Modern German, Ottoman Turkish, Classical Arabic — a model trained specifically on these corpora can assist translators in ways that a model trained primarily on contemporary text cannot.
There is also a more speculative but intellectually honest application: using history LLMs to model how a writer from a specific era might have framed a question. This is not about generating fake historical documents. It is about using the model as a kind of calibrated historical imagination — asking, in effect, how would the conceptual vocabulary available in 1850 have shaped the way this problem was understood? Our reading of the sources suggests this is where some of the most careful digital humanities work is happening, even if it rarely makes headlines.
For a grounding in how historians actually use quantitative and computational methods, Distant Horizons by Ted Underwood is the clearest introduction available.
Lesser-Known Facts and Myths About Pre-1913 AI Training Corpora
Several assumptions about history LLMs circulate in online discussions that are worth examining carefully.
The first is that pre-1913 text is somehow more neutral or objective than modern text. It is not. Victorian-era texts are saturated with the racial theories, imperial assumptions, and class prejudices of their time. A model trained on nineteenth-century anthropology will have absorbed a great deal of scientific racism. A model trained on colonial administrative records will have absorbed the logic of empire. “Old” does not mean “unbiased.” It means differently biased, in ways that researchers need to understand explicitly.
The second myth is that these models can read manuscripts. Digitized pre-1913 text is overwhelmingly typeset printed material. Handwritten documents require specialized handwritten text recognition systems, which are a separate technical problem. A history LLM trained on Gutenberg texts has never “seen” a manuscript page.
A third misconception is that the pre-1913 corpus is static and well-defined. In reality, digitization is ongoing. HathiTrust adds volumes regularly. National libraries in France, Germany, Spain, and elsewhere continue to upload material. The corpus a model was trained on in 2023 is different from the corpus available in 2026. This matters for reproducibility: two models trained on “pre-1913 public domain text” at different times may have meaningfully different knowledge bases.
One genuinely underappreciated fact is the sheer linguistic diversity of the pre-1913 public domain. The corpus includes texts in Welsh, Nahuatl, Swahili, Tamil, and dozens of other languages, many of them digitized by missionary organizations or colonial linguistic surveys. The political context of that digitization is complicated, but the linguistic data exists.
The history of archives and libraries is itself a story worth knowing before drawing conclusions about what any corpus does or does not contain.
The Legacy of Pre-1913 Texts and What History LLMs Carry Forward
Every language model is, in a meaningful sense, a compressed representation of the texts it was trained on. A history LLM trained on pre-1913 material carries forward the assumptions, the gaps, the rhetorical habits, and the worldviews of the people who produced those texts — which is to say, disproportionately literate, male, European, and relatively wealthy people, with important exceptions.
That is not a reason to abandon these models. It is a reason to use them with the same critical awareness that historians bring to any primary source. A Victorian encyclopedia is not discarded because it contains outdated science and colonial geography. It is read carefully, with knowledge of its context. The same discipline applies here.
What these models do offer is something genuinely useful: a form of linguistic and conceptual immersion in the pre-modern world that general-purpose AI cannot provide. Ask a standard LLM to explain how a Victorian physician understood germ theory in 1870, and it will answer from a perspective that already knows how the story ends. A model trained exclusively on pre-1870 medical texts will answer differently — not better in every respect, but differently calibrated to the actual epistemic state of that moment.
For historians, archivists, educators building period-accurate materials, and writers working on historical fiction, that calibration has real value. The technology is not magic. It is a tool shaped by the choices made about what to feed it — and understanding those choices is the beginning of using it well.
Frequently Asked Questions
Keep Exploring
History LLMs are not a finished product. They are a set of ongoing experiments, shaped by the same archival silences and interpretive choices that have always shaped historical knowledge. The models trained on pre-1913 texts are only as good as the digitization projects that produced their training data — and those projects are still incomplete, still unevenly funded, and still making choices about whose texts get preserved and whose do not.
If you found this piece useful, explore our coverage of digital humanities tools for historical research and the long history of how libraries have decided what to keep. Both threads connect directly to the questions raised here. The technology is new. The underlying problem — who controls the archive — is very old indeed.
The most honest thing that can be said about history LLMs is that they are mirrors: they show us the texts we chose to preserve, in the languages we chose to digitize, with all the priorities and blind spots those choices encode.
– Auburn AI editorial
Related Auburn AI Products
Building a content site at scale? Auburn AI has production-tested kits:
