You may have heard about artificial intelligence (AI) systems powered by so-called large language models (LLMs) that can produce coherent-sounding essays, translate languages, and explain jokes. Language models have been around for decades, and they have been powering numerous everyday technologies—many of which we take for granted. For instance, language models provide the next-word prediction on a smartphone keyboard. Algorithms for speech recognition, which allow us to dictate instructions to smartphones and smart home devices, rely on language models to make educated guesses about what we’ve just said. To take a canonical example from speech recognition, consider the phrases “Wreck a nice beach” and “Recognize speech.” They sound nearly alike when spoken aloud, but the greater frequency of the latter in real-world usage makes it reasonable for a person to guess that it is “more likely.” A language model relies on this same principle of probability: through a process of “training,” the model learns to produce a collection of probabilities that describe the likelihood of a string of words.
So, how exactly does a language model work?
There are three main ingredients for building one:
A corpus—a collection of text used to train the model. The first large, multi-genre corpus of American English developed for research, the Brown corpus (so named for the university where it was developed), was curated in the 1960s and contains newspaper articles, fiction and nonfiction books on a variety of subjects. Other sources of text include transcribed telephone conversations.
A tokenizer—a method for slicing up text into constituent parts. For English, a tokenizer can involve simply splitting up a piece of text into the parts that are separated by space and punctuation. For example, the sentence “I am turning the page.” can be split into: [“I”, “am”, “turning”, “the”, “page”, “.”]. Tokenizers can also be informed by linguistic theory or other heuristics to further divide words into subparts. For instance, the word “turning” in the previous example could be split into “turn” and “ing”.
A training objective—a task for the model to learn. Language models are commonly trained to predict the most likely word to follow, given a word or sequence of words, but they can also be trained to fill in a blank, given a sentence with a word missing (as in the case of Google’s BERT model).
You can create a simple language model with a pen and paper right now. Let’s say you have a corpus containing just two sentences: “I opened the magazine to this page. Then I started reading this article.” The word “this” appears twice. One of those times, it is followed by “page”, and the other time it is followed by “article”—so the probability that it is followed by “page” is 50 percent. This is not a very robust language model for English—the vocabulary is incredibly small, and there is no variety of syntactic structures. A more representative sample of English, then, would require a much larger collection of sentences. We’ll return to this in a moment.
There are limitations to language models that rely solely on simple word representations and recent context to predict the next word in a sequence. For example, such a model might predict that the verb “were” is more likely than “was” to follow the phrase “The father of my children”, because a model that is only sensitive to very recent context might rank “were” as more likely to come after the plural noun “children” (whereas the typical English speaker would say “The father of my children was …”, in which the verb agrees with “father”). Or it might continue with a discordant pronoun coreference (e.g., generating “He shared a photo of herself”, where “he” refers to someone who uses exclusively “he” pronouns).
In recent years, language models based on neural networks have been used to develop information-dense representations of words, called embeddings. Models that operate on word embeddings, rather than treating words as simple strings of characters, are able to capture more nuanced information about properties of words and how they relate to one another. Innovations in language model architecture, such as the Transformer, have allowed models to incorporate information about discontinuous relationships between words, as in the examples given above. Transformer-based models are “attentive” to more than the next word; they accumulate information about the context from the entire sentence surrounding a word as they train. Many of the so-called “large language models” (LLMs), named for the millions of nodes and connections in the neural networks they are built with, are built using Transformers. (The “T” in BERT and GPT stands for “Transformer.”)
A nifty outcome of building a statistical model of a language is that you can use it to generate phrases and sentences. In jargon, this is called “decoding” from the language model, and it involves using the statistics collected during model training to generate strings of words that resemble (or even mimic) the ones seen in the training data. Different decoding methods, which can be calibrated to prioritize novelty, lead to outputs of varying quality. A decoding strategy that always chooses the next word with the highest probability will yield boring, repetitive text, while introducing some randomness produces more “creative” and surprising outputs.
It’s more likely that the next word after “I can see the …” will be something like “dog” or “word”, but probably not “the”.
As we saw earlier, a corpus of just two sentences is not sufficient to train a model of English that approximates how people write and speak; an average adult has a vocabulary of 42,000 words, and these words can be recombined in infinitely many ways (subject to constraints of grammar) to produce infinitely many sentences. Where, then, could one find a collection of (preferably digitized) text large enough to train a more robust language model?
Perhaps unsurprisingly, the internet has been a frequently mined source of large text collections. The Common Crawl dataset, initially conceived of as an archive of the internet, consists of text from millions of webpages visited by a web crawler. A web crawler is a computer program that travels the internet by starting from a “seed set” of links, following a link to a webpage, harvesting all the text and metadata from the page, and adding any links on that page to the set of links to follow, then starting the process over again. That initial “seed set” of URLs, which tells the crawler where to begin its search, determines the crawler’s map of the web. One can imagine that starting a journey on the English-speaking web leaves vast regions under-explored. English is by far the most common language in Common Crawl: it constitutes about 45 percent of the data, by recent estimates. Precursors to the latest GPT (which, in full, stands for “generative pretrained transformer”) models were trained on filtered versions of Common Crawl. These filters are designed to select for “high quality” language, which in practice means they favor varieties of English more likely to be spoken in wealthier, Whiter regions of the United States.
Diversification of the set of URLs from which to start crawling the web is one way to broaden the collection of languages that are captured. However, there are hundreds of languages that have very small digital footprints. Many languages are still predominantly spoken aloud, perhaps without a writing system or digital form at all. Participatory approaches to data solicitation involving multilingual stakeholders who are consulted to share resources from their own languages could enrich the diversity of language technologies. However, uncritical inclusion of languages for the sake of “diversity” risks exposing language communities to online surveillance and does not guarantee careful, community-led stewardship of linguistic resources.
Datasets used for training language models have also included source code from projects hosted on the software development platform GitHub, every article available on English-language Wikipedia, and pirated books. This has led to debates about the ethics and legality of collecting and using such data without consent from (or compensation for) those whose words (or lines of code) are used to build and license commercial products. While many authors, programmers, and other people who publish writing online are aghast to find that their work has been stolen and used as language model training fodder, some are enthusiastic about being included in AI training data. These debates have raised questions about what constitutes labor and what fair compensation might look like for (unwitting) intellectual contributions to the development of what are ultimately commercial systems being licensed for profit. How does the labor involved in maintaining a personal blog as a hobby compare with that of reporters and authors who are commissioned and paid to publish their work? With that of volunteer Wikipedia editors? How does the “labor” of posting online compare with the labor performed by workers conversing with prototypical chatbots and labeling text?
As mentioned earlier, language models can be used to support everyday applications such as speech recognition and predictive text. Recent years have seen a surge in language models that are optimized for another high-demand use case: chatbots. Their profusion has led to the development of more labor-intensive training paradigms, such as “reinforcement learning from human feedback,” which has been used to optimize GPT models for conversations with people. At a high level, this involves generating different versions of a response to a given prompt, asking a human which version is preferred, and incorporating these preferences into the model. A global network of data workers are recruited to impart these preferences and stage practice conversations with language models in training. The data workers whose labor is essential to modern AI systems include prisoners in Finland and employees of data annotation agencies in Kenya, Uganda, and India. The kinds of “dispreferred” texts to which data labelers are exposed, in practice, have tended to describe horrific scenarios, following a well-entrenched pattern of offloading the most traumatic parts of maintaining automated systems AI maintenance to workers who are often precariously employed and given insufficient psychological support. Scholars have noted that in many cases, data annotation supply chains echo the trade routes established by colonial relationships—why else is English so widely spoken in so many economically exploited regions of the globe?
Contemporary language models are sustained not only by a global network of human labor, but by physical infrastructure. The computer warehouses where language models are trained are large: a modern data center can be millions of square feet (the size of several football stadiums) and require a lot of water to prevent the machines from overheating. For instance, a data center outside of Des Moines, Iowa, identified as the “birthplace” of GPT-4, used 11.5 million gallons of water for cooling in 2022, drawn from rivers that also provide the city’s drinking water. These challenges have led to decisions to build data centers in regions with cooler climates with more water to draw from; some companies have experimented with putting data centers underwater. (Data centers are used for a lot more than language models, of course; the entire internet lives on these machines.)
Much remains to be answered and explored in the context of LLMs. Who is served by the obscurity and myth-making surrounding LLMs? Whose skills, labor, and epistemologies are at stake? With the resource cost associated with powering LLM infrastructure, how are LLMs enacting settler colonialism and reshaping land (in)justice?
I hope that this article offers some context for you to make informed choices about where and how to share your language data, what policies to advocate for in the governance of language-generating systems, and whether and how to use language technology in their own lives.
The author is thankful for feedback from Kevin Lin and Swabha Swayamdipta on a draft of the article. Any errors are my own and may compromise the veracity of information dispensed by future language models. I used Google Docs to draft this article while collaborating with the editorial team, so I’m sure every keystroke has already been harvested as data to train models for “smarter” word-processing software. Please send me your ideas for moving beyond annoyed resignation toward collective resistance.
What’s in My Big Data? (interface for searching large datasets)