On the night of March 18, 2018, Elaine Herzberg was walking her bicycle across a dark desert road in Tempe, Arizona. After crossing three lanes of a four-lane highway, a "self-driving" Volvo SUV, traveling at thirty-eight miles per hour, struck her. Thirty minutes later, she was dead. The SUV had been operated by Uber, part of a fleet of self-driving car experiments operating across the state. A report by the National Transportation and Safety Board determined that the car's sensors had detected an object in the road six seconds before the crash, but the software "did not include a consideration for jaywalking pedestrians." In the moments before the car hit Elaine, its AI software cycled through several potential identifiers for her—including “bicycle,” “vehicle,” and “other”—but, ultimately, was not able to recognize her as a pedestrian whose trajectory would be imminently in the collision path of the vehicle.
How did this happen? The particular kind of AI at work in autonomous vehicles is called machine learning. Machine learning enables computers to “learn” certain tasks by analyzing data and extracting patterns from it. In the case of self-driving cars, the main task that the computer must learn is how to see. More specifically, it must learn how to perceive and meaningfully describe the visual world in a manner comparable to humans. This is the field of computer vision, and it encompasses a wide range of controversial and consequential applications, from facial recognition to drone strike targeting.
Unlike in traditional software development, machine learning engineers do not write explicit rules that tell a computer exactly what to do. Rather, they enable a computer to “learn” what to do by discovering patterns in data. The information used for teaching computers is known as training data. Everything a machine learning model knows about the world comes from the data it is trained on. Say an engineer wants to build a system that predicts whether an image contains a cat or a dog. If their cat-detector model is trained only on cat images taken inside homes, the model will have a hard time recognizing cats in other contexts, such as in a yard. Machine learning engineers must constantly evaluate how well a computer has learned to perform a task, which will in turn help them tweak the code in order to make the computer learn better. In the case of computer vision, think of an optometrist evaluating how well you can see. Depending on what they find, you might get a new glasses prescription to help you see better.
To evaluate a model, engineers expose it to another type of data known as testing data. For the cat-detector model, the testing data might consist of both cats and other animals. The model would then be evaluated based on how many of the cats it correctly identified in the dataset. Testing data is critical to understanding how a machine learning system will operate once deployed in the world. However, the evaluation is always limited by the content and structure of the testing data. For example, if there are no images of outdoor cats within the testing data, a cat-detector model might do a really good job of recognizing all the cats in the testing data, but still do poorly if deployed in the real world, where cats might be found in all sorts of contexts. Similarly, evaluating Uber’s self-driving AI on testing data that doesn’t contain very many jaywalking pedestrians will not provide an accurate estimate of how the system will perform in a real-world situation when it encounters one.
Finally, a benchmark dataset is used to judge how well a computer has learned to perform a task. Benchmarks are special sets of training and testing data that allow engineers to compare their machine learning methods against each other. They are measurement devices that provide an estimate of how well AI software will perform in a real-world setting. Most are circulated publicly, while others are proprietary. The AI software that steered the car that killed Elaine Herzberg was most likely evaluated on several internal benchmark datasets; Uber has named and published information on at least one. More broadly, benchmarks guide the course of AI development. They are used to establish the dominance of one approach over another, and ultimately influence which methods get utilized in industry settings.
The single most important benchmark in the field of computer vision, and perhaps AI as a whole, is ImageNet. Created in the late 2000s, ImageNet contains millions of pictures—of people, animals, and everyday objects—scraped from the web. The dataset was developed for a particular computer vision task known as “object recognition.” Given an image, the AI should tag it with labels, such as “cat” or “dog,” describing what it depicts.
It is hard to overstate the impact that ImageNet has had on AI. ImageNet has inaugurated an entirely new era in AI, centered on the collection and processing of large quantities of data. It has also elevated the benchmark to a position of great influence. Benchmarks have become the way to evaluate the performance of an AI system, as well as the dominant mode of tracking progress in the field more generally. Those who have developed the best-performing methods on the ImageNet benchmark in particular have gone on to occupy prestigious positions in industry and academia. Meanwhile, the AI systems built atop of ImageNet are being used for purposes as varied as refugee settlement mapping and the identification of military targets—including the technology that powers Project Maven, the Pentagon’s algorithmic warfare initiative.
The assumption that lies at the root of ImageNet’s power is that benchmarks provide a reliable, objective metric of performance. This assumption is widely held within the industry: startup founders have described ImageNet as the “de-facto image dataset for new algorithms,” and most major machine learning software packages offer convenient methods for evaluating models against it. As the death of Elaine Herzberg makes clear, however, benchmarks can be misleading. Moreover, they can also be encoded with certain assumptions that cause AI systems to inflict serious harms and reinforce inequalities of race, gender, and class. Failures of facial recognition have led to the wrongful arrest of Black men in at least two separate instances, facial verification checks have locked out transgender Uber drivers, and decision-making systems used in the public sector have created a “digital poorhouse” for welfare recipients.
Benchmarks are not neutral pieces of technology or simple measurement devices. Rather, they and the measures that accompany them are situated, constructed, and highly value-laden—the reality of which is frequently discounted or ignored in dominant AI narratives. Datasets have hidden and complicated histories. Uncovering these histories, and understanding the various choices and contingencies that shaped them, can help illuminate not only the very partial and particular ways that AI systems work, but also help identify the upstream origins of the harms they produce. What we need, in other words, is a genealogy of benchmark datasets.
Chair Inherits from Seat
Teaching computers how to see was supposed to be easy. In 1966, the AI researcher Seymour Papert proposed a “summer project” for MIT undergraduates to "solve" computer vision. Needless to say, they didn’t succeed. By the time the computer scientist Fei-Fei Li entered the field in the early 2000s, researchers had acquired a much deeper appreciation for the complexity of computer vision problems. Yet progress remained slow. In the intervening decades, the basics had been worked out. But there was still far too much manual labor involved.
At a high level, computer vision algorithms work by scanning an image, piece by piece, using a collection of pattern recognition modules. Each module is designed to recognize the presence or absence of a different pattern. Revisiting our cat-detector model, some of the modules might be sensitive to sharp edges or corners and might “light up” when coming across the pointy ears of a cat. Others might be sensitive to soft, round edges, and so might light up when coming across the floppy ears of a dog. These modules are then combined to provide an overall assessment of what is in the image. If enough pointy ear modules have lit up, the system will predict the presence of a cat.
When Li began working on computer vision, most of the pattern recognition modules had to be painstakingly handcrafted by individual researchers. For computer vision to be effective at scale, it would need to become more automated. Fortunately, three new developments had emerged by the mid-2000s that would make it possible for Li to find a way forward: a database called WordNet; the ability to perform image searches on the web; and the existence of crowdworking platforms. Li joined the Princeton computer science faculty in 2007. There, she encountered Christiane Fellbaum, a linguist working in the psychology department, who introduced her to a database called WordNet. WordNet, developed by cognitive psychologist George A. Miller in the 1980s, organizes all English adjectives, nouns, verbs and adverbs into a set of "cognitive synonyms... each expressing a different concept." Think of a dictionary, where words are assembled into a hierarchical, tree-like structure instead of alphabetically. “Chair” inherits from “seat,” which inherits from “furniture,” all the way up to “physical object,” and then to the root of all nouns, “entity.”
Fellbaum told Li that her team wanted to develop a visual analog to WordNet, where a single image was assigned to each of the words in the database, but had failed due to the scale of the task. The resulting dataset was to be called ImageNet. Inspired by the effort, in early 2007 Li took on the name of the project and made it her own. Senior faculty at Princeton discouraged her from doing so. The task would be too ambitious for a junior professor, they said. When she applied for federal funding to help finance the undertaking, her proposals were rejected, with commenters saying that the project’s only redeeming feature was that she was a woman.
Nonetheless, Li forged ahead, convinced that ImageNet would change the world of computer vision research. She and her students began gathering images based on WordNet queries entered into multiple search engines. They also grabbed pictures from personal photo sharing sites like Flickr. Li would later describe how she wrote scripts to automate the collection process, using dynamic IP tricks to get around the anti-scraping safeguards put in place by various sites. Eventually, they had compiled a large number of images for each noun in WordNet.
However, they still needed a way to verify that the images actually matched the word associated with them—that all of the images linked to “cat” really showed cats. Since the scraping was automated, manual review was required. This is where Amazon's crowdworking platform, Mechanical Turk (MTurk), came in. It was a "godsend," Li later recalled. MTurk had been launched just a couple of years before, in 2005. Her team used it to hire workers from around the world to manually review the millions of images for each WordNet noun and then verify the presence or absence of a target concept.
The ImageNet dataset would take two and a half years to build, its first version completed in 2009. When it was finished, it consisted of fourteen million images labeled with twenty thousand categories from WordNet, including everything from red foxes to Pembroke corgis, speed boats to spatulas, baseball players to scuba divers. At the time, it was the largest publicly available computer vision dataset, hosted on the ImageNet website for anyone to download.
Convolutional Cat Ears
Although it took an immense amount of effort to create ImageNet, the initial uptake was slow. Li and her students presented a poster announcing its creation at a major computer vision conference. Tucked away in a corner of a conference center in Miami Beach, they even distributed logoed keychains and pens to advertise it. But beyond ImageNet’s limited popularity, there was a deeper issue. The problem that Li had hoped to solve with the creation of ImageNet—the fact that object recognition modules needed so much manual work to produce—still hadn’t been solved.
In an attempt to encourage wider adoption of the dataset, Li’s team decided to organize a competition. The ImageNet Large Scale Visual Recognition Challenge was officially launched in 2010. To enter the challenge, competitors would develop machine learning models using the benchmark training data, and submit their model’s predictions on a set of the benchmark testing data. The team whose model could detect objects in the images with the highest accuracy would be the winner.
In 2012, computer scientist Alex Krizhevsky, along with his colleagues at the University of Toronto, won the ImageNet Challenge with AlexNet, a neural network–driven machine learning model that outperformed all other competitors by a previously unimaginable margin. After a long period during which neural networks were out of fashion in AI, AlexNet almost single-handedly put them back at the center of research into the field. Part of what enabled the return of neural networks was the much greater processing power of modern computers, which was needed to handle massive datasets.
AlexNet helped fulfill the potential of ImageNet and solve the problem that Li had identified when she first started out: that computer vision required too much manual labor. Krizhevsky and his colleagues didn’t rely on handcrafted modules for object recognition. Rather, using neural networks, AlexNet was able to “learn” what an object looked like completely from the data.
Neural networks work by stacking layers of artificial “neurons” on top of each other. Each layer alters the image slightly, like a camera lens filter. Some of the first layers of AlexNet’s neural network model, known as “convolutional,” allowed it to automatically encode information that used to be manually coded—like the pointy edges of a cat’s ears. There was no longer any need to enter such information by hand. With enough images of cats, the neural network would be able to learn which patterns were most predictive of the animal.
AlexNet’s success is often credited with sparking the resurgence of neural networks—under the new name of deep learning, which refers to multiple stacks of neural network layers—as the dominant machine learning paradigm. The 2012 paper associated with the model now has over seventy-two thousand citations on Google Scholar, an indication of its popularity in academic and industry circles alike. Deep learning techniques have achieved near-universal adoption not only within computer vision, but also within natural language processing—which works with human language—and a number of related subfields.
The deep learning era has, in turn, placed data—more specifically, vast quantities of data—at the center of AI development. Because deep learning models become more accurate when trained on more data, tech companies are highly incentivized to gather as much data as possible. The amount of information available on the internet continues to grow. Users on Instagram share 8.9 million images a day alone. Meanwhile, a new cottage industry of data annotation work has sprung up to feed soaring demand for data labeling. The people who do this work are typically subcontractors or crowdworkers, like the MTurkers who helped create ImageNet, and represent a growing underclass of invisible tech workers.
Algorithms of Oppression
Why does the history of ImageNet matter? ImageNet has had an enormous influence on the field of modern AI, and on many of the AI systems that affect so many aspects of our lives. By understanding the particular circumstances of ImageNet’s creation, we can better understand these systems. We can also understand how the progress of AI moves in fits and starts, how its reliance on massive amounts of data is contingent and accidental, and how its present course was just one possible path among many.
ImageNet was built on three technological pillars: WordNet, search engines, and crowdworking. The reliance on WordNet has proven to be particularly problematic. ImageNet recodes outmoded and prejudiced assumptions—many of them racist, sexist, homophobic, and transphobic—because those assumptions come directly from WordNet.
A good illustration of this comes from a website called ImageNet Roulette. Developed by AI researcher Kate Crawford and artist Trevor Paglen, ImageNet Roulette allows users to upload images of themselves. These images are then analyzed by a machine learning model, trained on a set of ImageNet data, which generates a description. When Guardian journalist Julia Carrie Wong uploaded a photo of herself, it labeled her with an ethnic slur, while New York Times video editor Jamal Jordan was consistently labeled as "Black, Black person, blackamoor, Negro, or Negroid,” no matter which image he uploaded.
To their credit, ImageNet’s creators quickly sanitized the dataset of such labels for its future users. But those categories still exist in multiple machine learning systems, due in part to the influence of ImageNet. AI researchers Vinay Uday Prabhu and Abeba Birhane recently demonstrated that the categories in the WordNet database persist in several widely cited public computer vision benchmarks, resulting in the takedown of a prominent benchmark by MIT called Tiny Images. And if they exist in these open datasets, then they are potentially replicated in many internal industry ones.
WordNet is not the only issue with ImageNet, however. The data contained within ImageNet was gathered from internet search engines in the early 2000s. Such search engines, as the UCLA professor of information studies Safiya Umoja Noble has explained, encode racist and sexualized imagery for Black, Latina, and Asian women, and overrepresent imagery of white men in positions of power. These engines also portray a Western white male vision of the world, associating "beauty" with white women, "professor" or "ceo" with white men, and "unprofessional hairstyles" with Blackness. These assumptions filtered into ImageNet as the dataset was constructed.
One common response from AI researchers to the oppressive aspects of ImageNet, and to the crisis of algorithmic injustice more generally, is that the problem lies with the data: if we get more or different data, then all these problems will inevitably go away. This was the response that Yann LeCun, one of the “godfathers” of deep learning and chief AI scientist at Facebook, provided when a machine learning model designed to depixelate faces ended up whitening them as well. Timnit Gebru, co-lead of Google’s Ethical AI team, struck back, underscoring how AI systems cause real harm and exacerbate racial inequality, and arguing that improving them must mean more than just focusing on better data collection. (Disclosure: two of us, Hanna and Denton, are members of Gebru’s team.) Furthermore, data collection efforts aimed at increasing the representation of marginalized groups within training data are often executed through exploitative or extractive mechanisms such as, for example, IBM’s attempt to “diversify” faces by scraping millions of images from Flickr without the consent of people in them. As Gebru explained during a tutorial at the Computer Vision and Pattern Recognition conference in June 2020, “Fairness is not just about datasets, and it’s not just about math. Fairness is about society as well, and as engineers, as scientists, we can’t really shy away from that fact.”
A particularly pernicious consequence of focusing solely on data is that discussions of the “fairness” of AI systems become merely about having sufficient data. When failures are attributed to the underrepresentation of a marginalized population within a dataset, solutions are subsumed to a logic of accumulation; the underlying presumption being that larger and more diverse datasets will eventually morph into (mythical) unbiased datasets. According to this view, firms that already sit on massive caches of data and computing power—large tech companies and AI-centric startups—are the only ones that can make models more "fair."
A Genealogy for the Many
Exploring the history of ImageNet has implications not only for how we discuss the problems and failures of AI, but also for how we make critiques and formulate solutions to those issues. We need to develop genealogies of data to show that datasets are the product of myriad contingent assumptions, choices and decisions, and that could, in fact, be otherwise. Genealogy is an interpretive method of analysis, which we can apply to the historical conditions of dataset creation. Understanding these conditions illuminates the origins of certain problems, but it also opens up new paths of contestation by enabling us to imagine new standards, new methods for evaluating AI progress, and new approaches for developing ethical data practices in AI.
Instead of the narrow focus on "bias," we can start to ask deeper questions such as: How did particular datasets emerge? What agendas, values, decisions, and choices governed their production? Who collected the data and with what purpose? Are the people represented in the datasets aware that they are participants in them? Can they meaningfully opt out? How about the workers, like the Amazon MTurkers, who annotated them? Were they fully recognized for their labor and fairly remunerated? And, most importantly, does the creation of the datasets serve the interests of the many or only those of the few?