The War On Data

Laurie Allen, DataRefuge

Portrait of Laurie Allen, by Gretchen Röehrs Laurie Allen, by Gretchen Röehrs

The federal government produces and publishes tons of climate data. This data is an essential resource for scientists, activists, and policymakers. Under Trump, however, it may be at risk. His Administration has deep links to the fossil fuel industry, and has every interest in discrediting the scientific evidence behind man-made global warming. As a result, federal agencies could remove critical climate data from publicly available websites, or degrade it in a variety of other ways.

University of Pennsylvania librarian Laurie Allen is a coordinator of the DataRefuge project, an interdisciplinary undertaking of librarians and coders who are rescuing data from government websites before it can be removed by the Trump Administration. They’ve staged hackathons and partnered with other institutions to begin building a reliable backup of public datasets. We spoke to Laurie about the politics of data, and how to save the information on which the future of our planet depends.

Could you start by telling us about what the DataRefuge project is, and how it got started?

I've been a librarian for 15 years, and I currently work at the University of Pennsylvania Library, where I head up the Digital Scholarship Department.

The main production of scholarship in academia comes through journal articles and scholarly books. Yet, increasingly, forms of scholarly production are emerging that include data sets and databases and interfaces and interactive collaboration and all sorts of other things. So my work involves engaging with those new modes of digital scholarship.

In that role, I’ve worked closely with a collaborator, Bethany Wiggin, a faculty member who directs the program in Environmental Humanities at Penn. After the election, Bethany’s students brought up the concern that environmental and climate data would disappear from federal government sites. They decided to do something, and the libraries decided to help.

If any of us had understood how much work would be involved, it’s possible we might not have taken on this project. It's so vast! But this was right after the election. I think I felt like, “Okay. Let's do stuff. Yup, that sounds impossible. Let's do it anyway.” We're going to have to do things that aren't anyone's job if we're going to make the changes we need to make.

When did you actually begin to harvest environmental and climate data from federal sites?

We started meeting regularly in late November, and began planning an event for January 13th and 14th. We wanted to do two things. First, to provide an opportunity for people to come together and learn about why environmental and climate data is so important. Second, to actually harvest the data, gather it, and store it in a safe place.

What are the federal agencies that publish environmental and climate data? Are we talking mostly about the Environmental Protection Agency (EPA), or are there others?

Oh, there are so many agencies. And that’s the breadth of the problem. Every agency has something to do with the environment or climate.

Yes, the EPA is obviously a big one and one that we feel is quite vulnerable politically. After the EPA, the National Oceanic and Atmospheric Administration (NOAA) is the next biggest one that scientists are most concerned about. The National Aeronautics and Space Administration (NASA) is a huge one. There’s the Department of Energy, the Department of the Interior—which contains the US Forest Service—and the Department of Agriculture. Even Housing and Urban Development (HUD) has data.

What form does all this data take? Can you go to a central location and download a bunch of Excel spreadsheets?

I love that question because it assumes that the federal ecosystem is much more rational than we have reason to believe that it is.

The Obama Administration tried to move the government towards an open data governance structure with data.gov. But that was a process. And people who understand technology know that moving big systems that have been around for a long time into a new architecture is not simple. It's tremendously expensive and difficult.

So when you think about federal data, you have to think of the word “data” very capaciously. It includes datasets of the kind you would find at data.gov. It also includes information scattered across hundreds of thousands of government webpages.

Sometimes those pages all point to the same dataset, and there’s an API for that dataset. So you can go grab the data from the API, and then remake the webpages. Sometimes it’s that simple. In other cases, the data is stored in some really old technology. It's served up through ColdFusion and God knows what the database underneath is. And so the only way to get it is by scraping, which is why these events require technical skill.

But it's really not easy to do this systematically. There's no list of all government data.

When you talk about data being “vulnerable,” can you speak more specifically about what that means?

We talk about four factors of vulnerability.

The first factor of vulnerability is political. This Administration seems eager to throw into doubt the well-established scientific evidence on climate change, so databases related to that might be withdrawn from public websites.

The second is technical. There's a ton of data that's stored in brittle technology. So by defunding an agency just a little bit, the government can break the technology where a particular dataset lives. Let’s say a dependency needs updating, but updating that dependency is going to throw off the whole system. All of a sudden, a data source becomes unavailable.

The third is legislative. With the Census, for example, there are laws that ensure the data is collected, maintained, and made available. But that’s not the case for many other sources of government data. There is often no law requiring that certain kinds of data be collected or distributed.

The fourth is the content. What happens if the content of the data itself is compromised? If you take one piece of data out, how many other kinds of scientific argument will be damaged?

...


This has been a free excerpt from Tech Against Trump, a new book by Logic chronicling the rising tide of anti-Trump resistance by tech workers and technologists.

To read the rest of the interview, head on over to our store and buy the book. Like our work? Subscribe for future issues!


< Back to the Table of Contents