Could you start by telling us about what the DataRefuge project is, and how it got started?
I’ve been a librarian for 15 years, and I currently work at the University of Pennsylvania Library, where I head up the Digital Scholarship Department.
The main production of scholarship in academia comes through journal articles and scholarly books. Yet, increasingly, forms of scholarly production are emerging that include data sets and databases and interfaces and interactive collaboration and all sorts of other things. So my work involves engaging with those new modes of digital scholarship.
In that role, I’ve worked closely with a collaborator, Bethany Wiggin, a faculty member who directs the program in Environmental Humanities at Penn. After the election, Bethany’s students brought up the concern that environmental and climate data would disappear from federal government sites. They decided to do something, and the libraries decided to help.
If any of us had understood how much work would be involved, it’s possible we might not have taken on this project. It’s so vast! But this was right after the election. I think I felt like, “Okay. Let’s do stuff. Yup, that sounds impossible. Let’s do it anyway.” We’re going to have to do things that aren’t anyone’s job if we’re going to make the changes we need to make.
When did you actually begin to harvest environmental and climate data from federal sites?
We started meeting regularly in late November, and began planning an event for January 13th and 14th. We wanted to do two things. First, to provide an opportunity for people to come together and learn about why environmental and climate data is so important. Second, to actually harvest the data, gather it, and store it in a safe place.
What are the federal agencies that publish environmental and climate data? Are we talking mostly about the Environmental Protection Agency (EPA), or are there others?
Oh, there are so many agencies. And that’s the breadth of the problem. Every agency has something to do with the environment or climate.
Yes, the EPA is obviously a big one and one that we feel is quite vulnerable politically. After the EPA, the National Oceanic and Atmospheric Administration (NOAA) is the next biggest one that scientists are most concerned about. The National Aeronautics and Space Administration (NASA) is a huge one. There’s the Department of Energy, the Department of the Interior—which contains the US Forest Service—and the Department of Agriculture. Even Housing and Urban Development (HUD) has data.
What form does all this data take? Can you go to a central location and download a bunch of Excel spreadsheets?
I love that question because it assumes that the federal ecosystem is much more rational than we have reason to believe that it is.
The Obama Administration tried to move the government towards an open data governance structure with data.gov. But that was a process. And people who understand technology know that moving big systems that have been around for a long time into a new architecture is not simple. It’s tremendously expensive and difficult.
So when you think about federal data, you have to think of the word “data” very capaciously. It includes datasets of the kind you would find at data.gov. It also includes information scattered across hundreds of thousands of government webpages.
Sometimes those pages all point to the same dataset, and there’s an API for that dataset. So you can go grab the data from the API, and then remake the webpages. Sometimes it’s that simple. In other cases, the data is stored in some really old technology. It’s served up through ColdFusion and God knows what the database underneath is. And so the only way to get it is by scraping, which is why these events require technical skill.
But it’s really not easy to do this systematically. There’s no list of all government data.
When you talk about data being “vulnerable,” can you speak more specifically about what that means?
We talk about four factors of vulnerability.
The first factor of vulnerability is political. This Administration seems eager to throw into doubt the well-established scientific evidence on climate change, so databases related to that might be withdrawn from public websites.
The second is technical. There’s a ton of data that’s stored in brittle technology. So by defunding an agency just a little bit, the government can break the technology where a particular dataset lives. Let’s say a dependency needs updating, but updating that dependency is going to throw off the whole system. All of a sudden, a data source becomes unavailable.
The third is legislative. With the Census, for example, there are laws that ensure the data is collected, maintained, and made available. But that’s not the case for many other sources of government data. There is often no law requiring that certain kinds of data be collected or distributed.
The fourth is the content. What happens if the content of the data itself is compromised? If you take one piece of data out, how many other kinds of scientific argument will be damaged?
When you think about political vulnerability, it seems clear that the Trump Administration will want to suppress any data that confirms the existence of climate change. But what are the other kinds of data that might be politically vulnerable under Trump?
The Department of Agriculture just took down a big animal welfare database. Luckily, most of the stuff was in the Internet Archive. It was basically a big set of PDFs. And there’s no law saying that the Department of Agriculture needs to keep publishing it. And there’s no staff whose job that is. So it’s gone from the web.
There were also bills recently introduced in both the House and the Senate to stop the production of geospatial information that would keep track of housing fairness by race.
So it’s not just climate and environmental data that’s under threat. We should be worried about a much wider range of data.
Of course. It’s evidence—it’s a way to prop up your agenda. I think it’s a mistake to imagine that science is immune from politics. It never has been—and neither is technology. They all work together and they always have.
Once you’ve retrieved the data—either by locating the files and downloading them or by scraping text from the sites—what do you do with it? How do you store it, and how do you make it usable?
Great question. And I’ll answer it by first answering a different question. When scientists ask about the project, their first question is: How will we know that the data you have is trustworthy?
Let’s say that the Internet Archive didn’t have the disappeared Department of Agriculture data about animal welfare. And someone on the internet says, “Don’t worry, I got it. Here, I’m going to put it on my website.” What if they’re funded by some Big Ag corporation? Why should we trust their data?
So the question isn’t how to make it available. It’s very easy to make it available. You put it on the web and then anyone can take it and do whatever they want with it. The question is how to ensure that your data can be trusted.
That’s the hard work. And it’s very technical: it involves checksums and hashes and that sort of thing. But it’s also interpersonal, because there aren’t only technical ways to ensure trust.
I’ll give you an example. I’m a librarian, and we have works of art in our collection. The artists who made them are dead, so how do you know they’re real? Of course, there are technical measures to determine their authenticity. But those technical measures can be tampered with. So we also rely on people. We keep track of whose hands the work has passed through. Chain of custody is the term for that.
So how does chain of custody work for a harvested government dataset?
At our DataRefuge hackathons, anyone can harvest the data—they can be anonymous. They place their harvested data into an S3 bucket, and then someone else checks it. The checker has to be approved, and they’re checking the data for two things. Is the data what the harvester says it is? And is this collection everything that a researcher would need in order to use the data?
The next stage is called bagging. And again, we have to know who those people are. The baggers use the BagIt format to turn the data into a “bag,” which is essentially a zip file packed with directories and a manifest of all the files inside. And the manifest contains checksums for every file and checksums for the whole bag.
So that’s how we store the data. And after it’s been checked and re-checked, we actually make it available as a bag.
It sounds like there’s a wide range of skillsets that come into play for a project like DataRefuge to be successful. You’ve got the technical folks who are building the tools to retrieve, verify, and store the data. You’ve got librarians like yourself who bring the deep knowledge of how to preserve and curate the data.
It’s wonderful when I can get into the weeds with other librarians. And I know that my friends in the software development community feel the same way. But we’re never going to build new systems if we don’t stop enjoying the marvelous feeling of talking to other people who use all the same same acronyms.
If I could just sit down with a bunch of fellow librarians and solve this problem, it wouldn’t be a problem. We would have already solved it.
The same is true for the tech community. We all let this happen. We all let ourselves get here. So the thing that feels like the right thing to do is probably not, because what we’ve been doing hasn’t been working. And considering the urgency of the problem we’re facing, this is the only way to make progress: by coming together and bringing our best minds to the table.
And it’s not just librarians and technologists. We’re also working with the people who are using this data: scientists and researchers and city planners and so on. We’re also working with the people whose lives are affected by the use of this data.
What’s next? How would you like to see the DataRefuge project evolve over the next year or so?
We want to connect with as many university libraries as we can. The problem of disappearing datasets under this Administration can be seen as an academic problem, because it’s the responsibility of university libraries to make available the data that faculty and students use to do their work. And the removal of government data from public sources means those researchers might not have access to the collections they need.
We also do need to activate the expertise we have at our communities. My colleague Bethany has been leading our local engagement in Philadelphia, working with city officials and residents to explore how climate and environmental data is used in our community. We must fight to retain access to the data that is needed for Philadelphia to work, and advocate for the continued creation of that data.
Ideally, of course, none of this needs to happen. Hopefully, this is all just a giant waste of time. Unfortunately, I think it’s becoming increasingly clear that it’s not.