Castle in the cloud: UC Berkeley Library balances access, security as purveyor and protector of research data

The Library’s mission is to share information — to sow the seeds of knowledge as widely and meaningfully as possible. It’s is a core value of librarianship, notes the American Library Association.

Right up there with another one: privacy. If you’re sensing a little dramatic tension, you’re right.

By day, librarians fight tirelessly to make research as open as possible, helping scholars share their evidence in the name of scientific truth. But bookending that work is the promise to secure and protect that information, ensuring contracts of confidentiality are intact.

The information now in question is data — a colossal trove expanding ever faster as data science redefines the research landscape. Now, as more and more research at Berkeley draws on sensitive data, navigating between the two has become more important than ever.

Amy Neeser
Data consultant Amy Neeser helps researchers understand the sensitivity of their data and how to protect it. (Photo by J. Pierre Carrillo for the UC Berkeley Library)

“We create data not only for ourselves, but for people downstream and in the future,” says Jon Stiles, data services lead at Berkeley’s D-Lab. “That’s been the library’s role for centuries and centuries. And we just need to realize that in an increasingly electronic world, that presents risks.”

Safe and sound

In some ways, you can think of research data as a baby, Stiles says. You want to put it in environments where it can grow and be successful. But you also have to defend it from harm’s way.

“The philosophy is, if data can be open, it needs to be,” says Amy Neeser, who collaborates with the Library as consulting and outreach lead for Research IT. “But some of it is highly sensitive, and in that case, it’s our responsibility to be good stewards in that way, too.”

At UC Berkeley, that’s the work of a village, involving many units on campus. One such partnership is Research Data Management, led by the Library and the campus’s Research IT unit. Consultants from that group show researchers how to safely store and transfer data, and how to safeguard their devices against a breach.

The first step is to figure out how sensitive the data is — from accounting records to the whereabouts of endangered species. For that work, consultants put on the cap of a reference librarian, Neeser says, to find the “question behind the question.”

“Researchers will come to us, think they have already figured out a solution, and say, ‘I need storage in the cloud,’” Neeser says. “And a lot of times, what they asked for is not what they need.

“We learn that, actually, this data is really sensitive, and we need to make sure we do things the right way.”

Hilary Schiraldi
During class visits and instruction sessions, business librarian Hilary Schiraldi teaches students about data security and what is at stake. (Photo by J. Pierre Carrillo for the UC Berkeley Library)

One common type of highly sensitive information at Berkeley is human genomics data, Neeser points out. At Berkeley, researchers look at DNA to track everything from the migratory patterns of our ancestors to the crucial links between a specific gene and disease.

To meet the growing security needs, the Library and its partners are now spearheading an initiative called Secure Research Data Compute. The program focuses on creating safe computing environments, including an encrypted data repository; building communities invested in data security and expanding services; and enacting clearer policies for data security.

Meanwhile, the Library has also launched a new initiative to train all librarians at UC Berkeley to have basic levels of data literacy — including data security.

In fact, technical help often includes input from subject librarians, Neeser says. In the social sciences, for example, data can be culturally sensitive, so consultants rely on librarians to guide those conversations with an eye to the nuances of that field.

“They’re the subject experts, and we’re the data side, so by joining forces, we can provide a really holistic experience for the researcher,” Neeser says.

For Stiles, it’s about getting everyone on the same page, so “everyone understands where to go and what to do at each step.”

‘It doesn’t feel good to say no’

There are many reasons data might need to be protected. It may contain personal information about people, gathered in a study. Or it might be commercial data, such as financial records, owned by companies and purchased by the Library on researchers’ behalf. The first category is protected to maintain confidentiality — the latter, to serve a company’s business model.

Vendors of commercial data often line their product with restrictive contracts. The Library’s job, then, is twofold: to try to peel back those limits, where possible, and to help researchers do their work within those binds.

Jim Church
Jim Church, the librarian for economics and international and foreign government information, says his first priority is getting researchers the data they need. (Photo by Jami Smith for the UC Berkeley Library)

“Librarians got into this business because we wanted to share information,” business librarian Hilary Schiraldi says. “It doesn’t feel good to say no.”

One goal is to convince vendors to let researchers “scrape” information from across entire databases. If researchers could scan, or text mine, thousands of newspapers for certain phrases, for instance, they could do key historical work without actually diving too deep into the content.

“It allows us a bird’s-eye view of large swaths of information,” says Stacy Reardon, literatures and digital humanities librarian. Negotiating for text-mining rights is important, she says, as databases typically do not allow data to be downloaded at scale.

“We ... strive for the best possible licenses,” says Jim Church, librarian for economics and international and foreign government information. “My first concern is getting researchers the data they need.”

Another challenge is figuring out how to crack research data open just enough so that fellow researchers can verify their peers’ work — a crucial tenet of the scientific process. You want your study to be replicable, Stiles says, but if someone can’t access your data, it “throws some grit in … the mechanism.”

It’s complicated, of course, and many researchers and data scientists on campus are working out solutions to tread that line. One option is to publish as much about a dataset as possible and provide reviewers with the code used to run it. It’s a way to be open — but not too much.

“It’s about trying to preserve as much privacy and confidentiality as possible while still providing as full and meaningful research access as we can,” Stiles says. “That’s the balance that the Library and all these other institutions on campus are trying to figure out.”