Library launches initiative to boost data science expertise, services at UC Berkeley

Data specialist Amy Neeser
Amy Neeser, consulting and outreach lead for Research IT, helped create the Library’s new data program.  (Photo by J. Pierre Carrillo for the UC Berkeley Library)

As with any healthy relationship, the marriage between the UC Berkeley Library and the world’s information is evolving.

Since the dawn of documentation, information — whether in cuneiform script or HTML — has been sifted, curated, and organized by the stewards of libraries. Today, as technological advances redefine our capacity to store, share, and process information, the roles of libraries have shifted once more.


“We have easier access to all of this information, derived from all over the world,” says Anthony Suen, director of programs at Berkeley’s Division of Data Sciences. “But there needs to be a central place where people can be interactive and think about these issues on a deeper level. The Library is the home base.”

In step with the explosion of data science across campus — including a new division, major, and undergraduate course that is the fastest-growing in university history — the Library has launched the UC Berkeley Library Data Initiatives Plan. Developed over years of conversations with librarians and campus partners, the plan is a multifaceted strategy for supporting Berkeley’s changing research landscape.

Data are, basically, blocks of information — numbers, survey results, images, map coordinates, and much more — that can be strung together and sorted at scale to answer a larger research question. With fine-tuned algorithms and the right computing power, data scientists can examine trends to explore issues ranging from humankind’s impact on climate change to the morphing of languages over time.

“With the right ideas and framework, you can do work that would have taken eons or billions of dollars within minutes or hours,” says Suen, who is helping the Library plan stations for data science consultation at the Center for Connected Learning, the new vision for Moffitt Library. “That’s the potential we have right now.”

The Library’s new initiative focuses on four major goals: increasing access to and resources for data collections; building all librarians’ data expertise; promoting data literacy across campus; and engaging with university partners to build communities around data science. “I’m a librarian, so my ethos is, ‘I’m here to help lower the barrier for people doing research,’” says Josh Quan, data services librarian, who spearheaded the data initiative alongside Amy Neeser, consulting and outreach lead for Research IT. “I’m committed to allowing researchers to do the best work they can in order to change the world.”

Through its Data Acquisition and Access Program, or DAAP, the Library acquires datasets on researchers’ behalf — including, in recent years, the entire U.S. congressional record, newspaper archives, housing rates in major cities, and more. One of the biggest datasets is Catalist, which includes voter registration data paired with U.S. census records. With that information, students can study how voters divide themselves, and along what lines.

Women transplant tea in India
An all-woman transplanting team moves paddy seedlings from a nursery into the main field in Telangana, India, in 2017. The image is part of the fieldwork of Manaswini Rao, a graduate student in the campus’s Department of Agricultural and Resource Economics. Rao is studying the role of gender in agricultural labor and uses employment data from the National Survey Sample, provided by the Library through its Data Acquisition and Access Program. (Photo courtesy of Manaswini Rao)

As part of its data initiative, the Library is also building a campuswide repository to store the datasets it purchases.

“It’s the Library’s job to think about, ‘OK, if one researcher requested something — if we bought it for them — how can we maximize its usage?’” Neeser says.

John Loeser, a campus Ph.D. candidate in agricultural and resource economics, used DAAP to access survey data from the National Sample Survey on household income and spending in India. According to Loeser — who was looking at the impact of road construction on agriculture — easy access to datasets not only increases transparency and reproducibility, but also opens the doors to undergraduates students who can download datasets and practice their research skills.

“That’s something that is not possible unless you have an institutional arrangement providing this data,” Loeser says. “It’s a great asset for Berkeley to have and, it’s going to be really valuable in the coming years.”

For the Library, one constant goal is to demystify data science for the campus community, building new pipelines into the field from all directions.

“It’s about making a really safe environment,” Neeser says. “It’s OK to not know things — nobody is an expert. We’re all learning this together.”