Body-camera footage and gender differences in literature: Grant to help scholars navigate legal issues of text data mining

Today, the National Endowment for the Humanities announced it is awarding a $165,000 grant to a UC Berkeley Library-led team of legal experts, librarians, and scholars who will help humanities researchers and staff navigate complex legal questions in the cutting-edge field of text data mining.

But first, what is text data mining?

Crack open a popular English language novel written in the 1850s — say, Brontë, Hawthorne, Dickens, or Melville — and you’ll notice a difference between how the authors describe male characters and female characters.

For example, the word “mind” might be used when describing a man, where “heart” is more likely to be used about a woman. Male characters might “get” something, while female characters are more likely to have “felt” it. But as the 20th century rolled around, these differences faded.

How do we know this? Text data mining. Text data mining, also called “text and data mining” or “computational text analysis,” refers to the technique of extracting information from across a broad set of digital content, and capturing and analyzing the trends that emerge. Text data mining has numerous applications, such as detecting racial disparity by evaluating language from police body-camera footage; developing new tools to allow large-scale analysis of TV series and photos; and capturing and designing new physical representations of naturally occurring laughter.

This kind of work can raise issues of copyright, contract, and privacy law — not to mention ethics, if culturally sensitive content is at stake.

The NEH grant will support a national team, led by UC Berkeley Library’s Office of Scholarly Communication Services’ Rachael Samberg, that will help humanities researchers, research staff, and librarians untangle the web of law and policy issues that arise in text data mining. 

Read more about the grant on Library Update.