Text Mining

Using the Library during COVID-19

UC Berkeley’s library buildings are now open. To stay up to date on the Library’s policies and services during the pandemic, visit the Library’s COVID-19 webpage.


"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst (2003)

Black & white image reading "Click and Collect" 

CC0 Photo by Henrik Dønnestad on Unsplash

Text mining (sometimes called "text data mining," or "text and data mining") describes a research approach in which scholars use automated methods to identify, extract, and analyze patterns and trends in large volumes of digital content. How do the Copyright Basics and our Digital Publishing Workflow apply when you're conducting text data mining? Are there any special copyright considerations?

Fair Use

Copyright comes into play with text mining when, for instance, researchers use algorithms to download and mine copyright-protected text, and work with or build a collection of materials (called a "corpus") for their project. Courts have largely determined that using these automated mining techniques to conduct scholarly research constitutes fair use. (You can read more here: Author's Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014)). 

However, sometimes researchers seek to publish from or circulate the copyright-protected corpus they've compiled -- either to allow other researchers to validate the algorithms applied, or to enable different research queries entirely. Publishing large portions of a copyright-protected corpus, or circulating the entire corpus to other scholars, can push the limits of what constitutes "fair use": All of a sudden, the corpus could be considered a market substitute for other researchers who now do not need to purchase or license the underlying works.

If you have questions about how fair use fits into your plans for text data mining, we can help you understand the issues.

License Agreements & Terms of Use

As when you're publishing any digital scholarship that includes content about or by someone else, copyright isn't the only legal question you may need to consider. If you've created or are using a corpus of materials from a library-licensed database or other online resource, there may be contractual arrangements that override the fair use that you would otherwise have been able to make. For instance, let's say you want to scrape content from a database that the Library has licensed. Some of our agreements allow this, and some don't. Using Python, Selenium, or other programmatic tools to scrape database search results (even cleverly) can breach the agreement, and also result in database access being shut down for the entire campus.

You should check out this guide regarding library license agrements and text data mining. Read any license agreements or website terms of use carefully to make informed decisions about how to proceed. If you have questions, e-mail tdm-access@berkeley.edu

Digital Scholarship Workflow

Other digital scholarship workflow questions remain equally important with text data mining. For instance, you'll potentially need to think about protecting privacy or dealing with policies related to indigenous knowledge. We are here as a resource if you have questions.