Text Mining


"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."

- from What is Text Mining? by Marti Hearst (2003)


By Vsimonyan. Sample text mining visualizationMind Before You Mine!

Given what we know from the other Copyright in Scholarship & Publication pages, how should we approach text mining or computational text analysis?

How can we help ensure that what we're doing with potentially copyrighted texts complies with legal standards when we publish our results?

For the most part, you just need to apply the Workflow we detailed for copyright issues in your scholarship (and for more, you can consult our Copyright & Digital Projects guide).

But we discuss a few issues that frequently pop up in text mining & computational text analysis below.


There Are Probably Licenses At Play

In many cases, the text mining you do for non-commercial scholarship can be fair use for purposes of Workflow Step 1. In fact, there is a great court opinion discussing this issue of you're curious: Author's Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014)

But if you're using a corpus that is online or offered through a licensed database, Workflow Step 3--i.e. addressing license or terms of use issues--becomes really important. Even if use would be fair under Workflow Step 1, you may have consented to contractual arrangements that override the fair use that you would otherwise have been able to make.

So, read website Terms of Use or license agreements carefully. Sometimes they regulate text mining directly and the mechanisms by which you can conduct the mining. Or, they might allow text mining but prohibit sharing excerpts from the materials they made available for mining. Licenses vary greatly.


And if you're using Library resources...

This terms of use & licensing point is really important--particularly if the license at issue is one the Library has signed for campus use.

Let's say you want to scrape content from a database that the Library has licensed. Well, some of our agreements allow this, and some don't. Using Python, Selenium, or other programmatic tools to scrape database search results (even cleverly) can result in access being shut down for the entire campus.

So, if you're using Library-licensed resources for text and data mining, please consult this flow chart that builds in a step to see whether the Library's licensing agreement allows for what you're trying to do:

Flowchart for Performing Text Analysis With Library Databases

Mining vs. Republishing

You also have to think about what happens after you've completed the analysis. Subsequently republishing the corpus content, itself, rather than just your visualization or analysis of that content, may or may not be fair use--you'll have to undertake a separate fair use analysis for that.

And again, remember that even if text mining is fair use, we have to consider whether we've signed contracts that constrict what would otherwise be fair use when it comes to republishing.


Putting It All Together

So, here is an easy way to think about all this:

  • Irrespective of whether use would be fair, however, you may have consented to contractual arrangements that override the fair use that you would otherwise be able to make. For example, a database or website you are using might regulate text mining, or might allow text mining but prohibit sharing excerpts from the materials that they made available for mining.

  • Therefore, it's important to carefully read any database or website terms of use or licenses as you prepare your project for publication. In particular, if you are using Library-licensed resources, you need to check in on the licenses *we've* signed! We have a guide on that.


Learn More

For more on text mining and computational text analysis in general, and to find data sets to work with, the other pages of the Text Mining & Computational Text Analysis should prove very useful.

With questions, please contact us at schol-comm@berkeley.edu.