"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts."
- from What is Text Mining? by Marti Hearst (2003)
Mind Before You Mine!
Given what we know from the other Copyright in Scholarship & Publication pages, how should we approach text mining or computational text analysis?
How can we help ensure that what we're doing with potentially copyrighted texts complies with legal standards when we publish our results?
But we discuss a few issues that frequently pop up in text mining & computational text analysis below.
There Are Probably Licenses At Play
In many cases, the text mining you do for non-commercial scholarship can be fair use for purposes of Workflow Step 1. In fact, there is a great court opinion discussing this issue of you're curious: Author's Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014).
And if you're using Library resources...
Let's say you want to scrape content from a database that the Library has licensed. Well, some of our agreements allow this, and some don't. Using Python, Selenium, or other programmatic tools to scrape database search results (even cleverly) can result in access being shut down for the entire campus.
So, if you're using Library-licensed resources for text and data mining, please consult this flow chart that builds in a step to see whether the Library's licensing agreement allows for what you're trying to do:
Mining vs. Republishing
You also have to think about what happens after you've completed the analysis. Subsequently republishing the corpus content, itself, rather than just your visualization or analysis of that content, may or may not be fair use--you'll have to undertake a separate fair use analysis for that.
And again, remember that even if text mining is fair use, we have to consider whether we've signed contracts that constrict what would otherwise be fair use when it comes to republishing.
Putting It All Together
So, here is an easy way to think about all this:
Mining databases or corpora to conduct scholarly analysis, without subsequently republishing the contents of those databases or corpora, can be fair use under Author's Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014).
Irrespective of whether use would be fair, however, you may have consented to contractual arrangements that override the fair use that you would otherwise be able to make. For example, a database or website you are using might regulate text mining, or might allow text mining but prohibit sharing excerpts from the materials that they made available for mining.
For more on text mining and computational text analysis in general, and to find data sets to work with, the other pages of the Text Mining & Computational Text Analysis should prove very useful.
With questions, please contact us at firstname.lastname@example.org.