TDM is presented in the form of a neon glow

Why we should care about text and data mining

By Rob Johnson

I was in Paris yesterday to present the findings of a new study on the use of text and data mining in public research. Organised by the ADBU – the French association of university library directors – I was joined at the event by speakers from ContentMine and NacTeM in the UK, and ISTEX and INRA in France.

I would hazard a guess that very few of my readers have heard of any of these organisations. It’s fair to say that TDM, as it is commonly known, is a somewhat esoteric subject. Getting to grips with it requires a working knowledge of intellectual property law, some expertise in software coding – and, in many cases, a love of darkened rooms. Check out the Twitter hashtag #tdm and you’ll find a smattering of posts about ‘text and data mining’ liberally sprinkled with others on ‘total dad moves’. This is not a subject that garners much public attention.

What exactly is TDM?

Part of the reason for this is that TDM is rather hard to define. It encompasses a huge range of analytical techniques that are used to analyse text and data in digital form, and generate new information. Yet the fact is TDM is part of all our lives, even if we don’t realise it.

Ever wondered how your phone prompts you about upcoming events based only on your emails? Or how your email provider filters out spam? That’s because Google, Apple or some other big company have developed algorithms to text mine your inbox.

What about Amazon’s eerily accurate recommendations for things you might want to buy? Or Facebook’s controversial news feeds? Yep, that’s all driven by data-mining.

The growing use of these and other TDM techniques in the corporate world has opened up a Pandora’s box of privacy concerns and ethical issues. But these technologies also have tremendous potential to be used by researchers for the public good.

Why TDM is important for research

Having spent 6 months looking at this area on behalf of the ADBU, I’m convinced that TDM is important – not just for researchers, but for society as a whole. We invest a lot of money in public research – about €16 billion per year in France and €12 billion per year in the UK, according to Eurostat. There is intrinsic value in the pursuit of knowledge and cultural enrichment, but, increasingly, this investment is based on an expectation of economic and social returns. The UK government’s recent announcement of an extra £2 billion per year for research and development is testament to this.

Yet researchers face a growing problem – there is too much data and knowledge out there for them to keep up with. The literature review, a staple part of almost any research investigation, is becoming a Herculean task in many fields. TDM can address this problem, helping researchers sift through mountains of text and data to find the parts that are relevant to their research.

Yet it can also do much more. We found TDM being used to make new connections between chemicals and diseases, to accelerate research into Alzheimer’s disease, and track the evolution of science through time. From life sciences to the humanities, chemistry to social science, the potential of TDM to transform research is enormous.

So what’s the problem?

If TDM has such potential, why isn’t it used more widely? This is the question we set out to address over the last 8 months, speaking with researchers in the UK, France and beyond to understand their experiences of using TDM. Our findings are set out in full in our report and accompanying case studies, published yesterday, and summarised in my slides here.

Part of the issue lies in archaic copyright legislation in Europe, meaning researchers don’t have the right to use TDM techniques even where they have lawful access to content. This is changing, with a new law recently passed in France, and a new Directive on the table from the EC. But the challenges run much deeper than that. The UK introduced a copyright exception for TDM in 2014, but so far it hasn’t led to a significant increase in uptake by researchers. When climbing the Everest of TDM, it’s clear that changing the law only gets us as far as base camp.

mountain-image

Our report attributes the slow rate of TDM adoption by UK and French researchers to a range of different factors, namely:

  • Uncertainty over the scope and application of copyright legislation
  • Difficulties in gaining access to content
  • Inadequate infrastructure
  • Gaps in skills and support
  • Lack of funding and incentives

Where do we go from here?

There are already indications that the centre of gravity for text and data mining is shifting away from Europe to other parts of the world, most notably Asia. If we are to put European researchers on a level playing field, action is needed in a number of areas:

  • Librarians need to offer better support and advice to researchers
  • Legislators need to provide greater clarity on what is permissible under the law
  • Research leaders and institutions need to endorse the use of TDM
  • Funders need to invest in the infrastructure needed to support TDM
  • Policymakers need to develop incentives for researchers to adopt the technique
  • Publishers and infrastructure providers need to adopt open, harmonised standards, and streamline access to content

How quickly these changes are delivered will determine which countries and continents stand to benefit the most from TDM. What is clear is that it is going to be a critical tool for the researchers of tomorrow. Move over ‘total dad moves’, text and data mining is coming, and it’s going to be big.

Leave a Comment