In this blog, Stephen Ryan takes us through what Natural Language Processing is and how it can help you interpret through summarising large volumes of text using a recent report on Life Sciences as an example.
A computer can read through a cleaned-up file and summarise it in a second or two using natural language processing (NLP). NLP is a branch of artificial intelligence that helps computers understand and manipulate human language. It can break down language into shorter, elemental pieces, which is really helpful when you need to retrieve the essential information from text-based sources such as research reports.
At Didobi, we research commercial real estate for clients such as the Urban Land Institute, the Investment Property Forum, institutional investors and others. When embarking on a new research project, being able to summarise reports (especially long ones) can save hours or even days. There are different types of summary, but we usually get the computer to assemble what is called an extractive summary – a subset of words and sentences containing the key points, using only words found in the original data. The summary is extracted by applying rules to rank words, phrases and sentences. Armed with a summary of each potentially interesting file, we can identify those which merit closer (i.e. human) scrutiny.
So let’s say a Google search points us towards 50 reports that might be relevant to a research project. By summarising all of them and scanning their summaries, the top candidates for in-depth reading reveal themselves.
Staying up to date is time consuming. Yesterday, an impressive looking 82-page report came across our radar. The authors were very reputable, and their chosen topic is of interest…but a report of this length is not something you could absorb in a few minutes. So the question is: whether to read it or not? Very quickly we were able to convert the report, which was a PDF file, to plain text and summarise it. We extracted the 15 most representative sentences and this summary indicated that the report is indeed well worth reading. The report is called Life Sciences Innovation: Building the Fourth Industrial Revolution by Blackstock Consulting.
Our research using NLP told us;
By the way, when summarising text data we tailor our summarisation technique to the length and nature of the source. An 82-page report is summarised in sentences but a shorter text such as a collection of emails might be summarised in “chunks”, which are small clusters of words.
But summaries are just part of the story. NLP is also a fantastic tool for handling the outputs from roundtables and interviews. As part of a recent research project on life sciences real estate, we had to dissect the transcript from two virtual roundtable sessions. Each session lasted an hour and generated about 7,000 words, which is roughly 14 A4 pages of text. Using NLP, we could quickly spot patterns (such as repeated noun phrases) in the delegates’ contributions. For that project, we also held a series of telephone interviews; once again, NLP assisted us in extracting key phrases from the transcript of those interviews.
As researchers, we need to handle large volumes of data. That data is either structured or unstructured. Structured data is neatly arranged in rows and columns, like a spreadsheet. Unstructured data is messy and disorganised, shambling towards us as PDFs, Word documents, presentations, open-ended survey responses, webpages and the transcripts from interviews and roundtable discussions. In an ideal world, we would be able to manipulate words and sentences as easily as a spreadsheet can handle numbers. With NLP, we can (almost) do that.