Banner

My computer reads and summarises files for me. Does yours?

In this blog, Stephen Ryan takes us through what Natural Language Processing is and how it can help you interpret through summarising large volumes of text using a recent report on Life Sciences as an example.

A computer can read through a cleaned-up file and summarise it in a second or two using natural language processing (NLP). NLP is a branch of artificial intelligence that helps computers understand and manipulate human language. It can break down language into shorter, elemental pieces, which is really helpful when you need to retrieve the essential information from text-based sources such as research reports.

At Didobi, we research commercial real estate for clients such as the Urban Land Institute,  the Investment Property Forum, institutional investors and others. When embarking on a new research project, being able to summarise reports (especially long ones) can save hours or even days. There are different types of summary, but we usually get the computer to assemble what is called an extractive summary – a subset of words and sentences containing the key points, using only words found in the original data. The summary is extracted by applying rules to rank words, phrases and sentences. Armed with a summary of each potentially interesting file, we can identify those which merit closer (i.e. human) scrutiny. 

So let’s say a Google search points us towards 50 reports that might be relevant to a research project. By summarising all of them and scanning their summaries, the top candidates for in-depth reading reveal themselves.

Staying up to date is time consuming. Yesterday, an impressive looking 82-page report came across our radar. The authors were very reputable, and their chosen topic is of interest…but a report of this length is not something you could absorb in a few minutes. So the question is: whether to read it or not? Very quickly we were able to convert the report, which was a PDF file, to plain text and summarise it. We extracted the 15 most representative sentences and this summary indicated that the report is indeed well worth reading. The report is called Life Sciences Innovation: Building the Fourth Industrial Revolution by Blackstock Consulting.

Our research using NLP told us;

  1. Centres of life sciences innovation, like the Crick Institute, play an essential role in creating ecosystems for life sciences companies to thrive in, but also help attract world-leading talent.
  2. The term “life sciences” includes companies working within life sciences, medical technology and healthcare services and products as per Beauhurst’s categories.
  3. With COVID-19 further highlighting the resilience of life sciences companies, the global competition for life sciences investment will only increase in the coming years.
  4. This will allow key life sciences stakeholders to benefit while also promoting better life sciences integration and collaboration across sub-sectors of the industry.
  5. Yet, despite increasing levels of investment and world-renowned academic research, the UK and Europe both face a critical undersupply of lab space, particularly for intermediate-sized life sciences companies.
  6. Whilst the US has traditionally been thought of as the leader in creating life sciences ecosystems, the UK has also successfully created several thriving life sciences hubs.
  7. To develop the UK life sciences industry, a UK equivalent tech and life sciences index needs to be created.
  8. There are certain elements which are critical when investing in life sciences real estate: abundant qualified human capital, robust invested capital, and government-sponsored funding into research and development.
  9. Flexible and adaptable space are universal needs for all life sciences companies, but certain niche specialisms require additional, specific design considerations.
  10. Design plays a significant role in supporting collaboration, encouraging innovation, and accelerating the commercialisation of research across every life sciences discipline.
  11. We have learnt from our research and real life experience of accommodating life sciences occupiers at White City.
  12. In turn, this will impact the lab space requirements of life sciences companies in this sector, who no longer require spaces that can accommodate heavy chemicals.
  13. According to Deloitte, in 2019, life sciences companies announced deals to acquire 37 technology companies, while the medical technology sector had a turnover of £25.6 billion.
  14. Much of the work conducted by life sciences companies, pharmaceutical, biotech, and other medical research fields, is simply impossible to conduct remotely.
  15. As well as supporting the growth of life sciences companies, innovation districts also play an essential role in recruiting and retaining talent.

By the way, when summarising text data we tailor our summarisation technique to the length and nature of the source. An 82-page report is summarised in sentences but a shorter text such as a collection of emails might be summarised in “chunks”, which are small clusters of words.

But summaries are just part of the story. NLP is also a fantastic tool for handling the outputs from roundtables and interviews. As part of a recent research project on life sciences real estate, we had to dissect the transcript from two virtual roundtable sessions. Each session lasted an hour and generated about 7,000 words, which is roughly 14 A4 pages of text. Using NLP, we could quickly spot patterns (such as repeated noun phrases) in the delegates’ contributions. For that project, we also held a series of telephone interviews; once again, NLP assisted us in extracting key phrases from the transcript of those interviews.

As researchers, we need to handle large volumes of data. That data is either structured or unstructured. Structured data is neatly arranged in rows and columns, like a spreadsheet. Unstructured data is messy and disorganised, shambling towards us as PDFs, Word documents, presentations, open-ended survey responses, webpages and the transcripts from interviews and roundtable discussions. In an ideal world, we would be able to manipulate words and sentences as easily as a spreadsheet can handle numbers. With NLP, we can (almost) do that.