Case Study

Leveraging NLP for Improved Data Extraction

One of our big pharmaceutical clients needed a custom solution that would enable them to aggregate and index data from a variety of documents to be able to augment, assemble, curate, and use healthcare datasets. They were looking for an automated solution for extracting preclinical and clinical data from published literature, as well as other sources.


Arrayo’s solution leveraged machine learning methodology and artificial intelligence-based approaches, learning from prior data, and continuously improving through using a set of QC metrics. By using advanced NLP algorithms, we improved data extraction accuracy. This approach allowed for high-quality and enhanced resolution of ambiguous terms, synonyms, abbreviations, and lengthy text constructs that identify distant relationships between terms.

The software addressed several challenges encountered by our client’s team on a day-to-day basis, including:

  • Enhanced data search that enabled numeric queries, use of semantic network generated by the NLP module, exact named entity matching, and fuzzy query mapping (finding terms and concepts similar to the one entered).
  • Dataset and controlled vocabulary management, including automated triggering of the data processing as soon as documents were uploaded to the system.
  • Discovery of related documents (content clustering and similarity metrics, ID matching and other methods, such as locating all documents with data from a given clinical trial).
  • Data extraction from composite entries such as “number (%)” and (n=) patterns.
  • Pre-processing of documents into a format usable by downstream analytical tools (R, Matlab, Python).
  • Image extraction into files that could be stored and retrieved by document ID and text queries.
  • Search term analysis and indexing of full-text document content (in addition to extracting tables and figures).
  • Extraction of headers and sub-headers from the first column in tables.
  • Processing of HTML (either already downloaded or by using URLs).
  • Extraction of data from Microsoft Word and PowerPoint documents.


Arrayo’s automated solution established reproducible and reliable generation of data directly from the incoming documents, which allowed our client to eliminate the reliance on vendor data products and reduce labor and licensing costs of the current data acquisition processes. 

Related Insights