Leveraging NLP for Improved Data Extraction

One of our big pharmaceutical clients needed a custom solution that would enable them to aggregate and index data from a variety of documents to be able to augment, assemble, curate, and use healthcare datasets. They were looking for an automated solution for extracting preclinical and clinical data from published literature, as well as other sources.

Delivery

Arrayo’s solution leveraged machine learning methodology and artificial intelligence-based approaches, learning from prior data, and continuously improving through using a set of QC metrics. By using advanced NLP algorithms, we improved data extraction accuracy. This approach allowed for high-quality and enhanced resolution of ambiguous terms, synonyms, abbreviations, and lengthy text constructs that identify distant relationships between terms.

The software addressed several challenges encountered by our client’s team on a day-to-day basis, including:

Enhanced data search that enabled numeric queries, use of semantic network generated by the NLP module, exact named entity matching, and fuzzy query mapping (finding terms and concepts similar to the one entered).
Dataset and controlled vocabulary management, including automated triggering of the data processing as soon as documents were uploaded to the system.
Discovery of related documents (content clustering and similarity metrics, ID matching and other methods, such as locating all documents with data from a given clinical trial).
Data extraction from composite entries such as “number (%)” and (n=) patterns.
Pre-processing of documents into a format usable by downstream analytical tools (R, Matlab, Python).
Image extraction into files that could be stored and retrieved by document ID and text queries.
Search term analysis and indexing of full-text document content (in addition to extracting tables and figures).
Extraction of headers and sub-headers from the first column in tables.
Processing of HTML (either already downloaded or by using URLs).
Extraction of data from Microsoft Word and PowerPoint documents.

Value

Arrayo’s automated solution established reproducible and reliable generation of data directly from the incoming documents, which allowed our client to eliminate the reliance on vendor data products and reduce labor and licensing costs of the current data acquisition processes.

Leveraging NLP for Improved Data Extraction

Delivery

Value

Related Insights

Splicing Event Navigator

Exon Dashboard

Genomics Pipeline

Custom LIMS System

Scalable NGS Pipelines for Bioinformatics Groups

Intelligent Insight Engine and Data Extraction & Integra...

Molecular Structure Search Engine

Self-Guided Analytics Pipeline

Biomarker Informatics

Building a Knowledge Graph from Audio Interview Data

Using NLP to Extract Biologics Data for R&D Support

Providing Laboratory Informatics System Services in a Benchl...

LIMS Software Integration

Follow us on Social Media

Arrayo

Your new source for insights at the intersection of data, financial services, and life sciences.

Leveraging NLP for Improved Data Extraction

Delivery

Value

Related Insights

Splicing Event Navigator

Exon Dashboard

Genomics Pipeline

Custom LIMS System

Scalable NGS Pipelines for Bioinformatics Groups

Intelligent Insight Engine and Data Extraction & Integra...

Molecular Structure Search Engine

Self-Guided Analytics Pipeline

Biomarker Informatics

Building a Knowledge Graph from Audio Interview Data

Using NLP to Extract Biologics Data for R&D Support

Providing Laboratory Informatics System Services in a Benchl...

LIMS Software Integration

Follow us on Social Media

Join The ARRAYO Newsletter!

Arrayo