Using NLP to Extract Biologics Data for R&D Support

Arrayo delivered a cloud-based solution comprised of the following components:

Document repository connector for input documents management, including document meta-data management and tracking.
Data ingestion and processing pipelines engine that enabled reproducible document parsing and robust operational support.
Customizable document parser with user-friendly features that included administration of templates for standardized document processing, controlled vocabulary management and standardized testing to assess accuracy.
Indexed content storage solution and infrastructure for exporting data in customizable output format, as well as data analysis, reporting and advanced search capabilities.
Administration interface (using built-in AWS capabilities that enable users to view parser input and output, as well as maintaining pipeline jobs/runs, and review logs).
Built-in NLP platform and custom Machine learning algorithms integration.
For each PDF report, we identified biologic entity ID or name, efficacy data, and found the characterization methods used.
We extracted key parameters, conditions, SOP, and results, and transformed them into Excel or database format that could be used later.
If different names were used for drug names, same methods, parameters, conditions, results etc., we unified them.
Pipeline was developed and run on AWS infrastructure.

Data was extracted, cleaned, and transformed with the maximum accuracy possible.

The bulk of the documents were historic, already assembled as a document store. They needed to be processed in batch, as a one-time effort, with no human-powered workflow involved. The NLP data processing pipeline would automatically process the historical manufacturing biologics stability reports to feed into the Biologics Developability Platform. There was no substantial scalability & deployment consideration once the initial historic data was processed. If additional similarly structured documents were generated and used by the R&D processes, those new documents would be processed automatically on an ongoing basis with similar accuracy.

Value

The project created a document data processing pipeline for extracting, cleaning, and transforming the Biologics stability data and made it available to support R&D research projects, such as building internal models for in silico Biologics Developability Prediction for early R&D projects. This solution is a good example of how cross functional collaboration efforts on the AI/ML platforms can be leveraged to support R&D.

Using NLP to Extract Biologics Data for R&D Support

Value

Related Insights

Splicing Event Navigator

Exon Dashboard

Genomics Pipeline

Custom LIMS System

Scalable NGS Pipelines for Bioinformatics Groups

Intelligent Insight Engine and Data Extraction & Integra...

Molecular Structure Search Engine

Self-Guided Analytics Pipeline

Biomarker Informatics

Leveraging NLP for Improved Data Extraction

Building a Knowledge Graph from Audio Interview Data

Using NLP to Extract Biologics Data for R&D Support

Providing Laboratory Informatics System Services in a Benchl...

LIMS Software Integration

Follow us on Social Media

Arrayo

Your new source for insights at the intersection of data, financial services, and life sciences.

Using NLP to Extract Biologics Data for R&D Support

Value

Related Insights

Splicing Event Navigator

Exon Dashboard

Genomics Pipeline

Custom LIMS System

Scalable NGS Pipelines for Bioinformatics Groups

Intelligent Insight Engine and Data Extraction & Integra...

Molecular Structure Search Engine

Self-Guided Analytics Pipeline

Biomarker Informatics

Leveraging NLP for Improved Data Extraction

Building a Knowledge Graph from Audio Interview Data

Using NLP to Extract Biologics Data for R&D Support

Providing Laboratory Informatics System Services in a Benchl...

LIMS Software Integration

Follow us on Social Media

Join The ARRAYO Newsletter!

Arrayo