Case Study

Using NLP to Extract Biologics Data for R&D Support

Arrayo delivered a cloud-based solution comprised of the following components:

  • Document repository connector for input documents management, including document meta-data management and tracking.
  • Data ingestion and processing pipelines engine that enabled reproducible document parsing and robust operational support.
  • Customizable document parser with user-friendly features that included administration of templates for standardized document processing, controlled vocabulary management and standardized testing to assess accuracy.
  • Indexed content storage solution and infrastructure for exporting data in customizable output format, as well as data analysis, reporting and advanced search capabilities.
  • Administration interface (using built-in AWS capabilities that enable users to view parser input and output, as well as maintaining pipeline jobs/runs, and review logs).
  • Built-in NLP platform and custom Machine learning algorithms integration.
  • For each PDF report, we identified biologic entity ID or name, efficacy data, and found the characterization methods used.
  • We extracted key parameters, conditions, SOP, and results, and transformed them into Excel or database format that could be used later.
  • If different names were used for drug names, same methods, parameters, conditions, results etc., we unified them.
  • Pipeline was developed and run on AWS infrastructure.

Data was extracted, cleaned, and transformed with the maximum accuracy possible.

The bulk of the documents were historic, already assembled as a document store. They needed to be processed in batch, as a one-time effort, with no human-powered workflow involved. The NLP data processing pipeline would automatically process the historical manufacturing biologics stability reports to feed into the Biologics Developability Platform. There was no substantial scalability & deployment consideration once the initial historic data was processed. If additional similarly structured documents were generated and used by the R&D processes, those new documents would be processed automatically on an ongoing basis with similar accuracy. 


The project created a document data processing pipeline for extracting, cleaning, and transforming the Biologics stability data and made it available to support R&D research projects, such as building internal models for in silico Biologics Developability Prediction for early R&D projects. This solution is a good example of how cross functional collaboration efforts on the AI/ML platforms can be leveraged to support R&D.

Related Insights