Case Study

Intelligent Insight Engine and Data Extraction & Integration from Unstructured Documents

Our client has generated a tremendous volume of data which was processed and stored in a variety of unstructured documents, such as ELN outputs, Spreadsheets, and PDFs. With data embedded in these documents, it couldn’t be easily extracted or transformed and would end up in a document store (i.e., Sharepoint). The client needed a solution that would enable them to extract, aggregate, and index data from unstructured documents, particularly scientific data found in tables and graphs. The primary challenge was dealing with a variety of complex table layouts, such as merged cells, header types, and various locations of units of measurement.


Arrayo provided a customized Machine Learning based Intelligent Insight Engine, leveraging NLP and controlled vocabularies, to extract data from tables and graphs of unstructured ELN output documents, PDF reports, PowerPoint presentations, and Excel spreadsheets. The Intelligent Insight Engine was delivered as a containerized application hosted on an Amazon Web Services (AWS) cloud environment, developed with custom Python code, and leveraged Open-Source Technologies. The Engine was delivered on the client’s choice of cloud platform and integrated successfully with their digital eco-system.


The Intelligent Engine was successful in the identification of complex tables, extraction of scientific data, and storage of the extracted data in a database. The solution enabled the pharmaceutical company to perform downstream data science activities with the newly extracted data, such as AI modeling, downstream analysis, and visualizations. The Engine also gave our client the ability to search and explore data, as data were stored with key-value pairs.

Related Insights