From library to landscape: integrative annotation workflows for compound libraries in drug repurposing
From library to landscape: integrative annotation workflows for compound libraries in drug repurposing
Abstract
In the rapidly advancing landscape of drug discovery and repurposing, efficient access and integration of chemical and bioactivity data from public repositories have become essential. To address this need, we developed two complementary annotation pipelines (KNIME- and Python-based) that automate the extraction and integration of curated chemical and bioactivity data from public repositories. These pipelines support any user-provided compound library, enabling reproducible workflows that integrate data from heterogeneous sources such as ChEMBL and PubChem. As part of the REMEDi4ALL project, with the aim of establishing a European platform for drug repurposing, we validated our framework using a harmonized subset of the Specs repurposing collection, which includes more than five thousand compounds available at the partner institutes. We also developed two interactive dashboards that support multilayered analyses and visualization by integrating chemical properties, bioactivity profiles, and relational data. Our results demonstrate that this framework streamlines the collection of harmonized data and facilitates analyses that are critical for drug repurposing efforts, while remaining versatile for broader applications in drug discovery. Moreover, the analysis of the annotations reveals that the Specs subset includes chemical scaffolds representative of a significant portion of approved drugs and compounds undergoing clinical evaluation, underscoring its potential as a rich source of drug repurposing candidates. Both pipeline protocols are publicly available online, and the dashboards are open access.
Introduction
Introduction
Drug repurposing has emerged as a powerful strategy in drug discovery, offering a cost-effective and resource-efficient approach to identifying new therapeutic uses for existing drugs
This method has become increasingly popular, especially for addressing unmet medical needs, e.g. in the case of rare or orphan diseases and urgent health crises. Compared to traditional drug discovery, drug repurposing can reduce the development time from ten to seventeen years to three to twelve years and costs from two to three billion dollars to around three hundred million dollars per approved drug, while increasing the likelihood of success by up to thirty percent. A notable example of successful drug repurposing is azidothymidine, which transitioned from an abandoned anticancer candidate to become the first FDA-approved drug for treating HIV infection. Similarly, thalidomide, which was once withdrawn from the market due to its teratogenic effects, was repurposed first for leprosy and later also for multiple myeloma. Among more recent examples, lonafarnib, which was originally developed for the treatment of cancer, was approved as the first treatment for Hutchinson-Gilford progeria syndrome.
Despite its advantages, data-driven drug repurposing still faces significant challenges. These include difficulties in accessing, integrating, and processing large volumes of heterogeneous chemical, physical, and biological experimental data describing preclinical molecules, clinical candidates, and marketed drugs. Critically, the success of these drug repurposing efforts relies on extensive biological and clinical data found in public repositories, underscoring the need for standardization strategies and analytical methods. Currently, there are more than one hundred public and open databases in the biomedical domain, each covering distinct subjects such as genes, compounds, and diseases. These resources serve as invaluable repositories of scientific knowledge, playing a pivotal role in advancing research and drug discovery. However, maintaining and updating these databases to keep pace with rapid scientific advancements requires dedicated teams of experts.
Repositories such as the Drug Repurposing Hub, ChEMBL, PubChem, DrugBank, Probes and Drugs, and the Guide to PHARMACOLOGY, among others, provide extensive information linking chemical structures, biological activities, mechanism of action, and clinical data. Since these databases continuously grow over time, they offer researchers an ever-expanding landscape of molecular and clinical information, allowing increasingly sophisticated drug repurposing strategies. However, the amount and complexity of available data require modern computational approaches to effectively mine and interpret this information. Indeed, advanced techniques in artificial intelligence and machine learning, along with relational databases, are increasingly being employed to uncover hidden patterns and relationships within these datasets, potentially revealing novel drug-disease associations and accelerating the repurposing process.
Despite their individual significance, these databases often remain siloed, limiting their potential to provide holistic insights into complex biological systems and diseases. To overcome this fragmentation, integrated workflows that combine data from multiple sources are essential. Such workflows can uncover hidden relationships among genes, compounds, targets, and diseases, enabling a more comprehensive understanding of biological mechanisms and facilitating translational research. However, significant challenges lie in establishing standardized protocols for data collection, curation, and sharing across the scientific community. A critical factor in the success of these integrative approaches is the selection of resources that adhere to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. Databases compliant with FAIR principles ensure that data can be easily located, accessed, and efficiently integrated across platforms without loss of meaning or context. This compliance is crucial for efficient harmonization of datasets, reducing redundancy, and promoting data interoperability. In the era of data-driven research, the ability to efficiently harmonize and utilize vast biomedical datasets will be a key driver of innovation.
Harmonized datasets are indeed essential in modern scientific research, particularly in biomedical fields. They enable the integration of diverse data sources into a unified and standardized framework when the FAIR principles are consistently applied. This standardization enhances the ability to validate citations, improves the statistical robustness of analyses, and allows researchers to evaluate the generalizability of findings across different contexts. In drug repurposing, data harmonization significantly accelerates research timelines by streamlining data integration and analysis. Furthermore, harmonized datasets promote interoperability among diverse data sources, facilitating collaboration between researchers and encouraging knowledge sharing throughout the scientific community. For instance, the Alzheimer's Disease Neuroimaging Initiative has contributed to more than six hundred publications on Alzheimer's biomarkers, underscoring the value of standardized datasets. In healthcare, harmonized clinical data enables more precise analyses and diagnoses, facilitates personalized treatments, and enhances the efficiency of AI models. Despite these benefits, challenges such as data heterogeneity, ethical concerns, technical barriers, and varied regional regulations persist. Overcoming these obstacles requires advanced frameworks and universal standards, such as HL7 Fast Healthcare Interoperability Resources FHIR, an interoperability standard enabling health data exchange between different software systems, to fully harness the potential of harmonized data in driving scientific discovery and innovation.
Despite the development of various methods for collecting and analyzing annotated data from screening libraries, the exponential growth in data volume and complexity requires continued innovation in this field. Accordingly, the development of up-to-date methods remains critical in drug repurposing. Moreover, there is a need for an easy-to-use, standardized method that can be widely adopted across different research settings. Such a protocol would democratize access to powerful data processing and analysis techniques, enabling researchers from diverse backgrounds to effectively explore the information available from chemical repositories.
In this work, we present a framework aimed at advancing drug repurposing efforts through the development of pipelines for annotating compound screening libraries, as well as platforms for visualization and multilayered analysis of annotated data.
The developed workflows are explicitly designed to facilitate automated integration and interpretation of underlying data using dynamic approaches, ensuring alignment with the latest available information. We demonstrate the applicability and reusability of this framework using the Specs Repurposing Library, which is a subset of commercially available compounds described in the Broad Institute's Drug Repurposing Hub. This approach provides solutions for both computational researchers and non-computational scientists, offering robust and practical tools to leverage prior knowledge effectively for their drug repurposing projects. By adhering to FAIR principles, the method ensures high-quality curated data, as well as easy access and reusability. Although the present research focuses on molecules with clinical trial history, aiming to provide a useful resource for informed decision-making in drug repurposing, the reported pipelines can be readily adapted to different drug discovery projects.
Aimed at serving a broad community, we developed two distinct complementary platforms: one based on Python and NEO4J and another utilizing the KNIME Analytics Platform.
This work is part of a collaborative initiative within the REMEDI4ALL EU project involving scientists from the Fraunhofer Institute for Translational Medicine and Pharmacology and the Karolinska Institute, who have collaborated to unify their compound collections under a shared identification procedure.