Table of contents

Before:

Existing molecular docking methods for RNA-ligand interaction prediction had low accuracy (50-60% AUROC) and struggled with variable-length RNA sequences, limiting drug discovery efficiency.

After:

The custom ML model achieved 65-70% AUROC and EF10 scores of ~3 (vs. 1-1.7 for SoTA), enabling 30% of active binders to be identified in the top 10% of ranked compounds, significantly accelerating RNA-targeted drug discovery.

About the Project

‍

In pharmaceutical research, exploring RNA-ligand interactions is a complex challenge, unlike the more understood protein-ligand interactions. This case study showcases our collaboration with the International Institute of Molecular and Cell Biology in Warsaw (IIMCB), a leading research institute, where we developed machine learning models to predict RNA-ligand binding with remarkable accuracy.

‍

The collaboration resulted in achieving AUROC scores between 65-68% on test datasets, notably outperforming the state-of-the-art (SoTA) methods, which typically score between 50-60%.

‍

Project Overview

‍‍

The Challenge

‍

RNA molecules are emerging as crucial therapeutic targets, such as bacterial ribosomes and human pre-mRNA of the survival of motor neuron 2 (SMN2) protein. However, the flexible and dynamic nature of RNA structures makes accurate targeting difficult. The limited availability of experimental RNA-ligand interaction data further complicates the development process.

Current methods, including molecular docking techniques, struggle to accurately model these interactions, which hinders the development of RNA-targeted drugs. The limited availability of experimental RNA-ligand interaction data further complicates the development process, making it difficult to train and validate computational models.

‍

Key challenges include

The flexible and dynamic nature of RNA structures compared to proteins, making them harder to target accurately
Lack of comprehensive datasets for RNA-ligand interactions
Limitations of existing computational models in handling variable-length RNA sequences
The need for models that can generalize well, given the limited experimental data available for many RNA targets

These challenges collectively contribute to the complexity of developing effective RNA-targeted therapeutics, underscoring the need for innovative approaches in predicting RNA-ligand interactions.Our ApproachOur team adopted a multi-faceted approach to address these challenges:

Interdisciplinary Collaboration: We partnered with bioinformatics experts Filip Stefaniak and Natalia Szulc from IIMCB, combining their domain knowledge with our machine learning expertise.
Data Preparation: IIMCB provided extensive datasets containing experimental results of interactions between variousRNAs and tens of thousands of ligands. The data showed whether or not a ligand binds to a specific RNA. For every pair of RNA and ligand 3D structures, and for each nucleotide in the RNA, their software, fingeRNAt [1], generated a sequence of numbers, representing the nature of the noncovalent interactions.
Feature Engineering: The IIMCB team computed features reflecting the chemistry of RNA-ligand interactions at the nucleotide level, based on experimentally verified RNA structures. Appsilon proposed an improvement in how the features are prepared for modeling.
Advanced Machine Learning Techniques: We implemented custom neural networks using the transformer architecture, capable of handling variable-length RNA sequences.
Rigorous Testing: We trained our models on data from two RNAs and tested them on a third, ensuring the robustness and generalizability of our approach.

‍Results and Achievements

‍

‍Our collaborative efforts yielded impressive results:

Improved Prediction Accuracy: Our models achieved AUROC scores of 65-68% on test sets and 68-70% on evaluation sets, significantly outperforming existing state-of-the-art methods (50-60% AUROC).

**Figure 1:** AUC curves showing the performance of molecular docking and our model.

Efficient Screening: We achieved EF10 scores of around 3, compared to state-of-the-art results of 1-1.7 on the same dataset. This means our model can identify about 30% of active binders in the top 10% of ranked compounds.

**Figure 2:** Percentage of binders ranked in the top 10% of scores by each method, reflecting EF10 scores.

Rapid Development: These results were achieved in just six weeks, demonstrating our team's efficiency in developing and deploying custom machine learning solutions.

‍Implications for Drug DiscoveryThe success of this project can lead to far-reaching implications for the pharmaceutical industry:

Accelerated Drug Discovery: By accurately predicting RNA-ligand interactions, our model can reduce the time and resources required for initial drug screening.
Cost Reduction: The ability to focus on a smaller, more promising subset of compounds for experimental testing can lead to cost savings in the drug development process.
Expanded RNA-Targeted Drug Development: This advancement could open up new possibilities for developing drugs that target RNA, potentially leading to treatments for previously undruggable diseases.
Improved Understanding of RNA Biology: The insights gained from this model could contribute to a deeper understanding of RNA function and regulation in biological systems.

TestimonialEffective prediction of binding of small molecule ligands to RNA, is the ultimate challenge of rational drug discovery. The Machine Learning-based methods developed with Appsilon are taking us closer to that goal.

- Filip Stefaniak, PhD, International Institute of Molecular and Cell Biology

‍

‍References‍

Szulc NA, Mackiewicz Z, Bujnicki JM, Stefaniak F (2022) fingeRNAt—A novel tool for high-throughput analysis of nucleic acid-ligand interactions. PLoS Comput Biol 18(6): e1009783. https://doi.org/10.1371/journal.pcbi.1009783

‍