Accelerating Drug Discovery: Machine Learning for Protein Crystal Detection

Reading time:

time

min

November 16, 2023

Understanding the 3D structure of biological molecules like proteins is a crucial component of <a href="https://appsilon.com/data-science-in-pharma/" target="_blank" rel="noopener">drug discovery</a>. This understanding hinges on a process known as <strong>protein crystallization.</strong> However, identifying these tiny crystals amongst a myriad of similar-looking molecules can be a time-consuming and resource-draining process.

Optimize your drug discovery process. Learn more about our AI model, Crystal Clear Vision, which outperforms benchmarks, reducing missed crystals by over 30%.

In this article, we will take a deep dive into this process, its relevance, existing practices, <strong>and how using AI-powered models can help reduce the cost of developing medicines</strong> by using an efficient computer vision model offering benefits for drug discovery.

The proposed model establishes a new state-of-the-art in protein crystal identification, confidently overcoming notorious limitations of former approaches (most notably, their generalization capability).
<h3>TL;DR:</h3>

We’re presenting a novel machine learning model that significantly accelerates the important process of protein crystal detection in drug discovery.
Our approach, leveraging pre-trained models and a focused binary classification, outperforms the state-of-the-art MARCO model by reducing the likelihood of missing crystals from 11.1% to 7.6% on MARCO’s test set. This represents a decrease of over 30% in the original error rate.
Moreover, we are able to achieve superior performance on new data using as few as 60 images of crystals.
Important Metrics:
- Crystal Identification Accuracy:
  - Old Method (MARCO): 88.9% recall and 93.4% precision on crystals. [*]
  - New Method: 90.8% recall and 93.9% precision on crystals (4-class model), and a recall of 92.4% with a precision of 93.4% (binary classification model).
- Overall Accuracy:
  - Old Method: 93.5% overall accuracy for 4-class classification, 97.7% for detecting crystals.
  - New Method: 94.0% overall accuracy (4-class model) and 98.1% (binary classification model).
- Fine-tuning to new data:
  - Old Method: on new “VIS” and “UV” datasets, MARCO achieves crystal detection accuracy of 91.1% and 75.7%, respectively.
  - New Method: fine-tuned using as few as 60 images of crystals, our model obtains accuracies of 92.9% and 81.1%, respectively; using 2000 images of crystals, the accuracy increases to 95.7% and 91.9%, respectively.
- Computational Effort:
  - Old Method: 260 training epochs across 50 GPUs.
  - New Method: 3 + 2 training epochs (4-class model + binary model) on a single GPU.
This breakthrough has the potential to provide substantial cost and time savings in drug development.

<h3>Table of Contents</h3>
<ul><li><a href="#introduction"><strong>Protein Crystallization</strong></a></li><li><strong><a href="#previous-frontier">The Previous Frontier</a></strong></li><li><strong><a href="#our-innovation">Our Innovation</a></strong></li><li><strong><a href="#benefits">Benefits of this New Approach</a></strong></li><li><strong><a href="#conclusion">Conclusion</a></strong></li></ul>
<h2 id="introduction">Protein Crystallization</h2>
Protein crystallization is crucial in understanding the 3D structure of macromolecules such as proteins or antibodies. This step is essential in drug discovery and development. Drug discovery requires the identification of <a href="https://www.sciencedirect.com/science/article/abs/pii/B9780124172050000225#:~:text=Protein%20Crystallography%20and%20Drug%20Discovery,the%20public%20domain%20at" target="_blank" rel="noopener noreferrer">hot-spots that could be druggable</a>, as well as performing structure activity relationship (SAR) analysis to improve the potency of hit/lead small molecules.

Protein crystallization is a bottleneck in drug discovery. It’s time-consuming and requires multiple lengthy attempts at various conditions; because of this, identifying crystals is crucial. <a href="https://www.ddw-online.com/the-cost-and-value-of-three-dimensional-protein-structure-1114-200308/" target="_blank" rel="noopener noreferrer">Reports indicate that leveraging structural knowledge in drug discovery</a> can <strong>halve the costs and time</strong> required from targeting a molecule to filing for clinical trials, which represents up to 10 million USD. Thus, our enhancements in detecting protein crystals could be a key contributor to increasing the chances of identifying crystals that could then be sent to synchrotrons and would result in electron density maps that lead to a 3D protein structure.
<blockquote>“Every time you miss a protein crystal, because they are so rare, you risk missing on an important biomedical discovery.”- Patrick Charbonneau, Duke University Dept. of Chemistry and Lead Researcher, MARCO initiative.</blockquote>
<h2 id="previous-frontier">The Previous Frontier: MARCO</h2>
A state-of-the-art model already exists, <a href="https://marco.ccr.buffalo.edu/#:~:text=,experts%20beyond%20the%20crystallography%20community" target="_blank" rel="noopener noreferrer">Machine Recognition of Crystallization Outcomes</a> (MARCO). This model achieves 89% recall on crystals [*], meaning that on the test subset of MARCO, it correctly identifies 89% of the crystals, so 11% go unnoticed.

<a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0198883" target="_blank" rel="noopener noreferrer">The original research</a> trained a modified Inception-v3 convolutional neural network (CNN) on a 4-class problem. This model achieved up to 93.5% overall accuracy on the test set but saw a performance reduction on crystals, correctly <em>identifying only 89% of crystal images</em> in a holdout set.

This performance difference is due to the fact that crystals are a minority class in the dataset, and the authors did not report any attempt to balance the classes. However, the major goal of drug discovery is to identify crystals; thus minimizing misclassified crystals is of utmost importance.

Furthermore, the original model was trained across 50 GPUs for 260 epochs, <strong>a total GPU time of almost 40 days!</strong> This is a significant amount of computational time and expense required to train a classification model.

Finally, accuracy of MARCO decreases when evaluated on data from new labs. Earlier this year, <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0283124" target="_blank" rel="noopener noreferrer">researchers prepared two datasets</a> consisting of crystallization images obtained using different apparatus than the ones used when generating data for the MARCO dataset. The pictures in the datasets were obtained using visible and ultraviolet light, so we call those datasets “VIS” and “UV”, respectively. The 4-class classification accuracy of MARCO on those sets drops to as little as 82% and 49%.
<h2>Watch Our Presentation</h2>
Delivered at <strong>Harvard </strong>for the<strong> Center for Computational Biomedicine (CCB)</strong> <strong>seminar series</strong>, our team presents our work in detail, emphasizing its impact on drug discovery, particularly in reducing drug development costs and timeframes. The presentation also includes a demonstration of <strong><a href="https://appsilon.com/shiny-molstar-r-package-molecular-structures-visualizations/" target="_blank" rel="noopener">shiny.molstar</a></strong>, a visualization toolkit for large-scale molecular data.

<iframe title="YouTube video player" src="https://www.youtube.com/embed/FfQPneaNlHc?si=0JKD-NjGuk9jhrCI" width="560" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe>
<h2 id="our-innovation">Our Innovation</h2>
Our novel machine learning model has outperformed MARCO by not only identifying crystals with higher recall (missing just 7.6% of crystals) while retaining the same precision but doing so with a fraction of the computational effort required previously.

Where MARCO required 260 training epochs across 50 high-powered graphics processing units (GPUs) – using an equivalent of 40 GPU-days – our model achieved superior accuracy in just 3 epochs (meaning only 19 GPU-hours - a <strong>reduction in the computational cost of 98%</strong>). Moreover, the model can be adjusted to generalize to novel data sources with significantly smaller effort. This opens the possibility of automating the identification of crystals at scale!
<h3>Our Strategy for Success</h3>
Our approach to improving crystal detection is based on a series of strategic steps that have enabled us to overcome the limitations of previous models, specifically the MARCO model. Here's a detailed insight into our methodology:
<h4>Our Improvements</h4>
Our approach improves upon the previous model by:
<ul><li><strong>Prioritizing crystal detection over overall accuracy:</strong> Instead of striving for high overall accuracy across all categories, as was the case with MARCO, we made a conscious decision to prioritize crystal detection. Our primary objective was to minimize errors in this crucial area, especially considering that an 11% misclassification rate of crystals in drug discovery can lead to significant costs and setbacks.</li><li><strong>Significantly reducing computational time and costs:</strong> We utilized pre-trained models and <a href="https://appsilon.com/transfer-learning-introduction/" target="_blank" rel="noopener">transfer learning</a> to ensure high accuracy while drastically reducing the computational time and costs.</li></ul>
<h4>Dataset</h4>
The <a href="https://marco.ccr.buffalo.edu/images" target="_blank" rel="noopener noreferrer">MARCO dataset</a> originates from a comprehensive collection of images from protein
crystallization trials from 5 organizations. These images represent various crystallization
outcomes and are classified into 4 categories:
<ol><li><strong>Crystals:</strong> Images showing successful crystallization of proteins.</li><li><strong>Clear:</strong> Images where the solution remains transparent, indicating an absence of Crystals.</li><li><strong>Precipitate:</strong> Images displaying a cloudy or amorphous deposit, meaning the protein has precipitated but not formed crystals.</li><li><strong>Other:</strong> A category for images that don't fit neatly into the aforementioned classes, possibly due to multiple phenomena occurring simultaneously or other unforeseen outcomes.</li></ol>
The <a href="https://zenodo.org/records/4635300" target="_blank" rel="noopener noreferrer">“VIS” and “UV” datasets</a> have been <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0283124" target="_blank" rel="noopener noreferrer">curated</a> by researchers from the Collaborative Crystallization Center to evaluate performance of the MARCO model on new data. Images in those datasets come from a different crystallization kit than ones used for capturing images in the MARCO dataset. They are split into the same 4 categories.
<h4>Method</h4>
Here’s our approach towards achieving this.
<ul><li><strong>Leveraging Pre-trained Models:</strong> We leveraged the power of large pre-trained computer vision models, specifically modern CNN (<a href="https://appsilon.com/convolutional-neural-networks/" target="_blank" rel="noopener">Convolutional Neural Networks</a>) and Vision Transformer networks. By employing transfer learning, we were able to fine-tune these models on the original 4-class problem. This allowed us to directly compare our results with the MARCO model.</li></ul>
<p style="padding-left: 40px;">Using pre-trained models and transfer learning has two main advantages. Firstly, the baseline model has already been trained on a large and diverse image dataset, meaning it can be easily fine-tuned to produce a highly accurate model for a specific problem, in this case crystal detection. Secondly, since we only have to fine-tune a small fraction of the model parameters, the computational cost is drastically reduced compared to the state-of-the-art models.</p>

<ul><li><strong>Focused Binary Classification:</strong> Recognizing the need to reduce the high error rate in crystal detection, we shifted our strategy to binary classification, specifically targeting crystals. This involved converting the dataset labels to a binary problem: crystal or not crystal, and further fine-tuning our models trained on the original 4-class problem.</li></ul>
<p style="padding-left: 40px;">However, given that only 12.8% of images in the dataset contained crystals, the data was heavily imbalanced. To address this, we incorporated a weighted loss function during the training phase. This ensured that misclassified crystals were penalized more heavily compared to images without crystals. Since our models were already fine-tuned to the 4-class dataset, this further fine-tuning was efficient and rapid.</p>

<ul><li><strong>Low-rank Adaptation of Transformer-based Models:</strong> The experimental set-ups vary from one laboratory to another, and gathering hundreds of thousands of images for retraining a model to fit new data well is a costly endeavour. We used <a href="https://arxiv.org/abs/2106.09685" target="_blank" rel="noopener noreferrer">Low-Rank Adaptation (LoRA)</a> techniques with our Vision Transformer models to fine-tune them to even small samples of new data, i.e., of “VIS” and “UV” datasets.This technique has been successfully used for fine-tuning pre-trained Large Language Models due to its robustness to overfitting. LoRA ensures that the model does not forget too much of what it has learned before, while giving it the flexibility it needs to adapt to the new data or task.</li></ul>
<h4>Results</h4>
The new method achieved <strong>comparable accuracy</strong> to the old method but with a<strong> substantially reduced computational effort</strong>. More importantly, it was far <strong>more efficient at identifying crystals</strong>, which is the main goal for advancing drug discovery. Finally, it allows efficient fine-tuning to new data using <strong>very few samples</strong>, making it very cheap to adapt to crystallization data it has not seen before.
<h5>4-class Model</h5>
<ul><li>Our 4-class model achieved an overall accuracy of <strong>94.0%</strong>.</li><li>On crystals, our 4-class model achieves a recall of <strong>90.8%</strong> and a precision of <strong>93.9%</strong>.</li></ul>
<strong>Confusion Matrix:</strong>

Our confusion matrix shows strong classification accuracy across the four classes. The full confusion matrix is shown below:

<img class="size-full wp-image-21639" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019912a228de4b21ab78d_image_2023-11-08_14-11-30.webp" alt="A confusion matrix heatmap for a four-category classification model: clear, crystals, other, and precipitate." width="569" height="432" /> Figure 1. Confusion matrix showing results

<strong>Receiver Operating Characteristic Curve</strong>

The Receiver Operating Characteristic (ROC) curve for our 4-class model is shown in comparison to the original MARCO model below:
<ul><li>Our model achieves an AUC of 0.993 compared to 0.986 for the MARCO model.</li></ul>
<img class="size-full wp-image-21835" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01993588d6f7ecdfca5d3_image_2023-11-16_10-35-27.webp" alt="Line graph showing ROC curves for crystal detection with three models: MARCO, a 4-class model, and a 2-class model, with AUC scores of 0.986, 0.993, and 0.994 respectively. Each line curves towards the top left corner, indicating low false positive rates and high true positive rates for all models." width="2048" height="2015" /> Figure 2. Receiver operating characteristic curve
<h5>Binary Class Model</h5>
Further fine-tuning our 4-class model for binary classification significantly improves the performance on crystal detection:
<ul><li>Our binary model achieved an <strong>overall accuracy of 98.1%</strong> (MARCO’s crystal detection accuracy is 97.7%)</li><li>On crystals, our binary model achieves a <strong>recall of 92.8% and a precision of 92.6%</strong>!</li></ul>
<h5>Fine-tuning to new data</h5>
Fine-tuning even further to perform crystal detection on the “VIS” and “UV” datasets, our models outperformed MARCO using very little data:
<ul><li>On the “VIS” dataset, our model needed just 60 images of crystals (supplemented by 60 images from each other category) to achieve <strong>an accuracy of 92.9%</strong> on the held-out test set (MARCO’s crystal detection accuracy on this set is 90.7%), which increased to <strong>an accuracy of 95.7%</strong> when using 2000 images of crystals.</li></ul>
<img class="size-full wp-image-21837" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019948cff4e061342123e_image_2023-11-16_10-36-37.webp" alt="Line chart showing a positive correlation between the size of the training dataset on a logarithmic scale and the accuracy of a crystal detection algorithm. Starting with 92.9% accuracy for 60 images, the chart shows a gradual increase to 95.7% accuracy for 2000 images, with intermediate values of 94.4% for 200 images and 95.0% for 1000 images." width="2048" height="1405" /> Figure 3. Accuracy of models fine-tuned to subsets of the “VIS” dataset. Grey bars represent measurement errors.
<ul><li>On the “UV” dataset, with just 60 images of crystals it achieved <strong>an accuracy of 81.1%</strong> (MARCO’s accuracy is 75.9% in this setting), which increased to <strong>an accuracy of 91.9%</strong> when using 2000 images of crystals.</li></ul>
<img class="size-full wp-image-21840" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019969ca976faa7d46ed5_image_2023-11-16_10-36-41.webp" alt="A line graph demonstrating the increase in accuracy of a crystal detection algorithm as the number of training images increases. The graph uses a logarithmic scale for the number of images, showing an upward trend from 81.1% accuracy with 60 images to 91.9% with 2000 images, passing through intermediate accuracies at various dataset sizes" width="2048" height="1405" /> Figure 4. Accuracy of models fine-tuned to subsets of the “UV” dataset. Grey bars represent measurement errors.
<blockquote>To delve into the intricacies of machine learning evaluation metrics, explore our comprehensive guide on the <a href="https://appsilon.com/machine-learning-evaluation-metrics-classification/" target="_blank" rel="noopener">Top 10 Machine Learning Evaluation Metrics for Classification</a>.</blockquote>
<h2 id="benefits">Benefits of this New Approach</h2><ul><li><strong>Enhanced Accuracy and Efficiency:</strong> This novel strategy outperforms the current state-of-the-art MARCO model with an increase in performance for protein crystal classification (achieving a 92.4% recall rate compared to 88.9% with the same precision) while drastically reducing the computational load (requiring only 3 epochs of training versus 260 epochs).</li><li><strong>Streamlined Computational Resources:</strong> Another benefit of this model is it significantly diminishes the computational resources and time required, facilitating the swift development of precise models applicable to a broader range of scenarios.</li><li><strong>Efficient Fine-tuning to New Data:</strong> For experimental set-ups not incorporated into the MARCO dataset, we need just 60 images of crystals to make the model fare better than the MARCO model, and with just 2000 images of crystals, it already achieves crystal detection accuracy of 92-96%.</li><li><strong>Cost and Time Savings in Clinical Development: </strong>Incorporating structural knowledge has the <a href="https://www.ddw-online.com/the-cost-and-value-of-three-dimensional-protein-structure-1114-200308/" target="_blank" rel="noopener noreferrer">potential to cut both expenses</a> and duration required to advance a target molecule to clinical trials. Leveraging this method could provide substantial cost reductions.</li></ul>
<h2 id="conclusion">Conclusion</h2>
This new method is a big step forward as it identifies protein crystals more accurately and efficiently, which can accelerate the drug discovery process. Plus, it can be adjusted to work well with new, different data, making it a flexible tool for future research.

By converting the problem to binary classification and further fine-tuning our models, we observe an increase in crystal detection rate by over 3% -<strong> reducing the rate of misclassified crystals by over 30%</strong> without increasing the number of false positives!
<blockquote>At Appsilon, we’re committed to <a href="https://www.appsilon.bio/" target="_blank" rel="noopener">solving complex drug discovery problems with cutting-edge machine learning solutions</a>.</blockquote>
[*] The results for the MARCO model given in this article differ from the ones provided by the authors. We obtained those using the MARCO model available online at <a href="https://github.com/tensorflow/models/tree/master/research/marco" target="_blank" rel="noopener noreferrer">https://github.com/tensorflow/models/tree/master/research/marco</a> and the MARCO validation dataset available at <a href="https://marco.ccr.buffalo.edu/download" target="_blank" rel="noopener noreferrer">https://marco.ccr.buffalo.edu/download</a> .
<h2>We are Open to Collaboration</h2>
We are actively seeking collaborations to enhance our capabilities and provide tailored solutions.

If you have crystal images or datasets, we invite you to share them with us. We could apply our top-notch machine learning model and help you optimize crystal detection. <a href="Maximize efficiency and minimize costs in drug discovery. Crystal Clear Vision's AI technology outperforms benchmarks, reducing missed crystals by over 30%.https://www.appsilon.com/crystal-clear-vision?utm_source=community&utm_medium=website&utm_campaign=blog" target="_blank" rel="noopener">Let's collaborate</a> to unlock the full potential of your data and explore innovative solutions together.

This article was co-authored by ML Engineer, <a href="https://appsilon.com/author/andrew/" target="_blank" rel="noopener">Andrew Cusick</a> and Life Sciences Innovation Lead, <a href="https://appsilon.com/author/ismael/" target="_blank" rel="noopener">Ismael Rodriguez</a>.

Are you looking to accelerate your drug discovery process? Explore Crystal Clear Vision, our AI model for efficient protein crystal detection.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.

Book the Audit

Accelerating Drug Discovery: Machine Learning for Protein Crystal Detection

Have questions or insights?

Is Your Software GxP Compliant?

Ensure Your R and Python Code Meets FDA and EMA Standards

Working with Clinical Trial Data? There’s a Pharmaverse Package for That

GSK’s Open-Source Shift: Training 1,000 Biostatisticians in R

Visualizing 700,000 Cells: Appsilon's Dashboard Featured in Nature Biotechnology

Share Your Data Goals with Us

Accelerating Drug Discovery: Machine Learning for Protein Crystal Detection

Have questions or insights?

Is Your Software GxP Compliant?

Ensure Your R and Python Code Meets FDA and EMA Standards

Read about similar topics

Working with Clinical Trial Data? There’s a Pharmaverse Package for That

GSK’s Open-Source Shift: Training 1,000 Biostatisticians in R

Visualizing 700,000 Cells: Appsilon's Dashboard Featured in Nature Biotechnology

Share Your Data Goals with Us