Genetic Research with Computer Vision: A Case Study in Studying Seed Dormancy

Reading time:

time

min

Jędrzej Świeżewski, Ph.D.

December 29, 2020

<h2>Assisting genetic research with computer vision</h2> To improve the detection of genetic signatures in seeds correlated with their dormancy, we have trained computer vision models. These models have captured more than was previously understood about the mechanisms of dormancy. Genetic research with computer vision is opening up the field to new discoveries and potential for growth. <blockquote>New to computer vision? <a class="editor-rtfLink" href="https://wordpress.appsilon.com/image-classification-tutorial/" target="_blank" rel="noopener noreferrer">Read our guide on getting started with fastai, ResNet, MobileNet, and more</a>.</blockquote> Navigate to a section: <ul><li><a href="#introduction">Introduction</a></li><li><a href="#data">Data</a></li><li><a href="#benchmark">Benchmark</a></li><li><a href="#architecture">Architecture</a></li><li><a href="#results">Results</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <h2 id="introduction">Introduction</h2> Seed dormancy is an actively studied biological phenomenon crucial in many areas of the economy (e.g., at early stages of the food supply chain) and ecology (e.g., providing insights for studying the effects of the climate crisis on plants). An active contributor to understanding the genetic mechanisms controlling seed dormancy is the <a class="editor-rtfLink" href="https://swiezewskilab.pl/" target="_blank" rel="noopener noreferrer">Świeżewski lab</a>, hosted at the <a class="editor-rtfLink" href="https://www.ibb.waw.pl/en" target="_blank" rel="noopener noreferrer">Institute of Biochemistry and Biophysics of the Polish Academy of Sciences</a>. The entire cycle of seed generation, dormancy, and germination is investigated by the researchers at the lab, allowing to broaden the understanding of critical triggers in the process and providing the possibility of finding scalable ways of monitoring it. Looking at seeds of thale cress (Arabidopsis thaliana), the researchers had noticed that a morphological parameter differentiates to some extent the seeds germinated and are still dormant after a key period of two weeks from the moment of observation. <img class="wp-image-6333 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0211fce6734eff8381a2c_test-1.webp" alt="Image 1 - Distributions of one of the morphological parameters of germinated and dormant seeds are shown, indicating that the parameter is more likely to be large for germinated seeds." width="1125" height="729" /> Image 1 - Distributions of one of the morphological parameters of germinated and dormant seeds are shown, indicating that the parameter is more likely to be large for germinated seeds. The morphological parameter of seeds was estimated based on pairs of images obtained from a machine handling their planting. The researchers contacted us for assistance in analyzing the data with computer vision. They hypothesized that more visually distinguishable features might be found by such models, allowing them to discern between seeds that in two weeks from the moment the photos were taken will be dormant or germinated. <h2 id="data">Genetic research data for computer vision models</h2> The data given to us consisted of several thousand pairs of images of seeds - each seed being roughly 0.2mm in diameter. The photos came from a machine used to automate their placement on trays where they are monitored. <img class="wp-image-6328 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021217a16ac39e34c1210_2_1_75.webp" alt="Image 2 - Pair of photos of a single seed (suspended on a pneumatic needle)" width="1196" height="599" /> Image 2 - Pair of photos of a single seed (suspended on a pneumatic needle) The images were taken in a highly controlled environment, so it was clear that only a tiny fraction of the image contains the seed. Hence, we should identify the part of the image containing the seed to help the planned models focus on the seed's features. Additionally, cropping the photos to the seed reduces their size, speeding up the models' training. However, the needle was not always placed in the same location of the image, so we needed to perform adaptive cropping. We decided to analyze the total color intensities for each row and column of pixels in the image - analyzing the obtained signals and automatically identifying the peaks in the red channels. <img class="size-full wp-image-6314" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395865b5a6f8240aaa2bc_3-5.webp" alt="Image 3 - Signals with pixels' intensities summed per row for the above photos (analogous plots were analyzed for the columns). The peaks are marked with a cross. The brightening of the background at the bottom of the images translates to a steadily growing intensity in color in the above plots." width="1570" height="516" /> Image 3 - Signals with pixels' intensities summed per row for the above photos (analogous plots were analyzed for the columns). The peaks are marked with a cross. The brightening of the background at the bottom of the images translates to a steadily growing intensity in color in the above plots. With a fixed square window centered on the locations of the peaks, we were able to crop the images to single out the seeds. <img class="wp-image-6329 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b02123ebdf0e3658806b80_4_75.webp" alt="Image 4 - Photos of a seed with the seeds' location identified using the cropping based on the location of the peak of color intensity. Uncropped images used for genetic research with computer vision. " width="1194" height="642" /> Image 4 - Photos of a seed with the seeds' location identified using the cropping based on the location of the peak of color intensity <img class="size-full wp-image-6315" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39587df484ee4e661a739_5-5.webp" alt="Image 5 - Cropped photos of a seed with actual resolution. Such images were later fed into the neural networks." width="656" height="330" /> Image 5 - Cropped photos of a seed with actual resolution. Such images were later fed into the neural networks <img class="wp-image-6316 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395e5f32c7f090d664242_6-5.webp" alt="Image 6 - More examples of cropped images (a single image per pair is shown). Images used for genetic research with computer vision. It can be seen that it is next to impossible for a human eye to tell the difference between the classes." width="1306" height="1338" /> Image 6 - More examples of cropped images (a single image per pair is shown). It can be seen that it is next to impossible for a human eye to tell the difference between the classes The resolution is not very high. Even though the original images consist of almost 400000 pixels each, the seeds typically take around 500 of them. This poses a challenge for any attempt to recognize the visual features of the seeds. In addition to cropping the images, the data had to be cleaned concerning the labels we received. The seeds were labeled with one of four categories: germinated, dormant, lost, and dead. The first two were the most exciting classes for our purposes. We dropped the additional classes for the current analysis, as they were products of improper application of seed handling. Interestingly, the data came from four different trays (each containing seeds from a distinct plant specimen), and each of them had a different distribution of the four initial classes. Many interesting observations can be made by visualizing the location of the seeds on the round trays. <img class="size-full wp-image-6335" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0212546af19f939a92d3a_trays-1.webp" alt="Image 7 - The class distributions varied significantly between the trays" width="2160" height="1942" /> Image 7 - The class distributions varied significantly between the trays We selected 10% of the data (stratified by the label) as a validation set. All results were evaluated on that set, and all the models trained only on the remainder training set. <h2 id="benchmark">Benchmark</h2> We knew the morphological parameter of seeds correlates with their dormancy in the above sense, so we first trained a simple extreme gradient boosted tree model to predict the label (whether the seed germinated or stayed dormant). The predictions are based only on the value of that morphological parameter of the seeds identified individually in the two photos of a given seed. The model reached 62% validation accuracy, which is not very impressive for a binary classifier but serves as a benchmark for further modeling. In particular, this benchmark quantifies the extent to which the seed dormancy can be explained with the mentioned morphological parameter alone. <h2 id="architecture">Architecture</h2> To accommodate the data being pairs of RGB photos, we decided to adapt known neural network architectures to accept six input channels, mixing the information from the two photos only in the later dense layers. We've started from training residual networks within the ResNet family as they are the most robust in our experience, offering promising results at a reasonable training pace. We settled on an adapted version of an EfficientNet version B3 which gave us a validation accuracy of 70%. <h2 id="results">Results</h2> Using computer vision analyses of the pairs of images, we were able to predict seed dormancy with accuracy significantly higher than a model based solely on the single morphological parameter known to correlate with the dormancy of the seeds. It's an exciting finding as it supports the hypothesis that the genetic mechanisms controlling dormancy are also reflected in more visually distinguishable features. To confirm the above result, we have analyzed how the model's accuracy differs between different sizes of the seeds. While there is a difference (seeds with the morphological feature more pronounced being classified more accurately), it is not significant. Moreover, the trays in which the seeds were grown were a stronger indicator of their dormancy predictability. Seeds from tray 2 were the hardest to predict, and those from tray 1 and 4 were the easiest (with accuracy reaching 75%). Interestingly, for trays 1 and 4, the morphological parameter distributions for both classes are not misaligned, and the balance between classes is complementary. <img class="wp-image-6334 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021267e876a3f78c5bbc1_sizes_per_tray-1.webp" alt="Image 8 - Distributions of the morphological parameter for the labels separately and individual trays. Trays 2 and 3 exhibit the shift of germinated seeds towards larger values, while trays 1 and 4 only to a modest extent - the two distributions for those trays are better aligned." width="2400" height="2400" /> Image 8 - Distributions of the morphological parameter for the labels separately and individual trays. Trays 2 and 3 exhibit the shift of germinated seeds towards larger values, while trays 1 and 4 only to a modest extent - the two distributions for those trays are better aligned. <h2 id="conclusion">Conclusion: genetic research with computer vision - a gateway to growth</h2> In this project, we have prepared and modeled on a very interesting dataset - consisting of pairs of tiny seed photos. Our key contribution was the confirmation of the researchers' hypothesis - dormancy is controlled by genetic mechanisms that have visual signatures at a very early stage in the development of the seeds. Learn more about computer vision: <ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO Algorithm and YOLO Object Detection: An Introduction</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/pp-yolo-object-detection/" target="_blank" rel="noopener noreferrer">PP-YOLO Object Detection Algorithm: Why It's Faster than YOLOv4</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/convolutional-neural-networks/" target="_blank" rel="noopener noreferrer">Convolutional Neural Networks: An Introduction</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/transfer-learning-introduction/" target="_blank" rel="noopener noreferrer">Introduction to Transfer Learning: Effective Machine Learning Without Custom Architecture</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/fast-ai-in-r/" target="_blank" rel="noopener noreferrer">Fast.ai in R: How to Make a Computer Vision Model within an R Environment</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/satellite-image-analysis-with-fast-ai-for-disaster-recovery/" target="_blank" rel="noopener noreferrer">Satellite Image Analysis with fast.ai for Disaster Recovery</a></li></ul> <a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><img class="aligncenter wp-image-5940 size-large" src="https://wordpress.appsilon.com/wp-content/uploads/2020/11/appsilon.hiring0-1024x576.jpg" alt="Job application call to action " width="1024" height="576" /></a> Appsilon is hiring for remote roles! See our <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/" target="_blank" rel="noopener noreferrer">Careers</a> page for all open positions, including <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#r-shiny-developer" target="_blank" rel="noopener noreferrer">R Shiny Developers</a>, <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#fullstack-software-engineer-tech-lead" target="_blank" rel="noopener noreferrer">Fullstack Engineers</a>, <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#frontend-engineer" target="_blank" rel="noopener noreferrer">Frontend Engineers</a>, a <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#senior-infrastructure-engineer" target="_blank" rel="noopener noreferrer">Senior Infrastructure Engineer</a>, and a <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#community-manager" target="_blank" rel="noopener noreferrer">Community Manager</a>. Join Appsilon and work on groundbreaking projects with the world's most influential Fortune 500 companies.<a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"> </a>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.