Machine Learning and Plankton: Copepod Prosome and Lipid Sac Segmentation

Reading time:

time

min

February 9, 2023

This project was done in collaboration with Dr. Emilia Trudnowska from the Department of Marine Ecology at the Institute of Oceanology Polish Academy of Sciences. Her work is primarily focused on Arctic zooplankton, its distribution, size spectra, and ecology. Appsilon's previous work with plankton data, monitoring marine arctic ecosystems with machine learning, provided a strong platform to continue research in this area. TOC: <ul><li><a href="#background">Project Background</a></li><li><a href="#posit">Data</a></li><li><a href="#model">Modeling</a></li><li><a href="#results">Results</a></li><li><a href="#summary">Summary</a></li></ul> <hr /> <h2 id="background">Project Background: Arctic Copepods and Machine Learning</h2> The fundamental idea behind the scientific project led by Dr. Trudnowska is to recognize the ecological plasticity of key Arctic copepods - <i>Calanus</i> on the basis of a few key morphological traits derived from photographs. The crucial part of this study is to learn more about the relationship between the size and volume of two components of planktonic organisms: the prosome (the main body) and the lipid sac (energy reserves accumulated by copepods before winter). <strong><span style="font-size: 11px;"><a href="https://appsilon.com/yolo-counting-nests-antarctic-birds/" target="_blank" rel="noopener">Shags, drones, and YOLO</a> - The story of how Appsilon and the Polish Antarctic Station are assessing the Antarctic ecosystem.</span></strong> Typically it takes a tremendous amount of time and human resources to manually compute these features across thousands of photos collected within the scope of several campaigns and research projects. A machine learning-based solution that can segment the Prosomes and Lipid sacs with sufficient accuracy could solve this problem by automating the area computation process. These areas could then be used to approximate the fitness of individuals, and thus their quality as a food source for higher trophic levels. Such a solution holds the potential to turn weeks of work into a matter of minutes! <img class="wp-image-17776" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bf53c74b4d4f4d0e33f_arctic-copepod-lipid-sac-and-prosome-machine-learning.webp" alt="arctic copepod lipid sac and prosome machine learning" width="1200" height="462" /> Lipid Sac (left-image; pink-part) & Prosome (right-image; red-part) <blockquote> <p style="text-align: justify;">I find the results to be very promising. I know that in the last few years, several research groups started some attempts on automatisation of lipid sac analyses from such kinds of photos, but they were never as satisfactory as the model approach utilized by Appsilon. Such a tool will save a tremendous amount of time, not to mention the repeatability of automated methods, as each human (student) has slightly different eye perception and way of reasoning about the visible extent of specific elements. I am more than happy to apply this method for data analyses of the new article dealing with ecological plasticity of those important planktonic animals in order to understand their morphological response to variations in ocean hydrography (temperature) and food quality in the sea, which determines their quality as food for fish, seabirds and mammals in the region of Svalbard.</p> <p style="text-align: right;"><strong>- Dr. Emilia Trudnowska</strong>, Institute of Oceanology Polish Academy of Sciences</p> </blockquote> <h2 id="data">Data: Machine Learning Approach to Copepod Lipid Sac Analysis</h2> The dataset we used to validate the feasibility of the machine learning approach consisted of about 350 labeled, high-resolution, calibrated binocular images of <i>Calanus </i>with clearly visible prosomes and lipid sacs. <img class="alignnone wp-image-17786" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bf75591d718748e78d8_visible-prosome-and-lipid-sac-for-identification.webp" alt="visible prosome and lipid sac for identification" width="1200" height="900" /> The prosomes and lipid sacs were labeled as n-edged polygons (the number of vertices, white dots, serves as a good indicator of how tedious the work of manually labeling the images is!). <img class="alignnone wp-image-17784" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bf83bcf3cfb907908df_manual-labeling-tedium-before-machine-learning.webp" alt="manual labeling tedium before machine learning" width="1200" height="462" /> <h2 id="model">Modeling - MaskRCNN</h2> A popular region proposal network, MaskRCNN, was trained under the detectron2 framework to serve as a benchmark for more modern and fine-tuned approaches. Since one class was a spatial subset of the other, a significant number of pixels were double-labeled, something the MaskRCNN is not built to deal with. Hence, two separate instances of the model, with identical architecture and hyperparameters, were trained (one for each class). <img class="wp-image-17796" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bfa832c47d67310462a_rol-align.webp" alt="Batch MaskRCNN modeling" width="1200" height="556" /> Image from <a href="https://arxiv.org/abs/1703.06870" target="_blank" rel="nofollow noopener">He et al., 2017</a> 150 of the images were used for training while the remaining 200 were reserved for validation and testing. Training each instance for 2000 iterations, with two workers and a batch size of two, took approximately 15 minutes using a K80 GPU (the same GPU available on a standard Colab runtime). <h2 id="results">Results of Machine Learning to Identify Copepod Structures</h2> <h3>Qualitative Results</h3> For lipid sacs, 85% of the results appeared to fit into our visual quality criteria! <img class="wp-image-17780" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bfab1e58f77d3b0d3bd_copepod-lipid-sac-identification-with-machine-learning.webp" alt="copepod lipid sac identification with machine learning" width="1200" height="800" /> Sample results for lipid sac segmentation For prosomes, a staggering 95% of the test results appeared to fit well! <img class="size-full wp-image-17782" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bfc44295a0f33219e1c_copepod-prosome-identificaiton-with-machine-learning.webp" alt="copepod prosome identificaiton with machine learning" width="2600" height="1734" /> Sample results for prosome segmentation <h3>Quantitative Results</h3> The results were further evaluated using 2 different quantitative metrics: Intersection over Union (Jaccard Index) and F1-score (Dice coefficient). These metrics are typically used by practitioners of machine learning (and other statistical methods) to evaluate the performance of a given predictive solution. <h4>Intersection over Union</h4> This metric is computed by dividing the number of common pixels between the model’s prediction and the ground truth (intersection) by the total number of pixels occupied by either the model’s prediction or the ground truth (union). This is a rather conservative metric in that it gets harder and harder to improve the scores as the results (area of intersection) improve. For lipid sac segmentation, we were able to achieve a mean IoU of <b>86%</b> whereas for prosome segmentation we were able to achieve a mean IoU of <b>94.4%</b>! Those are very good scores, as typically an IoU as low as 50% is considered to be good. As shown by the histograms below, more than half of the results were above 90% for the lipid sac, whereas almost all (95%!) of the results showed an IoU greater than 90%. Even in the worst case, the results did not fall below 75% for prosome. <img class="alignnone wp-image-17792" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bfd9224e846ad9c1359_IoU_lipid.webp" alt="Intersection over Union - Lipid sac" width="1200" height="286" /> <img class="alignnone wp-image-17794" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bffd21aabe449a837ed_IoU_prosome.webp" alt="Intersection over Union - Prosome" width="1200" height="286" /> <h4>F1 Score</h4> This metric is computed by dividing two times the number of common pixels between the model’s prediction and the ground truth (intersection) by the sum of the number of pixels covered by the model's prediction and the ground truth (intersection + union). This metric indicates the model’s balanced ability to both capture correct pixels (recall) and be accurate with the pixels it does capture (precision), thus making it more useful for our intended application, i.e., area computation. For lipid sac segmentation, we were able to achieve a mean F1-score of <b>0.94</b> whereas for Prosome segmentation we were able to achieve a mean F1-score of <b>0.97</b>! Those are exceptional scores as scores above 0.8 are typically considered to be good, and above 0.9 as very good. As shown by the histograms below, more than half of the results for lipid sacs were over 0.95! For prosomes, more than 92% of the results were over 0.95 and 65% of the results were in the 0.97-1.00 range! <img class="alignnone wp-image-17788" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01bffe7ac33b972b4615a_F1_lipid.webp" alt="F1 score - Lipid" width="1200" height="286" /> <img class="alignnone wp-image-17790" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01c01fa26a4961d78caa3_F1_prosome.webp" alt="F1 score - prosome" width="1200" height="286" /> <h3>Areas of Improvement for Modeling Copepod Lipid Sacs and Prosomes</h3> While most of the initial results look great, there were some cases of under-coverage. We’re currently working on improving those. You can see such examples in the images below. <img class="alignnone wp-image-17778" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01c0260c4b5e759a7bb3b_arctic-copepod-lipid-sac-identification-with-machine-learning.webp" alt="arctic copepod lipid sac identification with machine learning" width="1200" height="800" /> <h2 id="summary">Copepods and Machine Learning - A Faster Approach to Structure Identification</h2> Here we reported on a standard deep learning solution, often considered a benchmark model for image segmentation tasks. We produced promising results in the segmentation and area computation of two components, prosomes and lipid sacs, of the <i>Calanus</i> (Arctic Plankton) body. <p style="text-align: left;"><strong><span style="font-size: 11px;"><a href="https://appsilon.com/monitoring-ecosystems-with-computer-vision/" target="_blank" rel="noopener">Computer Vision and Flowers</a> - A budding way to monitor shifting ecosystems.</span></strong></p> This will help reduce the time taken to evaluate the fitness of a <i>Calanus </i>population, and thus its quality as a food source for higher trophic levels, from weeks to minutes! <b>A paper utilizing the results is on its way! </b> Stay tuned to our <a href="https://appsilon.com/ai-research/" target="_blank" rel="noopener">AI & Research page</a> for future updates. We’re working on fine-tuning the results, perhaps to an extent to supersede conventional methods. We also look forward to the possible impact our work can have when used on large datasets collected in various research groups around the world (in particular in Canada and Norway)! <h3>Let's Talk Copepods!</h3> If you have a similar dataset or are interested in how our AI & Research team can assist in your project, <a href="https://appsilon.com/ai-research/#contact">let's talk</a>.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.