Using AI to Identify Wildlife in Camera Trap Images from the Serengeti
<h3 style="text-align: center;">By <a href="https://www.linkedin.com/in/marrogala/">Marek Rogala</a> and <a href="https://www.linkedin.com/in/swiezew/">Jędrzej Świeżewski, PhD</a></h3>
<h2>An opportunity for biodiversity research...</h2>
Camera trap imaging (automatic photography of animal species in the wild) is becoming the gold standard in biodiversity conservation efforts. It allows for accurately monitoring large swaths of land at an unprecedented scale. However, the sheer volume of data generated using these devices makes it very difficult for humans to analyse. With recent developments in machine learning and computer vision, we acquired the tools to resolve this issue and provide the biodiversity community with an ability to tap the potential of the knowledge generated automatically with systems triggered by a combination of heat and motion.
<img class="wp-image-3499 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b02297fbced7e6143b1f40_cheetah.webp" alt="a cheetah approaches the camera trap" width="512" height="384" /> Camera traps, arguably less obstructive than human surveyors, still attract attention
<h2>...and a challenge for the Machine Learning community</h2>
We recently took part in <b>Hakuna-ma Data</b>, a competition organised by <a href="https://www.drivendata.org/competitions/59/camera-trap-serengeti/page/145/">DrivenData</a> in partnership with <a href="https://www.microsoft.com/en-us/ai/ai-for-earth">Microsoft’s AI for Earth</a>, which asked participants to build an algorithm for wildlife detection that would generalise well across time and locations. This competition differed from previous iterations in the sense that researchers, data scientists and developers did not have direct access to the test image set. Rather, they were asked to submit their models and have them executed by the event organizers in Microsoft Azure. 811 participants from around the world trained their models on the publicly available data from 10 seasons to finally submit their solution to be tested on the 11th season’s private data set.
We congratulate all our fellow competitors and the <i>ValAn_picekl</i> team for taking the grand prize of $12,000. We are also proud to announce that we took 5th place in the final leaderboard and we would like to share how we managed to achieve this.
You can play with our final model in a ready-to-run Google Colab notebook <a href="https://colab.research.google.com/github/Appsilon/serengeti_try_it_yourself/blob/master/classify_images_on_colab.ipynb">here</a>.
<h2>Our approach</h2>
The competition involved processing large volumes of images so using a fast neural network framework with strong GPUs in the backend was a must. We were provisioning virtual machines from Google Cloud Platform equipped with sufficiently powerful GPUs (ranging from Tesla K80 to Tesla V100 for the most heavy lifting). Typically we were running a few of them in parallel to speed up experimentation. We worked with Python code, coordinating crucial parts of code in a GitHub repository and using notebooks for fast experimenting and visual inspection of the performance of models. We built the models in Fast.ai (with PyTorch backend), and used the integration with <a href="https://www.wandb.com/">Weights & Biases</a> for keeping track of our experiments.
Based on previous <a href="https://www.pnas.org/content/115/25/E5716">studies</a> of the dataset, we decided to work with ResNet 50, an architecture of a rather moderate depth. Since we joined the competition very close to the end, we opted not to experiment with other (or simply deeper) architectures, and after seeing promising initial results of this one, decided to focus on improving them.
Given the lack of time we had to optimise our strategy. We used an agile approach: we quickly created our baseline “MVP” (Minimum Viable Product) of a solution and then iterated a lot based on its results. Initially, submissions were allowed only every five days, which gave us a natural sprint-like rhythm.
The dataset was challenging in itself: 6.6 million photos with relatively large resolution - several terabytes of data. We reused a version of the dataset provided by one of the participants in which the photos’ resolution had been significantly scaled down (by a factor 4 on each side).
<img class="aligncenter size-full wp-image-3498" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022999f2634632a51d89d_table.webp" alt="class of image and percentage of occurrence " width="512" height="173" />
The dataset was very imbalanced - around 75% of the images were empty. The most popular animals are wildebeest, zebra and Thomson’s gazelle. On the other side of the spectrum you have super rare animals like steenbok or bats, visible only in a handful of photos.
<img class="size-full wp-image-3500" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0229a6cd8f7584c836a02_rhino.webp" alt="the white rhinoceros " width="512" height="384" /> As rare as in nature – rhinos were one of the least common species in the dataset
<h2>Top 5 things that worked</h2>
We believe the following 5 key factors contributed the most to our models’ success:
<h3><b>1. Large validation set</b></h3>
This was crucial to ensure that the model generalises well, because we were assessed on the last season – a private, unseen dataset. Each season comes from a different year. We decided to hold out an entire season (season 8) for validation – we wanted it to contain many images (almost a million!), and yet be relatively recent – since the test set consisted of yet unpublished season 11. This choice was also supported by a study of the distributions of individual species across seasons. We noticed that only the last three seasons (8, 9 and 10) contained photos of a few of the relatively rare species.
<h3><b>2. Training with growing resolution</b></h3>
We divided the training process into three stages. Each stage had a part in which we trained only the final layers of the network followed by training of all the layers. At each stage we trained the model on images of an increasingly higher resolution, letting us train longer without overfitting. The maximal resolution we used was 512x384, which was still a quarter of the provided images in each dimension.
The final model was trained in the following stages:
<ol><li>5 epochs on network’s final layers on 128x96 px images</li><li>5 epochs on all layers on 128x96 px images</li><li>5 epochs on network’s final layers on 256x192 px images</li><li>5 epochs on all layers on 256x192 px images</li><li>5 epochs on network’s final layers on 512x384 px images</li><li>5 epochs on all layers on 512x384 px images</li></ol>
This approach significantly sped up our training - even a model operating on 128x96 images achieved a good accuracy, and we could train it much faster than the 512x384 model. At this point we would like to give kudos to Pavel Pleskov, a participant of the competition who, in the spirit of healthy rivalry, shared a scaled down version of the dataset with the community saving us download time and allowing many more participants to join.
<h3><b>3. Data augmentation / one cycle fitting</b></h3>
We used standard image augmentations during training (horizontal flips, small rotations, small zooms, small warps and tweaks of lighting) to keep the model from overfitting on a pixel level and help it generalize. We used <a href="https://arxiv.org/pdf/1803.09820.pdf">Leslie Smith’s one cycle policy</a> to speed up the training.
<h3><b>4. Aggressive undersampling</b></h3>
We eliminated large portions of the most frequent classes (e.g., 95% of empty photos) from our training to put more stress during training on examples from the less frequent classes. This allowed us to train significantly faster, helping the model to focus on the challenging part of the input space. A caveat here was to remove them in a smart way - keeping examples with more than one animal in the training set as they are harder for models to get right.
<img class="wp-image-3501 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0229cdf1312180459a6a5_zebra.webp" alt="a zebra in the savannah. camera trap image" width="512" height="384" /> Multiple animals in one picture – a challenging case for an ML model
<h3><b>5. Inspection of losses / post-processing</b></h3>
What helped us to push the accuracy into top places was working deeply with the data and results given by our initial models. We wanted to see which images we got wrong and what happened in each range of our predicted certainty. To minimize our loss and optimize our submission we studied individual classes and inspected where we lose the most.
At this stage we chose to incorporate a simple inference mechanism from all the images in a sequence and not just single images. We took the predictions from the first image of each sequence (again guided by results on the validation set), unless the maximum prediction for a given species in a photo later in the sequence was higher and exceeded a certain threshold (this way we allowed, e.g., birds from “otherbird” flying by in frames later in the sequence to affect the prediction). Analogously for the “empty” category, we took the prediction from the first photo in a sequence unless in a later photo, the prediction for this class was below a certain threshold (this way, if an animal appeared only later, the prediction for “empty” was toned down suitably).
We next processed ranges of predicted values for each of the species with a high occurrence frequency and grid-searched for basic linear transformations minimizing our loss on the validation set. This helped the score a lot - most likely by getting the model’s returned probabilities to reflect species distribution in the dataset.
<h2>In retrospect – what we would like to have done</h2>
There were a few more ideas we were eager to try, but because we joined the competition only 3 weeks before its conclusion, we did not. Most notably:
<h3><b>Focal loss</b></h3>
The dataset at hand was highly imbalanced. An alternative to the aggressive downsampling we implemented (see above), would have been to use a modified loss function, namely the focal loss. While <a href="https://arxiv.org/pdf/1708.02002.pdf">first proposed</a> in the context of object detection, it has been recently used also in the context of imbalanced classification problems.
<h3><b>More training time, more epochs</b></h3>
Various signs indicated that our models might have been simply trained longer to achieve even better scores. Wary of overfitting, we have approached the training schedules conservatively, however given more time, we would have liked to explore longer training at the various stages (see above).
<h3><b>Use a 2-stage approach: animal location detection and animal classification</b></h3>
We considered using a two stage approach, with the first model focusing on animal presence regardless of its species, and the second model focusing on species classification. Such an approach could have boosted accuracy and was suggested by previous research. Due to limited time we chose not to try this approach.
After the competition we learned about great work in this area by Siyu Yang and Dan Morris from Microsoft AI for Earth. They built <a href="https://medium.com/microsoftazure/accelerating-biodiversity-surveys-with-azure-machine-learning-9be53f41e674">a model for animal detection in pictures</a>. It identifies the presence of animals in photos, providing bounding boxes for positive predictions. The strength of the model comes from it being trained on data from multiple geographical locations.
They suggest a very interesting approach of using a generic model for detecting animal location, and a second project- and location-specific model for species classification. One clear advantage is that the second model can focus precisely on the animal, which clearly has a potential to give better accuracy. This method can also help a lot when it comes to recognizing multiple animals in a picture.
<h3><b>Data cleaning</b></h3>
Investigating our models’ performance, we noticed many examples of mislabels (e.g., only a leg of a zebra visible in a single image of a sequence - labelled as empty, was classified as a zebra by our model with high certainty). There were also many examples of images taken at night in which human labellers spotted no animals, while models were very sure about seeing them. After manually tweaking the color levels of such images, the darkness indeed revealed the detected animals. This fact touches on a very important topic: whether we should build models to replicate the labelers’ work (an approach promoted by the rules of the competition) or rather should we focus on models, which are as good as possible in detecting animals in the captured images.
<h2>Why we joined and what it meant for us</h2>
The competition gave us a deep dive into the world of wildlife image classification. We are very satisfied with the result and had a lot of fun whilst working on these models. We were able to demonstrate and polish our computer vision skills, which will be critical to our <a href="https://appsilon.com/ai-for-good/">AI for Good</a> initiative. We are currently in the process of building a pilot version of a wildlife image classifier for the National Parks Agency of Gabon and we will apply our new knowledge to this end. Stay tuned for more information about our upcoming projects. We are determined to show that AI can have a tangible, positive impact on life and the world we all share.
<h2><b>Follow Appsilon Data Science on Social Media</b></h2><ul><li>Follow<a href="https://twitter.com/appsilon"> @Appsilon</a> on Twitter</li><li>Follow Appsilon on<a href="https://www.linkedin.com/company/appsilon"> LinkedIn</a></li><li>Sign up for our company<a href="https://appsilon.com/blog/"> newsletter</a></li><li>Try out our R Shiny<a href="https://appsilon.com/opensource/"> open source</a> packages</li><li>Sign up for the AI for Good<a href="https://appsilon.com/ai-for-good/"> newsletter</a></li><li>We are hiring <a href="https://appsilon.com/careers/">software engineers</a></li></ul>