Identifying Plants in the Wild
Winston Hackett, Keigo Hachisuka
Final Write Up/ Reflection
link
https://docs.google.com/document/d/1wjpuwdVAXcivkMUB7qHC4O-ytODnjIdajnqcBlWnrZw/edit?usp=sharing
Introduction
The goal of this project is to implement a model that can effectively predict a plant species given a picture of a plant. We referenced a paper
(https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-017-1014-z)
where the researchers utilized herbarium specimens to develop a model to predict plants. At the time of the paper, a herbarium specimen model is something that has not been done in the past. A lot of the specimens are currently being digitized thus using the specimens for plant identification is a relatively new opportunity. Identifying new specimens by hand is quite difficult and time consuming, thus automating the process may help in identifying any specimens that have not been identified yet. The researchers also expanded on this by utilizing the model to identify plants in the wild.
We chose this paper because it not only explores identifying plants in a confined space, but talks about the possible expansion of the model. It is said to be the first use of deep learning to identify plants and we saw it as a good fit. Both of us were in a first year seminar about plants and we drew inspiration from the class. Using that inspiration we ran into this paper and felt that its research fit closely with our ideas.
Initially we planned on using an herbarium specimen, however, our initial dataset got compromised by ransomware, thus we had to switch datasets. The new dataset consisted of plants in the wild as opposed to herbarium specimens. With this new dataset we expected some more difficulty due to there being more noisy data. However, this aligns better with our future goals. After building a model utilizing herbarium specimens, we considered attempting to use our model on plants in real life. With the difficulties with our original dataset, we decided to implement a model to predict plants in the wild from the start.
Methodology
Our dataset as explained in the introduction had to be changed from herbarium specimen to plants in the wild. We obtained our dataset from We acquired our data from inaturalist, which is a database consisting of verified research grade images of plants, taken in the wild. We chose 10 species to train on. The 10 species are as follows: Acer Macrophyllum, Arnica, Lewisia, Lupinus Latifolius, Salal, Salmonberry, Trillium, Vine maple, Western Pasqueflower, and Western Red Cedar. Although there was change in the dataset, our approach to the model remained fairly similar. For preprocessing the data, we first collected the data, turned it into a tensorflow dataset then split it into training and validation datasets. Each image is resized to a 256x256 image. We then normalized the images and randomly flipped and rotated the images to mitigate overfitting. We also OHE our labels which were handled when creating the tensorflow dataset.
For our model, we utilize 6 Conv2d layers, all with kernel regularizers, to reduce overfitting. We also max pooling, batch normalization, flatten, and dropout layers. The dropout layer played a significant part in reducing noisy data and thus reduced overfitting. We used two dense layers with a leaky relu activation and our final layer is a softmax layer.
To train our model, we used an Adam optimizer with a learning rate of 0.0001. After attempting different learning rates, we felt that this value performed the best. Our loss function is Categorical Cross Entropy and we ran 75 epochs a batch size of 3.
Results
After working through and implementing different models, our best model is the one described above (named Plant_Model in our code). This model resulted in a training accuracy of 73.21%, training loss of 1.5475, validation accuracy of 71.25% and validation loss of 1.9150. The validation accuracy fluctuates more than the training accuracy which we believe is due to overfitting. We also attempted transfer learning, but did not yield the results we wanted, which is likely due to the lack of a proper base model.
Challenges
In this project we ran into many challenges. Our first challenge was the loss of our original dataset which forced us to find and use another dataset. Because the new dataset consisted of plants in the wild and not herbarium specimens, we knew that the data would be more inconsistent. Due to the nature of the dataset, preprocessing was also a challenge. Building off this new dataset, the other issue we ran into was overfitting. This was due to a multitude of factors. Our two biggest factors were a small dataset and too much noise in our data. We did our best to acquire as much data as possible, however ,this was a challenge, due to using a different dataset than originally anticipated. Because these images were not herbarium specimens, they contained a lot more noise which we believe made it more difficult for our model to extract the features. We mitigated this to the best of our ability by utilizing Dropout layers and implementing regularization in our Conv2d layers. Our model relies heavily on a strong dataset, and while we were able to get a fairly robust dataset, more data would have helped our model perform more consistently. We also attempted to implement transfer learning to help make up for the smaller dataset, however, due to a lack of a strong base model, the model did not perform as well. Given these challenges we minimize overfitting and are able to create a strong model to identify plants.
Reflection
We feel that our project turned out fairly well. Our goal was to train a model on 10 species of plants and we were able to do so with a relatively high accuracy. Our stretch goal was to train a model on plants in the wild, but due to the change in dataset, this became our base goal.
Our model works as expected. We are able to feed in a plant of the 10 species and can predict the species of the plant just over 70% of the time. Our approach changed due to the change in the dataset. Herbarium specimens are consistent in their format which would have made it easier to train. However, since we changed datasets to plants in the wild, we knew that there would be noise and inconsistencies. We were wary of overfitting and took steps to mitigate this as much as possible.
If we did this project again, we would like to have started with the herbarium specimen and build off of that. We felt that the herbarium specimens would have given us a far better and more consistent baseline to build off of. If we had more time, we would have liked to use the herbarium dataset to train the model, and use the model to implement transfer learning which we feel would have better trained our current dataset. This implementation was observed in our reference paper and is something we would like to consider implementing in the future. Another idea we also would like to use is segmentation to remove background artifacts like people.
Our biggest takeaways from this project are the dangers of overfitting and the ability to be flexible. We did not know that our original dataset would get compromised, however when it did, we feel that we handled it gracefully and were able to approach our problem from a different angle. We also had not run into a lot of issues with overfitting in our assignments, thus we learned a lot from this project. We realized the importance of preprocessing and creating a model in such a way that minimizes noise. This project was a learning opportunity for the both of us and we gained skills that we will take outside the scope of this class.
Initial Write Up
Introduction
https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-017-1014-z
The objective of this paper is to identify the large number of herbarium species using the current dataset to label herbarium species. This paper is also observing how well models perform on plants as this is something that has not been done in the past. A lot of the specimens are currently being digitized so using the specimens for plant identification is a relatively new opportunity. Identifying new specimens by hand is quite difficult and time consuming, thus automating the process may help in identifying any specimens that have not been identified yet.
We chose this paper because it not only explores identifying plants in a confined space, but talks about the possible expansion of the model. It is said to be the first use of deep learning to identify plants and we saw it as a good fit. Both of us were in a FYS about plants and we drew inspiration from that. Using that inspiration we ran into this paper and felt that its research fit closely with our ideas.
What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc
This is a classification problem as we are trying to identify the species of a herbarium specimen.
Related Work
https://www.frontiersin.org/articles/10.3389/fpls.2021.787127/full
This paper talks about a kaggle competition where teams were tasked with identifying species of a given dataset. Teams used different models with varying degrees of success. The top teams used “deep metric learning in addition to the standard cross entropy loss”. The top teams used varying base neural networks which is interesting as no one model was the “best”. It is noted that teams that did poorly did not do so necessarily because of a poor model but rather a lack of computational power to process the large dataset.
Data
We are planning to construct a simple dataset from the University of Washington’s digitized herbarium specimen images. We plan to collect 100 images times 10 plant species, with 70 training specimens per plant, 20 validation specimens per plant, and 10 test specimens per plant. So, in total there will be 1000 plant images in our dataset. We will have to do some preprocessing, since each herbarium specimen has the plant name written on a label, and we don’t want our model to learn off of the label. So, we will apply a gaussian blur to each label to render it irrelevant.
Methodology: What is the architecture of your model?
How are you training the model?
We are planning to train this model using CNN. Similar to the CNN project we believe that our dataset is a lot like the CIFAR dataset. We also see similarities to the MNIST dataset as most specimens are recorded in a similar format. The images should be fairly consistent which will make it easier to train as long as we can make our model take into account the different shapes and features of each unique species.
If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.
I believe that the hardest parts of this model is the preprocessing and tweaking of the model. A lot of the images containing the species are labeled in the image (the species name is in the picture). The model will likely use that as data instead of the plant itself which will defeat the entire purpose of this model. We need to ensure that those labels are either removed or ignored. Another hurdle will be to train on a lot of species. We think that if we train too many species the accuracy may get lower. There are many species that are extremely similar and if we have too many species that are similar our accuracy may struggle. Picking the right model parameters will be vital to get the best possible results.
Metrics
What experiments do you plan to run?
We plan on running accuracy tests to see how well our model does on test data. We plan on splitting out data into training and testing sets.
Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?
Yes, accuracy applies since we are trying to see if our model can accurately predict what a plant species will be given an image of the herbarium specimen.
If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.
At a high level the authors were trying to see how accurately plants can be labeled using Deep Learning. They were also trying to see if training with herbarium specimens can translate to accurately predicting the species of a plant in the environment. Lastly, they try to see if a model that identifies species in one region can identify the same species in another region.
The authors quantified their results by measuring the accuracy of their model. They concluded that deep learning has potential to classify plants which may lead to a semi or fully automated system of labeling plants.
What are your base, target, and stretch goals?
Our base is to be able to identify a subset of plants. Our target is to be able to increase the number of species trained and hopefully be able to accurately identify all plants in the dataset. Our goal is to be able to tweak our model and see if we can use it to identify plants out in the wild.
Ethics
Why is Deep Learning a good approach to this problem?
Deep learning is a good approach to this problem because it is a method that has been successfully used to solve similar problems in the past. We are trying to identify a species of plants and this is similar to problems deep learning has solved in the past such as identifying numbers.
Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
The major stakeholders are us the individuals implementing the model and botanists who are labeling the herbarium specimen. Mistakes in our algorithm may mean that the mistakes must be fixed by hand which translates to more work for the botanists. On the flip side, if our model is very effective, it could result in botanists losing their jobs because our model would replace them
How are you planning to quantify or measure error or success? What implications does your quantification have?
We are planning on quantifying success based on the accuracy of our model determining a species of a given plant. One implication is the threshold in which we determine success. 80% accuracy may be good in one model but for this model it may mean that too many plants get mislabeled which means that the botanists must go in to fix those mistakes.
Division of labor
Because we are a two person group we will mostly be working together. Winston is in charge of collecting the data and Keigo will be in charge of preprocessing. Everything after will be done together.
Reflection Link
https://docs.google.com/document/d/1hVEQVSaC_0OKUJ1H89uM3NQuoZOKeM0phMaD1cxeBLs/edit?usp=sharing
Built With
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.