CHECK IN 2
Challenges: What has been the hardest part of the project you’ve encountered so far? We ran into issues with web scraping because the library we were using was attempting to scrape the HTML from the website we are collecting data from before the page fully loaded. It is also hard in general to approach a large and ambiguous problem. Currently we have implemented the web scraping aspect of our project, and need to run the Scrapy script to download and process our images and classes.
Insights: Are there any concrete results you can show at this point? How is your model performing compared with expectations? We do not have concrete results that we can show at this point yet, so we do not have a comparison with our expectations. However, we anticipate producing these results this week. We have accomplished writing the script to download and preprocess a subset of our dataset, and will download the whole dataset and preprocess it using Brown computing power soon.
Plan: Are you on track with your project? What do you need to dedicate more time to? We need to dedicate more time to the implementation of our CNN model, we have not fully finished developing our architecture. What are you thinking of changing, if anything? At the moment, we have not considered changing anything since we don’t have any concrete results. However, we have considered using a bag of words CV model to potentially generate architectural style given an image in addition to classifying the decade the image is from.
CHECK IN 1 Team Name: Model Architecture
Main Idea: The Deep Learning algorithm will use Convolutional Neural Networks to determine the decade that a building was built in using an image of the building. In addition, if there is suitable data we will use an RNN or transformer w/ encoder and decoder to potentially caption the test images.
Who: Names and logins of all your group members. Mbeards1, tcolema3
Introduction: What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. If you are doing something new, detail how you arrived at this topic and what motivated you.
We are doing something new. We arrived at this topic because of an interest in architecture. Taylor and I have both taken architecture classes at Brown and thought that this would make a good Deep Learning project because we had never heard of anyone doing architecture-related work with Deep Learning models before. However, other work in the Deep Learning field has been done that would apply well to analyzing architecture, because the problem boils down to image classification and there is sufficient data available online. After we discovered our mutual interest in a Deep Learning model for architecture, we started thinking about what specific problem within architecture we wanted to solve. We decided to work on decade classification because we are interested in the different architectural styles and how they relate to time periods. We are excited to see how well a Deep Learning algorithm can perform on this task!
What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc. Classification
Related Work: Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.
This paper is a survey of a large amount of research papers that delve into how to caption images using deep learning. With respect to image processing, the most common way they have found is to use CNN layers to help classify images. The paper then goes into detail about which methods are used to actually pair images to language (the most common are GRU, LSTM and Transformer). The paper then provides methods for evaluating model architecture and success of captioning. This is highly relevant to our topic since our implementation is a subset of image based deep learning. https://dl.acm.org/doi/abs/10.1145/3295748
In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list. https://towardsdatascience.com/image-captions-with-deep-learning-state-of-the-art-architectures-3290573712db
Data: What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).
We will be using image data and text data from: https://www.archinform.net/index.htm. This is a publicly available dataset that has a significant amount of images and info that are very easy to download/parse through. Since each building has a URL that is only dependent on a number, it will be very easy to parse this data.
How big is it? Will you need to do significant preprocessing?
We will most likely use around 30,000 color images. We will try to normalize each image to about 256x256x3 or 512x512x3. We will also need to preprocess the dates from a single year to a classification that represents a decade. Finally, we will need to preprocess and pad the text descriptions of the images.
Methodology: What is the architecture of your model? How are you training the model?
Since there are two tasks in our model (classification and captioning), we will have to use a relatively unique approach to training the model. Our architecture will look similar to this: Image -> CNN layers -> dense layer (spit out classification at this point) -> embedding/encoder layers -> transformer -> dense -> softmax
If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.
After doing research, this method of layering seems to be a commonly used one for captioning. From a justification standpoint, it makes sense. Instead of going from language to language like we did for our homeworks, we now need to process the image data using CNNs, and then treat it as a readable language that we can convert to English captions. If we run into issues with the captioning, we may change the transformer model to a RNN GRU model. With respect to the classification from the CNNS, I don’t expect there to be any issues with our model architecture.
Metrics: What constitutes “success?” What experiments do you plan to run?
Testing classification: There are two tests we can run. First, we can compare the predicted decade to the true decade and see how accurate the model is. Second, we can look at the percent of predictions that are +- 1 decade for a more general look. Testing captioning: For captioning, we can compare the true caption to the predicted caption of the images and see how their perplexity/accuracy compare like the translation homeworks. For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? Our notions of accuracy still apply to the model, since we are implementing fairly commonly used methods of classification.
If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. If you are doing something new, explain how you will assess your model’s performance.
We will assess our models performance by comparing the accuracy on the test set for classification to what a random guess would be. This also applies to the captioning with perplexity/accuracy, but we can also use a sort of human “eye test” to see if the captions actually make sense.
What are your base, target, and stretch goals? Classification: Since we will use 10 decades for classification: Base >20% accuracy Target >30% accuracy Stretch >50% accuracy Captioning Base <150 perplexity, Accuracy >.2 Target <100 perplexity, Accuracy >.4 Stretch <70 perplexity, Accuracy >.6
Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.) What broader societal issues are relevant to your chosen problem space? Why is Deep Learning a good approach to this problem? What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?
Our dataset is an image database of prominent architectural images and buildings. Since most buildings are in public, there are not really any significant concerns about how these photos were taken. However, in terms of global architecture these images most likely are not representative, as they mainly focus on Western Europe and the USA. In terms of bias, there are two main ones I can think of. The first being that there is a bias towards famous buildings, which means that this model may fail at classifying common buildings. Second, there will be a bias towards western architecture as this is the most studied region and field of architecture. This means that our model will not be trained on a fully representative data set, which could reinforce the idea that western architecture is more worth studying than architecture from other regions.
Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
The major stakeholders of this problem are the architects that have designed the buildings, the people who built the buildings, and the people who own the buildings or serve to profit from the buildings. If the algorithm were widely used, incorrect evaluations of the date of construction of a building could affect its value and lead to profit or financial loss for the stakeholders. For example, if a building were valuable because it was emblematic of a certain decade with an important architectural style, but the algorithm determined that it was not an accurate depiction of the decade’s style, this could degrade the value of the building. How are you planning to quantify or measure error or success? What implications does your quantification have? Add your own: if there is an issue about your algorithm you would like to discuss or explain further, feel free to do so.
Division of labor: Briefly outline who will be responsible for which part(s) of the project.
We will divide labor equally overall, however if a team member has a particular strength for a certain task, they will take the lead on that section of the project. For example, Maggie has taken data science, which teaches web scraping and preprocessing skills, so she will complete a large portion of those tasks. This strategy will help us be most effective as we work through the project tasks because we can benefit from each other’s different skill sets without being siloed on different parts of the project.
Built With
- keras
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.