LePersonFinder

Title

LePerson Finder

FINAL SUBMISSION

Github: https://github.com/teddyam/LePersonFinder/tree/nick

Slides (2470): https://docs.google.com/presentation/d/1pdMufdjjoQBHvgQbuImIx6aAJ5X34K4SKHEXdFmR1X4/edit?usp=sharing

Written Report: https://docs.google.com/document/d/1nGx5vEES-2blBARovCNFZKrWStvvLrhoIG274I1vBOs/edit?usp=sharing

Who

Nicolas Kim (nhkim), Teddy Arida-Moody (taridamo), Sam Duong (stduong), Justin Chan (jchan62)

Introduction

We are not implementing an existing paper but we are doing a similar objective (anomaly/victim detection).

Motivation

Extensive existing literature on SAR scenarios and identification of people in images (supervised CNNs), we aim to evaluate the effectiveness of transformers in this task as opposed to existing CNN architectures We know that Transformers within the field of image segmentation by far underperform top of the line CNN architectures, so we wanted to see if a Transformer could obtain comparable results

Problem Type

Supervised Learning

Related Work

https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/ We noticed for search and rescue applications there were a lot of pre-existing CNN architectures and wanted to try a novel idea of using a transformer model instead.

Relevant Paper Summary

One paper relevant to our topic is: https://www.mdpi.com/2072-4292/14/13/2977. This is a paper that is motivated by the fact that many DL approaches to this field for object detection have insufficient labels for victims (e.g. bounding box coordinates). So they generate a pipeline using GANs that generates “unreal” data that essentially is a realistic composition of victim bodies against various backgrounds, augmenting existing datasets for downstream use.

Data

We are using the HERIDAL dataset, which contains aerial images of wilderness, some of which contain people in search and rescue situations. The dataset contains 68750 image patches and 500 4000x3000 pixel full-size images, some of the preprocessing we anticipate is filtering through images to select only those with people in them (29050 of the above). Additionally we would have to flatten the image and divide it into patches.

Methodology

*General Architecture We take the image and we flatten it, we divide each image into sequential patches – mimic like a human sentence. Then we feed that sequence as an entire input into the model and embed them. Then we’ll have a unidirectional, Encoder-only, Transformer model that adopts the standard Transformer architecture: Encoder blocks → MLP → bounding box/segmentation prediction for victims. This prediction could be formulated as either a classification or a regression task, but we are leaning towards classification (whether a victim is within the given patch or not – binary).

Training

We would just batch the image data and feed it to the model iteratively – to reduce overfitting, we are also intending on augmenting the dataset by manipulating the images (e.g. cutout regularization, flipping, rotating, etc).

Design Justification

We chose an Encoder only design because there’s no need to generate some kind of novel output for our input i.e. like a new image or something like an autoencoder that improves the quality of the image or something. We chose a standard Transformer model because we can just develop on it as opposed to adopting some complex model beforehand then trying to scale it. Some backup ideas we may have to experiment with are maybe some baseline models (i.e. simple 2D CNNs) if we run into issues – maybe there’s something wrong with the way we’re formulating the problem or inherent in our understanding of the task – also we could reframe the task from classification to regression.

Metrics

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? The notion of accuracy does apply for this project, if the bounding box generated by the model contains the person in the image, loss for that sample will be lower.

Goals

Base: get the model to run without unexpected behavior Target: accuracy/baseline metrics better than random guessing Stretch goals: comparable with other CNN models in the field in terms of overall metrics/performance (ROC, accuracy).

Ethics

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is the IPSAR dataset which contains aerial images of wilderness, some of which contain people in search and rescue situations. Some concerns regarding its collection include bias in the data toward people that were indeed found in a given search and rescue situation and unequal distribution in terms of race and gender that could lead to bias in the model itself.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? The major stakeholders of this algorithm are search and rescue teams, government bodies that manage these scenarios for major landmarks (national parks, wilderness areas, etc), and potential visitors to these landmarks that could find themselves in these scenarios. The consequences of mistakes of the algorithm are misinformation in these search and rescue situations that could negatively influence attempts to save the victim.

Timeline

~3 week timeline incl. everyone: Jchan and Nick and Teddy – we have (until 5/5) Systems lab + DL hw – 4/16 is aim to get this all done Sam Final + final project due week before DL (has until 5/1)

:) & rainbows plan: 1st week: (no individuals break off). [4/12-4/19] 6-10 hours for all of us (baseline) 3-4 sessions of group coding Preprocessing + math + model architecture/design decision DONE We all do together Get a skeleton of the model into code Sam gets most burden of this 2nd week: (ppl break off, reconvene for debugging/clarifications) [4/19-4/26] 5 hours per week for all of us (baseline) Preprocessing should be fully finished Get the model fully running w/o any optimizations Training, testing, basic metrics: accuracy, loss Can have shit metrics 5-10 hours debugging (between Jchan + Nick + Teddy mostly) 3rd week: (together as much as possible) [4/27-5/5] 5 hours per week for all of us (baseline) Optimization + final visualizations Poster prep