Priyam and Akshay hail from India. While internet penetration and mobile adoption has seen tremendous growth in the last few years, high-speed & reliable internet is still only available in dense metro cities.
As we started the project, our hope was to empower in our country who could really use the vast resources of the internet -- farmers who could learn from the latest in agricultural techniques, students in remote areas without good schooling, etc. Given the language barriers illiteracy that exist in our country, we realised that video resources (i.e. learning by seeing) would be most effective.
But how do we deliver good quality video content to rural areas which have low bandwidth and poor signal strengths? The answer, was to build a super-resolution model that can work on mobile devices.
What it does
The system is split into 4 major components explained below:
- Data is prepared using the
tensorflow datasetapis. All pre-processing, batching, etc is done in block 1. This gives us batches of small (low-res) frames with the past and future context (frames from the previous and next timestep), along with the target (high-res) frames for training.
- This data first flows into the motion compensation block 2 which estimates coarse as well as fine flow-vectors.
- The context frames are warped using their respective flow-vectors using block 3. This is achieved by the
tfa. These warped frames (now sans-motion) are then sent to block 4.
- In block 4, we perform an Early Fusion of inputs by using a
TimeDistributed-Conv2doperation which performs a conv2d op on every context frame. The output filters are then averaged (i.e. fused) for computing the training error.
- Finally, the fused image is compared against the high-res image using a perceptual loss specifically for super-resolution and an additional Huber loss for minising flow.
- Non-overlapping patches from the test video are extracted in a similar manner (i.e. with their appropriate context patches).
- These are fed to the feed-forward network (Blocks 2, 3, 4).
- The output frames are then tiled in order to produce a complete high-res image. Individual frames are merged using a video encoder utility (eg. ffmpeg) to produce a high-res video.
How we built it
- We did a thorough literature survey of recent and classical methods in image super-resolution by reading papers from recent CVPR/ICCV/NeurIPS conferences and review articles on arXiv.
- Then we decided on tackling the problem of video superresolution as it a much more nuanced and extremely important for large-scale deployement of superresolution.
- We created an overview of the subproblems we had to solve in order to make this work. This is uploaded in our GitHub repository's README.
- We then split the work and kept updating the repo as we made progress.
Challenges we ran into
As we both live in different time-zones, we had to have discussions at the opposite ends of a day often late into the night or early in the morning and sometimes it was difficult to move forward with an idea as we had to wait ~12 hours to have a discussion.
While both of us are not new to the field of machine learning and deep neural networks, we did have a hard time deciphering some of the recent lliterature on single image super-resolution as a large portion of this field stems from classic image processing pipelines. This resulted in some delays during experimentation as most works assumed a fair bit of knowledge on the topic and thus we had to make guesses on experimental parameters such as whether to early or late fuse our frames and the tradeoffs between classical DeConvolution based upsampling that is used heavily in the image segmentation literature versus a fairly new sub-pixel convolution based approach. Fortunately, tensorflow’s state of the art and up to date modules gave us access to both.
Accomplishments that we're proud of
Despite living in different time-zones we were able to collaborate on this project and make tangible progress on it. When we started toying around with the idea of Super resolution, we did not know much about the current state of the art methods to solve this problem and thus had to spend weeks just reading (and summarizing) papers from the most recent CVPR/ICCV/NeurIPS conference which often left us confused. But we braved through the literature review and settled on a combination of recent techniques which have shown to give reliable and perceptually sane super resolution results. Also, we are currently in the process of extending this method with a perceptual loss which makes it even harder to distinguish between a super-resoluted image and the original HighRes image.
For the interested reader here is a link to our notes on recent work in image Superresolution (WARNING! Not for the faint of heart!) https://hackmd.io/c/HJEN0RlFE/%2FEe8Qcr_dQdaH4-ihAIlVFA
What we learned
In our quest for understanding the field of video superresolution and making it work, we realized that most current methods only look at parts of the problem and there is a much larger scope in using a holistic approach. Streaming services such as Google’s own stadia and the youtube platform can directly benefit from the current advances in superresolution but only if we solve it on all ends - data, model and algorithm. Google is in a unique position as it has all the components to solve this problem at scale.
What's next for
Super Resolution On The Edge
- We are currently looking into training a larger and diverse video super resolution model on datasets such as the Youtube-8M video dataset
- tensorflow’s compression tools and the tensorflow.js toolkit will allows us to compress our model to a modest size and run it in real-time on resource-constrained devices such as mobile phones and ultrabooks.