Visual Super Sampling

Enhancing Video Definition through Deep Learning Haibo Li, Shukai Ni, Xianyang Xie

Introduction In today’s digital world, high-quality videos and images are essential across various fields, from healthcare and satellite imagery to social media and online content. However, many videos and images suffer from low resolution, resulting from limitations in imaging devices or conditions under which the photos were taken. Enhancing the resolution of these images can significantly improve their clarity and usability.

Super-resolution is a technique that aims to convert low-resolution images into higher-resolution versions by filling in extra pixels. Traditional methods like stretching the image often lead to blurry results. With advances in deep learning, particularly through structures known as autoencoders and UNets, has become a game-changer in improving image resolution.

This project explores the possibilities of autoencoders and integrates insights from the U-Net architecture, commonly used in image processing tasks like segmentation and reconstruction. Our model reverses the traditional U-Net structure. It starts by expanding image dimensions and then contracts them to achieve the desired high resolution. This method allows the model to progressively enhance image clarity, ensuring the final image is usable and realistic. This approach represents a significant advancement in the application of deep learning for image quality enhancement. Methodology We’re using FFmpeg to split the videos into images. Then in the ImageProcessor process, given the patch size, we cropped the images to ensure that their dimensions were aligned perfectly with an integer number of patches across both the image width and image height.

In the ImagesStitcher process, we restructure the patches back into an image. This task is accomplished by precisely positioning each patch according to its original grid coordinates, ensuring seamless alignment and uniformity. After the process, we’re also using FFmpeg to combine these images back into videos.

The Super Resolution Autoencoder is structured into two primary components: the encoder and the decoder. The encoder is composed of several convolutional layers designed to extract and compress significant features from the input images. These layers use “Conv2D” with progressively increased filters (64, 128, 256). Each convolution layer is followed by a “MaxPool2D” layer to reduce spatial dimensions, thereby enlarging the receptive field of the filters and enhancing feature capture. Conversely, the decoder aims to reconstruct the high-resolution output from the encoded representation, effectively mirroring the architecture of the encoder but in reverse. It uses “Upsampling2D” layers refining these features to ensure the output matches the dimensions of the target high-resolution images. The inclusion of “Add” layers enables a residual learning approach by combining outputs from corresponding encoder and decoder layers, which aids in the recovery of fine image details.

Apart from the encoder-decoder structure, we have borrowed insights from a commonly adopted architecture in image processing tasks such as segmentation, detection, and reconstruction. U net consists of a Contracting Path (Encoder) and an Expansive Path (Decoder), which is shaped like a letter U. In our experiment we reverse the shape, first expanding the image dimension and then contracting it to the target dimension. Since the results would be larger than the input, the contracting path takes fewer layers than the expanding path. For instance, in one of our trained models, we aimed for the following basic building blocks: Conv2D, Relu, Conv2D, Relu, Conv2DTranspose. By targeting Conv2D with 1 stride and ‘Conv2DTranspose’ with 2 stride, the unit effectively enlarges the image by a factor of 2. Likewise, by concatenating several x2 units and /2 units, we could derive the targeted scaling factor.

Results For the Autoencoder architecture, the encoder successfully captured significant features from input images, leveraging progressively increasing filters and spatial dimension reduction, which facilitated the extraction and compression of essential image characteristics. However, the decoder is not extremely effective in constructing better details than the input. It was effective in the recovery of lower-resolution details but not in the recovery of higher-resolution details unseen in the inputs.

The reverse U-net structure, on the other hand, was able to achieve the expected results, we effectively expanded and contracted image dimensions to achieve the desired scaling factor. The utilization of basic building blocks such as Conv2D, Relu, and Conv2DTranspose enabled precise control over image enlargement and reduction. We have tested 2x, and 4x blurred images with training image numbers ranging from 200 to 1000, and the model can converge swiftly with less than 5 epochs to a validation MSR loss of less than 0.003. Through strategic concatenation of scaling units, we could flexibly change the scaling factor and achieve a tradeoff between computing power and scaling efficiency.

Admittedly, lower loss does not mean visually similar image representation. We have prepared several out-of-sample images and asked volunteers to tell inputs with double-blind settings. In all cases, they were able to claim that the processed image has a higher ‘resolution’. Here is an example of input and output images. The image was a video screenshot, which suits the intended application use case of this model. As one could tell, the model could provide effective insights in realistic settings.

Challenges It challenges us at first in terms of finding a general solution to unify the image data format. We decided to use patches with a fixed size (50 for low-res images and 100 for high-res images) so that images with different sizes can now be preprocessed into patches with identical shapes. In the ImageStitcher process, we're taking a lot of time to accurately align and stitch the patches back into a single image. Additionally, once we stitch the patches together as an image, saving it also consumes a considerable amount of time, especially when we are dealing with high-resolution images.

Deeper models may capture more complex patterns but are also harder to train and require more data. There’s a need to balance the depth and complexity of the model to avoid overfitting.

Even though our model does output the desired higher resolution, there are visible boundaries between patches when we stitch them together. We have considered several possible solutions. One is to repatch the images with a different size and take the average pixel value along the boundaries to smooth out the visible lines. However, this requires repeated processing throughout the entire model pipeline which is extremely time-consuming. The other one is to reconsider the entire patching process to separate overlaying patches so that the pixels on boundaries are trained multiple times. This method can also considerably increase the time complexity of model training as there are polynomially more patches.

Reflection Reflecting on this project, we feel that while we met some goals we set earlier, there were still multiple areas for improvement. Our final model can successfully enhance image resolution with color accuracy. Hence, our base goal and target goal are met with confidence; not only does our model outperform the baseline model, but it also matches the performance of state-of-the-art super-resolution models. In the exploratory phase, our autoencoder model failed to predict meaningful pixels instead of only outputting grey and black boxes. However, our approach evolved from relying on traditional autoencoder architectures to integrating a reverse U-Net structure, which was pivotal for manipulating image dimensions effectively.

If we were to redo the project, we would focus more on the early stages of data handling and model setup, particularly experimenting with different patching techniques. We would also consider tuning model parameters more thoroughly and exploring various network architectures given more computation resources.

The project underscored the importance of choosing the right data handling techniques, and model architecture and revealed the complexities of training deep learning models for high-fidelity outputs. It has been an insightful experience, teaching us about the practical challenges of applying deep learning to real-world image-processing problems.

Built With

autoencoder
gan
vae

Updates

Shukai Ni posted an update — Apr 28, 2024 09:54 AM EDT

Introduction

This can be copied from the proposal.

Rendering high definition videos or complex 3D scenarios is computationally expensive and time consuming. It would be cheaper and smoother if the quality of the rendering could be improved with deep learning frameworks. Temporal-wise: it might be possible to render at a lower FPS and insert intermediary frames in between. For example, render shaders at 10 FPS and create 5 extra frames between each frame to achieve 60 FPS. Spatial-wise, it is possible to keep the same FPS rate but render shaders and material at lower quality and improve rendered definition. However, the purpose of this project is not at the rendering pipeline. Instead, it assumes a pre-rendered stream and aims to improve definition of the output.

Challenges: What has been the hardest part of the project you’ve encountered so far? It challenges us at first in terms of finding a general solution to unify the image data format. We decided to use patches with a fixed size (50 for low-res images and 100 for high-res images), so that images with different sizes can now be preprocessed into patches with identical shapes. Deeper models may capture more complex patterns but are also harder to train and require more data. There’s a need to balance the depth and complexity of the model to avoid overfitting. VAEs are known for producing somewhat blurry results compared to deterministic models, which might be less desirable for applications where clarity if crucial for our goal.

Insights Are there any concrete results you can show at this point? We have finished implementing preprocess and experimental models. At this point, we also implemented a universal visualization techniques that can be applied to all of the models we desire to experiment with. How is your model performing compared with expectations? During this exploration stage, our focus is only testing out well developed architecture, and they have mediocre performance, we can generally see they work for this expected scenario, but not very effective in improving the definition.

Plan

Are you on track with your project? What do you need to dedicate more time to? What are you thinking of changing, if anything?

The project's progression has been consistent, albeit partially aligned with our initial objectives. There is a clear necessity to allocate further resources and time towards enhancing the efficiency of our models.
Necessitating the exploration and implementation of innovative data augmentation techniques. By incorporating a wider variety of video types and conditions into our training regimen, we aim to bolster the models' robustness and ensure uniform performance across diverse visual scenarios.

Log in or sign up for Devpost to join the conversation.

Shukai Ni posted an update — Apr 13, 2024 01:53 PM EDT

Introduction Rendering high definition videos or complex 3D scenarios is computationally expensive and time consuming. It would be cheaper and smoother if the quality of the rendering could be improved with deep learning frameworks. Temporal-wise: it might be possible to render at a lower FPS and insert intermediary frames in between. For example, render shaders at 10 FPS and create 5 extra frames between each frame to achieve 60 FPS. Spatial-wise, it is possible to keep the same FPS rate but render shaders and material at lower quality and improve rendered definition. However, the purpose of this project is not at the rendering pipeline. Instead, it assumes a pre-rendered stream and aims to improve definition of the output.

Frameworks: VAE, Autoencoder, GAN, etc. One existing industrial solution(DLSS by Nvidia) works at the rendering level, aiming to render 3D objects at low definition and improve output product with deep learning super sampling. In this project the goal is to 1) maintain and if possible, improve the accuracy of the output stream. This is defined as the rasterized output. In the theoretical scenario f^{-1}(f(x)) = x, meaning the super-sampled image undersampled should look like the original image. 2) The data could be acquired as the higher definition video and manual undersampled: for example, a 1080P video could be manually converted to 720p as training inputs and use the original 1080P video as the baseline. Related work Our project draws inspiration from Nvidia's Deep Learning Super Sampling (DLSS) technology, which operates at the rendering level to enhance the definition of 3D objects. While DLSS focuses on real-time rendering enhancement, our project shifts focus towards post-rendered video streams, using deep learning frameworks such as Variational Autoencoders (VAE), Autoencoders, and Generative Adversarial Networks (GAN) to upscale and enhance video quality. This shift represents a novel application of deep learning in video processing, aiming to fill a gap in current video upscaling solutions.

Our project extends the insights from “Frame Rate Upscaling with Deep Neural Networks”, which explores frame interpolation through deep learning techniques, notable CNN and GAN, This foundational work critiques linear interpolation for ites blurriness in 2D animation and tests various models for enhancing video famerates.

https://paulbridger.com/posts/video-analytics-pipeline-tuning/ https://tedxiao.me/pdf/CS294_Report.pdf

Data The dataset comprises YouTube videos, specifically selected to include both animates and real-life footage to challenge and validate our models across diverse scenarios. We will ultilize 240 framerates from 3 distinct videos, applying preprocessing steps like normalization to generate a suitable training set.

Methodology Before finally deciding on the target architecture, we decide to systematically explore several deep learning architectures to identify the optimal model for supersampling video content. This exploration will focus on Variational Autoencoders (VAEs), Autoencoders, and Generative Adversarial Networks (GANs), given their demonstrated capabilities in various image and video enhancement contexts. The core challenge we aim to address is preserving the temporal consistency across video frames while improving their spatial resolution.

We will try to minimize the discrepancy between the upscaled video output and its corresponding original high-definition counterpart. This ensures that our model can generalize across different content types. With iterative experimentation, we anticipate identifying a robust model under different input conditions.

Metrics Success will be measured by the quality of the upscaled videos compared to their original high-definition counterparts. Metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) will be used to quantitatively assess video quality. Our base goal is to achieve perceptually noticeable improvements in video quality on standard datasets. The target goal is to match or exceed the quality improvements offered by existing upscaling solutions like DLSS, without requiring integration into the rendering pipeline. Our stretch goal is to develop a model that can be applied in real-time to various video streams, including live broadcasts. Our base goal is to outperform linear interpolation methods (baseline model), our target goal is to match the performance of the state-of-the-art in frame interpolation, and our stretch goal is to surpass current methods in terms of both quality and efficiency for a wide range of video types. Ethics Broader Societal Issues

The enhancement of video framerates touches on several societal issues, including accessibility and misinformation. Higher framerates can significantly improve the viewing experience for all audiences, including those with visual impairments or seizures who may benefit from smoother transitions in videos content. However, the technology’s ability to generate realistic video frames can also be misused for creating deepfakes, potentially exacerbating problems related to misinformation and privacy violations.

Why is Deep Learning a good approach to this problem?

This is because Deep Learning is good at deciphering complex patterns within extensive datasets. This capacity is particularly beneficial for enhancing the quality of videos after they have been rendered. Utilizing Deep Learning would effectively obviate the need for more sophisticated rendering hardware or advanced rendering techniques, which are usually costly.

Division of labor Shukai Ni: Data Preprocessing, Model architecture, Parameter tuning Xianyang Xie: Data generating, Data Preprocessing, Parameter tuning Haibo Li: Data Collecting, Data generating, Model architecture

Log in or sign up for Devpost to join the conversation.

Shukai Ni started this project — Apr 13, 2024 01:52 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.