Enhancing Video Definition through Deep Learning Haibo Li, Shukai Ni, Xianyang Xie
Introduction In today’s digital world, high-quality videos and images are essential across various fields, from healthcare and satellite imagery to social media and online content. However, many videos and images suffer from low resolution, resulting from limitations in imaging devices or conditions under which the photos were taken. Enhancing the resolution of these images can significantly improve their clarity and usability.
Super-resolution is a technique that aims to convert low-resolution images into higher-resolution versions by filling in extra pixels. Traditional methods like stretching the image often lead to blurry results. With advances in deep learning, particularly through structures known as autoencoders and UNets, has become a game-changer in improving image resolution.
This project explores the possibilities of autoencoders and integrates insights from the U-Net architecture, commonly used in image processing tasks like segmentation and reconstruction. Our model reverses the traditional U-Net structure. It starts by expanding image dimensions and then contracts them to achieve the desired high resolution. This method allows the model to progressively enhance image clarity, ensuring the final image is usable and realistic. This approach represents a significant advancement in the application of deep learning for image quality enhancement. Methodology We’re using FFmpeg to split the videos into images. Then in the ImageProcessor process, given the patch size, we cropped the images to ensure that their dimensions were aligned perfectly with an integer number of patches across both the image width and image height.
In the ImagesStitcher process, we restructure the patches back into an image. This task is accomplished by precisely positioning each patch according to its original grid coordinates, ensuring seamless alignment and uniformity. After the process, we’re also using FFmpeg to combine these images back into videos.
The Super Resolution Autoencoder is structured into two primary components: the encoder and the decoder. The encoder is composed of several convolutional layers designed to extract and compress significant features from the input images. These layers use “Conv2D” with progressively increased filters (64, 128, 256). Each convolution layer is followed by a “MaxPool2D” layer to reduce spatial dimensions, thereby enlarging the receptive field of the filters and enhancing feature capture. Conversely, the decoder aims to reconstruct the high-resolution output from the encoded representation, effectively mirroring the architecture of the encoder but in reverse. It uses “Upsampling2D” layers refining these features to ensure the output matches the dimensions of the target high-resolution images. The inclusion of “Add” layers enables a residual learning approach by combining outputs from corresponding encoder and decoder layers, which aids in the recovery of fine image details.
Apart from the encoder-decoder structure, we have borrowed insights from a commonly adopted architecture in image processing tasks such as segmentation, detection, and reconstruction. U net consists of a Contracting Path (Encoder) and an Expansive Path (Decoder), which is shaped like a letter U. In our experiment we reverse the shape, first expanding the image dimension and then contracting it to the target dimension. Since the results would be larger than the input, the contracting path takes fewer layers than the expanding path. For instance, in one of our trained models, we aimed for the following basic building blocks: Conv2D, Relu, Conv2D, Relu, Conv2DTranspose. By targeting Conv2D with 1 stride and ‘Conv2DTranspose’ with 2 stride, the unit effectively enlarges the image by a factor of 2. Likewise, by concatenating several x2 units and /2 units, we could derive the targeted scaling factor.
Results For the Autoencoder architecture, the encoder successfully captured significant features from input images, leveraging progressively increasing filters and spatial dimension reduction, which facilitated the extraction and compression of essential image characteristics. However, the decoder is not extremely effective in constructing better details than the input. It was effective in the recovery of lower-resolution details but not in the recovery of higher-resolution details unseen in the inputs.
The reverse U-net structure, on the other hand, was able to achieve the expected results, we effectively expanded and contracted image dimensions to achieve the desired scaling factor. The utilization of basic building blocks such as Conv2D, Relu, and Conv2DTranspose enabled precise control over image enlargement and reduction. We have tested 2x, and 4x blurred images with training image numbers ranging from 200 to 1000, and the model can converge swiftly with less than 5 epochs to a validation MSR loss of less than 0.003. Through strategic concatenation of scaling units, we could flexibly change the scaling factor and achieve a tradeoff between computing power and scaling efficiency.
Admittedly, lower loss does not mean visually similar image representation. We have prepared several out-of-sample images and asked volunteers to tell inputs with double-blind settings. In all cases, they were able to claim that the processed image has a higher ‘resolution’. Here is an example of input and output images. The image was a video screenshot, which suits the intended application use case of this model. As one could tell, the model could provide effective insights in realistic settings.
Challenges It challenges us at first in terms of finding a general solution to unify the image data format. We decided to use patches with a fixed size (50 for low-res images and 100 for high-res images) so that images with different sizes can now be preprocessed into patches with identical shapes. In the ImageStitcher process, we're taking a lot of time to accurately align and stitch the patches back into a single image. Additionally, once we stitch the patches together as an image, saving it also consumes a considerable amount of time, especially when we are dealing with high-resolution images.
Deeper models may capture more complex patterns but are also harder to train and require more data. There’s a need to balance the depth and complexity of the model to avoid overfitting.
Even though our model does output the desired higher resolution, there are visible boundaries between patches when we stitch them together. We have considered several possible solutions. One is to repatch the images with a different size and take the average pixel value along the boundaries to smooth out the visible lines. However, this requires repeated processing throughout the entire model pipeline which is extremely time-consuming. The other one is to reconsider the entire patching process to separate overlaying patches so that the pixels on boundaries are trained multiple times. This method can also considerably increase the time complexity of model training as there are polynomially more patches.
Reflection Reflecting on this project, we feel that while we met some goals we set earlier, there were still multiple areas for improvement. Our final model can successfully enhance image resolution with color accuracy. Hence, our base goal and target goal are met with confidence; not only does our model outperform the baseline model, but it also matches the performance of state-of-the-art super-resolution models. In the exploratory phase, our autoencoder model failed to predict meaningful pixels instead of only outputting grey and black boxes. However, our approach evolved from relying on traditional autoencoder architectures to integrating a reverse U-Net structure, which was pivotal for manipulating image dimensions effectively.
If we were to redo the project, we would focus more on the early stages of data handling and model setup, particularly experimenting with different patching techniques. We would also consider tuning model parameters more thoroughly and exploring various network architectures given more computation resources.
The project underscored the importance of choosing the right data handling techniques, and model architecture and revealed the complexities of training deep learning models for high-fidelity outputs. It has been an insightful experience, teaching us about the practical challenges of applying deep learning to real-world image-processing problems.
Built With
- autoencoder
- gan
- vae
Log in or sign up for Devpost to join the conversation.