LipSink

The Story

I have always wanted to be an animator, but when I tried it for the first time it took forever to just create a short video. For a 10 minute video at 24 frames a second you would have to draw the mouth 14,440 times! So we just wanted to make a simple application that can help with lip syncing in animation. The program is just a basic version and we hope to keep working on it after the Hackathon is over.

What Does It Do?

Here is the overview: you can record through your webcam and the program will perform live face detection and overlay a series of dots where the facial landmarks are. When you are done recording it renders a mouth from the facial landmarks and creates a video file with it.

Upon booting up the program you are presented with an extremely simple user interface. Sometimes video manipulation applications give you too many options and the user feels overwhelmed with using a complex tool. Well not with our program! The main page only has two buttons: one to start recording and a checkbox which toggles the face detection overlay. After starting the recording you can stop it by hitting the same button again. When you are done recording another window pops up with rendering settings (which are also very simple). It gives you the choice to change the width, height, frame rate, and export directory of the video file. The webcam records at 24 frames per second so changing the export frame rate will speed up or slow down the animated mouth. After you are done hit that export button and watch the progress bar fill as it renders the video. It also provides you with an estimated time until completion so you can schedule your time efficiently (Estimated completion time is calculated by: frames left to render * 44 milliseconds, 44 appears to be the average time to render a single frame). This means it takes about as long to render as it does to record the video.

Tools

We used:

JavaFX and SceneBuilder to make the user interface and IntelliJ for the IDE.
JavaCV to record video.
Java Media Framework for recording audio and writing video to file.
Haar cascade classifier to detect faces.
LBF Landmarks for finding mouth landmark points.
Gson for saving and loading configurations to file.
Marvin Framework for image manipulation.
Apache Commons Math Library for Spline Interpolation.

Problems We Overcame

There were a number of issues we ran into but we managed to conquer them!

One simple yet somewhat difficult problem was how to connect the dots. The facial detection program gives us 18 points where the mouth is, but how do you make nice smooth lines that connect those dots? The basic solution is to draw straight lines connecting them, but that gives a jagged, unrefined feel. The complicated solution is to do UV mapping from an artist's drawing of a mouth to the video, but we didn't have the time nor knowledge to do that. The third solution finds a nice middle ground between the two other possible solutions in terms of complexity. We were advised to check something out called Spline Interpolation. I personally had never heard of it before but apparently it is a method of curve fitting. After downloading the Apache Commons Math Library we were able to separate the points into four functions (two for each lip) and connect them with nice smooth lines.

Another problem in the render process was how to color the lips. Using a boundary fill algorithm we colored the bottom lip and then the top lip separately and colored the inside of the mouth afterwards. However this led to many glitchy looking frames when the mouth was closed so instead we colored the mouth fully red and then drew the interior lines on top of it.

Normally I am not super concerned about optimization of my programs. Sure I consider the big-O of algorithms but beyond that I just figure the program takes however long it takes. Readability and good program structure is more important than speed. However we ran into a bit of a problem while recording. At 24 frames per second the program can take no more than 41 and 2/3 milliseconds to detect the face, overlay the points, and display the image. If it takes any longer the recording lags and the animated mouth looks choppy. Unfortunately it was taking around 130 ms to do it. Not good. Via some debug and time reports it was clear that a majority of the time was spent detecting the face (I should have kinda figured that one out on my own). By imposing a minimum face size the performance improved dramatically since it did not have to try to find any faces smaller than a threshold.

Problems We Didn't Overcome

No shame in it. When you only have 24 hours there is only so much you can do. Either we ran out of time or found out about the problem after the time was up.

The export directory feature in the render settings doesn't actually work. For some reason rendering a video to anywhere outside the local folder the project is in causes an exception to be thrown. The fix isn't too hard: export to a temporary local folder and then move it after the file is done being created.

The mouth is pretty jittery. We attempted to stabilize the frames by averaging each point's position with the 2 before it and that helped but it is still far from perfect. If you average with any more than that you can't read the lips anymore. We drew the points with respect to the center of the face so one hypothesized solution is to draw them with respect to the center of the mouth instead.

By far the worst problem was that of merging the audio and video. To begin with connecting to the webcam takes time so the audio is about a second and a half longer. Since the video is being rendered after the recording is finished the audio and video have to be written to file separately. JMF has a way to merge .WAV and .MOV files, but it seems to only work on 32 bit machines, and as we wrote this on a 64 bit computer that was a big problem.

The Future and Beyond

I really enjoyed working on this project and hopefully we can keep developing it. Here are some ideas for future features:

Add multiple mouths that can be selected.
Work on optimizing the render process. 1:1 ratio is too long.
Allow the user to give an image and a position. The program will render the mouth moving over top of that image.
We already have the nose, eye, and face points so might as well render out the rest of them.