Inspiration In the process of video processing, we realize that the need to convert hard subtitles (Hardsub) to soft subtitle (soft subtitle), especially for educational videos, online courses, or documentary movies. Users usually only have a Hardsub with a subtitle with no subtitle file, making it difficult to edit, translate into other languages, or reuse. The cost for current subtitle extract systems is quite high, while many groups only have a minimum budget. This has motivated us to build the most optimal and economical system possible.

What it do Our system performs the following main functions:

Extract subtitles from Hardsub video with automatic OCR technology.

Create a soft subtitle file (SRT or VTT) that can be edited or translated into other languages.

Support many video formats (MP4, MKV, AVI).

Optimize operating costs to suit individuals, educational centers, small film translations.

It is possible to expand integrated into Pipeline to process large videos.

How we built it Main language: Python

Library: FFMPEG-POPON, OpenCV, Tesseract OCR, PYSRT

Processing process:

Use FFMPEG to cut the frame according to the necessary FPS.

Handling images via OpenCV to increase the contrast, threshold the subtitle area.

Use Tesseract OCR to identify text in the Croped Subtitle.

Write identifying data into standard SRT file by PYSRT.

Initial implementation: Run Local on the Ubuntu 20.04 server with GPU or CPU that supports AVX2 to accelerate Decode.

Challenges we ran into The accuracy of OCR: With videos with strange fonts or complex backgrounds, Tesseract needs further refinement.

Time synchronization (Timing): Extract frame incorrectly FPS or Skip frames lead to Mismatch time subtitle.

Cloud GPU cost: Although the cost of OCR is not large, the high -resolution decode decode requires a strong enough server, which needs to optimize Batch Processing.

Multilingual subtitle processor: Tesseract default English, need to install more traineddata for Vietnamese, Japanese, Chinese, etc.

Accomplifts Successfully building the subtitle extract system at low cost (<0.02 USD/minute video).

Average OCR accuracy 85-90% with Hardsub video of good quality.

Write a module that is easy to integrate into Pipeline to process available videos of customers.

Support automatically detect and crop subtitle Area, reduce the effort of manual configuration.

What we learned The optimization of Pre-Processing images to increase OCR accuracy.

The importance of Frame Sampling Rate and Video Decoding Performance in mass processing.

Understand more about Tesseract Training and Fine-Tuning for each type of font.

How to build the system can expand and deploy on VPS or Cloud with a minimum budget.

What's Next for Build The Hardsub System Integrating AI models identify speech-to-style to supplement if the video does not have Hardsub.

Deploying SAAS API for User Upload Video and receive automatic SRT, cheap charge.

UI web-based construction allows quick subtitle editing right on the browser.

Support adding a subtitle translation module into another language combined with Google Translate or LLM.

Fine-Tune Tesseract for popular subtitle fonts in Vietnam and Japan to increase accuracy to> 95%.

Built With

Share this project:

Updates