Inspiration
Research papers on arXiv are full of great ideas, but they are long, technical, and usually in English. Most learners and creators do not have hours to read a full PDF and then script, record, and edit an explainer video.
We wanted a pipeline that respects the source paper while making it approachable in Bahasa Indonesia, in a format people already consume on their phones: vertical video with a real conversation, not a single robotic voice reading an abstract.
That is why we built paper2video: turn an arXiv paper into a short educational video starring Pak Nam (mentor) and Zaba (curious beginner), with automation from fetch to YouTube.
What it does
paper2video takes an arXiv ID (or picks the newest paper from RSS) and runs an end to end pipeline:
- Download metadata, PDF, and extracted text
- Summarize the paper (problem, method, findings, limitations) with RAG over the PDF
- Verify the summary quality and retry if needed
- Generate a two person dialog script in Indonesian
- Render a 1080x1920 video with subtitles, character sprites, wiggle animation, and per line TTS
- Optionally upload to YouTube with generated metadata
You can trigger everything from Telegram (/e2e latest upload) and get live progress, the dialog script, and the final MP4 in chat.
On our SumoPod VPS, a morning job processes three new papers at 07:00 WIB. Uploads are spread at 09:00, 14:00, and 20:00 so the channel publishes steadily through the day.
How we built it
Stack: Python, ffmpeg, ElevenLabs TTS (edge-tts fallback), OpenCode and multi LLM for summary and dialog, YouTube Data API, python-telegram-bot.
Multi agent design: Roles are split like a small team. Extractor (RAG and summary), Verifier (QA and retry), Writer (dialog), Director (render and TTS), Publisher (YouTube). Steps are logged in agent-run.jsonl for a clear audit trail.
OpenClaw skills map to each stage: paper-extract, paper-verify, dialog-script, video-render, youtube-publish, orchestrate-paper.
Video: Layered composite (background per speaker, active character sprite, top subtitle). Wiggle via ffmpeg overlay for fast renders. TTS is generated per subtitle chunk so audio matches on screen text.
Ops: run_e2e.py for one command demos. Cron on VPS via morning_job.py and upload_job.py. RSS discovery with rate limiting for arXiv.
Challenges we ran into
arXiv rate limits: Switched to RSS discovery and global request spacing so batch jobs do not get blocked.
Summary quality: Raw LLM output sometimes missed fields or drifted from the abstract. We added a Verifier agent and up to two automatic retries before dialog generation.
JSON from the LLM: LaTeX in papers produced invalid escapes. We hardened parsing and sanitization.
YouTube metadata: Descriptions with comparison symbols were rejected as invalid HTML. We sanitize text and fall back to a minimal description on upload failure.
Audio vs subtitles: Reused MP3 caches and splitting one long clip by equal time caused mismatch. We moved to per chunk TTS and text hash cache invalidation.
Telegram UX: Early versions gave no feedback until the job finished. We added instant ack, streaming progress lines, dialog preview, and video delivery in chat.
Accomplishments that we're proud of
A working pipeline from arXiv to YouTube, not just a slideshow or text summary.
Pak Nam and Zaba format that forces simple explanations through dialogue.
Real deployment: VPS cron plus Telegram control from a phone.
Multi agent loop with verification and logged steps, not a single opaque script.
Practical demo papers rendered end to end (for example Attention Is All You Need and recent cs.LG and cs.CL uploads).
Fast vertical video renders using ffmpeg based wiggle instead of thousands of PNG frames.
What we learned
Splitting agent responsibilities makes failures easier to fix (bad summary vs bad dialog vs bad render).
Verification before the expensive video step saves time and improves trust in the content.
Ops matter as much as models: rate limits, OAuth, cache invalidation, and user visible progress on Telegram all shaped the final product.
RAG on paper.txt beats dumping the first N characters of a PDF for extraction quality.
For video, synchronizing audio per subtitle chunk is simpler and more reliable than seeking one long TTS file.
What's next for paper2video
Shorter cuts tuned for true Shorts length (under 60 seconds) where needed.
Better paper selection (ranking or user chosen topics from Telegram).
Thumbnail generation from paper title and key figure.
Multi language dialog beyond Indonesian.
Public channel branding and consistent publishing analytics.
Optional human in the loop approval in Telegram before YouTube upload.
Harden VPS monitoring and alerts when morning or upload jobs fail.
Built With
- openclaw
- python
Log in or sign up for Devpost to join the conversation.