Product Overview
Audio Studio
Visual Studio
Video Intelligence
NLP Studio

🌌 Vyonix Studio: The Multimodal AI Data Factory

Turning "Unstructured Chaos" into "Diamond-Grade Data" with Gemini 3

💡 Inspiration

The biggest bottleneck in AI today isn't models—it's data. I watched teams spend months building "Frankenstein pipelines" just to label a simple video dataset:

One tool for audio transcription.
Another for object bounding boxes.
A third to scrub PII.
And a messy spreadsheet to stitch it all together.

I realized: Why are we treating audio, vision, and text as separate problems? Gemini 3 sees the world like we do—holistically. I built Vyonix Studio to prove that a single multimodal model can replace an entire data engineering department, collapsing months of work into muntes.

🚀 What it does

Vyonix Studio is a unified "Glass Box" for AI Data Engineering. It ingests raw, chaotic media (video/audio/text/images) and uses Gemini 3's multimodal intelligence to structure it into training-ready assets with forensic precision.

It is NOT just a passive analyzer. It is a Human-AI Collaboration System with four specialized studios:

🎙️ Audio Intelligence Studio

AI-Powered Features:

✅ Precision Transcription: Sub-word timestamp accuracy (HH:MM:SS.mmm format)
✅ Indian Accent Mastery: Handles Hinglish and regional phonetics where global models fail
✅ Sentiment Analysis: Visualizes pitch, mood (Joy/Anger/Sadness), and speaker shifts
✅ Professional TTS: Generates vocal assets using Gemini 2.5 TTS with emotional range, We don't just generate audio; we use Gemini 3 to audit the synthetic output of Gemini 2.5 TTS, ensuring every millisecond is perfectly labeled and structured

Human-in-the-Loop Features:

✅ Live Segment Editing: Edit transcriptions with inline text editing
✅ Custom Tagging: Add/modify speaker labels and timestamps
✅ Smart ZIP Export: Merges AI output with manual edits (zero data loss)
✅ JSON Export: Structured output ready for training pipelines
✅ Multi-File Processing: Batch upload with history tracking

👁️ Vision Pro Studio

AI-Powered Features:

✅ Zero-Shot Object Detection: Detect ANY concept (e.g., "Robot Owl", "Defective Chip", "Person in Red Jacket")
✅ Normalized Coordinates: 0-1000 precision mapping for universal compatibility
✅ Confidence Scoring: Each detection includes AI confidence percentage
✅ Structured Video Transcription: Timestamped dialogue extraction from video
✅ Synthetic Video Generation: Veo 3.1 integration for creating training videos from text prompts, again the generated video is uploaded back for Gemini 3 flash to recognize objects in the video and transcribe.

Human-in-the-Loop Features:

✅ Manual Annotation Tools: Draw custom bounding boxes with precision
✅ Editable Labels: Click-to-rename any object label
✅ Dynamic Aspect Ratio: Auto-adjusts to video dimensions (no drift!)
✅ Timeline Sync: Jump to any timestamp by clicking annotations
✅ JSON Export: HH:MM:SS formatted timestamps for easy integration

🖼️ Image Generation Studio

AI-Powered Features:

✅ Photorealistic Synthesis: Generate training images using Gemini 3 Pro
✅ Flux Model Integration: Alternative generation pipeline for edge cases
✅ Concept Augmentation: Create dataset variations ("same car in rain/snow/sunset")
✅ Batch Generation: Queue multiple image generation tasks

📝 NLP & PII Engine

AI-Powered Features:

✅ Named Entity Recognition: 10+ entity types (PERSON, ORG, LOC, GPE, DATE, SSN, PHONE, EMAIL, etc.)
✅ PII Detection: Instant identification for compliance auditing
✅ Sentiment Analysis: Document-level mood classification
✅ Text Summarization: Condensed insights from long documents
✅ Topic/Keyword Extraction: Automatic tagging and categorization

Human-in-the-Loop Features:

✅ Interactive Tagging: Click-to-tag custom entities
✅ Index Self-Correction: Forensic-level accuracy alignment
✅ Custom Entity Types: Define your own classification schema
✅ Bulk Export: JSON/CSV output for downstream processing

💰 Financial Console

✅ Real-Time Token Tracking: See costs per request in milliseconds
✅ Batch API Integration: 50% cost reduction on heavy workloads
✅ Economic Transparency: Unit economics at your fingertips
✅ Cost Projections: Estimate enterprise-scale processing costs

⚙️ How I built it

Vyonix is a Next.js application deployed on Google Cloud Run, powered by the Gemini 3.0 ecosystem.

The Brain: I used Gemini 3 Flash via the Google AI SDK for its blazing speed and accurate timestamp generation.
The Canvas: A custom-built React Video/Audio Player that syncs with AI metadata. We had to build a custom "Coordinate Mapper" to translate Gemini's 1000x1000 coordinate space into responsive CSS (top: 23.4%, left: 11.2%).
The Workflow:
1. User drags & drops a file.
2. Check for "Human-in-the-Loop" needs (e.g., editing a bounding box label).
3. Batch API: Heavy jobs are sent to the Batch API to cut costs by 50%.
4. Export: Data is packaged into a standard JSON/ZIP format ready for PyTorch/TensorFlow.

🏔️ Challenges I ran into

The "Drifting Box" Problem: Gemini provides coordinates normalized to 1000. Displaying these on a responsive video player that scales with the window was a nightmare. I built a dynamic Subject-Aware Aspect Ratio wrapper that ensures the bounding box stays glued to the object, even if you resize the browser.
Audio " hallucinations": Early prompts gave us narrative summaries ("A man walked in"). I needed data ("TIMESTAMP: 00:04, SPEAKER: Man, TEXT: Hello"). I refined our system prompt to enforce a strict JSON-only output schema, forcing the model to act as a structured database rather than a creative writer.

🏆 Accomplishments that I am proud of

Deployment: The app is live on Google Cloud Run, scaling to zero when not in use.
Economic Viability: By integrating the Gemini Batch API, I proved I can process 2,000+ hours of video for the cost of a few coffees, making enterprise-grade labelling accessible to startups.
The "Vibe": I achieved a premium "Glassmorphism" UI that feels like a sci-fi tool, not a boring internal dashboard.

📚 What I learnt

"Vibe Coding" is real: I used English (prompts) as our primary compilation target. 80% of our backend logic is just... asking Gemini nicely and precisely.
Context is King: Giving Gemini the previous 5 seconds of context improved transcription accuracy by 40%.

🚀 What's next for Vyonix Studio

Video Intelligence 2.0: Tracking objects across frames (action recognition).
Marketplace: A "HuggingFace for Data" where users can sell their cleaner (Vyonix-audited) datasets.
Enterprise SSO: Integrating with corporate identity providers for secure auditing.

Built with ❤️ using Gemini 3 & Google Cloud Run Try the Live App

Judges please check testing instructions sections to get the access code to start using the app

Built With

gemini3
nextjs
tailwind
typescript

Submitted to

Gemini 3 Hackathon

Created by

# My Contribution to Vyonix AI Data Factory

I am the **sole architect and developer** of Vyonix AI Data Factory, a production-grade data annotation platform that solves the fundamental bottleneck in AI development: creating high-quality labeled datasets.

## What I Built

### Full-Stack Serverless Application
- Designed and implemented a **Next.js 16** application with **App Router** architecture
- Built **10+ serverless API routes** handling multimodal AI processing (Audio, Vision, Text)
- Deployed to **Google Cloud Run** with custom Docker containerization
- Integrated **4 different Gemini models** (`gemini-3-flash-preview`, `gemini-3-pro-image-preview`, `gemini-2.5-flash-preview-tts`, and generic models)

### Core Technical Innovations

#### 1. Audio Intelligence Pipeline
- Engineered a robust file upload system using **Gemini File Manager API** with async state polling
- Solved the "raw PCM audio stream" problem by implementing **WAV header injection** from scratch (44-byte RIFF header construction)
- Built multi-version prompt engineering system (v1-v4) with A/B testing capability
- Implemented precise timestamp parsing (`HH:MM:SS.mmm` → milliseconds) for waveform synchronization

#### 2. Vision Zero-Shot Detection
- Created a coordinate normalization system (0-1000 scale) to handle device-independent bounding boxes
- Integrated **Gemini Vision** for auto-annotation without pre-trained models
- Built synthetic image generation workflows for rare edge-case data augmentation

#### 3. NLP Entity Extraction
- Designed **LLM index self-correction algorithms** to handle response drift (±20 char fuzzy matching)
- Implemented Privacy-First PII detection and redaction system
- Created structured JSON parsing with markdown wrapper handling

#### 4. Production Architecture
- Solved **Cloud Run read-only filesystem** constraints by redesigning storage to use `/tmp`
- Implemented **60-minute extended timeouts** for heavy media processing
- Built error scrubbing middleware to hide underlying model identities (enterprise white-labeling)
- Created real-time token usage tracking for cost optimization

### UI/UX Development
- Designed a premium dark-mode interface with **Tailwind CSS 4**
- Integrated **Wavesurfer.js** for interactive waveform manipulation
- Built sliding panels, drag-and-drop uploads, and real-time sentiment overlays

## Technical Challenges Overcome

### 1. Stream Processing
**Problem:** Gemini TTS returns raw PCM data without headers—browsers can't play it.

**Solution:** I reverse-engineered the WAV format spec to inject valid headers on-the-fly.

```typescript
// 44-byte RIFF header construction
const header = Buffer.alloc(44);
header.write("RIFF", 0);
header.writeUInt32LE(36 + audioBuffer.length, 4);
header.write("WAVE", 8);
// ... complete WAV specification implementation
```

### 2. Serverless Storage
**Problem:** Next.js/Cloud Run's immutable filesystem prevents traditional file serving.

**Solution:** Moved to in-memory base64 encoding for TTS output, eliminating filesystem dependencies.

### 3. LLM Reliability
**Problem:** Entity extraction indices often drifted ±5-20 characters from actual text positions.

**Solution:** Built a fuzzy search algorithm to self-correct the model's output:

```typescript
const searchStart = Math.max(0, entity.start - 20);
const searchEnd = Math.min(text.length, entity.end + 20);
const segment = text.slice(searchStart, searchEnd);
const localIdx = segment.indexOf(entity.mention);
```

## Impact

Vyonix reduces data labeling costs by **>90%** compared to human annotation farms, enabling small teams to train high-quality AI models without venture capital.

**Live Demo:** https://vyonix-studio-service-vob67naxna-uc.a.run.app)

---

**Built for:** Google Gemini Developer Competition
**Author:** @inareshmatta

Naresh Matta

Updates

Naresh Matta started this project — Feb 05, 2026 01:24 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.