Inspiration

We’ve all experienced "Mechanic Anxiety"—that sinking feeling when your car engine starts making a strange knocking sound, or your AC unit starts buzzing. You know you need a professional diagnosis, but you also fear the bill.

Is it a loose screw (500 PKR fix) or a catastrophic rod failure (50,000 PKR fix)?

We realized that while mechanics rely on years of ear-training to diagnose faults, Artificial Intelligence now has the ability to "see" and "hear" with superhuman precision. We wanted to build a tool that democratizes this expertise—giving every car owner, homeowner, and technician a senior diagnostic engineer in their pocket.

SonicFix was born from a simple question: What if your phone could tell you exactly what’s wrong, just by listening?

What it does

SonicFix is a multimodal diagnostic assistant that fuses Visual and Acoustic data to identify mechanical failures in real-time.

  1. Visual Context: The user snaps a photo of the machine (e.g., a car engine, a washing machine, or an industrial compressor). This grounds the AI, preventing it from guessing blindly.
  2. Acoustic Analysis: The user records the sound of the machine running.
  3. Smart Filtering: On-device models filter out background noise (speech, traffic) to ensure only mechanical sounds are analyzed.
  4. Instant Diagnosis: The app identify the specific fault (e.g., "Worn Serpentine Belt"), the severity level, actionable repair steps, and even an estimated repair cost tailored to the local market (Pakistan 2026).

How we built it

We built SonicFix using a Flutter frontend for cross-platform performance and a robust Serverless Python Backend on Firebase Cloud Functions.

The core innovation is our "Fusion Pipeline":

  1. Signal Pre-processing (The Gatekeeper): We integrated YAMNet (from TensorFlow Hub) as a first line of defense. Before wasting expensive API tokens, YAMNet analyzes the raw audio waveform to classify the sound source. It makes independent predictions for each of 521 audio events. ```python # YAMNet confirms if the audio is 'Mechanical' or 'Silence/Speech' if yamnet_data["primary_sound"] in NON_MECHANICAL_BLACKLIST: flag_for_review() else: proceed_to_fusion()
2. **Multimodal Fusion (The Brain):**
We use the **Gemini 3 API** to perform true multimodal reasoning. We don't just send text; we inject the **Image**, the **Audio**, and the **YAMNet Classification Tag** into a single prompt.
This allows Gemini 3 to correlate visual signs of wear (e.g., rust on a pulley) with specific audio frequencies (e.g., a high-pitched squeal), achieving accuracy that single-mode models cannot match.

3. **Resilient Architecture (The Fallback):**
Since Gemini 3 is in Preview, we engineered a production-grade **Fallback Cascade**. Our system prioritizes `gemini-3-flash-preview` for its reasoning power but automatically degrades to gemini-3.0-pro` or `gemini-1.5-flash` ` if the preview API returns a `503 Service Unavailable` error.

## Challenges we ran into

* **The "Zombie" Process:** During development, our local Python server kept hanging on Port 8080 due to a conflict between the Firebase CLI and Flask on Windows. We had to learn deep Windows process management (`taskkill /F /IM python.exe`) to keep our deployment pipeline moving.
* **503 Service Errors:** Being on the bleeding edge of **Gemini 3 Preview** meant dealing with stability issues. Our initial deploys went good but later on last days it starting failed frequently with "Service Unavailable." This forced us to implement the **Fallback Hierarchy** strategy, which turned a weakness into one of our strongest architectural features.
* **Audio Sampling Rates:** YAMNet is strictly picky about 16kHz mono audio. We had to write a custom `ensure_sample_rate()` pre-processor using `scipy` to normalize audio from different mobile devices before analysis.

## Accomplishments that we're proud of

* **True Multimodal Integration:** We aren't just sending text to a chatbot. We successfully effectively fused **Vision (Image)** and **Hearing (Audio)** into a single inference pass using Gemini 3.
* **The YAMNet Integration:** Successfully implementing a TensorFlow Hub model inside a serverless function to act as an "Expert Signal" for the LLM was a major technical win.
* **Regional Pricing Logic:** We successfully prompted the model to understand the specific economic context of Pakistan (PKR), making the tool genuinely useful for our local target audience rather than just giving generic dollar estimates.

## What we learned

* **Edge vs. Cloud Balance:** We learned that "Cloud-only" isn't always best. Using YAMNet as a lightweight filter saves massive amounts of compute time by rejecting non-mechanical audio early.
* **The Power of Prompt Engineering:** We discovered that giving the model a "Role" (`Role: You are SonicFix, a Senior Mechanical Diagnostics AI`) drastically improved the quality and structure of the JSON output compared to generic prompts.
* **Handling Instability:** Building with Preview APIs requires defensive programming. We learned to never trust that an endpoint will be up 100% of the time and to always have a stable model backup plan.

## What's next for SonicFix

* **Real-Time AR Overlay:** Using Gemini 3's video capabilities to overlay repair instructions directly onto the engine block through the phone camera.
* **OBD-II Integration:** Connecting via Bluetooth to the car's computer to combine sensor data codes with our audio-visual analysis for 100% diagnostic certainty.
* **Enterprise API:** Offering our "Audio-Visual Diagnostic" endpoint to insurance companies for automated claim verification.

Built With

Share this project:

Updates

posted an update

We just pushed a massive update to SonicFix, effectively completing our "Ear & Eye" pipeline. Here is what's new in the build:

  1. The "Gemini 3 First" Architecture We are now running on the bleeding edge. Our backend attempts to use gemini-3-flash-preview for every diagnosis to leverage its superior reasoning capabilities.

The Safety Net: Because preview models can be unstable (we hit a few 503s!), we built a custom Fallback Cascade. If Gemini 3 is busy, the system instantly degrades to gemini-1.5-flash without the user ever knowing. It’s the best of both worlds: Innovation + Reliability.

  1. New "Chat-Style" Interface We completely overhauled the UI. Instead of a static result screen, SonicFix now features a Session-Based Chat Interface.

Visual Context: You can see the photo you snapped of the engine/machine.

Audio Playback: You can replay the mechanical sound you recorded right inside the chat bubble.

Diagnostic Cards: The AI response is now rendered as a beautiful "Health Card" showing the Fault, Severity, and Estimated Cost (PKR).

  1. Under the Hood

Frontend: Flutter (Riverpod + Material 3)

Backend: Firebase Cloud Functions (Python)

AI: Multimodal Fusion (Image + Audio + Context)

We are now recording the final demo video. Fingers crossed for the submission

Log in or sign up for Devpost to join the conversation.