Vantage

APK can be found in our release here: https://github.com/ParthPatel00/Vantage/releases/tag/v1.0

Inspiration

Every photographer we know edits afterwards. They take the photo, look at it, then spend longer fixing it than taking it. Auto mode picks for you and you can't argue with it. Pro mode hands you twenty knobs and assumes you know which one to turn.

Pixel's Camera Coach was the closest thing we'd seen to a real fix, a small text tip telling you to recompose, generated in the cloud. But it took up to a minute, never touched the camera itself, and at the end you still had to do the editing yourself.

The moment this idea became a project was when we realized that the Snapdragon 8 Elite NPU could run Gemma 4 E2B fast enough to make a decision about your photo before you took it. If the model lives on the phone and inference takes a second, you don't need a coaching tip. You just need a different shutter button.

What it does

Modern phone cameras have two failure modes:

Auto mode is a black box. The phone makes every choice: ISO, white balance, color grade, and you get whatever it decides. If the look isn't right, you're editing afterwards.
Pro mode exposes every dial: ISO, shutter speed, white balance, focus distance, exposure compensation, eight filter looks. Most people don't know which dial to turn.

Google's Camera Coach on Pixel 10 sits in between, but it's cloud-based, takes up to a minute, only gives text tips, and never actually adjusts the camera. It hands you a checklist; we run the checklist for you.

We wanted to collapse all of that into one button. Frame the shot. Tap once. Get the photo you actually wanted. The AI is the photographer; the user just decides what they want a picture of.

The thing that made this feel possible now was Gemma 4 E2B running on the Snapdragon 8 Elite NPU, a multimodal model small enough to live on the phone and fast enough that "tap, analyze, shoot" feels instant. So we built around one constraint: every decision the camera makes has to happen on-device, by Gemma, in less time than a human takes to lift the phone.

How we built it

One tap, three jobs. When you press the AI shutter, the app does three things at once. It asks Gemma to look at the viewfinder and pick the best technical settings (ISO, shutter speed, white balance, focus, zoom, flash). It asks for a creative call, which of twelve filters fits the mood, plus brightness, contrast, saturation, and gamma adjustments. Then it captures the shot, applying everything in a single pipeline.

Two outputs per shot. Every tap produces a pair: the unedited original and the AI-tuned version. The Vintage filter, for instance, isn't a color overlay, it lifts the blacks, warms the whites, and pulls saturation down by about 30%, all baked into the saved image. The gallery shows both with a swipe between them, so you can see exactly what the AI changed.

Voice as creative direction. Hold the mic and say "cinematic portrait" or "1980s retro look" the next shot uses your words as the creative target. Gemma was trained on enough photography that those phrases map cleanly to filter choices and tone curves. We anchor the most common requests with a few style examples in the prompt; Gemma generalizes well from there.

Live preview filtering. The viewfinder itself runs through an OpenGL ES shader pipeline, so you see the chosen filter in real time - not just on the saved photo. Each of the twelve looks (Cinematic, Vintage, Dramatic, Noir, Vivid, Warm, Cool, Muted, Fade, Mono, Silvertone, Natural) is a hand-tuned fragment shader.

The stack. Kotlin and Jetpack Compose for UI, CameraX with Camera2 interop for camera control, LiteRT-LM (Kotlin API) for Gemma, GPUImage for the captured-image enhancement, OpenGL ES 2.0 for the live preview. Built for the Samsung Galaxy S25 Ultra, Snapdragon 8 Elite, Hexagon v79 NPU.

Challenges we ran into

The wrong native libraries. LiteRT-LM talks to the NPU through a stack of .so files. Our first builds crashed instantly with a deep native error. We assumed the Qualcomm QAIRT SDK was the canonical source -it wasn't. The right libraries live in a Google sample app repo, not in Qualcomm's SDK. Once we figured that out, the crash vanished. We lost half a day chasing the wrong thread.

Initialization order matters. The Hexagon DSP loads its runtime through a path that has to be set in the process environment before LiteRT-LM is touched. We had it set after and got an inscrutable native crash. Moving the env-var set to the very top of engine init fixed it.

Filter drift across rounds. When we ran two analysis rounds in a row to refine the settings, Gemma sometimes second guessed itself, picking CINEMATIC in round one, then quietly flipping to NATURAL in round two because the numbers were closer to neutral. The creative decision was getting lost. We added a small piece of state that pins the round one filter and lets only the technical numbers update so that the look stays decisive.

Working around a 2B-parameter model. Gemma 4 E2B is small by today's standards, and small multimodal models are noticeably less reliable than their bigger cloud cousins at structured reasoning. Swapping in a larger model wasn't an option, the whole point was to keep this on-device. So we engineered around the constraint. We anchor every common creative phrase ("cinematic portrait", "1980s retro") with an example in the prompt so the model has a reference point to generalize from. We use a strict single-line-per-field output template instead of free-form generation so we know what to parse. We clamp every numeric output to a valid range so the model can't push the camera into broken settings, and we strip the occasional repetition Gemma falls into. None of this is glamorous, but most of the work in a small-model deployment is exactly this kind of glue, and the result is roughly 100% parse success and photos that come out the way the user asked for.

Hiding the dials. Pro photographers love their dials. Most people don't. The hardest design problem wasn't the AI. It was making one button feel like it could replace ten. We spent more time tuning Gemma's prompt (when does a portrait want CINEMATIC vs DRAMATIC? how much contrast is too much?) than wiring up the actual camera.

What we learned

On-device LLMs are ready for interactive UX. Once the native plumbing was right, Gemma's per-frame inference came in around 1.5–2 seconds on the NPU. That's the line between "AI camera" and "waiting for the AI."
The NPU is the difference. The same model on the CPU runs 4–5× slower. Subsequent model loads after a cache warmup drop from 10 seconds to under 2. The Snapdragon 8 Elite isn't just an upgrade, it changes what you can build.
Strict prompts beat fancy parsing. A clean output format in the prompt plus a forgiving parser is a robust pattern even on small multimodal models. You don't need tool-calling to get there.
Subtraction is harder than addition. A pro camera with twenty controls is easier to design than one button that has to be right every time. Most of the work was in not exposing things.
Show, don't tell. The original-vs-AI swipe in the gallery does more to explain what Vantage is than any tip text could. Users get it instantly.

## Accomplishments that we're proud of

Sub-2-second multimodal inference on-device. Each scene analysis round, vision plus structured text, lands in roughly 1.5 to 2 seconds on the Snapdragon 8 Elite NPU. Fast enough that the user never thinks of it as "the AI"; it's just the shutter button.
The dual-output gallery. Saving the original alongside the AI-tuned shot, and letting users swipe between them, turned out to be the single best explanation of what Vantage is. It does more than any tip text could.
Closed-loop inspiration, all on-device. Gemma writes its own Unsplash search query from the viewfinder and the user's voice intent, the app fetches six matching references, and the user's pick becomes the creative target for the next shot. The model is doing query authoring, scene analysis, and creative direction in the same session, all on the NPU, all on-device.
Voice intent that actually steers the model. Saying "cinematic portrait" or "1980s retro" doesn't just decorate the chat history, it changes the filter, the contrast curve, and the white balance Gemma picks. The phrase becomes the photograph.
Twelve hand-tuned live filters at zero lag. A custom OpenGL ES renderer applies each filter directly on the camera texture, so what you see in the viewfinder is exactly what the saved photo looks like.
Robust structured output without tool calls. Roughly 100% parse success on Gemma's free-form output, using only a strict prompt template and a forgiving parser. No JSON-mode required.

What's next for Vantage

Use your own photos as the reference. Right now references come from Unsplash. The next step is letting users point Vantage at their own gallery, an album of shots they love, an aesthetic moodboard, a folder of past portraits, and have Gemma analyze those images instead. The same closed loop applies: Gemma reads your reference, the AI shutter tunes the shot to match. The model never has to leave the device, and the references never leave it either.
Live coaching mode. A second mode next to the AI shutter that runs scene analysis continuously and surfaces guidance in real time, translucent boxes for "where your subject is" vs "where it should be", edge arrows for movement, narrated tips. The single-tap shutter remains the headline; live coaching becomes the second mode.
ElevenLabs voice. Replace the stock Android text-to-speech with a more natural voice for the spoken feedback. Small change, big perceived-quality jump for narration and for users who prefer voice over text.
Multi-shot stories. "Take a five-photo wedding album" or "give me a three-shot portrait sequence", Gemma plans the sequence, the app guides the user through each frame, and you end up with a coordinated set instead of a single shot.
Wider device support. Right now the model is the SM8750-tuned variant for the S25 Ultra's NPU. Adding a CPU/GPU fallback path opens Vantage up to any modern Android device, with the trade-off of slower inference.
Fine-tuned Gemma for photography. The base 2B model is already strong. A few hundred examples of professional photographer critiques would tighten the composition tips and filter selection further, especially for the harder edge cases (low-light food, backlit portraits, motion).