Gemini Integration Description

Penny uses five Gemini 3 capabilities together in a single financial app:

  1. Vision API — Photograph brokerage statements and receipts; Gemini parses document structure (tables, charts, text) and extracts holdings with confidence scores.
  2. Configurable Thinking Levelshigh for document analysis, medium for voice coaching, low/minimal for tips and alerts. The UI shows which level is active.
  3. Structured Output — Every response is type-safe JSON validated against Zod schemas. No string parsing.
  4. Streaming — Voice coaching streams Gemini text generation directly to text-to-speech.
  5. Autonomous Agent — A background agent monitors portfolios 24/7, detects allocation drift, sends Gemini-generated push notifications, and learns from user response patterns.

Penny follows Google's "Action Era" concept: AI that sees (camera), reasons at variable depth (thinking levels), and acts without prompting (imports holdings, sends alerts). Not a chatbot wrapper — a background copilot.


Inspiration

Most AI apps are chat wrappers. Text in, text out. I wanted to build an app where Gemini 3 watches documents, reasons at different depths, and takes actions without being asked.

I picked personal finance because it demands all three modes of AI interaction:

  • Multimodal input — statements, receipts, and charts require vision
  • Variable reasoning — risk analysis needs depth; market updates need speed
  • Autonomous action — the AI should monitor and act, not wait to be asked

Penny is a financial copilot that uses Gemini 3 for more than conversation.


What it does

1. Multimodal Vision + Document Understanding

Users photograph brokerage statements. Gemini 3 Vision extracts holdings from tables, text, and charts by parsing document structure — not running OCR.

Input: Photo of Fidelity statement
Output: Structured JSON with holdings, quantities, prices, confidence scores

High-confidence holdings import directly. Lower confidence gets flagged for user review.

2. Configurable Thinking Levels

I adjust thinkingLevel by task:

Task Thinking Level Reason
Document analysis high Table extraction needs deep reasoning
Voice coaching medium Conversational depth without latency
Portfolio insights low Contextual suggestions without heavy compute
Daily tips minimal Fast, lightweight generation
Drift alerts minimal Speed matters more than depth

The UI displays which level is active so users see how much reasoning is happening.

3. Structured Output with Schema Validation

Gemini returns type-safe JSON validated against Zod schemas:

const DocumentAnalysisSchema = z.object({
  holdings: z.array(z.object({
    name: z.string(),
    symbol: z.string().optional(),
    quantity: z.number(),
    price: z.number(),
    confidence: z.number().min(0).max(1),
  })),
  reasoning: z.string(),
});

Verdict enums, percentages, and consideration arrays are all schema-guaranteed — no regex parsing.

The same pattern powers "Ask Before I Buy" — users enter a purchase and Gemini returns a structured verdict, portfolio impact percentage, opportunity cost, and pros/cons in one call.

4. Autonomous Marathon Agent

The AI runs without user prompts:

  • Monitors portfolios via expo-background-fetch
  • Detects allocation drift from user goals
  • Sends proactive push notifications with Gemini-generated messages
  • Learns from user response patterns to adjust intervention timing
  • Logs every decision in a transparent Activity Feed

5. Real-Time Voice Coaching

Streaming Gemini responses feed directly into text-to-speech. Users speak to the app and hear portfolio coaching as it generates — no waiting for the full response.


How I built it

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         PENNY APP                           │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Camera    │  │ Voice Input │  │  Background Agent   │  │
│  │  (Vision)   │  │  (Audio)    │  │   (Autonomous)      │  │
│  └──────┬──────┘  └──────┬──────┘  └──────────┬──────────┘  │
│         │                │                     │             │
│         ▼                ▼                     ▼             │
│  ┌────────────────────────────────────────────────────────┐ │
│  │              GEMINI 3 INTEGRATION LAYER                │ │
│  │  ┌──────────────────────────────────────────────────┐  │ │
│  │  │  thinkingLevel: 'minimal'|'low'|'medium'|'high'  │  │ │
│  │  └──────────────────────────────────────────────────┘  │ │
│  │  • Vision API (document/receipt analysis)              │ │
│  │  • Structured Output (Zod schema validation)           │ │
│  │  • Streaming (voice coaching)                          │ │
│  │  • Retry with exponential backoff                      │ │
│  └────────────────────────────────────────────────────────┘ │
│         │                │                     │             │
│         ▼                ▼                     ▼             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │  Portfolio  │  │   Alerts    │  │   Agent Activity    │  │
│  │   Import    │  │   & Tips    │  │       Log           │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Tech Stack

  • Frontend: React Native + Expo (iOS/Android)
  • AI: Gemini 3 Flash Preview (gemini-3-flash-preview)
  • Observability: Opik for LLM tracing
  • Auth: Firebase Authentication
  • Storage: AsyncStorage + Firebase
  • Background Tasks: expo-background-fetch + expo-task-manager

Gemini 3 API Integration

// Vision + structured output + thinking levels in one call
const result = await generateStructuredWithGemini({
  prompt: documentAnalysisPrompt,
  schema: DocumentAnalysisSchema,
  image: base64Image,
  thinkingLevel: 'high',
  temperature: 0.2,
});

// Streaming for voice responses
await streamWithGemini({
  prompt: coachingPrompt,
  thinkingLevel: 'medium',
  onChunk: (text) => appendToUI(text),
  onComplete: (full) => speakAloud(full),
});

Autonomous Agent Loop

async function runAgentLoop() {
  const holdings = await loadPortfolio();
  const goals = await getUserGoals();

  const drift = calculateDrift(holdings, goals.targetAllocation);
  if (drift > THRESHOLD && shouldIntervene(state)) {
    const message = await generateWithGemini({
      prompt: `Portfolio drifted: ${drift}. Write encouraging notification.`,
      thinkingLevel: 'minimal',
    });
    await sendPushNotification('Portfolio Drift', message);
    await logIntervention({ type: 'drift_alert', message });
  }

  // Agent learns: if user ignores alerts, back off
  state.userResponseRate = calculateResponseRate(recentInterventions);
}

Challenges I ran into

  1. Thinking level trade-offshigh improves document extraction but adds latency. I benchmarked each task and picked levels that balance quality and speed.

  2. Structured output reliability — Gemini sometimes returns malformed JSON. I added auto-correction (case normalization, array parsing) before Zod validation.

  3. Autonomous trust — Users distrust AI acting without permission. The Activity Log shows every agent decision with reasoning, which became the main trust mechanism.

  4. Multimodal prompt tuning — Document extraction needed iteration. Adding "extract EVERY holding, even if partially visible" improved recall significantly.


Accomplishments

Typical AI App Penny
Text in → Text out Camera → Structured Data → Portfolio Import
Single reasoning mode 4 thinking levels matched to task
Reactive (waits for input) Proactive (autonomous monitoring)
Generic responses Type-safe JSON with confidence scores
Opaque Transparent agent activity log

Production Details

  • Exponential backoff retry logic
  • Response caching (5 min TTL)
  • LLM observability via Opik
  • Parallel API calls
  • Demo mode for judge access

What I learned

  1. Thinking levels change architecture — Variable reasoning depth lets you match compute to task complexity instead of using one mode for everything.

  2. Multimodal is underused — Most projects treat Gemini as text-only. Vision + structured output enables scan-to-import flows that feel magical.

  3. Autonomous AI needs transparency — The Activity Log started as a debug tool. It became the feature that makes users trust the agent.

  4. Structured output beats text parsing — For action-oriented AI, typed JSON is more reliable than parsing prose.

What's next for Penny

  • Gemini Live API — Full duplex voice for conversational coaching
  • Multi-agent architecture — Separate agents for spending, investing, and risk
  • Video analysis — Process portfolio review recordings and earnings calls

Built With

Share this project:

Updates