Penny - Talk to Your Money

PENNY home screen with portfolio value, performance chart, and top holdings
Camera scanning a brokerage statement
More features
AI portfolio analysis
Penny chatbot voice coaching
Scan receipt with Gemini reasoning
"Ask Before I Buy" analysis showing a Think Twice verdict for a $3,000 vacation, with portfolio impact, opportunity cost, considerations
AI PERFORMANCE DASHBOARD
PENNY runs a background agent that monitors your portfolio autonomously

Gemini Integration Description

Penny uses five Gemini 3 capabilities together in a single financial app:

Vision API — Photograph brokerage statements and receipts; Gemini parses document structure (tables, charts, text) and extracts holdings with confidence scores.
Configurable Thinking Levels — high for document analysis, medium for voice coaching, low/minimal for tips and alerts. The UI shows which level is active.
Structured Output — Every response is type-safe JSON validated against Zod schemas. No string parsing.
Streaming — Voice coaching streams Gemini text generation directly to text-to-speech.
Autonomous Agent — A background agent monitors portfolios 24/7, detects allocation drift, sends Gemini-generated push notifications, and learns from user response patterns.

Penny follows Google's "Action Era" concept: AI that sees (camera), reasons at variable depth (thinking levels), and acts without prompting (imports holdings, sends alerts). Not a chatbot wrapper — a background copilot.

Inspiration

Most AI apps are chat wrappers. Text in, text out. I wanted to build an app where Gemini 3 watches documents, reasons at different depths, and takes actions without being asked.

I picked personal finance because it demands all three modes of AI interaction:

Multimodal input — statements, receipts, and charts require vision
Variable reasoning — risk analysis needs depth; market updates need speed
Autonomous action — the AI should monitor and act, not wait to be asked

Penny is a financial copilot that uses Gemini 3 for more than conversation.

What it does

1. Multimodal Vision + Document Understanding

Users photograph brokerage statements. Gemini 3 Vision extracts holdings from tables, text, and charts by parsing document structure — not running OCR.

Input: Photo of Fidelity statement
Output: Structured JSON with holdings, quantities, prices, confidence scores

High-confidence holdings import directly. Lower confidence gets flagged for user review.

2. Configurable Thinking Levels

I adjust thinkingLevel by task:

Task	Thinking Level	Reason
Document analysis	`high`	Table extraction needs deep reasoning
Voice coaching	`medium`	Conversational depth without latency
Portfolio insights	`low`	Contextual suggestions without heavy compute
Daily tips	`minimal`	Fast, lightweight generation
Drift alerts	`minimal`	Speed matters more than depth

The UI displays which level is active so users see how much reasoning is happening.

3. Structured Output with Schema Validation

Gemini returns type-safe JSON validated against Zod schemas:

const DocumentAnalysisSchema = z.object({
  holdings: z.array(z.object({
    name: z.string(),
    symbol: z.string().optional(),
    quantity: z.number(),
    price: z.number(),
    confidence: z.number().min(0).max(1),
  })),
  reasoning: z.string(),
});

Verdict enums, percentages, and consideration arrays are all schema-guaranteed — no regex parsing.

The same pattern powers "Ask Before I Buy" — users enter a purchase and Gemini returns a structured verdict, portfolio impact percentage, opportunity cost, and pros/cons in one call.

4. Autonomous Marathon Agent

The AI runs without user prompts:

Monitors portfolios via expo-background-fetch
Detects allocation drift from user goals
Sends proactive push notifications with Gemini-generated messages
Learns from user response patterns to adjust intervention timing
Logs every decision in a transparent Activity Feed

5. Real-Time Voice Coaching

Streaming Gemini responses feed directly into text-to-speech. Users speak to the app and hear portfolio coaching as it generates — no waiting for the full response.

How I built it

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         PENNY APP                           │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Camera    │  │ Voice Input │  │  Background Agent   │  │
│  │  (Vision)   │  │  (Audio)    │  │   (Autonomous)      │  │
│  └──────┬──────┘  └──────┬──────┘  └──────────┬──────────┘  │
│         │                │                     │             │
│         ▼                ▼                     ▼             │
│  ┌────────────────────────────────────────────────────────┐ │
│  │              GEMINI 3 INTEGRATION LAYER                │ │
│  │  ┌──────────────────────────────────────────────────┐  │ │
│  │  │  thinkingLevel: 'minimal'|'low'|'medium'|'high'  │  │ │
│  │  └──────────────────────────────────────────────────┘  │ │
│  │  • Vision API (document/receipt analysis)              │ │
│  │  • Structured Output (Zod schema validation)           │ │
│  │  • Streaming (voice coaching)                          │ │
│  │  • Retry with exponential backoff                      │ │
│  └────────────────────────────────────────────────────────┘ │
│         │                │                     │             │
│         ▼                ▼                     ▼             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │  Portfolio  │  │   Alerts    │  │   Agent Activity    │  │
│  │   Import    │  │   & Tips    │  │       Log           │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Tech Stack

Frontend: React Native + Expo (iOS/Android)
AI: Gemini 3 Flash Preview (gemini-3-flash-preview)
Observability: Opik for LLM tracing
Auth: Firebase Authentication
Storage: AsyncStorage + Firebase
Background Tasks: expo-background-fetch + expo-task-manager

Gemini 3 API Integration

// Vision + structured output + thinking levels in one call
const result = await generateStructuredWithGemini({
  prompt: documentAnalysisPrompt,
  schema: DocumentAnalysisSchema,
  image: base64Image,
  thinkingLevel: 'high',
  temperature: 0.2,
});

// Streaming for voice responses
await streamWithGemini({
  prompt: coachingPrompt,
  thinkingLevel: 'medium',
  onChunk: (text) => appendToUI(text),
  onComplete: (full) => speakAloud(full),
});

Autonomous Agent Loop

async function runAgentLoop() {
  const holdings = await loadPortfolio();
  const goals = await getUserGoals();

  const drift = calculateDrift(holdings, goals.targetAllocation);
  if (drift > THRESHOLD && shouldIntervene(state)) {
    const message = await generateWithGemini({
      prompt: `Portfolio drifted: ${drift}. Write encouraging notification.`,
      thinkingLevel: 'minimal',
    });
    await sendPushNotification('Portfolio Drift', message);
    await logIntervention({ type: 'drift_alert', message });
  }

  // Agent learns: if user ignores alerts, back off
  state.userResponseRate = calculateResponseRate(recentInterventions);
}

Challenges I ran into

Thinking level trade-offs — high improves document extraction but adds latency. I benchmarked each task and picked levels that balance quality and speed.
Structured output reliability — Gemini sometimes returns malformed JSON. I added auto-correction (case normalization, array parsing) before Zod validation.
Autonomous trust — Users distrust AI acting without permission. The Activity Log shows every agent decision with reasoning, which became the main trust mechanism.
Multimodal prompt tuning — Document extraction needed iteration. Adding "extract EVERY holding, even if partially visible" improved recall significantly.

Accomplishments

Typical AI App	Penny
Text in → Text out	Camera → Structured Data → Portfolio Import
Single reasoning mode	4 thinking levels matched to task
Reactive (waits for input)	Proactive (autonomous monitoring)
Generic responses	Type-safe JSON with confidence scores
Opaque	Transparent agent activity log

Production Details

Exponential backoff retry logic
Response caching (5 min TTL)
LLM observability via Opik
Parallel API calls
Demo mode for judge access

What I learned

Thinking levels change architecture — Variable reasoning depth lets you match compute to task complexity instead of using one mode for everything.
Multimodal is underused — Most projects treat Gemini as text-only. Vision + structured output enables scan-to-import flows that feel magical.
Autonomous AI needs transparency — The Activity Log started as a debug tool. It became the feature that makes users trust the agent.
Structured output beats text parsing — For action-oriented AI, typed JSON is more reliable than parsing prose.

What's next for Penny

Gemini Live API — Full duplex voice for conversational coaching
Multi-agent architecture — Separate agents for spending, investing, and risk
Video analysis — Process portfolio review recordings and earnings calls

Built With

elevenlabs
expo.io
firebase
gemini-3
opik
react-native
tailwind-css
typescript

Updates

Hulya Masharipov started this project — Feb 07, 2026 12:25 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.