Inspiration

Every knowledge-worker vertical wakes up to the same problem: fragmented sources, delayed information, unstructured data. An M&A banker scrapes SEC EDGAR + press wires by hand. A litigator stitches together SCOTUSblog + court dockets + DOJ press. A VC associate reads TechCrunch + Crunchbase + a dozen newsletters. Each spends 2–4 hours every morning rebuilding the same picture. Tier-1 tools (CapIQ, PitchBook, Westlaw) cost $20K+/seat per vertical and don't speak to each other.

The insight: this is one workflow problem, not three. We wrote a persona mapping — five personas × four industries — and saw that the architecture should be the product, with the industry being a configuration knob.

What it does

Dealflow is the morning intelligence platform for deal professionals, litigators, and venture investors. One ingestion pipeline, three verticals.

  • M&A. 57 sources — SEC EDGAR (10 form types), press wires, law-firm tombstones, PE/IB news, AI search. XBRL financial enrichment, EV multiples, 3-year stock chart vs. S&P 500, DCF, and a 1,000-trial Monte Carlo.
  • Legal. 6 sources — SCOTUSblog, Justia, ABA, Reuters Legal, DOJ press, UK CMA. Inline IRAC analysis (Issue · Rule · Application · Conclusion) generated per case, with named precedents and a downstream-impact section.
  • VC. 6 sources — TechCrunch, Crunchbase, VentureBeat, Axios Pro Rata, plus AI-search queries. Lead-investor mapping, post-money tracking, comparable rounds, plausible-exit candidates.

A header pill switches between the three. Underneath, one ClickHouse, one scheduler, one FastAPI service powers all three.

How we built it

Layer Stack
Frontend Next.js 16, React 19, Tailwind, Recharts, TanStack Table
Backend FastAPI, GPT-4o for extraction + briefs + IRAC, APScheduler for daily 7am UTC crons
Crawler One GenericSpider dispatching to 6 parsers (HTML, RSS, EDGAR JSON, NewsAPI, Serper, Nimble)
Stores SQLite (system of record), Chroma (RAG), ClickHouse Cloud (analytics + event firehose)
Sponsors Nimble (replaces Tavily for web search + chat grounding), ClickHouse (deals mirror + pipeline runs + source-yield events + API request log), Senso (GEO + brand layer)
Deploy Railway (API + scheduler + persistent disk), Vercel (frontend), ClickHouse Cloud

The key abstraction is industry_config.py — one registry mapping each vertical to a CSV path, a stage list (M&A: rumor → … → completed; Legal: filed → … → closed; VC: sourced → … → exited), an extraction prompt set, and a UI label set. The DB columns stay generic (acquirer, target, deal_value) — they relabel per industry at the UI layer. 80% of the runtime is shared; only the prompts, sources, and labels fork.

An LLM orchestration agent picks which M&A sources to crawl each run from yield telemetry. The priority score is

$$\text{priority} = \text{base} \times \big(0.4 \cdot \overline{\text{yield}} + 0.4 \cdot \text{yield}_{\text{last}} + 0.2 \cdot \text{recency}\big)$$

so high-signal sources get pulled more often, drifting sources get flagged after 3 consecutive zero-yield runs.

Challenges we ran into

  • The M&A → multi-industry pivot mid-hackathon. Required adding an industry column to SQLite and ClickHouse, namespacing dedup keys to prevent cross-vertical collisions, gating the SEC XBRL enricher to M&A only, and routing the LLM-orchestration agent (which is M&A-coded) to static-spider mode for the other verticals. ~3,200 net lines across 30+ files, additive only — the existing 121 M&A records were auto-tagged industry='ma' by a DEFAULT 'ma' migration and nothing regressed.
  • NEXT_PUBLIC_API_URL baked into the wrong Railway service. The frontend was pointed at a dead service for a stretch — surface symptom was every dashboard returning 404 on data load. Fixed via Vercel env update + force-rebuild.
  • Vercel's GitHub auto-deploy + Railway's GitHub creds both broke at different times and had to be manually re-bootstrapped via the CLI.
  • A router.push('/?…') bug in DealFilters sent every filter change back to the marketing landing page from any dashboard. Fix was a one-line usePathname() swap, but it took clicking 30 filters to spot.
  • Python 3.9 still doesn't support X | None type hints at runtime, even with from __future__ import annotations — only when the annotation isn't a default value. Cost us one cycle of stack traces.
  • iCloud-synced project dir kept creating Finder duplicates (WorkflowDiagram 2.tsx) that broke the Next.js typecheck. Cleaned up with one find + rm.

What we learned

  1. Prompts are configuration, not code. The cost of adding a second vertical was a CSV + one prompt set — not a fork of the codebase. The third vertical took two hours total. A fourth (CRE) is essentially free.
  2. Right tool for the right write. SQLite for the typed system-of-record, Chroma for retrieval, ClickHouse for the firehose. Trying to make any one store do all three is a category error.
  3. Generic data, industry-aware UI. We never renamed the acquirer column in the database — Plaintiff / Defendant / Lead investor are display labels sourced from a registry. This kept the schema migration to a single additive column.
  4. The architecture is the demo. Showing the same engine power three visibly different products beats any architecture slide.

What's next

  • Fourth vertical: Commercial Real Estate — REIT filings + property listings + CMBS data. ~1 day of work.
  • Per-vertical enrichment partners — PACER for legal court dockets; Crunchbase API for VC cap-table depth.
  • Watchlists + email alerts — schema exists; SMTP wiring is the gap.
  • Self-serve source admin/admin UI for adding CSV sources instead of file edits.

Built With

Share this project:

Updates