Project Submission: YC Model Content Protocol (MCP) Overview The YC Model Content Protocol (MCP) is a structured interface for extracting and querying startup information from the Y Combinator website. It enables AI assistants to retrieve semantically rich, normalized data from YC startup profiles — making the site machine-readable and queryable without manual scraping or brittle parsing.

What It Does MCP allows AI systems to:     •    Extract structured data from YC startup pages     •    Query specific content like founder bios, product descriptions, funding details, or tags     •    Interpret page structure using semantic cues and hierarchical content modeling     •    Receive updated content when startup profiles change

Why We Built It AI assistants need reliable access to structured data to understand and reason about startups. The YC website provides high-value information, but it's designed for human readers. MCP bridges that gap by turning YC content into a clean, machine-readable format — purpose-built for retrieval, summarization, and analysis by AI systems.

How It Works

  1. Content Extraction A modular scraping layer identifies and retrieves consistent content blocks across YC startup profiles, including:     •    Descriptions     •    Founders and team info     •    Tags (industry, location, stage)     •    Metadata (batch, status, URL)
  2. Transformation Engine The raw HTML is converted into a structured, assistant-ready format:     •    Layout normalization     •    Semantic labeling of content (e.g., pitch, traction, team)     •    Hierarchical representation for context-aware queries
  3. Query API The resulting content is exposed through a simple interface that lets assistants:     •    Request content by URL     •    Query specific fields or sections     •    Detect updates and changes to existing profiles

Challenges     •    Layout variability: While YC pages follow a general structure, edge cases required adaptive parsing and fallbacks     •    Implicit semantics: Many content blocks lacked labels, so we implemented heuristics based on content type, order, and structure     •    Performance: To support fast queries, we implemented lightweight caching and partial refresh strategies     •    Context preservation: Relationships between content elements are preserved in a hierarchical model for better querying

What’s Next     •    Change tracking: Versioning and diffing of startup data over time     •    Interactive content support: Parsing of structured subcomponents like embedded lists, badges, or expandable sections     •    User-context adaptation: Filtering or prioritizing content based on assistant use-case (e.g., investor vs. applicant queries)

Impact MCP makes the YC website programmatically accessible without requiring manual scraping or brittle DOM logic. Assistants can now answer questions like:     •    “Who founded this startup and what’s their background?”     •    “Which YC companies in the current batch are focused on AI?”     •    “Has this startup changed its pitch since last week?” By turning unstructured pages into structured protocol-compliant content, MCP unlocks the YC dataset for intelligent systems.

Share this project:

Updates