Project Submission: YC Model Content Protocol (MCP) Overview The YC Model Content Protocol (MCP) is a structured interface for extracting and querying startup information from the Y Combinator website. It enables AI assistants to retrieve semantically rich, normalized data from YC startup profiles — making the site machine-readable and queryable without manual scraping or brittle parsing.
What It Does MCP allows AI systems to: • Extract structured data from YC startup pages • Query specific content like founder bios, product descriptions, funding details, or tags • Interpret page structure using semantic cues and hierarchical content modeling • Receive updated content when startup profiles change
Why We Built It AI assistants need reliable access to structured data to understand and reason about startups. The YC website provides high-value information, but it's designed for human readers. MCP bridges that gap by turning YC content into a clean, machine-readable format — purpose-built for retrieval, summarization, and analysis by AI systems.
How It Works
- Content Extraction A modular scraping layer identifies and retrieves consistent content blocks across YC startup profiles, including: • Descriptions • Founders and team info • Tags (industry, location, stage) • Metadata (batch, status, URL)
- Transformation Engine The raw HTML is converted into a structured, assistant-ready format: • Layout normalization • Semantic labeling of content (e.g., pitch, traction, team) • Hierarchical representation for context-aware queries
- Query API The resulting content is exposed through a simple interface that lets assistants: • Request content by URL • Query specific fields or sections • Detect updates and changes to existing profiles
Challenges • Layout variability: While YC pages follow a general structure, edge cases required adaptive parsing and fallbacks • Implicit semantics: Many content blocks lacked labels, so we implemented heuristics based on content type, order, and structure • Performance: To support fast queries, we implemented lightweight caching and partial refresh strategies • Context preservation: Relationships between content elements are preserved in a hierarchical model for better querying
What’s Next • Change tracking: Versioning and diffing of startup data over time • Interactive content support: Parsing of structured subcomponents like embedded lists, badges, or expandable sections • User-context adaptation: Filtering or prioritizing content based on assistant use-case (e.g., investor vs. applicant queries)
Impact MCP makes the YC website programmatically accessible without requiring manual scraping or brittle DOM logic. Assistants can now answer questions like: • “Who founded this startup and what’s their background?” • “Which YC companies in the current batch are focused on AI?” • “Has this startup changed its pitch since last week?” By turning unstructured pages into structured protocol-compliant content, MCP unlocks the YC dataset for intelligent systems.