Inspiration

Every product team has a MongoDB cluster and a stakeholder asking the same shape of question over and over: how many users are on the pro plan, what's the revenue from shipped orders, who hasn't logged in in 30 days. Each one means writing a fresh aggregation pipeline. What if the stakeholder could ask the database directly in plain English?

What it does

gemini-data-agent treats every data question as a discover-then-query loop. You ask "how many users are on each plan?" and the agent walks the MongoDB MCP tools:

  1. list_databases to see what databases are available on the cluster.
  2. list_collections to inspect the candidate database.
  3. collection_schema to read the actual field types of the target collection (so it doesn't query against fields that don't exist).
  4. The right query shape — count for "how many", find for "show me the top N", aggregate for "group by".

The answer is structured: a one-line direct answer, the exact MongoDB tool + arguments it used, 2-4 evidence bullets with counts copied verbatim from the database response, and one concrete follow-up the analyst could try.

How we built it

  • Google Cloud Agent Builder (ADK) for the agent framework. The whole agent fits in six lines of ADK: one LlmAgent, one McpToolset, a Gemini model, and a system prompt that defines the discover-then-query workflow.
  • Gemini 2.5 Flash on Vertex AI for reasoning. Fast enough that a stakeholder's question turns into an answer in a couple of seconds; cheap enough that reviewers can fire as many queries as they want.
  • MongoDB MCP server for tools. The agent talks to the official mongodb-mcp-server (npm) tool surface. A stub MCP server ships in the repo with a small canned e-commerce dataset (500 users, 1200 orders) so reviewers can run the project locally with zero setup. Set MONGODB_CONNECTION_STRING and flip the stub toggle off, and the same agent code targets a real cluster via npx.
  • Streamlit for the dashboard.
  • Cloud Run for hosting.

Challenges we ran into

The first version of the system prompt got the agent to call the right tools but Gemini occasionally hallucinated counts (e.g., "29,200 users on the pro plan" when the canned dataset has 500 total). The fix was a sharper system prompt that explicitly instructs the agent to copy numbers verbatim from the tool output and to call another tool rather than estimate when it doesn't see a number. Subsequent live runs returned 269 free + 127 starter + 73 pro + 31 enterprise = 500 exactly.

Accomplishments that we're proud of

  • A real Vertex AI Gemini call walked all four tools (list_databases → list_collections → collection_schema → aggregate) in seven events and returned the exact counts.
  • The stub-vs-real split means the demo runs in under 60 seconds locally without a MongoDB cluster.
  • This is the fourth substantively-different MCP integration in this hackathon sibling family (Dynatrace, Arize Phoenix, MongoDB, plus RAG drift). All four share the same LlmAgent + McpToolset shape; the MCP protocol carried the abstraction.

What we learned

When the underlying tool calls return ground-truth numbers, the LLM's job is to relay them, not to estimate. A prompt that says "copy numbers verbatim from the tool output, never round, never extrapolate" measurably tightens accuracy.

What's next for gemini-data-agent

  • A "save query" feature so common questions ("monthly active users by plan") run as a one-click button on the dashboard.
  • Multi-collection joins via $lookup aggregation, so the agent can answer questions that span users + orders.
  • Plug in additional partner MCPs (Postgres MCP, Snowflake MCP) and let the same agent target whatever database the team uses.

Built With

  • agent-development-kit
  • gemini
  • gemini-2.5
  • google-cloud-agent-builder
  • mcp
  • model-context-protocol
  • mongodb
  • mongodb-mcp
  • python
  • streamlit
  • vertex-ai
Share this project:

Updates