Inspiration
Every product team has a MongoDB cluster and a stakeholder asking the same shape of question over and over: how many users are on the pro plan, what's the revenue from shipped orders, who hasn't logged in in 30 days. Each one means writing a fresh aggregation pipeline. What if the stakeholder could ask the database directly in plain English?
What it does
gemini-data-agent treats every data question as a discover-then-query loop. You ask "how many users are on each plan?" and the agent walks the MongoDB MCP tools:
list_databasesto see what databases are available on the cluster.list_collectionsto inspect the candidate database.collection_schemato read the actual field types of the target collection (so it doesn't query against fields that don't exist).- The right query shape —
countfor "how many",findfor "show me the top N",aggregatefor "group by".
The answer is structured: a one-line direct answer, the exact MongoDB tool + arguments it used, 2-4 evidence bullets with counts copied verbatim from the database response, and one concrete follow-up the analyst could try.
How we built it
- Google Cloud Agent Builder (ADK) for the agent framework. The whole agent fits in six lines of ADK: one
LlmAgent, oneMcpToolset, a Gemini model, and a system prompt that defines the discover-then-query workflow. - Gemini 2.5 Flash on Vertex AI for reasoning. Fast enough that a stakeholder's question turns into an answer in a couple of seconds; cheap enough that reviewers can fire as many queries as they want.
- MongoDB MCP server for tools. The agent talks to the official
mongodb-mcp-server(npm) tool surface. A stub MCP server ships in the repo with a small canned e-commerce dataset (500 users, 1200 orders) so reviewers can run the project locally with zero setup. SetMONGODB_CONNECTION_STRINGand flip the stub toggle off, and the same agent code targets a real cluster via npx. - Streamlit for the dashboard.
- Cloud Run for hosting.
Challenges we ran into
The first version of the system prompt got the agent to call the right tools but Gemini occasionally hallucinated counts (e.g., "29,200 users on the pro plan" when the canned dataset has 500 total). The fix was a sharper system prompt that explicitly instructs the agent to copy numbers verbatim from the tool output and to call another tool rather than estimate when it doesn't see a number. Subsequent live runs returned 269 free + 127 starter + 73 pro + 31 enterprise = 500 exactly.
Accomplishments that we're proud of
- A real Vertex AI Gemini call walked all four tools (list_databases → list_collections → collection_schema → aggregate) in seven events and returned the exact counts.
- The stub-vs-real split means the demo runs in under 60 seconds locally without a MongoDB cluster.
- This is the fourth substantively-different MCP integration in this hackathon sibling family (Dynatrace, Arize Phoenix, MongoDB, plus RAG drift). All four share the same
LlmAgent+McpToolsetshape; the MCP protocol carried the abstraction.
What we learned
When the underlying tool calls return ground-truth numbers, the LLM's job is to relay them, not to estimate. A prompt that says "copy numbers verbatim from the tool output, never round, never extrapolate" measurably tightens accuracy.
What's next for gemini-data-agent
- A "save query" feature so common questions ("monthly active users by plan") run as a one-click button on the dashboard.
- Multi-collection joins via
$lookupaggregation, so the agent can answer questions that span users + orders. - Plug in additional partner MCPs (Postgres MCP, Snowflake MCP) and let the same agent target whatever database the team uses.

Log in or sign up for Devpost to join the conversation.